A demonstration of the most common machine learning mistake people make in using training and test data and why cross-validation avoids this mistake

In our previous article, we covered the principles of training and test data and why all experts do it rigorously.

Here is a short recap of training and test data:

Your training data is the core knowledge of your machine learning model. An intuitive way to think of training data is by thinking it like a set of books we give a person to read. This person can only build knowledge that exists in these books. In other words, if you gave your model a set of books on how to speak English, then you should only assume that it has learnt how to speak English.

Your test data is then a set of quizzes or exams that you use on your model to test whether it has properly learnt the knowledge you want it to learn.

The most common mistake that people make is selecting an unrepresentative training and/or test data that does not match what they are really predicting. For example, suppose we are trying to build a traffic prediction model for the high way during peak hours on weekdays, both our training and test data should be representative of what we are predicting: peak hour traffic on weekdays. Hence, we need to have training and test data of peak hour traffic from past weekdays. If the model was trained with non-peak hour traffic or peak hour traffic from weekends, then the predicting power of our model would be weak. Keep in mind that it is possible to see a fluke in high accuracy due to random chance (e.g. maybe you used a day where there is a big event on a weekend which matches similar peak hour traffic conditions on a weekday).

Because we cannot always be perfectly sure that our training and test data are representative and we also need to make sure our results are not just flukes, using cross-validation will ensure that there is sufficient randomization in the process to reduce the probability of these events happening.

This content is a part of our Fundamentals of Machine Learning and AI course. The course contains more details and the face-to-face opportunity to discuss with machine learning experts. Contact us or visit our course catalog to learn more about our interactive short courses.