As we all know, it is important to split the dataset into a training set, validation set and test set before building a model. The reasons for the preprocessing are:
- To avoid data leakage and prevent overfitting problem
- To tune the hyperparameters in the model to achieve better predictive performance
Each of the dataset after the split plays a different role in the model building.
- Training set is used directly in the model training process.
- Validation set is also used in training the model but mainly to tune the model hyperparameters and choose the best model.
- Test set is only used when the model is completely trained, and it is used to evaluate the final model performance.
If you simply use the splitted training set, validation set to train / tune the model, and use the test set to evaluate the model (validation set approach), you can only have one or at most two estimates of the model performance. What’s more, since you reserve a certain percentage of the data for validation, you will have less data to train the model, and the model tends to perform worse when trained on fewer data. This problem will become worse when the dataset is small.
To solve this kind of problem, you can use cross validation, which is a method used in applied machine learning. It allows us to use all the data for training and testing.
A dataset has very rich information, we don’t know which piece of data contains important information and which piece of data does not, so if we withhold 20% or 30% of data that we never learn from, we may lose lots of information. Plus, cross validation also makes it easier to tune models. There’s several ways to do it, I will introduce two approaches here.
- K-Folds Cross Validation
- LOOCV (Leave One Out Cross Validation)
K-Fold Cross Validation
K-Fold cross validation will randomly split the data into K folds, the model will use every K-1 parts for training the model and the rest one part for validation, this process will repeat K times. So actually, K-Fold cross validation uses all of the data to train the model.
5-fold cross validation
LOOCV (Leave One Out Cross Validation)
The underlying concept of LOOCV is very similar to K-Fold cross validation, however, for K-Fold cross validation, K is usually 5 or 10, whereas for LOOCV, K is the size of your dataset. In other words, LOOCV creates N folds and leaves one observation out each time.
I prefer K-Fold cross validation than LOOCV, below are the reasons for why.
- Less variance than validation set method. Because each fold is considered for both training and validation. No randomness of using some observations for training or validation. Also less variance than LOOCV method, because each fold in LOOCV is trained on almost identical observations.
- Less bias than validation set method. Because it uses all the data for model training, instead of a part of the data.
- Good for a small dataset.
- Not as computationally expensive as LOOCV, K is usually 5-10.
Things to Notice
The way you set up the cross validation should mimic the use case for your real problem. Also, you should prevent data leakage.
- Do any exploration with the training set only.
- Pay attention to any data that contains time. If so, use folds that are from the same time period and pay attention to the time sequence. For example, use a whole month fold for validation and use the data prior to that month for training.
- Training data should be independent of test data. If you have multiple data points from the same subject, you should cluster those points and assign them to folds, instead of letting those data appear in both training set and test set
Comments
Post a Comment