Skip to main content

Something About Cross Validation


As we all know, it is important to split the dataset into a training set, validation set and test set before building a model. The reasons for the preprocessing are:
  • To avoid data leakage and prevent overfitting problem
  • To tune the hyperparameters in the model to achieve better predictive performance

Each of the dataset after the split plays a different role in the model building.
  • Training set is used directly in the model training process.
  • Validation set is also used in training the model but mainly to tune the model hyperparameters and choose the best model.
  • Test set is only used when the model is completely trained, and it is used to evaluate the final model performance.

If you simply use the splitted training set, validation set to train / tune the model, and use the test set to evaluate the model (validation set approach), you can only have one or  at most two estimates of the model performance. What’s more, since you reserve a certain percentage of the data for validation, you will have less data to train the model, and the model tends to perform worse when trained on fewer data. This problem will become worse when the dataset is small.

To solve this kind of problem, you can use cross validation, which is a method used in applied machine learning. It allows us to use all the data for training and testing. 

A dataset has very rich information, we don’t know which piece of data contains important information and which piece of data does not, so if we withhold 20% or 30% of data that we never learn from, we may lose lots of information. Plus, cross validation also makes it easier to tune models. There’s several ways to do it, I will introduce two approaches here.
  • K-Folds Cross Validation
  • LOOCV (Leave One Out Cross Validation)

K-Fold Cross Validation

K-Fold cross validation will randomly split the data into K folds, the model will use every K-1 parts for training the model and the rest one part for validation, this process will repeat K times. So actually, K-Fold cross validation uses all of the data to train the model.

5-fold cross validation


LOOCV (Leave One Out Cross Validation)

The underlying concept of LOOCV is very similar to K-Fold cross validation, however, for K-Fold cross validation, K is usually 5 or 10, whereas for LOOCV, K is the size of your dataset. In other words, LOOCV creates N folds and leaves one observation out each time.

LOOCV - Leave One Out Cross Validation


I prefer K-Fold cross validation than LOOCV, below are the reasons for why.
  • Less variance than validation set method. Because each fold is considered for both training and validation. No randomness of using some observations for training or validation. Also less variance than LOOCV method, because each fold in LOOCV is trained on almost identical observations.
  • Less bias than validation set method. Because it uses all the data for model training, instead of a part of the data.
  • Good for a small dataset.
  • Not as computationally expensive as LOOCV, K is usually 5-10.

Things to Notice

The way you set up the cross validation should mimic the use case for your real problem. Also, you should prevent data leakage.
  • Do any exploration with the training set only.
  • Pay attention to any data that contains time. If so, use folds that are from the same time period and pay attention to the time sequence. For example, use a whole month fold for validation and use the data prior to that month for training.
  • Training data should be independent of test data. If you have multiple data points from the same subject, you should cluster those points and assign them to folds, instead of letting those data appear in both training set and test set

Comments

Popular posts from this blog

Use Monte Carlo Simulation + Hypothesis Test to Evaluate Missing Value Imputation

It is quite normal that there’s a lot of missing values here and there when you get a dataset in the real world. How to deal with missing values is a big topic. As far as I’m concerned, there’s no one optimal method that works for all situations, usually, it is case-by-case.  For me, most of the time I don’t want to throw away observations that contains missing values, unless the data is really large while the missing value percentage under certain features are quite high (e.g. >98%) and those features do not play an important role in the analysis. Otherwise I’ll try my best to impute all the missing values to retain important information in the data as much as possible. There’s a lot of missing value imputation methods out there, including statistical methods and machine learning methods. No matter what method we use, we want a data distribution (after imputation) that is not twisted, or in other words, is not statistically significantly different from the original distribution...

Use Flowers to Analyze PCA and NMF

PCA and NMF are both dimension reduction methods that can be used for BIG DATA  problems. They use matrix algebra to transform the original feature space in the dataset to a much lower dimensional space while keeping the majority of the information in the dataset. To make this article more vivid, I would like to use image data to show the functionality and difference about these two methods. In fact, PCA and NMF are very natural transformation methods for image data, because image data has large feature space (width x height x channel) and pixels in the image are also highly correlated. Let’s take a look at this flower dataset first. The flower data set I used in this article was from Kaggle ( https://www.kaggle.com/alxmamaev/flowers-recognition ) We have roses and sunflowers in this image dataset. Each image is represented as a 128x128x3 numpy array. So if you flatten each image, you will get a 1D array with 49152 features, which is a REALLY LARGE number in terms of features....

Why Do We Need Precision-Recall Curve?

  What are the precision & recall and why are they used?  For binary classification problems, it's usually a matter of how to differentiate from a negative class to a positive class. When we have a binary classifier to make predictions, there will always be true positives (TP) , false positives (FP) , true negatives (TN) , and false negatives (FN) . The classifier that can produce more true positives and negatives (less false positives and negatives in other words) indicates better predictive power and less business cost. As a data scientist, I usually first define the cost of a false negative and the cost of a false positive from a business standpoint, so that I can optimize the threshold accordingly and balance the trade-off. In a nutshell, both precision and recall are the metrics to evaluate the predictive performance. Below are the detailed definitions about precision and recall. precision = TP/(TP + FP) , it tells how sure you are of your true positives. For exa...