Skip to main content

Posts

Showing posts from September, 2020

Something About Cross Validation

As we all know, it is important to split the dataset into a training set, validation set and test set before building a model. The reasons for the preprocessing are: To avoid data leakage and prevent overfitting problem To tune the hyperparameters in the model to achieve better predictive performance Each of the dataset after the split plays a different role in the model building. Training set is used directly in the model training process. Validation set is also used in training the model but mainly to tune the model hyperparameters and choose the best model. Test set is only used when the model is completely trained, and it is used to evaluate the final model performance. If you simply use the splitted training set, validation set to train / tune the model, and use the test set to evaluate the model ( validation set approach ), you can only have one or  at most two estimates of the model performance. What’s more, since you reserve a certain percentage of the data for validation, ...