Skip to main content

Use Monte Carlo Simulation + Hypothesis Test to Evaluate Missing Value Imputation


It is quite normal that there’s a lot of missing values here and there when you get a dataset in the real world. How to deal with missing values is a big topic. As far as I’m concerned, there’s no one optimal method that works for all situations, usually, it is case-by-case. 

For me, most of the time I don’t want to throw away observations that contains missing values, unless the data is really large while the missing value percentage under certain features are quite high (e.g. >98%) and those features do not play an important role in the analysis. Otherwise I’ll try my best to impute all the missing values to retain important information in the data as much as possible.

There’s a lot of missing value imputation methods out there, including statistical methods and machine learning methods. No matter what method we use, we want a data distribution (after imputation) that is not twisted, or in other words, is not statistically significantly different from the original distribution.

How to statistically prove that our imputation does not distort the original data distribution?

In this article, I’ll introduce one of the two statistical methods I created, which utilizes hypothesis test and Monte Carlo simulation to test if the imputed data distributions are significantly different from the original data distributions. I’ll name this method “Monte Carlo Hypothesis Test”.

There are many parametric or non-parametric hypothesis tests used to determine whether a data distribution is from a certain population. For example, Shapiro–Wilk test is used to test the normality of a data distribution. The null hypothesis of this test is that the population is normally distributed, and if the result p-value is less than the specified confidence level (e.g. 0.05), then the null hypothesis is rejected. In other worlds, the test indicates that the data is not normally distributed. 

However, with a large dataset, even mild deviations from non-normality may be detected by the test. So if our data contains tens of thousands of data, or even millions of data, when we use hypothesis tests to validate whether the whole data distribution is changed after the imputation, then we may find that the test always rejects the null hypothesis, because the tests are too “sensitive”. That’s why we usually use the Shapiro–Wilk test in conjunction with a Q–Q plot together to make the decision. 

The intuition behind the “Monte Carlo Hypothesis Test” method is to overcome the sensitiveness of the hypothesis test.

Detailed steps of this method: 

  1. Randomly sample a small size (50, 100, less than 200) of data from the original dataset and the imputed dataset (note: the two groups of sample data have the same index).
  2. Use appropriate hypothesis tests to test whether the two groups of the sample data are from the same population. Get the p-value for the test.
  3. Repeat step 1 & 2 multiple times (500, 800, 1000, etc). Record all the p-values along the way.
  4. When the simulation process is done, average all the p-value.
  5. If the averaged p-value is less than the specified confidence level, then we can say the data distribution before & after the imputation statistically differ from each other and you may need to use another imputation method instead.

Here is the implementation code for this method:



The source code of this method can also be found on my Github.

Hope this helps! 






Comments

Popular posts from this blog

Comparison between Logistic Regression and Support Vector Machine

Logistic regression and support vector machine (SVM) are both popular models that can be applied to classification tasks. This article gives the introduction of these two methods and summarizes the differences between them. What is Logistic Regression? Logistic regression is a generalized linear model for binary classification. In logistic regression, we take the output of the linear function and then pass the value to the sigmoid function. The sigmoid function is S-shaped,  it is a bounded and differentiable activation function.  We use sigmoid function in logistic regression because it can take any real-valued number and map it into a value between the range of 0 and 1, as is known to all, the probability of any event is between 0 and 1, so sigmoid function is an intuitive and right choice for logistic regression. After we get the probabilities, we then set a threshold to make decisions, if the probability is greater than the threshold, we assign it a label 1, else we a...

Why Do We Need Precision-Recall Curve?

  What are the precision & recall and why are they used?  For binary classification problems, it's usually a matter of how to differentiate from a negative class to a positive class. When we have a binary classifier to make predictions, there will always be true positives (TP) , false positives (FP) , true negatives (TN) , and false negatives (FN) . The classifier that can produce more true positives and negatives (less false positives and negatives in other words) indicates better predictive power and less business cost. As a data scientist, I usually first define the cost of a false negative and the cost of a false positive from a business standpoint, so that I can optimize the threshold accordingly and balance the trade-off. In a nutshell, both precision and recall are the metrics to evaluate the predictive performance. Below are the detailed definitions about precision and recall. precision = TP/(TP + FP) , it tells how sure you are of your true positives. For exa...