Skip to main content

Use Monte Carlo Simulation + Hypothesis Test to Evaluate Missing Value Imputation


It is quite normal that there’s a lot of missing values here and there when you get a dataset in the real world. How to deal with missing values is a big topic. As far as I’m concerned, there’s no one optimal method that works for all situations, usually, it is case-by-case. 

For me, most of the time I don’t want to throw away observations that contains missing values, unless the data is really large while the missing value percentage under certain features are quite high (e.g. >98%) and those features do not play an important role in the analysis. Otherwise I’ll try my best to impute all the missing values to retain important information in the data as much as possible.

There’s a lot of missing value imputation methods out there, including statistical methods and machine learning methods. No matter what method we use, we want a data distribution (after imputation) that is not twisted, or in other words, is not statistically significantly different from the original distribution.

How to statistically prove that our imputation does not distort the original data distribution?

In this article, I’ll introduce one of the two statistical methods I created, which utilizes hypothesis test and Monte Carlo simulation to test if the imputed data distributions are significantly different from the original data distributions. I’ll name this method “Monte Carlo Hypothesis Test”.

There are many parametric or non-parametric hypothesis tests used to determine whether a data distribution is from a certain population. For example, Shapiro–Wilk test is used to test the normality of a data distribution. The null hypothesis of this test is that the population is normally distributed, and if the result p-value is less than the specified confidence level (e.g. 0.05), then the null hypothesis is rejected. In other worlds, the test indicates that the data is not normally distributed. 

However, with a large dataset, even mild deviations from non-normality may be detected by the test. So if our data contains tens of thousands of data, or even millions of data, when we use hypothesis tests to validate whether the whole data distribution is changed after the imputation, then we may find that the test always rejects the null hypothesis, because the tests are too “sensitive”. That’s why we usually use the Shapiro–Wilk test in conjunction with a Q–Q plot together to make the decision. 

The intuition behind the “Monte Carlo Hypothesis Test” method is to overcome the sensitiveness of the hypothesis test.

Detailed steps of this method: 

  1. Randomly sample a small size (50, 100, less than 200) of data from the original dataset and the imputed dataset (note: the two groups of sample data have the same index).
  2. Use appropriate hypothesis tests to test whether the two groups of the sample data are from the same population. Get the p-value for the test.
  3. Repeat step 1 & 2 multiple times (500, 800, 1000, etc). Record all the p-values along the way.
  4. When the simulation process is done, average all the p-value.
  5. If the averaged p-value is less than the specified confidence level, then we can say the data distribution before & after the imputation statistically differ from each other and you may need to use another imputation method instead.

Here is the implementation code for this method:



The source code of this method can also be found on my Github.

Hope this helps! 






Comments

Popular posts from this blog

Use Flowers to Analyze PCA and NMF

PCA and NMF are both dimension reduction methods that can be used for BIG DATA  problems. They use matrix algebra to transform the original feature space in the dataset to a much lower dimensional space while keeping the majority of the information in the dataset. To make this article more vivid, I would like to use image data to show the functionality and difference about these two methods. In fact, PCA and NMF are very natural transformation methods for image data, because image data has large feature space (width x height x channel) and pixels in the image are also highly correlated. Let’s take a look at this flower dataset first. The flower data set I used in this article was from Kaggle ( https://www.kaggle.com/alxmamaev/flowers-recognition ) We have roses and sunflowers in this image dataset. Each image is represented as a 128x128x3 numpy array. So if you flatten each image, you will get a 1D array with 49152 features, which is a REALLY LARGE number in terms of features....

Why Do We Need Precision-Recall Curve?

  What are the precision & recall and why are they used?  For binary classification problems, it's usually a matter of how to differentiate from a negative class to a positive class. When we have a binary classifier to make predictions, there will always be true positives (TP) , false positives (FP) , true negatives (TN) , and false negatives (FN) . The classifier that can produce more true positives and negatives (less false positives and negatives in other words) indicates better predictive power and less business cost. As a data scientist, I usually first define the cost of a false negative and the cost of a false positive from a business standpoint, so that I can optimize the threshold accordingly and balance the trade-off. In a nutshell, both precision and recall are the metrics to evaluate the predictive performance. Below are the detailed definitions about precision and recall. precision = TP/(TP + FP) , it tells how sure you are of your true positives. For exa...