It is quite normal that there’s a lot of missing values here and there when you get a dataset in the real world. How to deal with missing values is a big topic. As far as I’m concerned, there’s no one optimal method that works for all situations, usually, it is case-by-case.
For me, most of the time I don’t want to throw away observations that contains missing values, unless the data is really large while the missing value percentage under certain features are quite high (e.g. >98%) and those features do not play an important role in the analysis. Otherwise I’ll try my best to impute all the missing values to retain important information in the data as much as possible.
There’s a lot of missing value imputation methods out there, including statistical methods and machine learning methods. No matter what method we use, we want a data distribution (after imputation) that is not twisted, or in other words, is not statistically significantly different from the original distribution.
How to statistically prove that our imputation does not distort the original data distribution?
In this article, I’ll introduce one of the two statistical methods I created, which utilizes hypothesis test and Monte Carlo simulation to test if the imputed data distributions are significantly different from the original data distributions. I’ll name this method “Monte Carlo Hypothesis Test”.
There are many parametric or non-parametric hypothesis tests used to determine whether a data distribution is from a certain population. For example, Shapiro–Wilk test is used to test the normality of a data distribution. The null hypothesis of this test is that the population is normally distributed, and if the result p-value is less than the specified confidence level (e.g. 0.05), then the null hypothesis is rejected. In other worlds, the test indicates that the data is not normally distributed.
However, with a large dataset, even mild deviations from non-normality may be detected by the test. So if our data contains tens of thousands of data, or even millions of data, when we use hypothesis tests to validate whether the whole data distribution is changed after the imputation, then we may find that the test always rejects the null hypothesis, because the tests are too “sensitive”. That’s why we usually use the Shapiro–Wilk test in conjunction with a Q–Q plot together to make the decision.
The intuition behind the “Monte Carlo Hypothesis Test” method is to overcome the sensitiveness of the hypothesis test.
Detailed steps of this method:
- Randomly sample a small size (50, 100, less than 200) of data from the original dataset and the imputed dataset (note: the two groups of sample data have the same index).
- Use appropriate hypothesis tests to test whether the two groups of the sample data are from the same population. Get the p-value for the test.
- Repeat step 1 & 2 multiple times (500, 800, 1000, etc). Record all the p-values along the way.
- When the simulation process is done, average all the p-value.
- If the averaged p-value is less than the specified confidence level, then we can say the data distribution before & after the imputation statistically differ from each other and you may need to use another imputation method instead.
Here is the implementation code for this method:
Hope this helps!
Comments
Post a Comment