It is quite normal that there’s a lot of missing values here and there when you get a dataset in the real world. How to deal with missing values is a big topic. As far as I’m concerned, there’s no one optimal method that works for all situations, usually, it is case-by-case. For me, most of the time I don’t want to throw away observations that contains missing values, unless the data is really large while the missing value percentage under certain features are quite high (e.g. >98%) and those features do not play an important role in the analysis. Otherwise I’ll try my best to impute all the missing values to retain important information in the data as much as possible. There’s a lot of missing value imputation methods out there, including statistical methods and machine learning methods. No matter what method we use, we want a data distribution (after imputation) that is not twisted, or in other words, is not statistically significantly different from the original distribution...
PCA and NMF are both dimension reduction methods that can be used for BIG DATA problems. They use matrix algebra to transform the original feature space in the dataset to a much lower dimensional space while keeping the majority of the information in the dataset. To make this article more vivid, I would like to use image data to show the functionality and difference about these two methods. In fact, PCA and NMF are very natural transformation methods for image data, because image data has large feature space (width x height x channel) and pixels in the image are also highly correlated. Let’s take a look at this flower dataset first. The flower data set I used in this article was from Kaggle ( https://www.kaggle.com/alxmamaev/flowers-recognition ) We have roses and sunflowers in this image dataset. Each image is represented as a 128x128x3 numpy array. So if you flatten each image, you will get a 1D array with 49152 features, which is a REALLY LARGE number in terms of features....