Skip to main content

Use Flowers to Analyze PCA and NMF


PCA and NMF are both dimension reduction methods that can be used for BIG DATA  problems. They use matrix algebra to transform the original feature space in the dataset to a much lower dimensional space while keeping the majority of the information in the dataset. To make this article more vivid, I would like to use image data to show the functionality and difference about these two methods. In fact, PCA and NMF are very natural transformation methods for image data, because image data has large feature space (width x height x channel) and pixels in the image are also highly correlated.

Let’s take a look at this flower dataset first.

The flower data set I used in this article was from Kaggle (https://www.kaggle.com/alxmamaev/flowers-recognition)

We have roses and sunflowers in this image dataset. Each image is represented as a 128x128x3 numpy array. So if you flatten each image, you will get a 1D array with 49152 features, which is a REALLY LARGE number in terms of features. 

But no worries, we can use PCA and NMF to reduce dimension to these images. Let’s start with PCA.

PCA (Principal Component Analysis)

PCA uses orthogonal transformations to transform features to a set of uncorrelated components. Each component is a new feature in the transformed dataset. Each new feature is a linear combination of the original features. New features (principal components) are ranked in decreasing order by how much variance they explained in the data.

Algorithm

  1. Center the features (Subtracting the mean of each column).
  2. Find the first component that explains most of the variation in the data. Specifically, create a new variable z1 using a vector of weights denoted by w1, choose the weights that can maximize the variance of z1. (Constraint w1 to have a norm of 1). Please remember z1 is a linear combination of original features and weights (w1).
  3. Do this again to find z2 and w2, but this time we'll constrain these new combinations to be uncorrelated with the previous one (Orthogonal transformation assures the uncorrelated components), and z2 will explain the second most variation in the data. 
  4. Keep repeating this process until we have p new features (z) and our weights (w).

Apply PCA to the flower image data

As I applied PCA to the flower data, I used the first 200 principal components as the new features and transformed the original dataset. The first 200 principal components explained over 80% of the variance in the data. It means I successfully reduced the feature space from 49152 to 200, while keeping over 80% information in the  data. 

The first 8 principal components 

You may be curious about how much the original data changed after PCA was applied, because in this case, I missed out about ~20% of the information. To answer this curiosity, we can reconstruct one of the flowers using the transformed data and compare it with its original image.

Compare a flower image before and after the PCA transformation

The reconstructed flower image is more blurry, but we can still see discernible flower patterns in the image after PCA transformation. 

NMF (Non-negative Matrix Factorization)

NMF is another dimension reduction method. It works by taking the data matrix X and decomposing it into two new matrices, W and H , so that X can be approximately represented as WH (X ~ WH). So intrinsically, NMF is an optimization problem to find W and H that best approximate X

NMF is useful for non-negative data, for example, image data, audio data, or text data, etc. 

Algorithm

  1. Choose the rank of the matrix that will approximate X. The rank is denoted by m. This will be the number of columns in W and the number of rows in H. Smaller m will use less data to approximate X, but a larger m will give better approximations. W is analogous to the principal components in PCA, and H is like the matrix of loadings.
  2. In this optimization process, we are actually minimizing the Euclidean distance between the original and the approximation, which is called the Frobenius Norm (Constrain all the values in the matrices as non-negative). 
  3. The resulting matrix W is our new lower dimensional representation of X. We can use these as features. H is the matrix of loadings that we can dig into to understand the decomposition and reconstruct the approximation to X

Apply NMF to the flower image data

NMF is a slower algorithm than PCA, so I chose to start with a small rank (m = 100). Unlike PCA, NMF doesn't have a similar interpretation for the explained variance, but we can get an estimate of the reconstruction error, which measures how much our factorized or decomposed matrices approximate the original one.

Similarly, we can take a look at the components and the reconstructed image after NMF transformation.

The first 8 components in the H matrix


Compare a flower image before and after the NMF transformation

The reconstructed flower image is very blurry, but we can also see some flower patterns in the image after NMF transformation. As aforementioned, if we increase the rank, we will see flowers closer to the original image in the reconstructed image.

Summary & Caution

Both PCA and NMF can be used for any data when we want to reduce the number of features. But when you're working with non-negative data like images or audios, NMF can be better as it can be more interpretable than PCA when you want to inspect the components.

PCA should only be used for numeric data because the transformation doesn't make as much sense if you apply it to categorical data and dummy variables.

Both PCA and NMF are unsupervised learning algorithms, which means they don’t consider the response variables at all. So if you apply them in a supervised learning task, they don't necessarily give you better predictions because the data they left out might be what’s actually important to the prediction.





















Comments

Popular posts from this blog

Use Monte Carlo Simulation + Hypothesis Test to Evaluate Missing Value Imputation

It is quite normal that there’s a lot of missing values here and there when you get a dataset in the real world. How to deal with missing values is a big topic. As far as I’m concerned, there’s no one optimal method that works for all situations, usually, it is case-by-case.  For me, most of the time I don’t want to throw away observations that contains missing values, unless the data is really large while the missing value percentage under certain features are quite high (e.g. >98%) and those features do not play an important role in the analysis. Otherwise I’ll try my best to impute all the missing values to retain important information in the data as much as possible. There’s a lot of missing value imputation methods out there, including statistical methods and machine learning methods. No matter what method we use, we want a data distribution (after imputation) that is not twisted, or in other words, is not statistically significantly different from the original distribution...

Why Do We Need Precision-Recall Curve?

  What are the precision & recall and why are they used?  For binary classification problems, it's usually a matter of how to differentiate from a negative class to a positive class. When we have a binary classifier to make predictions, there will always be true positives (TP) , false positives (FP) , true negatives (TN) , and false negatives (FN) . The classifier that can produce more true positives and negatives (less false positives and negatives in other words) indicates better predictive power and less business cost. As a data scientist, I usually first define the cost of a false negative and the cost of a false positive from a business standpoint, so that I can optimize the threshold accordingly and balance the trade-off. In a nutshell, both precision and recall are the metrics to evaluate the predictive performance. Below are the detailed definitions about precision and recall. precision = TP/(TP + FP) , it tells how sure you are of your true positives. For exa...