Skip to main content

Posts

Use Monte Carlo Simulation + Hypothesis Test to Evaluate Missing Value Imputation

It is quite normal that there’s a lot of missing values here and there when you get a dataset in the real world. How to deal with missing values is a big topic. As far as I’m concerned, there’s no one optimal method that works for all situations, usually, it is case-by-case.  For me, most of the time I don’t want to throw away observations that contains missing values, unless the data is really large while the missing value percentage under certain features are quite high (e.g. >98%) and those features do not play an important role in the analysis. Otherwise I’ll try my best to impute all the missing values to retain important information in the data as much as possible. There’s a lot of missing value imputation methods out there, including statistical methods and machine learning methods. No matter what method we use, we want a data distribution (after imputation) that is not twisted, or in other words, is not statistically significantly different from the original distribution...
Recent posts

Use Flowers to Analyze PCA and NMF

PCA and NMF are both dimension reduction methods that can be used for BIG DATA  problems. They use matrix algebra to transform the original feature space in the dataset to a much lower dimensional space while keeping the majority of the information in the dataset. To make this article more vivid, I would like to use image data to show the functionality and difference about these two methods. In fact, PCA and NMF are very natural transformation methods for image data, because image data has large feature space (width x height x channel) and pixels in the image are also highly correlated. Let’s take a look at this flower dataset first. The flower data set I used in this article was from Kaggle ( https://www.kaggle.com/alxmamaev/flowers-recognition ) We have roses and sunflowers in this image dataset. Each image is represented as a 128x128x3 numpy array. So if you flatten each image, you will get a 1D array with 49152 features, which is a REALLY LARGE number in terms of features....

Why Do We Need Precision-Recall Curve?

  What are the precision & recall and why are they used?  For binary classification problems, it's usually a matter of how to differentiate from a negative class to a positive class. When we have a binary classifier to make predictions, there will always be true positives (TP) , false positives (FP) , true negatives (TN) , and false negatives (FN) . The classifier that can produce more true positives and negatives (less false positives and negatives in other words) indicates better predictive power and less business cost. As a data scientist, I usually first define the cost of a false negative and the cost of a false positive from a business standpoint, so that I can optimize the threshold accordingly and balance the trade-off. In a nutshell, both precision and recall are the metrics to evaluate the predictive performance. Below are the detailed definitions about precision and recall. precision = TP/(TP + FP) , it tells how sure you are of your true positives. For exa...

Something About Cross Validation

As we all know, it is important to split the dataset into a training set, validation set and test set before building a model. The reasons for the preprocessing are: To avoid data leakage and prevent overfitting problem To tune the hyperparameters in the model to achieve better predictive performance Each of the dataset after the split plays a different role in the model building. Training set is used directly in the model training process. Validation set is also used in training the model but mainly to tune the model hyperparameters and choose the best model. Test set is only used when the model is completely trained, and it is used to evaluate the final model performance. If you simply use the splitted training set, validation set to train / tune the model, and use the test set to evaluate the model ( validation set approach ), you can only have one or  at most two estimates of the model performance. What’s more, since you reserve a certain percentage of the data for validation, ...

Build an autocorrect model to correct misspelled words

Autocorrect is an application that changes misspelled words into correct ones. You have it on the phone and on your computer inside your document editors, as well as email applications. For example, it takes a sentence like this one, “happy birthday deah friend.” , and corrects the misspelled word deah to dear . We can use a simple yet powerful model to solve this kind of problem. One thing to notice here is that you are not searching for contextual errors, just spelling errors,  So words like deer will pass through this filter, as it is spelled correctly, regardless of how the context may seem. This simple yet powerful autocorrect model works in 4 key steps: Identify a misspelled word Find strings n edit distance away Filter candidates Calculate word probabilities Now, let’s look at the details of how each step is implemented. Step 1: Identify a misspelled word How does the machine know if a word is misspelled or not?  Simply put, if a word is not given in a dictionary, th...

Comparison between Logistic Regression and Support Vector Machine

Logistic regression and support vector machine (SVM) are both popular models that can be applied to classification tasks. This article gives the introduction of these two methods and summarizes the differences between them. What is Logistic Regression? Logistic regression is a generalized linear model for binary classification. In logistic regression, we take the output of the linear function and then pass the value to the sigmoid function. The sigmoid function is S-shaped,  it is a bounded and differentiable activation function.  We use sigmoid function in logistic regression because it can take any real-valued number and map it into a value between the range of 0 and 1, as is known to all, the probability of any event is between 0 and 1, so sigmoid function is an intuitive and right choice for logistic regression. After we get the probabilities, we then set a threshold to make decisions, if the probability is greater than the threshold, we assign it a label 1, else we a...

How to use Naive Bayes to perform sentiment analysis

Naive Bayes has a wide range of applications in sentiment analysis, author identification, information retrieval and word disambiguation. It's a very popular method since it's relatively simple to train, easy to use and interpret. In this article, I will summarize how to use Naive Bayes to perform sentiment analysis on text data. General Idea  Bayes Rule for determining if a word is positive or negative: How about determining if a writing is positive or negative? When you use Naive Bayes to predict the sentiments of a writing, what you're actually doing is estimating the probability for each sentiment by using the joint probability of the words in sentiments. The Naive Bayes formula is just the ratio between these two probabilities, the products of the priors and the likelihoods.  The Naive Bayes inference formula for this kind of  binary classification: However, the formula above is the products of ratio, which brings risk of underflow. To solve this problem, you can t...