Skip to main content

How to use Naive Bayes to perform sentiment analysis


Naive Bayes has a wide range of applications in sentiment analysis, author identification, information retrieval and word disambiguation. It's a very popular method since it's relatively simple to train, easy to use and interpret. In this article, I will summarize how to use Naive Bayes to perform sentiment analysis on text data.


General Idea 

Bayes Rule for determining if a word is positive or negative:

How about determining if a writing is positive or negative?

When you use Naive Bayes to predict the sentiments of a writing, what you're actually doing is estimating the probability for each sentiment by using the joint probability of the words in sentiments. The Naive Bayes formula is just the ratio between these two probabilities, the products of the priors and the likelihoods. 


The Naive Bayes inference formula for this kind of  binary classification:

However, the formula above is the products of ratio, which brings risk of underflow. To solve this problem, you can take the log likelihood of this formula, which is just to take the sum of the logarithms of the ratios:


Now, you have got the Naive Bayes formula to predict the sentiments of writings. Now the problem becomes to get the ratios in formula B.


Only 5 steps are needed before you get the answer by Naive Bayes:

  1. Preprocess text data

  2. Compute (word, sentiment) frequency

  3. Calculate probability of (word, sentiment) with Laplacian Smoothing

  4. Get lambdas and log prior

  5. Predict the sentiment

Among them, step 1 and step 2 are just the same as the first 2 steps in my first article.

Now, let’s see what’s actually happening in each step.

Step 1: Preprocess text data

There are several steps in text data preprocessing:

  • Tokenizing the string

    • It means split the string into individual words without blanks or tabs

  • Lowercasing

    • ‘HAPPY’, ‘Happy’ -> ‘happy’

  • Removing stop words and punctuation

    • Stop words: 'is', 'and', 'a', ‘are’, …

    • Punctuation: ',', '.', ':', '!', ...

  • Remove handles and URLs if necessary

    • @Nickname, Https://...

  • Stemming

    • 'tuned’', 'tuning', 'tune' -> 'tun'

The preprocessed text data looks like this:


Step 2: Compute (word_i, sentiment) frequency

It is the same as the process of creating the frequency dictionary from the processed text data as mentioned in my last article. The keys in this dictionary are pairs of words and sentiment. The values in the dictionary are the frequencies of all the pairs of (word, sentiment).


The results frequency table looks like this:


Step 3: Calculate probability of (word_i, sentiment) with Laplacian Smoothing 

Laplacian smoothing is a technique to avoid probabilities being zero by making a slight change to the formula: 

  • Adding 1 to the numerator to avoid probability being 0

  • Adding the total number of unique words in vocabulary to the denominator to make sure all the probabilities in each column still sum up to 1

  • Probability with Laplacian Smoothing: 


The probability table after smoothing looks like this:


Step 4: Get lambdas and log prior

Lambda is the log of the ratio of the probability that a word is positive and the probability that the word is negative. 


Log prior is the logarithm of the prior probability of different sentiment, it is also the first term in formula B. If the dataset is balanced, then the log prior here is 0. This term is important if the dataset is unbalanced.


Step 5: Predict the sentiment of a writing

Everything is ready now! 

You can now predict the sentiment of a writing by summing all the lambdas for each word that appeared in the writing, as well as the log prior term. Based on the formula B, the threshold for your decision is 0, so if you get a score greater than 0, then the according writing is positive, otherwise is negative.

Problems with Naive Bayes method for sentiment analysis

Although Naive Bayes is easy to train and use, it has some problems that may cause potential misclassifications.


  • Naive Bayes assumes words in a piece of text are independent from each other.

    • This assumption is not necessarily true. This could lead you to under or overestimates the conditional probabilities of individual words.


  • Naive Bayes relies on the distribution of the training data sets. 

    • Naive Bayes will be either too optimistic or too pessimistic if the training data set is quite unbalanced, which is usually the case to happen in reality because in the real social application stream, positive cases tend to occur more often since negative cases might contain violent or other offensive content that is banned by the platform or muted by other users. So if one day you use some real world data, it’s better to check if the data is balanced or not. Good news is most of the available annotated corpora are artificially balanced, containing the same proportion of positive and negative tweets as a random sample would.


  • Word order matters 

    • Word order matters to the sentiment, however, it gets missed by Naive Bayes method here.

    • For example, “I'm happy because I did not go.” is positive, “I'm not happy because I did go.” is negative, the word ‘not’ is very important to the sentiment, but the Naive Bayes here can’t tell this difference.


  • Adversarial attack

    • It describes some language phenomena, such as sarcasm, irony and euphemism. Naive Bayes tend to give completely opposite predictions.

    • For example, “The movie was ridiculously powerful and I cried right through until the ending”, this is a positive comment on a movie, but if you preprocess the sentence, you will get a sequence of negative words, and the Naive Bayes will give a negative prediction using these negative words.


Acknowledgements

All the pictures in this article are also from the course “ Natural Language Processing with Classification and Vector Spaces ”. I highly recommend Natural Language Processing Specialization on Coursera. This article is a part of my review after this course, I will continue summarizing what I’ve learnt next.



Code

Comments

Popular posts from this blog

Use Monte Carlo Simulation + Hypothesis Test to Evaluate Missing Value Imputation

It is quite normal that there’s a lot of missing values here and there when you get a dataset in the real world. How to deal with missing values is a big topic. As far as I’m concerned, there’s no one optimal method that works for all situations, usually, it is case-by-case.  For me, most of the time I don’t want to throw away observations that contains missing values, unless the data is really large while the missing value percentage under certain features are quite high (e.g. >98%) and those features do not play an important role in the analysis. Otherwise I’ll try my best to impute all the missing values to retain important information in the data as much as possible. There’s a lot of missing value imputation methods out there, including statistical methods and machine learning methods. No matter what method we use, we want a data distribution (after imputation) that is not twisted, or in other words, is not statistically significantly different from the original distribution...

Use Flowers to Analyze PCA and NMF

PCA and NMF are both dimension reduction methods that can be used for BIG DATA  problems. They use matrix algebra to transform the original feature space in the dataset to a much lower dimensional space while keeping the majority of the information in the dataset. To make this article more vivid, I would like to use image data to show the functionality and difference about these two methods. In fact, PCA and NMF are very natural transformation methods for image data, because image data has large feature space (width x height x channel) and pixels in the image are also highly correlated. Let’s take a look at this flower dataset first. The flower data set I used in this article was from Kaggle ( https://www.kaggle.com/alxmamaev/flowers-recognition ) We have roses and sunflowers in this image dataset. Each image is represented as a 128x128x3 numpy array. So if you flatten each image, you will get a 1D array with 49152 features, which is a REALLY LARGE number in terms of features....

Why Do We Need Precision-Recall Curve?

  What are the precision & recall and why are they used?  For binary classification problems, it's usually a matter of how to differentiate from a negative class to a positive class. When we have a binary classifier to make predictions, there will always be true positives (TP) , false positives (FP) , true negatives (TN) , and false negatives (FN) . The classifier that can produce more true positives and negatives (less false positives and negatives in other words) indicates better predictive power and less business cost. As a data scientist, I usually first define the cost of a false negative and the cost of a false positive from a business standpoint, so that I can optimize the threshold accordingly and balance the trade-off. In a nutshell, both precision and recall are the metrics to evaluate the predictive performance. Below are the detailed definitions about precision and recall. precision = TP/(TP + FP) , it tells how sure you are of your true positives. For exa...