Skip to main content

How to use logistic regression to perform sentiment analysis



Introduction

Sentiment analysis is the process of determining whether a piece of writing is positive, negative or neutral. It can be used to better understand the users and improve the products. For example, you can understand customer experience towards products on Amazon in their online feedback, or you can identify Twitter’s user sentiment in the tweets they posted.


It’s very easy to use logistic regression to perform sentiment analysis, only 3 steps are needed:

  1. Text data preprocessing

  2. Feature Extraction

  3. Building a logistic regression model


Now, let’s take a look at the details of each step. 

Step1: Text Data Preprocessing

There are several steps in text data preprocessing:

  • Tokenizing the string

    • It means split the string into individual words without blanks or tabs

  • Lowercasing

    • ‘HAPPY’, ‘Happy’ -> ‘happy’

  • Removing stop words and punctuation

    • Stop words: 'is', 'and', 'a', ‘are’, …

    • Punctuation: ',', '.', ':', '!', ...

  • Remove handles and URLs if necessary

    • @Nickname, Https://...

  • Stemming

    • 'tuned’', 'tuning', 'tune' -> 'tun'

Step 2: Feature Extraction

The goal is to create a Mx3 feature matrix for all pieces of text data. Each row in the feature matrix represents the feature vector for a writing. M is the number of writing, 3 represents 3 features: 

  1. Bias unit. All equal to 1

  2. Sum of positive frequencies for every unique word on writing m

  3. Sum of negative frequencies for every unique word on writing m




To get this feature matrix, you need to create the frequency dictionary from the processed text data. The keys in this dictionary are pairs of words and sentiment. The values in the dictionary are the frequencies of all the pairs of (word, sentiment).


Take the word ‘cake’ for example, the dictionary will count the frequency of ‘cake’ in positive writings, and the key is (‘cake’, 1), the dictionary will also count the frequency of ‘cake’ in negative writings, and the key is (‘cake’, 0). 


As long as you get this frequency dictionary and the processed text data after step 1, you would be able to get a vector representation for each writing using the frequency dictionary mapping. A vector has a bias unit and two additional features that store the sum of the frequencies that every unique word on your processed writings appear in positive writings and the sum of the frequencies they appear in negative ones.


The flow chart below shows the process of step 1 and step 2.


Vocabulary from Raw Data

Frequency of Word in Positive Writings

Frequency of Word in Negative Writings

The Frequency Dictionary that Maps (word, sentiment) to Frequency

Feature Matrix


Step 3: Build a logistic regression model

  • Use the feature matrix mentioned above as the input matrix

  • The target variable is binary in this case, telling whether a writing is positive (1) or negative (0)

  • The logistic regression model aims to learn the best set of estimated parameters for the 3 features which minimize the cost function. During the learning process, the estimated parameters are updated in the direction of gradient of the cost function

  • This process will repeat again and again until a certain number of iterations and you’ll reach a point (global minimum) near the optimum cost and you will end your training there. This algorithm is also known as gradient descent

  • After the logistic regression model has been trained, in other words, after you have got the best set of parameters, you can send the product of the estimated parameter and the feature matrix to the sigmoid function to predict your test writings and evaluate the results

    • Sigmoid function:








Acknowledgements

I learnt all of these from Natural Language Processing with Classification and Vector Spaces several weeks ago, all the pictures in this article are also from this course. I highly recommend Natural Language Processing Specialization on Coursera. This article is a part of my review after this course, I will continue summarizing what I’ve learnt next.



Code

https://github.com/hudan42/NLP/blob/master/LR_SentimentAnalysis.ipynb

Comments

Popular posts from this blog

Use Monte Carlo Simulation + Hypothesis Test to Evaluate Missing Value Imputation

It is quite normal that there’s a lot of missing values here and there when you get a dataset in the real world. How to deal with missing values is a big topic. As far as I’m concerned, there’s no one optimal method that works for all situations, usually, it is case-by-case.  For me, most of the time I don’t want to throw away observations that contains missing values, unless the data is really large while the missing value percentage under certain features are quite high (e.g. >98%) and those features do not play an important role in the analysis. Otherwise I’ll try my best to impute all the missing values to retain important information in the data as much as possible. There’s a lot of missing value imputation methods out there, including statistical methods and machine learning methods. No matter what method we use, we want a data distribution (after imputation) that is not twisted, or in other words, is not statistically significantly different from the original distribution...

Use Flowers to Analyze PCA and NMF

PCA and NMF are both dimension reduction methods that can be used for BIG DATA  problems. They use matrix algebra to transform the original feature space in the dataset to a much lower dimensional space while keeping the majority of the information in the dataset. To make this article more vivid, I would like to use image data to show the functionality and difference about these two methods. In fact, PCA and NMF are very natural transformation methods for image data, because image data has large feature space (width x height x channel) and pixels in the image are also highly correlated. Let’s take a look at this flower dataset first. The flower data set I used in this article was from Kaggle ( https://www.kaggle.com/alxmamaev/flowers-recognition ) We have roses and sunflowers in this image dataset. Each image is represented as a 128x128x3 numpy array. So if you flatten each image, you will get a 1D array with 49152 features, which is a REALLY LARGE number in terms of features....

Why Do We Need Precision-Recall Curve?

  What are the precision & recall and why are they used?  For binary classification problems, it's usually a matter of how to differentiate from a negative class to a positive class. When we have a binary classifier to make predictions, there will always be true positives (TP) , false positives (FP) , true negatives (TN) , and false negatives (FN) . The classifier that can produce more true positives and negatives (less false positives and negatives in other words) indicates better predictive power and less business cost. As a data scientist, I usually first define the cost of a false negative and the cost of a false positive from a business standpoint, so that I can optimize the threshold accordingly and balance the trade-off. In a nutshell, both precision and recall are the metrics to evaluate the predictive performance. Below are the detailed definitions about precision and recall. precision = TP/(TP + FP) , it tells how sure you are of your true positives. For exa...