Skip to main content

Build an autocorrect model to correct misspelled words


Autocorrect is an application that changes misspelled words into correct ones. You have it on the phone and on your computer inside your document editors, as well as email applications. For example, it takes a sentence like this one, “happy birthday deah friend.”, and corrects the misspelled word deah to dear.

We can use a simple yet powerful model to solve this kind of problem. One thing to notice here is that you are not searching for contextual errors, just spelling errors,  So words like deer will pass through this filter, as it is spelled correctly, regardless of how the context may seem.

This simple yet powerful autocorrect model works in 4 key steps:

  1. Identify a misspelled word
  2. Find strings n edit distance away
  3. Filter candidates
  4. Calculate word probabilities

Now, let’s look at the details of how each step is implemented.


Step 1: Identify a misspelled word

How does the machine know if a word is misspelled or not? 

Simply put, if a word is not given in a dictionary, then the machine can flag it as a misspelled word, which needs correction.


Step 2: Find strings n edit distance away

When saying n edit distance, it means an edit distance of n, n can be 1, 2, 3, and so on.

An edit is a type of operation performed on a string to change it into another string. And edit distance counts the number of these operations. The n edit distance tells you how many operations away one string is from another.

There are 4 types of edit operations:

  1. Insert (add a letter)                 ‘to’ → ‘top’, ‘too’
  2. Delete (remove a letter)                 ‘heat’ → ‘hat’, ‘eat’’
  3. Switch (swap 2 adjacent letters)         ‘eat’ → ‘eta’, ‘aet’
  4. Replace (change 1 letter to another) ‘power’ → ‘tower’, ‘poker’

You can modify any string by using these 4 edits. You can find a list of all available strings that are n edit distance away by the combination of these operations. For autocorrect, n is usually 1 to 3.


Step 3: Filter candidates

After you get the list of possible strings from step 2, you only want to keep those real, correctly spelled words. So once again, just like what you did in step 1, check every string in the list to see if it is in a known dictionary.

As a result, you are left with a list of real words only.


Step 4: Calculate word probabilities

In Step 4, you will calculate the probabilities of each real word in the list, which tell you how likely each word is to appear in the context and choose the most likely word to be the substitution word.  For example, the word "and" is more common than the word "ant" in any given body of texts, also called a corpus. This is how an autocorrect model knows which word to substitute for the misspelled one.



How to calculate word probabilities?

  1. Calculate the total number of words in the corpus (V)
  2. Calculate the word frequencies (C(w))
  3. Probability of a word = C(w) / V

Here’s a simple example, the corpus here is defined as one sentence: I am happy because I am learning.


As long as you get all the probabilities of words in the list, you can find the substitute word with the highest probability, and that’s it!


Acknowledgements

All the pictures in this article are from the course “ Natural Language Processing with Probabilistic Models”.






Comments

Popular posts from this blog

Comparison between Logistic Regression and Support Vector Machine

Logistic regression and support vector machine (SVM) are both popular models that can be applied to classification tasks. This article gives the introduction of these two methods and summarizes the differences between them. What is Logistic Regression? Logistic regression is a generalized linear model for binary classification. In logistic regression, we take the output of the linear function and then pass the value to the sigmoid function. The sigmoid function is S-shaped,  it is a bounded and differentiable activation function.  We use sigmoid function in logistic regression because it can take any real-valued number and map it into a value between the range of 0 and 1, as is known to all, the probability of any event is between 0 and 1, so sigmoid function is an intuitive and right choice for logistic regression. After we get the probabilities, we then set a threshold to make decisions, if the probability is greater than the threshold, we assign it a label 1, else we a...

Use Monte Carlo Simulation + Hypothesis Test to Evaluate Missing Value Imputation

It is quite normal that there’s a lot of missing values here and there when you get a dataset in the real world. How to deal with missing values is a big topic. As far as I’m concerned, there’s no one optimal method that works for all situations, usually, it is case-by-case.  For me, most of the time I don’t want to throw away observations that contains missing values, unless the data is really large while the missing value percentage under certain features are quite high (e.g. >98%) and those features do not play an important role in the analysis. Otherwise I’ll try my best to impute all the missing values to retain important information in the data as much as possible. There’s a lot of missing value imputation methods out there, including statistical methods and machine learning methods. No matter what method we use, we want a data distribution (after imputation) that is not twisted, or in other words, is not statistically significantly different from the original distribution...

Use Flowers to Analyze PCA and NMF

PCA and NMF are both dimension reduction methods that can be used for BIG DATA  problems. They use matrix algebra to transform the original feature space in the dataset to a much lower dimensional space while keeping the majority of the information in the dataset. To make this article more vivid, I would like to use image data to show the functionality and difference about these two methods. In fact, PCA and NMF are very natural transformation methods for image data, because image data has large feature space (width x height x channel) and pixels in the image are also highly correlated. Let’s take a look at this flower dataset first. The flower data set I used in this article was from Kaggle ( https://www.kaggle.com/alxmamaev/flowers-recognition ) We have roses and sunflowers in this image dataset. Each image is represented as a 128x128x3 numpy array. So if you flatten each image, you will get a 1D array with 49152 features, which is a REALLY LARGE number in terms of features....