Skip to main content

Build an autocorrect model to correct misspelled words


Autocorrect is an application that changes misspelled words into correct ones. You have it on the phone and on your computer inside your document editors, as well as email applications. For example, it takes a sentence like this one, “happy birthday deah friend.”, and corrects the misspelled word deah to dear.

We can use a simple yet powerful model to solve this kind of problem. One thing to notice here is that you are not searching for contextual errors, just spelling errors,  So words like deer will pass through this filter, as it is spelled correctly, regardless of how the context may seem.

This simple yet powerful autocorrect model works in 4 key steps:

  1. Identify a misspelled word
  2. Find strings n edit distance away
  3. Filter candidates
  4. Calculate word probabilities

Now, let’s look at the details of how each step is implemented.


Step 1: Identify a misspelled word

How does the machine know if a word is misspelled or not? 

Simply put, if a word is not given in a dictionary, then the machine can flag it as a misspelled word, which needs correction.


Step 2: Find strings n edit distance away

When saying n edit distance, it means an edit distance of n, n can be 1, 2, 3, and so on.

An edit is a type of operation performed on a string to change it into another string. And edit distance counts the number of these operations. The n edit distance tells you how many operations away one string is from another.

There are 4 types of edit operations:

  1. Insert (add a letter)                 ‘to’ → ‘top’, ‘too’
  2. Delete (remove a letter)                 ‘heat’ → ‘hat’, ‘eat’’
  3. Switch (swap 2 adjacent letters)         ‘eat’ → ‘eta’, ‘aet’
  4. Replace (change 1 letter to another) ‘power’ → ‘tower’, ‘poker’

You can modify any string by using these 4 edits. You can find a list of all available strings that are n edit distance away by the combination of these operations. For autocorrect, n is usually 1 to 3.


Step 3: Filter candidates

After you get the list of possible strings from step 2, you only want to keep those real, correctly spelled words. So once again, just like what you did in step 1, check every string in the list to see if it is in a known dictionary.

As a result, you are left with a list of real words only.


Step 4: Calculate word probabilities

In Step 4, you will calculate the probabilities of each real word in the list, which tell you how likely each word is to appear in the context and choose the most likely word to be the substitution word.  For example, the word "and" is more common than the word "ant" in any given body of texts, also called a corpus. This is how an autocorrect model knows which word to substitute for the misspelled one.



How to calculate word probabilities?

  1. Calculate the total number of words in the corpus (V)
  2. Calculate the word frequencies (C(w))
  3. Probability of a word = C(w) / V

Here’s a simple example, the corpus here is defined as one sentence: I am happy because I am learning.


As long as you get all the probabilities of words in the list, you can find the substitute word with the highest probability, and that’s it!


Acknowledgements

All the pictures in this article are from the course “ Natural Language Processing with Probabilistic Models”.






Comments

Popular posts from this blog

Comparison between Logistic Regression and Support Vector Machine

Logistic regression and support vector machine (SVM) are both popular models that can be applied to classification tasks. This article gives the introduction of these two methods and summarizes the differences between them. What is Logistic Regression? Logistic regression is a generalized linear model for binary classification. In logistic regression, we take the output of the linear function and then pass the value to the sigmoid function. The sigmoid function is S-shaped,  it is a bounded and differentiable activation function.  We use sigmoid function in logistic regression because it can take any real-valued number and map it into a value between the range of 0 and 1, as is known to all, the probability of any event is between 0 and 1, so sigmoid function is an intuitive and right choice for logistic regression. After we get the probabilities, we then set a threshold to make decisions, if the probability is greater than the threshold, we assign it a label 1, else we a...

Something About Cross Validation

As we all know, it is important to split the dataset into a training set, validation set and test set before building a model. The reasons for the preprocessing are: To avoid data leakage and prevent overfitting problem To tune the hyperparameters in the model to achieve better predictive performance Each of the dataset after the split plays a different role in the model building. Training set is used directly in the model training process. Validation set is also used in training the model but mainly to tune the model hyperparameters and choose the best model. Test set is only used when the model is completely trained, and it is used to evaluate the final model performance. If you simply use the splitted training set, validation set to train / tune the model, and use the test set to evaluate the model ( validation set approach ), you can only have one or  at most two estimates of the model performance. What’s more, since you reserve a certain percentage of the data for validation, ...