Build an autocorrect model to correct misspelled words

Autocorrect is an application that changes misspelled words into correct ones. You have it on the phone and on your computer inside your document editors, as well as email applications. For example, it takes a sentence like this one, “happy birthday deah friend.”, and corrects the misspelled word deah to dear.

We can use a simple yet powerful model to solve this kind of problem. One thing to notice here is that you are not searching for contextual errors, just spelling errors, So words like deer will pass through this filter, as it is spelled correctly, regardless of how the context may seem.

This simple yet powerful autocorrect model works in 4 key steps:

Identify a misspelled word
Find strings n edit distance away
Filter candidates
Calculate word probabilities

Now, let’s look at the details of how each step is implemented.

Step 1: Identify a misspelled word

How does the machine know if a word is misspelled or not?

Simply put, if a word is not given in a dictionary, then the machine can flag it as a misspelled word, which needs correction.

Step 2: Find strings n edit distance away

When saying n edit distance, it means an edit distance of n, n can be 1, 2, 3, and so on.

An edit is a type of operation performed on a string to change it into another string. And edit distance counts the number of these operations. The n edit distance tells you how many operations away one string is from another.

There are 4 types of edit operations:

Insert (add a letter) ‘to’ → ‘top’, ‘too’
Delete (remove a letter) ‘heat’ → ‘hat’, ‘eat’’
Switch (swap 2 adjacent letters) ‘eat’ → ‘eta’, ‘aet’
Replace (change 1 letter to another) ‘power’ → ‘tower’, ‘poker’

You can modify any string by using these 4 edits. You can find a list of all available strings that are n edit distance away by the combination of these operations. For autocorrect, n is usually 1 to 3.

Step 3: Filter candidates

After you get the list of possible strings from step 2, you only want to keep those real, correctly spelled words. So once again, just like what you did in step 1, check every string in the list to see if it is in a known dictionary.

As a result, you are left with a list of real words only.

Step 4: Calculate word probabilities

In Step 4, you will calculate the probabilities of each real word in the list, which tell you how likely each word is to appear in the context and choose the most likely word to be the substitution word. For example, the word "and" is more common than the word "ant" in any given body of texts, also called a corpus. This is how an autocorrect model knows which word to substitute for the misspelled one.

How to calculate word probabilities?

Calculate the total number of words in the corpus (V)
Calculate the word frequencies (C(w))
Probability of a word = C(w) / V

Here’s a simple example, the corpus here is defined as one sentence: I am happy because I am learning.

As long as you get all the probabilities of words in the list, you can find the substitute word with the highest probability, and that’s it!

Acknowledgements

All the pictures in this article are from the course “ Natural Language Processing with Probabilistic Models”.

One Punch Dan （一拳丹）

Search This Blog