A statistical auto-correct system from scratch

6 min readMar 14, 2021

The task of an auto-correct system is finding out which words in a document are misspelled. These misspelled words might be presented to a user by underlining that words. Correction is the task of substituting the well-spelled word for misspellings.

Modern auto-correct systems use high-end deep learning techniques that are a little complicated to understand. If interested, you can look into some of these papers

In this blog post, we’ll look into a simple and intuitive auto-correct system based on probabilities. More details in the next sections.

View this project on -> Kaggle Link, GitHub Link

Note: If you are interested in looking into the code parallelly, I highly recommend viewing it on Kaggle.

Data Requirement

The very first requirement of the auto-correct system that we’ll be building is data. We need a trusted (good English and grammar) text corpus. There are some public domain text corpuses available. Since it’s an unsupervised type of problem here what we need is just text. One can use any competition data or any other public dataset that has a text field column. In the current version, I have used a small fraction of the wiki corpus.

Architecture

This auto-correct architecture has 4 components -
1) Filtering Misspells: One simple approach could be checking if a word is there in the vocabulary or not. We’ll use the text corpus to define a vocabulary.
2) Word Suggestion Mechanism: This mechanism suggests candidate words based on deletion, insertion, switch, or replacement of one/two characters in the original word.
3) Probability Distribution Mechanism: The probability distribution {key(word) : value(probability)} is created calculated using the text corpus. The probability of each candidate is found using this distribution and the most probable candidate is the final one.
4) Replace Misspells: Simply replace the misspelled word with the most probable suggestion.

This is a very simplified architecture compared to what is used in reality. Hence its performance will not be very good. But we can make some improvements in this architecture to get better results.

Drawbacks
- It has a fixed outcome. i.e. ‘hime’ will be converted to ‘time’ only (because ‘time’ is a more frequent word hence a more probable one) not ‘home’ or anything else.
- It is solely based on the frequency of words in the corpus.
- Doesn’t care about the context.
- Doesn’t care about part of speech, grammar, or punctuations.
- Can’t suggest something which is not in the vocabulary

Improvements
1) It can be further improved by introducing bi-gram probabilities. Hence, it will get some inference from previous words.
2) The suggestions that are less distance away from the misspelled word are more likely. Hence, the system can be further improved by introducing dynamic programming-based min edit distance functionality.

Improvement 1: Introducing n-gram probabilities to get context from previous words

This idea is taken from the n-grams based language models. In an n-gram language model
- Assume the probability of the next word depends only on the previous n-gram.
- The previous n-gram is the series of the previous ’n’ words.

The conditional probability for the word at position ‘t’ in the sentence, given that the words preceding it are w(t-1), w(t-2) ….. w(t-n) is:

P( w(t) | w(t-1)….w(t-n) )

This probability can be estimated by counting the occurrences of these series of words in the training data.
- The probability can be estimated as a ratio, where
- The numerator is the number of times word (t) appears after words (t-1) through (t-n) appear in the training data.
- The denominator is the number of times word t-1 through t-n appears in the training data.

In other words, to estimate probabilities based on n-grams, first, find the counts of n-grams (for the denominator) then divide it by the count of (n+1)-grams (for numerator).

- The function C(….) denotes the number of occurrences of the given sequence.
- P hat means the estimation of P.
- The denominator of the above equation is the number of occurrences of the previous n words, and the numerator is the same sequence followed by the word w(t).

Now the issue with the above formula is that it doesn’t work when a count of an n-gram is zero.
- Suppose we encounter an n-gram that did not occur in the training data.
- Then, the later equation cannot be evaluated (it becomes zero divided by zero).

A way to handle zero counts is to add k-smoothing.
- K-smoothing adds a positive constant k to each numerator and k * |V| in the denominator, where |V| is the number of words in the vocabulary.

For n-grams that have a zero count, the above equation becomes 1/|V|.
- This means that any n-gram with zero counts has the same probability of 1/|V|.

Improvement 2 : Introducing min_edit_diatsnce functionality

The idea is derived from the intuition that the suggestions that are less distance away from the misspelled word are more likely. Hence, the system can be further improved by introducing dynamic programming-based min edit distance functionality.

So, given a string source[0..i] and a string target[0..j], we will compute all the combinations of substrings[i, j] and calculate their edit distance. To do this efficiently, we will use a table to maintain the previously computed substrings and use those to calculate larger substrings.

We’ll first create a matrix and update each element in the matrix as follows:

Implementing these improvements will make our auto-correct system better. Let’s look at a demo of how well it is working.

Demo

Evaluation

Let’s do a unit testing of this system to see how accurately it is performing.

Conclusion

This project is an implementation of a statistical auto-correct system. The architecture that has been developed gives the accuracy around 52% — 55%. The improved version of this architecture could get inference from the previous words by using bi-gram probabilities and min edit distance functionality provided further enhancement. Overall, this simple probability-based auto-correct system performed okay. In order to get better performance one can go for deep learning-based auto-correct systems.

References

PS: Thanks for reading through this. Any suggestion, feedback, criticism, appreciation is most welcome. If interested, follow me on Kaggle, Linked In.