Noisy Channel Model For Spelling

"In Natural Language Processing (NLP), detecting misspellings is crucial for accurate text analysis. One common approach utilizes the noisy channel model. This model assumes text goes through a "channel" that might introduce errors like typos or substitutions. By comparing the observed misspelled word with a large vocabulary or a language model, the system can estimate the most likely intended word, correcting the misspelling and improving overall understanding of the text."- Gemini 2024

In one often cited example from Speech and Language Processing by Jurafsky and Martin, the noisy channel model is explained using spelling correction probabilities with a unigram model. To extend that to a bigram model, we can replace the unigram probability P(c) with a bigram probability P(c|p) * P(n|c), where c is the corrected word, p is the previous word, and n is the next word.

Then for the phrase "versatile acress whose", we look at the probabilites for possible corrections of "acress" and choose the most likely. For instance, given 2 possible corrections, actress and across, using bigrams on the Corpus of Contemporary American English (COCA), the solution is explained in this video.

To find probabilities for other possible corrections, counts can be found in an online tool for COCA.

This example uses NLTK and the available brown corpus on a simple, more common phrase, given a misspelling and two candidate corrections.

import nltk

words = nltk.corpus.brown.words()
bigrams = nltk.bigrams(words)
fdist = nltk.FreqDist(bigrams)

phrase = ["it", "has", "raining"]

for word in ['was', 'his']:
    print(phrase[0], word, fdist[(phrase[0], word)])
    print(word, phrase[2], fdist[(word, phrase[2])])

Where the solution shows "was" will be the clear winner after computing the final probabilities:


>> it was 743
>> was raining 1
>> it his 2
>> his raining 0