For a good overview of what OCR is, check out this overview
I found myself cutting the spines off books, again. This time it was because I couldn’t find an e-book copy of ‘Animal Liberation’ anywhere on the net, and I’ve amassed quite a few physical copies--mostly from garage sales--that I could afford to experiment with. The pages were to be scanned, OCR-ed, and put on my Kindle.
The experience got me thinking about what it would take to write a good OCR post-processor. This is something that I’ve been thinking (off and on) about for a few months now, and I have a lot of ideas running around in my head. So this will be the first in a series of blog posts of my testing out some of those ideas. The first few, this included, will be about autocorrection of individual words (isolated term correction).
One of the first things that came to mind is to use traditional spelling correction procedures on the OCR output: finding approximate matches from a corpus of acceptable vocabulary (using sim-hash (more) or a k-gram index), and returning the n closest matches (minimizing Levenshtein distance). Although undoubtedly helpful and necessary for most applications, I thought it would be more interesting to explore whether there were any tenable solutions that didn’t rely on a priori information like a vocabulary list and a precomputed k-gram index. A solution like this should, in theory, be able to handle OCR output that contains tokens for which no dictionary exists, or is difficult to get. An example of this would be output from a medical report, a text in Esperanto (or Klingon), or ciphered text.
A document contains many repeated words. The OCR output misrecognizes some of these repeated words, but not all. Given a set of tokens that are likely referring to one word, how can we choose the item with correct spelling?
The token with the lowest average Levenshtein distance from all other tokens is very often the correct spelling.
An OCR reads a document that contains a word many times, and the OCR makes a few mistakes. Of the tokens in the OCR output, most of them will be the proper spelling. The tokens that are misspellings will be the result of (mostly single-edit) transformations (insertions, deletions, substitutions, etc..) of the original word. It stands to reason that, most of the time, these misspellings will be more similar to the original word than to other misspellings because the original word is where the mistake is derived from. For example, misrecognition of the word “cat” can yield “bat” and “cab”. Both “bat” and “cab” are single-edit transformations of “cat” (the Levenshtein distance is 1) but they are two transformations away from each other.
We’ll model incorrect OCR output on a hypothetical document that contains several repeats of a single word. We’ll find the word that satisfies the condition in the theory and compare it with the actual word. We will also compare this to a control, where a spelling is chosen at random from a list of tokens that are known to refer to the same word.
Modeling a crummy OCR:
If an OCR has an 80% character recognition accuracy, its processing of text can be modeled as a series of Bernoulli trials where the probability of success is 80%. When it is unsuccessful, the most common mistake is a mistaken letter, but occasionally it is the insertion of a noise character or the complete removal of a letter.
import random import numpy as np def process(word, bernoulli_p): new_word = "" for letter in word: failed_trial = np.random.geometric(p=bernoulli_p)-1 if failed_trial: choice = random.randint(97, 132) if 128 <= choice <= 132: letter = "" elif 123 <= choice <= 127: letter = letter + chr(random.randint(97, 122)) else: letter = chr(choice) new_word += letter return new_word
In this python snippet, for every character in the original word, a number is drawn from a geometric distribution with the probability given to the function. If this is 1, the letter is added to the new word. If it is greater than 1, an integer is chosen from 97 to 132 (inclusive). If it is somewhere between 97 and 122 (the ASCII decimal values for ‘a’-’z’) the corresponding character is added to the new word. In the case that it is between 123 to 132, half the time it will skip a letter, and half the time it will read the correct letter but also add a random noise character from ‘a’-’z’.
>>> for i in xrange(0, 7):
>>> print process("morrissey", .8)
For the two words, “morrissey” and “counterrevolutionaries” (I choose the latter word only because it has a lot of letters), hypothetical documents containing 5, 20, and 100 occurrences of the word are processed with accuracies of 99%, 98%, 95%, 90%, 85%, 80% 75%, and 70%. The token with the lowest average Levenshtein distance is chosen and compared to the original word. Its success or failure is noted. As a comparison, a token is chosen from the OCR output randomly and its probability of being the correct spelling of the original word is noted (this is roughly the character accuracy probability to the nth power, where n is the number of characters in the original word). This is repeated 1,000 times and averaged (mean).
The ‘least Levenshtein distance’ method consistently outperforms chance. With the shorter word (“morrissey”), this method will choose the correct spelling over 95% of the time in a document that contains the word 20 times, even when the character-by-character accuracy of the model OCR is 80% (this means that the entire word will be OCR-ed correctly only about 13% of the time).
The number of times the word appears in the document has a great bearing on the accuracy of the results; the more often the word appears, the more consistently the correct word is identified.
In the 22-character word “counterrevolutionaries”, the results are less exciting. For example, in a document that contains the word 20 times, with an OCR character accuracy of 80%, the method will choose the correct word only 14% of the time. However, this is partially because, in this case, the word is only OCR-ed correctly 0.74% of the time and it is unlikely that the correct spelling of the word even appears in the OCR output. Remember that this method relies on at least one correct spelling in the OCR output to work.
In the interest of open and audit-able science, the code, data, and analysis for the experiment is available on a git repository I set up for this and other experiments I plan to perform on OCR post-processing methods.
Even though I find these results exciting, this method of spelling correction is probably unsuitable in most circumstances where the document is not very large and no particular word occurs very often.
I have other ideas for this and related problems that I plan to test and write posts about in the coming weeks.share this: