Friday, September 29, 2006

 

OCR problems in Google Books

The accuracy of OCR is vastly increased when the words you read exist in a dictionary. In that case the OCR, when trying to decide between more than one possible interpretation, can use the dictionary to help determine which is the more likely.

As the Google Books project announces its extension to scan European books in Madrid it will have to adjust the dictionary it uses - it's no good scanning Spanish books using the same dictionary as American English books.

However, there are some "English" documents that Google has scanned already where the OCR process has gone very wrong. Consider this page of old printed English, with the "long s" symbol, which looks like a modern "f" character. Looks like the OCR was not told that this dated from 1796, so to look out for long "s" - hence it has identified lots of "fuch" and "fale" rather than "such" and "sale" on the page. A simple dictionary check would have helped here - but only if the process expects "f" and "s" to be confused.