Home About Research Profile Teaching Profile Skills Blog Tutorials CV

How do you convert scanned images of a text into digital, searchable text? Optical Character Recognition (OCR) software can read text from images and convert it into strings of text. This page from Germán Stiglich's gazetteer of Peru (Diccionario Geográfico del Peru (1922)) was digitized using Abbyy Fine Reader's OCR capabilities. Scroll across the page to notice the near perfect recognition of this page.

However, more commonly, OCR texts are full of errors. In these cases, there are three choices: 1) 'clean' or edit the text manually; 2) identify common mistakes and write a script to fix them all at once or train Abbyy to read this type of text and run the OCR again; or 3) accept the mistakes as within an acceptable margin of error.

[add structural tagging with Python's regular expressions