It sounds like there are two sorts of things you need to clean up:
a) OCR errors
b) Formatting (like unnecessary line breaks)
For the former, I understand that Adobe Acrobat and ABBYY FineReader
have tools built in to spellchecking. PrimeOCR, an expensive OCR
package, has a related package
We'd been using Andrew Darby's method and ran into this problem earlier this
year. A (now ex-)colleague coded Calibr
(https://github.com/LincolnUniLTL/calibr ) when we ran into this problem, and
we've been running it since. Does depend on tidy csv though.
Deborah
-Original Message-
Hi Erica,
We are working on a similar project converting concert performances
from the past 20 years for our School of Music. though we use simple
OCR for PDFs (supporting full text searching), we are selectively
cleaning up OCR for metadata purposes. That is taking the first page
of