Re: [CODE4LIB] Looking for a script to clean up OCR text files

2014-11-23 Thread Kevin Hawkins
It sounds like there are two sorts of things you need to clean up: a) OCR errors b) Formatting (like unnecessary line breaks) For the former, I understand that Adobe Acrobat and ABBYY FineReader have tools built in to spellchecking. PrimeOCR, an expensive OCR package, has a related package

Re: [CODE4LIB] Library Hours Fail

2014-11-23 Thread Fitchett, Deborah
We'd been using Andrew Darby's method and ran into this problem earlier this year. A (now ex-)colleague coded Calibr (https://github.com/LincolnUniLTL/calibr ) when we ran into this problem, and we've been running it since. Does depend on tidy csv though. Deborah -Original Message-

Re: [CODE4LIB] Looking for a script to clean up OCR text files

2014-11-23 Thread Monica Rivero
Hi Erica, We are working on a similar project converting concert performances from the past 20 years for our School of Music. though we use simple OCR for PDFs (supporting full text searching), we are selectively cleaning up OCR for metadata purposes. That is taking the first page of