Re: [CODE4LIB] Looking for a script to clean up OCR text files

2014-11-25 Thread Erica FINDLEY
Thanks everyone for your ideas and suggestions. There are many things I am going to take a look at here and perhaps this is a good time for me to learn some regular expressions. I also want to respond regarding my desire to clean up the formatting of the OCR data (line breaks, junk characters,

Re: [CODE4LIB] Looking for a script to clean up OCR text files

2014-11-24 Thread Sylvain Machefert
Hello Scott, I would be curious to hear more from what you expect from OpenRefine in that case. I know OpenRefine is powerful for many things but I can't get it for the current case, can you expand ? Thanks Sylvain Machefert - Bordeaux, France Web services librarian - http://geobib.fr/en

Re: [CODE4LIB] Looking for a script to clean up OCR text files

2014-11-23 Thread Kevin Hawkins
It sounds like there are two sorts of things you need to clean up: a) OCR errors b) Formatting (like unnecessary line breaks) For the former, I understand that Adobe Acrobat and ABBYY FineReader have tools built in to spellchecking. PrimeOCR, an expensive OCR package, has a related package

Re: [CODE4LIB] Looking for a script to clean up OCR text files

2014-11-23 Thread Monica Rivero
Hi Erica, We are working on a similar project converting concert performances from the past 20 years for our School of Music. though we use simple OCR for PDFs (supporting full text searching), we are selectively cleaning up OCR for metadata purposes. That is taking the first page of

Re: [CODE4LIB] Looking for a script to clean up OCR text files

2014-11-22 Thread scott bacon
Erica, You may find what you need from OpenRefine: http://openrefine.org/ On Fri, Nov 21, 2014 at 5:15 PM, Erica FINDLEY eri...@multco.us wrote: Greetings, I am working on a project to digitize concert programs. These are the type of programs you get when attending a musical concert that

[CODE4LIB] Looking for a script to clean up OCR text files

2014-11-21 Thread Erica FINDLEY
Greetings, I am working on a project to digitize concert programs. These are the type of programs you get when attending a musical concert that list performers and details about the concert. Since these items are text heavy we have decided to use OCR software to output a text file that will