Thanks everyone for your ideas and suggestions. There are many things I am
going to take a look at here and perhaps this is a good time for me to
learn some regular expressions.
I also want to respond regarding my desire to clean up the formatting of
the OCR data (line breaks, junk characters,
Hello Scott,
I would be curious to hear more from what you expect from OpenRefine in
that case. I know OpenRefine is powerful for many things but I can't get
it for the current case, can you expand ?
Thanks
Sylvain Machefert - Bordeaux, France
Web services librarian - http://geobib.fr/en
It sounds like there are two sorts of things you need to clean up:
a) OCR errors
b) Formatting (like unnecessary line breaks)
For the former, I understand that Adobe Acrobat and ABBYY FineReader
have tools built in to spellchecking. PrimeOCR, an expensive OCR
package, has a related package
Hi Erica,
We are working on a similar project converting concert performances
from the past 20 years for our School of Music. though we use simple
OCR for PDFs (supporting full text searching), we are selectively
cleaning up OCR for metadata purposes. That is taking the first page
of
Erica,
You may find what you need from OpenRefine: http://openrefine.org/
On Fri, Nov 21, 2014 at 5:15 PM, Erica FINDLEY eri...@multco.us wrote:
Greetings,
I am working on a project to digitize concert programs. These are the type
of programs you get when attending a musical concert that
Greetings,
I am working on a project to digitize concert programs. These are the type
of programs you get when attending a musical concert that list performers
and details about the concert.
Since these items are text heavy we have decided to use OCR software to
output a text file that will