Extracting text from scanned PDF docs

Kevin Carlson Fri, 24 Sep 2010 13:44:39 -0700

We receive PDF files which appear to contain scanning artifacts which
severely impact recognition.  Specifically, under magnification you
can see regularly spaced "notches" and corresponding "bumps",
especially noticeable with vertical lines.


Currently I'm using Ghostscript to convert the files to TIFF for
processing, any Python-based alternatives out there?  Ultimately would
like to do all cleaning and converting using Python, with "Pytesser"
to do the OCR.

Any suggestions on cleaning up the files to improve recognition rates?
 I'd like to see about "training" the OCR using the notched
characters, but the links on doing so seem incomplete.  Any
recommendations would be appreciated!

Thanks!
Kevin

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Extracting text from scanned PDF docs

Reply via email to