I’m working on a application to read scanned forms using Tesseract –
both the pre-printed background text, and the data in the fields.
Various things like W-2s, that have lots of lines and boxes, and a
fairly irregular layout.  My main interest is just to have Tesseract
reliably read the text, with its position on the form – it’s then my
job to program up a way to make sense of it.  I’m not interested in
the lines and boxes.

Currently I’m getting a lot of garbage:
-       lines and boxes that get interpreted as text (mainly punctuation of
course)
-       words that get merged with lines and boxes, resulting in superfluous
“F” or “L”, or ultra-large containing rectangles.
-       A large number of clearly (human) legible words which seem to be
completely missed by Tesseract.

I know I can get to work on filtering the output according to my
knowledge of the content of the material being scanned – using my own
dictionaries etc.  My question is, are there any config settings, or
strategies that might be useful for me to apply to Tesseract that
would help in these circumstances.

Thanks for any suggestions!


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to