3.00 might help a bit, but better would be to subtract the image of a blank
form from you input. The trick is to align the images first...Ray.

On Thu, Feb 19, 2009 at 10:44 AM, keithjd <[email protected]> wrote:

>
> I’m working on a application to read scanned forms using Tesseract –
> both the pre-printed background text, and the data in the fields.
> Various things like W-2s, that have lots of lines and boxes, and a
> fairly irregular layout.  My main interest is just to have Tesseract
> reliably read the text, with its position on the form – it’s then my
> job to program up a way to make sense of it.  I’m not interested in
> the lines and boxes.
>
> Currently I’m getting a lot of garbage:
> -       lines and boxes that get interpreted as text (mainly punctuation of
> course)
> -       words that get merged with lines and boxes, resulting in
> superfluous
> “F” or “L”, or ultra-large containing rectangles.
> -       A large number of clearly (human) legible words which seem to be
> completely missed by Tesseract.
>
> I know I can get to work on filtering the output according to my
> knowledge of the content of the material being scanned – using my own
> dictionaries etc.  My question is, are there any config settings, or
> strategies that might be useful for me to apply to Tesseract that
> would help in these circumstances.
>
> Thanks for any suggestions!
>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to