3.00 might help a bit, but better would be to subtract the image of a blank form from you input. The trick is to align the images first...Ray.
On Thu, Feb 19, 2009 at 10:44 AM, keithjd <[email protected]> wrote: > > I’m working on a application to read scanned forms using Tesseract – > both the pre-printed background text, and the data in the fields. > Various things like W-2s, that have lots of lines and boxes, and a > fairly irregular layout. My main interest is just to have Tesseract > reliably read the text, with its position on the form – it’s then my > job to program up a way to make sense of it. I’m not interested in > the lines and boxes. > > Currently I’m getting a lot of garbage: > - lines and boxes that get interpreted as text (mainly punctuation of > course) > - words that get merged with lines and boxes, resulting in > superfluous > “F” or “L”, or ultra-large containing rectangles. > - A large number of clearly (human) legible words which seem to be > completely missed by Tesseract. > > I know I can get to work on filtering the output according to my > knowledge of the content of the material being scanned – using my own > dictionaries etc. My question is, are there any config settings, or > strategies that might be useful for me to apply to Tesseract that > would help in these circumstances. > > Thanks for any suggestions! > > > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~----------~----~----~----~------~----~------~--~---

