Thanks Ray
Unfortunately I'm not dealing with fixed formats for the forms. I have too wide a variety to be able to do the subtract trick, or to just OCR specific rectangles on the forms. I'm looking to pull off the text and its position and then analyse what I get from that. So I'll play around a bit more, keep an eye open for 3.00 when it's ready, and also take a look at commercial packages to see it they are structured in a way that suits my needs any better. On Feb 25, 6:46 pm, Ray Smith <[email protected]> wrote: > 3.00 might help a bit, but better would be to subtract the image of a blank > form from you input. The trick is to align the images first...Ray. > > On Thu, Feb 19, 2009 at 10:44 AM, keithjd <[email protected]> wrote: > > > I’m working on a application to read scanned forms using Tesseract – > > both the pre-printed background text, and the data in the fields. > > Various things like W-2s, that have lots of lines and boxes, and a > > fairly irregular layout. My main interest is just to have Tesseract > > reliably read the text, with its position on the form – it’s then my > > job to program up a way to make sense of it. I’m not interested in > > the lines and boxes. > > > Currently I’m getting a lot of garbage: > > - lines and boxes that get interpreted as text (mainly punctuation of > > course) > > - words that get merged with lines and boxes, resulting in > > superfluous > > “F” or “L”, or ultra-large containing rectangles. > > - A large number of clearly (human) legible words which seem to be > > completely missed by Tesseract. > > > I know I can get to work on filtering the output according to my > > knowledge of the content of the material being scanned – using my own > > dictionaries etc. My question is, are there any config settings, or > > strategies that might be useful for me to apply to Tesseract that > > would help in these circumstances. > > > Thanks for any suggestions! --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~----------~----~----~----~------~----~------~--~---

