I’m working on a application to read scanned forms using Tesseract – both the pre-printed background text, and the data in the fields. Various things like W-2s, that have lots of lines and boxes, and a fairly irregular layout. My main interest is just to have Tesseract reliably read the text, with its position on the form – it’s then my job to program up a way to make sense of it. I’m not interested in the lines and boxes.
Currently I’m getting a lot of garbage: - lines and boxes that get interpreted as text (mainly punctuation of course) - words that get merged with lines and boxes, resulting in superfluous “F” or “L”, or ultra-large containing rectangles. - A large number of clearly (human) legible words which seem to be completely missed by Tesseract. I know I can get to work on filtering the output according to my knowledge of the content of the material being scanned – using my own dictionaries etc. My question is, are there any config settings, or strategies that might be useful for me to apply to Tesseract that would help in these circumstances. Thanks for any suggestions! --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~----------~----~----~----~------~----~------~--~---

