Hello Mrozik, >> I need to develop .NET WinForms app for reading form data from scanned >> documents. Is it possible to use tesseract in this scenario? (form? >> region? recognition, recognition of test answers - marked checkbox >> data?) Like DataIntelligence said, Tesseract couldn't by itself do that. But myself i've been looking around for a very simillar application, and i've finally found what i need by using Tesseract and creating specific code around it. For example, i believe scanning forms is always the same "kind" of layout. My application first determines the "type" of form it is, with colorimetry (Our forms present similarities in data, but "types" are pretty much defined by a barcode and colors). For your purpose, i would say that you code some predicate-making application, which first determines the type.
Once the type is known, this means that you know also the disposition of sentences and checkboxes. Myself, i've coded in my application an object which extract those rects, and then give them one by one to Tesseract. If you know where your checkboxes are, it is almost the same process, you have to extract those rects, and run an equalization histogram on it, so you end up with a "black" box outline, and a "white" inside. Then, you just have to see if the inside is really "white", or if it has some kind of marks. But this is probably requiring a different approach depending on the context. For instance, if the pens provided for filling the form are colored, and if your scans are in color too, then determining if there is a check inside one or another is easier. So i'd say, from what you seem to need, - You don't need Tesseract, unless you want to read the "questions" of your forms. - First, make an algo to rotate your scans, so they are well "horizontal". This avoids a lot of computing for type recognition and checkboxes extraction. It's easy, find for example the two corners of your image. Then calculate the slope coefficient, and then just do an atan(coef) to get the angle. Then negative-rotate your image of this angle and you're set. - First, try to "humanly" classify the "types" of forms your application will process (One layout = one kind). - Based on that, try to find the fastest and efficient way to make a program recognize those "types". - Then, once you've got the type, extract the checkboxes. Don't use absolute coordinates, instead use relative (Typically, x/y would be a percentage of width/height instead of an absolute pair of values). - Recognize the content of the checkboxes. Hope this helps, Pierre. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

