Re: Form data recognition?

MARTIN Pierre Wed, 07 Apr 2010 12:08:39 -0700

Hello Mrozik,

>> I need to develop .NET WinForms app for reading form data from scanned
>> documents. Is it possible to use tesseract in this scenario? (form?
>> region? recognition, recognition of test answers - marked checkbox
>> data?)
Like DataIntelligence said, Tesseract couldn't by itself do that. But myself 
i've been looking around for a very simillar application, and i've finally 
found what i need by using Tesseract and creating specific code around it.
For example, i believe scanning forms is always the same "kind" of layout. My 
application first determines the "type" of form it is, with colorimetry (Our 
forms present similarities in data, but "types" are pretty much defined by a 
barcode and colors). For your purpose, i would say that you code some 
predicate-making application, which first determines the type.


Once the type is known, this means that you know also the disposition of 
sentences and checkboxes. Myself, i've coded in my application an object which 
extract those rects, and then give them one by one to Tesseract. If you know 
where your checkboxes are, it is almost the same process, you have to extract 
those rects, and run an equalization histogram on it, so you end up with a 
"black" box outline, and a "white" inside. Then, you just have to see if the 
inside is really "white", or if it has some kind of marks.
But this is probably requiring a different approach depending on the context. 
For instance, if the pens provided for filling the form are colored, and if 
your scans are in color too, then determining if there is a check inside one or 
another is easier.

So i'd say, from what you seem to need,
- You don't need Tesseract, unless you want to read the "questions" of your 
forms.
- First, make an algo to rotate your scans, so they are well "horizontal". This 
avoids a lot of computing for type recognition and checkboxes extraction. It's 
easy, find for example the two corners of your image. Then calculate the slope 
coefficient, and then just do an atan(coef) to get the angle. Then 
negative-rotate your image of this angle and you're set.
- First, try to "humanly" classify the "types" of forms your application will 
process (One layout = one kind).
- Based on that, try to find the fastest and efficient way to make a program 
recognize those "types".
- Then, once you've got the type, extract the checkboxes. Don't use absolute 
coordinates, instead use relative (Typically, x/y would be a percentage of 
width/height instead of an absolute pair of values).
- Recognize the content of the checkboxes.

Hope this helps,
Pierre.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Form data recognition?

Reply via email to