Re: better to train with low-quality or high-quality scans?

Dmitri Silaev Wed, 07 Mar 2012 12:34:04 -0800

I think the matters being considered in this forum thread can be very
interesting. However answers I would give might seem to be too simple.
Tesseract is a great OCR system, but I must confess I just explore
another aspect of Tesseract's behavior when it comes to a critical
point and for a number reasons I try to avoid digging in its code, so
I would appreciate much if any of lead developers can correct me if
I'm wrong.

In this reply I'll try to address the "lo/hi-res problem." At some
point this problem arises for everyone trying to employ Tesseract for
a bit beyond its conventional usage.

I won't give you here any ready recipes, and if you're serious about
using Tesseract I suggest reading the document at
http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseracticdar2007.pdf,
especially chapters 5 and 7, and also US Patent 5237627. These
documents are relatively old and Tesseract is constantly evolving,
however they would give you a good starting point to explore
Tesseract's logic *in code*. Yes, you would have to explore it in
code; the articles (patents, forum posts, whatever you'll find) won't
give you the whole picture.

Now back to life. Every "blob" (a connected component (CC) or a group
of CCs) in Tesseract, before training or recognition, is scaled to a
universal coordinate space, further used to extract "features."
Features are basically segments of polygonal approximation of CC's
outline. When Tess is being trained for a character, these features
are extracted from its outline and saved as a "prototype." When an
unknown character undergoes recognition, its features are compared to
features of every stored prototype (using some optimization techniques
so this process is fast, although the number of prototypes can be
really big.)

So what these facts add to the matter of the lo/hi-res problem? Let's
consider the two cases. The first one: Tess is trained with lo-res
images in order to also recognize lo-res images. Low resolution of
training data means little information can be used to build
prototypes. This leads to much similarity between prototypes of
characters that look relatively close when taken at low resolution.
Adding that matching characters are also low res, the outcome in some
cases can be a random value. Hence theoretically we get low
recognition accuracy.

Consider hi-res images used for training and low-res images - for
recognition. In this case we still have little info for building
matching characters' features so Tesseract can deem some of them
similar. But here we exclude any source of confusion at the prototype
side as prototypes are built with the most degree of detail. Poor
quality outlines are matched against "ideal" outlines, therefore in
most cases the result is not "so random."

That's why I think it's better to use high resolution images for
training and low resolution images for recognition. But this is only
when you're forced to do so. Although often unreal, the best scenario
is to train with and to recognize images of the same decent quality,
thus getting expected and accurate results. For low resolution image
recognition, besides my theoretical investigation, there's much space
for experimentation. Although logically sound, my conclusions can
prove wrong for some specific characters. You might want to resort to
a dictionary or context.

HTH

Warm regards,
Dmitri Silaev
www.CustomOCR.com

On Sun, Mar 4, 2012 at 10:02 PM, Falke <[email protected]> wrote:
> My subject looks deceptively like a stupid question -- but it really
> isn't:
>
> Supposing you need to recognize a bunch of existing scanned documents,
> which are relatively low-resolution. You can not obtain higher
> resolution versions, and are stuck with the low one, having to make
> do.
>
> However, it's not TOO low for SOME degree of accuracy (let's say --
> 75%,  with packaged languages), so you're not giving up just yet.
>
> ADDITIONALLY, you DO have a high-rez scan sample of a document that
> has exactly the same font(s)/typeset as your low-resolution scans
> (just not the content)
>
> So, my question is:
>
> When you train, is it better to:
>
> 1) Use the high-resolution sample to create your boxes? As I see it,
> this would yield boxes and training data that represents the target
> typeset with higher precision BUT THEORETICALLY -- their theoretical
> ideal form, rather than their degraded shapes as seen in the low-rez
> pbm file.
>
> 2)  Use the low-resolution sample to create your boxes and train?
> Your boxes should then be closer to the degraded version of the
> typeset, as seen in your low-rez documents.  Right?
>
> 3) Combine high-rez with low-rez? ( As to what proportions of the two
> -- that would be the subsequent question here, if #3 is the best
> approach.)
>
> Perhaps the answer would stem from whether degradation (in low-rez)
> happens (has happened) chaotically, randomly (to some degree), as
> opposed to consistently, uniformly.   In other words, does the lower-
> resolution scanning produce too much random variation in form, which
> is hard to "reel back in", to reassemble into paragonal uniformity, by
> means of box training.  (So, then, you'd let tesseract do its glyph-by-
> glyph computation/guess that a certain glyph is a degraded version of
> the ideal  stored in the training data)
>
> And the above, it seems, would depend on tesseract's internal
> algorithms...
>
> any thoughts on the matter?
>
> TIA
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: better to train with low-quality or high-quality scans?

Reply via email to