better to train with low-quality or high-quality scans?

Falke Sun, 04 Mar 2012 18:51:58 -0800

My subject looks deceptively like a stupid question -- but it really
isn't:


Supposing you need to recognize a bunch of existing scanned documents,
which are relatively low-resolution. You can not obtain higher
resolution versions, and are stuck with the low one, having to make
do.

However, it's not TOO low for SOME degree of accuracy (let's say --
75%,  with packaged languages), so you're not giving up just yet.

ADDITIONALLY, you DO have a high-rez scan sample of a document that
has exactly the same font(s)/typeset as your low-resolution scans
(just not the content)

So, my question is:

When you train, is it better to:

1) Use the high-resolution sample to create your boxes? As I see it,
this would yield boxes and training data that represents the target
typeset with higher precision BUT THEORETICALLY -- their theoretical
ideal form, rather than their degraded shapes as seen in the low-rez
pbm file.

2)  Use the low-resolution sample to create your boxes and train?
Your boxes should then be closer to the degraded version of the
typeset, as seen in your low-rez documents.  Right?

3) Combine high-rez with low-rez? ( As to what proportions of the two
-- that would be the subsequent question here, if #3 is the best
approach.)

Perhaps the answer would stem from whether degradation (in low-rez)
happens (has happened) chaotically, randomly (to some degree), as
opposed to consistently, uniformly.   In other words, does the lower-
resolution scanning produce too much random variation in form, which
is hard to "reel back in", to reassemble into paragonal uniformity, by
means of box training.  (So, then, you'd let tesseract do its glyph-by-
glyph computation/guess that a certain glyph is a degraded version of
the ideal  stored in the training data)

And the above, it seems, would depend on tesseract's internal
algorithms...

any thoughts on the matter?

TIA


-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

better to train with low-quality or high-quality scans?

Reply via email to