My subject looks deceptively like a stupid question -- but it really isn't:
Supposing you need to recognize a bunch of existing scanned documents, which are relatively low-resolution. You can not obtain higher resolution versions, and are stuck with the low one, having to make do. However, it's not TOO low for SOME degree of accuracy (let's say -- 75%, with packaged languages), so you're not giving up just yet. ADDITIONALLY, you DO have a high-rez scan sample of a document that has exactly the same font(s)/typeset as your low-resolution scans (just not the content) So, my question is: When you train, is it better to: 1) Use the high-resolution sample to create your boxes? As I see it, this would yield boxes and training data that represents the target typeset with higher precision BUT THEORETICALLY -- their theoretical ideal form, rather than their degraded shapes as seen in the low-rez pbm file. 2) Use the low-resolution sample to create your boxes and train? Your boxes should then be closer to the degraded version of the typeset, as seen in your low-rez documents. Right? 3) Combine high-rez with low-rez? ( As to what proportions of the two -- that would be the subsequent question here, if #3 is the best approach.) Perhaps the answer would stem from whether degradation (in low-rez) happens (has happened) chaotically, randomly (to some degree), as opposed to consistently, uniformly. In other words, does the lower- resolution scanning produce too much random variation in form, which is hard to "reel back in", to reassemble into paragonal uniformity, by means of box training. (So, then, you'd let tesseract do its glyph-by- glyph computation/guess that a certain glyph is a degraded version of the ideal stored in the training data) And the above, it seems, would depend on tesseract's internal algorithms... any thoughts on the matter? TIA -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

