Re: better to train with low-quality or high-quality scans?

Dmitri Silaev Thu, 08 Mar 2012 00:21:38 -0800

It looks like your second post have undergone moderation later than
mine, but was written earlier...


Anyway, on speckles. When training, Tesseract also performs
segmentation (blocks, lines, words, characters) just like it does
during the recognition. A part of segmentation is the culling of noise
CCs but it's an obscure process when at some point some rejected CCs
can be reverted to good CCs and vice versa. As a result some unknown
character CCs may appear close to noise CCs so that their bounding
boxes overlap. When this happens Tesseract can either report an error
in the box file or treat the speckle as a part of character's shape,
and therefore it would be trained incorrectly.

So the best would be to clean up the image before passing it to
Tesseract. You can use ImageMagick or whatever tool you like.

Warm regards,
Dmitri Silaev
www.CustomOCR.com



On Wed, Mar 7, 2012 at 9:11 PM, Falke <[email protected]> wrote:
>
>
> On Mar 5, 10:20 am, William Tozier <[email protected]> wrote:
>> On Mar 4, 2012, at 1:02 PM, Falke wrote:
>>
>> > any thoughts on the matter?
>>
>> For several months on a back burner until Summer, I've had a research 
>> project in which I've intentionally down-sampled high-resolution pages 
>> several different ways. So for example a 700dpi 8-bit grayscale original is 
>> dropped to 1-bit 300 dpi, but only after tiny resizing, rotation, 
>> noise-adding and other "degradations" of the original.
>>
>> The question I'm trying to address is: How can OCR of several low-resolution 
>> scans of the same page be combined to produce improved accuracy.
>>
>> The more interesting variant I'm also looking into is: Can a low-resolution 
>> scan be "improved" by creating variants whose OCR results are then combined 
>> in a similar way?
>>
>> The very very early results (working by hand in ABBYY and ImageMagick) is 
>> yes to both, but I haven't got anything general worked out yet. If you're 
>> feeling technical, my own inspiration comes from work with stochastic 
>> resonance (http://en.wikipedia.org/wiki/Stochastic_resonance), dithering 
>> methods (http://en.wikipedia.org/wiki/Dither), and super-resolution methods 
>> (http://en.wikipedia.org/wiki/Super-resolution).
>>
>
> Hands-down cool stuff (I've heard of super-resolution used to get more
> detail from surveillance videos, years ago -- the multi-frame way) ...
> I guess it's somewhat similar to my idea of re-assembling the ideal
> paragon from multiple degraded variants.
>
> Also, it seems to me that recognition was improved when I blew up the
> image in gimp, then applied thresholding.  Perhaps the gimp uses
> dithering in its scaling-up algorithm(?)
>
>> I'm not aware of work in OCR using these methods, but for a number of 
>> reasons I'm intentionally avoiding the technical literature until I get a 
>> prototype working. Could be there's something in tesseract or one of the 
>> proprietary systems that uses these approaches.
>
> But back to my original question:  does anyone know if it is best to
> train with perfect samples?  How much noise is allowed in the samples
> -- random specks, and stuff.  Is it highly recommended to clean up all
> the specks, at least? (aside from the degradation issue).  Does
> tesseract have any noise cleaning routine in assembling the training
> data?  As in: if  you have 20 boxes of the same thing, and a couple of
> them have noise specks, would tesseract know that those specks were
> noise and purge them?  That the latter is really secondary to my
> original question...
>
> thanks
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: better to train with low-quality or high-quality scans?

Reply via email to