Re: better to train with low-quality or high-quality scans?

Falke Wed, 07 Mar 2012 18:57:49 -0800


On Mar 5, 10:20 am, William Tozier <[email protected]> wrote:
> On Mar 4, 2012, at 1:02 PM, Falke wrote:
>
> > any thoughts on the matter?
>
> For several months on a back burner until Summer, I've had a research project 
> in which I've intentionally down-sampled high-resolution pages several 
> different ways. So for example a 700dpi 8-bit grayscale original is dropped 
> to 1-bit 300 dpi, but only after tiny resizing, rotation, noise-adding and 
> other "degradations" of the original.
>
> The question I'm trying to address is: How can OCR of several low-resolution 
> scans of the same page be combined to produce improved accuracy.
>
> The more interesting variant I'm also looking into is: Can a low-resolution 
> scan be "improved" by creating variants whose OCR results are then combined 
> in a similar way?
>
> The very very early results (working by hand in ABBYY and ImageMagick) is yes 
> to both, but I haven't got anything general worked out yet. If you're feeling 
> technical, my own inspiration comes from work with stochastic resonance 
> (http://en.wikipedia.org/wiki/Stochastic_resonance), dithering methods 
> (http://en.wikipedia.org/wiki/Dither), and super-resolution methods 
> (http://en.wikipedia.org/wiki/Super-resolution).
>


Hands-down cool stuff (I've heard of super-resolution used to get more
detail from surveillance videos, years ago -- the multi-frame way) ...
I guess it's somewhat similar to my idea of re-assembling the ideal
paragon from multiple degraded variants.

Also, it seems to me that recognition was improved when I blew up the
image in gimp, then applied thresholding.  Perhaps the gimp uses
dithering in its scaling-up algorithm(?)

> I'm not aware of work in OCR using these methods, but for a number of reasons 
> I'm intentionally avoiding the technical literature until I get a prototype 
> working. Could be there's something in tesseract or one of the proprietary 
> systems that uses these approaches.

But back to my original question:  does anyone know if it is best to
train with perfect samples?  How much noise is allowed in the samples
-- random specks, and stuff.  Is it highly recommended to clean up all
the specks, at least? (aside from the degradation issue).  Does
tesseract have any noise cleaning routine in assembling the training
data?  As in: if  you have 20 boxes of the same thing, and a couple of
them have noise specks, would tesseract know that those specks were
noise and purge them?  That the latter is really secondary to my
original question...

thanks

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: better to train with low-quality or high-quality scans?

Reply via email to