Re: better to train with low-quality or high-quality scans?

William Tozier Mon, 05 Mar 2012 08:45:06 -0800

On Mar 4, 2012, at 1:02 PM, Falke wrote:

> any thoughts on the matter?


For several months on a back burner until Summer, I've had a research project 
in which I've intentionally down-sampled high-resolution pages several 
different ways. So for example a 700dpi 8-bit grayscale original is dropped to 
1-bit 300 dpi, but only after tiny resizing, rotation, noise-adding and other 
"degradations" of the original.

The question I'm trying to address is: How can OCR of several low-resolution 
scans of the same page be combined to produce improved accuracy.

The more interesting variant I'm also looking into is: Can a low-resolution 
scan be "improved" by creating variants whose OCR results are then combined in 
a similar way?

The very very early results (working by hand in ABBYY and ImageMagick) is yes 
to both, but I haven't got anything general worked out yet. If you're feeling 
technical, my own inspiration comes from work with stochastic resonance 
(http://en.wikipedia.org/wiki/Stochastic_resonance), dithering methods 
(http://en.wikipedia.org/wiki/Dither), and super-resolution methods 
(http://en.wikipedia.org/wiki/Super-resolution).

I'm not aware of work in OCR using these methods, but for a number of reasons 
I'm intentionally avoiding the technical literature until I get a prototype 
working. Could be there's something in tesseract or one of the proprietary 
systems that uses these approaches.

Bill Tozier
Line
Vague Innovation, LLC
[email protected]



-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: better to train with low-quality or high-quality scans?

Reply via email to