On Mar 4, 2012, at 1:02 PM, Falke wrote: > any thoughts on the matter?
For several months on a back burner until Summer, I've had a research project in which I've intentionally down-sampled high-resolution pages several different ways. So for example a 700dpi 8-bit grayscale original is dropped to 1-bit 300 dpi, but only after tiny resizing, rotation, noise-adding and other "degradations" of the original. The question I'm trying to address is: How can OCR of several low-resolution scans of the same page be combined to produce improved accuracy. The more interesting variant I'm also looking into is: Can a low-resolution scan be "improved" by creating variants whose OCR results are then combined in a similar way? The very very early results (working by hand in ABBYY and ImageMagick) is yes to both, but I haven't got anything general worked out yet. If you're feeling technical, my own inspiration comes from work with stochastic resonance (http://en.wikipedia.org/wiki/Stochastic_resonance), dithering methods (http://en.wikipedia.org/wiki/Dither), and super-resolution methods (http://en.wikipedia.org/wiki/Super-resolution). I'm not aware of work in OCR using these methods, but for a number of reasons I'm intentionally avoiding the technical literature until I get a prototype working. Could be there's something in tesseract or one of the proprietary systems that uses these approaches. Bill Tozier Line Vague Innovation, LLC [email protected] -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

