On Mar 5, 10:20 am, William Tozier <[email protected]> wrote: > On Mar 4, 2012, at 1:02 PM, Falke wrote: > > > any thoughts on the matter? > > For several months on a back burner until Summer, I've had a research project > in which I've intentionally down-sampled high-resolution pages several > different ways. So for example a 700dpi 8-bit grayscale original is dropped > to 1-bit 300 dpi, but only after tiny resizing, rotation, noise-adding and > other "degradations" of the original. > > The question I'm trying to address is: How can OCR of several low-resolution > scans of the same page be combined to produce improved accuracy. > > The more interesting variant I'm also looking into is: Can a low-resolution > scan be "improved" by creating variants whose OCR results are then combined > in a similar way? > > The very very early results (working by hand in ABBYY and ImageMagick) is yes > to both, but I haven't got anything general worked out yet. If you're feeling > technical, my own inspiration comes from work with stochastic resonance > (http://en.wikipedia.org/wiki/Stochastic_resonance), dithering methods > (http://en.wikipedia.org/wiki/Dither), and super-resolution methods > (http://en.wikipedia.org/wiki/Super-resolution). >
Hands-down cool stuff (I've heard of super-resolution used to get more detail from surveillance videos, years ago -- the multi-frame way) ... I guess it's somewhat similar to my idea of re-assembling the ideal paragon from multiple degraded variants. Also, it seems to me that recognition was improved when I blew up the image in gimp, then applied thresholding. Perhaps the gimp uses dithering in its scaling-up algorithm(?) > I'm not aware of work in OCR using these methods, but for a number of reasons > I'm intentionally avoiding the technical literature until I get a prototype > working. Could be there's something in tesseract or one of the proprietary > systems that uses these approaches. But back to my original question: does anyone know if it is best to train with perfect samples? How much noise is allowed in the samples -- random specks, and stuff. Is it highly recommended to clean up all the specks, at least? (aside from the degradation issue). Does tesseract have any noise cleaning routine in assembling the training data? As in: if you have 20 boxes of the same thing, and a couple of them have noise specks, would tesseract know that those specks were noise and purge them? That the latter is really secondary to my original question... thanks -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

