tesseact can improve its accuracy if it can scan the document multiple times. The only way to achieve this benefit is to include the repetition in a single tiff with multiple pages. You can identify single instances of "hard" words when parsing the "confedence" of the output.
On Wed, Oct 24, 2012 at 9:37 PM, Phlip <[email protected]> wrote: > Tesseractors: > > We are using Tesseract for an outside-of-the-box situation - not > scanning neatly typed documents. > > Our situation is a fuzzy, low-contrast picture. But - even when I use > many image enhancements, such as leveling the colors, blurring them, > improving the contrast, shrinking the image, etc, I still get the same > situation. > > One scan will OCR correctly into text, and the next will contain > garbage. Specifically, even the tiniest difference in image > enhancement, such as bumping the contrast from 49% to 51%, can cause > this effect. It's as if tesseract is sensitive to one pixel's > difference. > > I'm aware this is a FAQ, and I have read all the traffic I can find on > it. Maybe, for example, if I could declare a required font size, then > tesseract would engage on the first correct letter, instead of the > first stray pixel, and get the scan right more often. > > (Yes, we could dive into the learning system, and learn us a fuzzy > block-capitals font. But the next input object could possibly use a > slightly different font, so we'd be back to square-one!) > > So, how to get a more stable, reproducible scan? > > -- > Phlip > http://c2.com/cgi/wiki?ZeekLand > > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

