I have a page image that's mostly an illustration, with a line of text above and a three-line italicized caption below the (rectangular) illustration. It has about a 1degree skew clockwise (nothing unusual -- i didn't think)
Well, the top sentence gets recognized, the illustration is skipped, but then the caption below the illustration is ALSO skipped. However: When I deskew the tiff with "convert -deskew 40%", everything gets recognized (with some "routine" glyph-level misrecognition) -- tesseract does attempt to recognize the italicized caption below. What does one make of this? Do we have to deskew the image outside of tesseract, before running the latter? How much tolerance is there for skewing (in terms of degrees) ? Among other things this post is a "heads-up!" because the failure is sort of silent -- partially missing text may not be missed (by humans) until late in the project or even after the "end product" is in use. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

