Hello, we are trying to use Tesseract to recognize text in real world images. We have a good text finder and a good binarization and feed Tesseract the already binarized image, but it still happens that the binarized image contains some dirt. It seems that Tesseract is quite "trigger happy" in such situations. The text contains only upper case letters and digits which are all of the same height. Nevertheless, if there is some sort of dirt before or after the text or in between the letters, Tesseract desperately tries to fit any letter in there. Even small spots that are clearly of a very different size than the letters are recognized as "I", "J", "S" or other letters. We have tried to train the period "." and the asterisk "*" with some of the dirt which did help somewhat. However, it is hard to get a good representative training set for dirt. So here are our questions: 1. Is there a way to make Tesseract less trigger happy in general? To make it disregard stuff that does not look like any known character at all? 2. We have also tried to binarize with different thresholds and feed each variant to Tesseract so we could pick the result with the highest confidence. However, the confidence consistently increases with the number of characters and thus, the results with dirt consistently outperform those without. Is there a better way to combine results from different runs on the same image? 3. Are there other settings that might influende the result (any of the page segmentation modes for example, or anything else)? 4. We have shied away from using the PSM_SINGLE_CHAR and splitting the input into separate characters as we were hoping the baseline algorithm would be able to figure out that some "characters" are actually way too small. What is your opinion?
Thanks in advance for any help, Marcus -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

