Marcus, I am interested in these problems as well. Have you received any help on this or made any progress that you can share?
Thanks, M On Wednesday, January 18, 2012 11:00:16 PM UTC-8, Speedy wrote: > > Hello, > we are trying to use Tesseract to recognize text in real world images. > We have a good text finder and a good binarization and feed Tesseract > the already binarized image, but it still happens that the binarized > image contains some dirt. > It seems that Tesseract is quite "trigger happy" in such situations. > The text contains only upper case letters and digits which are all of > the same height. Nevertheless, if there is some sort of dirt before or > after the text or in between the letters, Tesseract desperately tries > to fit any letter in there. Even small spots that are clearly of a > very different size than the letters are recognized as "I", "J", "S" > or other letters. We have tried to train the period "." and the > asterisk "*" with some of the dirt which did help somewhat. However, > it is hard to get a good representative training set for dirt. > So here are our questions: > 1. Is there a way to make Tesseract less trigger happy in general? To > make it disregard stuff that does not look like any known character at > all? > 2. We have also tried to binarize with different thresholds and feed > each variant to Tesseract so we could pick the result with the highest > confidence. However, the confidence consistently increases with the > number of characters and thus, the results with dirt consistently > outperform those without. Is there a better way to combine results > from different runs on the same image? > 3. Are there other settings that might influende the result (any of > the page segmentation modes for example, or anything else)? > 4. We have shied away from using the PSM_SINGLE_CHAR and splitting the > input into separate characters as we were hoping the baseline > algorithm would be able to figure out that some "characters" are > actually way too small. What is your opinion? > > Thanks in advance for any help, > Marcus -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

