Hello,
we are trying to use Tesseract to recognize text in real world images.
We have a good text finder and a good binarization and feed Tesseract
the already binarized image, but it still happens that the binarized
image contains some dirt.
It seems that Tesseract is quite "trigger happy" in such situations.
The text contains only upper case letters and digits which are all of
the same height. Nevertheless, if there is some sort of dirt before or
after the text or in between the letters, Tesseract desperately tries
to fit any letter in there. Even small spots that are clearly of a
very different size than the letters are recognized as "I", "J", "S"
or other letters. We have tried to train the period "." and the
asterisk "*" with some of the dirt which did help somewhat. However,
it is hard to get a good representative training set for dirt.
So here are our questions:
1. Is there a way to make Tesseract less trigger happy in general? To
make it disregard stuff that does not look like any known character at
all?
2. We have also tried to binarize with different thresholds and feed
each variant to Tesseract so we could pick the result with the highest
confidence. However, the confidence consistently increases with the
number of characters and thus, the results with dirt consistently
outperform those without. Is there a better way to combine results
from different runs on the same image?
3. Are there other settings that might influende the result (any of
the page segmentation modes for example, or anything else)?
4. We have shied away from using the PSM_SINGLE_CHAR and splitting the
input into separate characters as we were hoping the baseline
algorithm would be able to figure out that some "characters" are
actually way too small. What is your opinion?

Thanks in advance for any help,
Marcus

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to