Marcus,

I am interested in these problems as well. Have you received any help on 
this or made any progress that you can share?

Thanks,
M

On Wednesday, January 18, 2012 11:00:16 PM UTC-8, Speedy wrote:
>
> Hello, 
> we are trying to use Tesseract to recognize text in real world images. 
> We have a good text finder and a good binarization and feed Tesseract 
> the already binarized image, but it still happens that the binarized 
> image contains some dirt. 
> It seems that Tesseract is quite "trigger happy" in such situations. 
> The text contains only upper case letters and digits which are all of 
> the same height. Nevertheless, if there is some sort of dirt before or 
> after the text or in between the letters, Tesseract desperately tries 
> to fit any letter in there. Even small spots that are clearly of a 
> very different size than the letters are recognized as "I", "J", "S" 
> or other letters. We have tried to train the period "." and the 
> asterisk "*" with some of the dirt which did help somewhat. However, 
> it is hard to get a good representative training set for dirt. 
> So here are our questions: 
> 1. Is there a way to make Tesseract less trigger happy in general? To 
> make it disregard stuff that does not look like any known character at 
> all? 
> 2. We have also tried to binarize with different thresholds and feed 
> each variant to Tesseract so we could pick the result with the highest 
> confidence. However, the confidence consistently increases with the 
> number of characters and thus, the results with dirt consistently 
> outperform those without. Is there a better way to combine results 
> from different runs on the same image? 
> 3. Are there other settings that might influende the result (any of 
> the page segmentation modes for example, or anything else)? 
> 4. We have shied away from using the PSM_SINGLE_CHAR and splitting the 
> input into separate characters as we were hoping the baseline 
> algorithm would be able to figure out that some "characters" are 
> actually way too small. What is your opinion? 
>
> Thanks in advance for any help, 
> Marcus

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to