I was in PSM_SINGLE_LINE mode indeed, because my text is already segmented 
into lines, and changing to PSM_AUTO does help with the I-I issue, but I 
have to say that the overall quality is still better with PSM_SINGLE_LINE. 
With PSM_AUTO I start getting all kinds of punctuation and other errors. I 
also tried disabling chopping, leading to disastrous results. My glyphs are 
not guaranteed to not touch.
I am still perplexed though how tesseract ends up preferring I-Iercules 
instead of Hercules, when Hercules is a dictionary word and the I-I -> H 
ambig rule is in place... 

On Thursday, November 6, 2014 7:48:00 PM UTC-5, [email protected] wrote:
>
> Hello,
> I'm using tesseract 3.02 on Windows 7 and I started with the 
> eng.traineddata that was distributed with 3.02.
> Tesseract keeps misreading some symbols, specifically 6 instead of G, I-I 
> instead of H and a few others, so I'm getting 6od instead of God, 
> I-Iercules instead of Hercules and so on. I was hoping that using the 
> dictionary would help with this so I wouldn't have to retrain, because 
> after all it's just these few symbols, but nothing seems to help. So far 
> I've tried:
>
> Cranking up the language_model_penalty_non_dict_word and 
> language_model_penalty_non_freq_dict_word values in the config file
> Adding "load_system_dawg T" and "load_freq_dawg T" to the config file 
> (even though it's supposed to do that by default)
> Adding the 6->G rule to unicharambigs (as "1 6 1 G 0") and recombining. 
> The I-I -> H rule was already there.
> Adding the words God and Hercules to the frequent word list and 
> recombining (eng.freq-dawg).
> Emptying both the word list (eng.word-dawg) and frequent word list 
> (eng.freq-dawg) and putting just these two words in and recombining, just 
> to see if it would make a difference. It didn't.
>
> Nothing I've done so far has helped, but it seems to me that the point of 
> using the dictionary is to deal with exactly this type of a situation, so I 
> feel like I must be missing something. Have I maybe missed a configuration 
> step?
>
> Thanks
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2906d195-bc75-4b68-ad97-49f69221d106%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to