Hello,
I'm using tesseract 3.02 on Windows 7 and I started with the 
eng.traineddata that was distributed with 3.02.
Tesseract keeps misreading some symbols, specifically 6 instead of G, I-I 
instead of H and a few others, so I'm getting 6od instead of God, 
I-Iercules instead of Hercules and so on. I was hoping that using the 
dictionary would help with this so I wouldn't have to retrain, because 
after all it's just these few symbols, but nothing seems to help. So far 
I've tried:

Cranking up the language_model_penalty_non_dict_word and 
language_model_penalty_non_freq_dict_word values in the config file
Adding "load_system_dawg T" and "load_freq_dawg T" to the config file (even 
though it's supposed to do that by default)
Adding the 6->G rule to unicharambigs (as "1 6 1 G 0") and recombining. The 
I-I -> H rule was already there.
Adding the words God and Hercules to the frequent word list and recombining 
(eng.freq-dawg).
Emptying both the word list (eng.word-dawg) and frequent word list 
(eng.freq-dawg) and putting just these two words in and recombining, just 
to see if it would make a difference. It didn't.

Nothing I've done so far has helped, but it seems to me that the point of 
using the dictionary is to deal with exactly this type of a situation, so I 
feel like I must be missing something. Have I maybe missed a configuration 
step?

Thanks

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a8b99ece-3e74-461d-a553-42384b2e77f7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to