I was in PSM_SINGLE_LINE mode indeed, because my text is already segmented into lines, and changing to PSM_AUTO does help with the I-I issue, but I have to say that the overall quality is still better with PSM_SINGLE_LINE. With PSM_AUTO I start getting all kinds of punctuation and other errors. I also tried disabling chopping, leading to disastrous results. My glyphs are not guaranteed to not touch. I am still perplexed though how tesseract ends up preferring I-Iercules instead of Hercules, when Hercules is a dictionary word and the I-I -> H ambig rule is in place...
On Thursday, November 6, 2014 7:48:00 PM UTC-5, [email protected] wrote: > > Hello, > I'm using tesseract 3.02 on Windows 7 and I started with the > eng.traineddata that was distributed with 3.02. > Tesseract keeps misreading some symbols, specifically 6 instead of G, I-I > instead of H and a few others, so I'm getting 6od instead of God, > I-Iercules instead of Hercules and so on. I was hoping that using the > dictionary would help with this so I wouldn't have to retrain, because > after all it's just these few symbols, but nothing seems to help. So far > I've tried: > > Cranking up the language_model_penalty_non_dict_word and > language_model_penalty_non_freq_dict_word values in the config file > Adding "load_system_dawg T" and "load_freq_dawg T" to the config file > (even though it's supposed to do that by default) > Adding the 6->G rule to unicharambigs (as "1 6 1 G 0") and recombining. > The I-I -> H rule was already there. > Adding the words God and Hercules to the frequent word list and > recombining (eng.freq-dawg). > Emptying both the word list (eng.word-dawg) and frequent word list > (eng.freq-dawg) and putting just these two words in and recombining, just > to see if it would make a difference. It didn't. > > Nothing I've done so far has helped, but it seems to me that the point of > using the dictionary is to deal with exactly this type of a situation, so I > feel like I must be missing something. Have I maybe missed a configuration > step? > > Thanks > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2906d195-bc75-4b68-ad97-49f69221d106%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

