p.s. On Saturday, June 28, 2014 12:39:21 AM UTC-4, [email protected] wrote: > > > 3) Attempted to increase the strength of dictionary matches as discussed > on the FAQ ( > https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_to_increase_the_trust_in/strength_of_the_dictionary?), > > both via API calls to setVariable and via a configuration file (tess-two > uses tesseract 3.0.3): > > language_model_penalty_non_freq_dict_word 1 > language_model_penalty_non_dict_word 1 > > However, I still occasionally get words that are three characters long and > not in the dictionary, e.g. "C9" will be recognized as "129". When this > happens it wrecks havoc with the base 16 decoding, as there are an odd > number of hex digits. Since I can include additional error correction > data, I'd be fine with dictionary words being hallucinated, but having > three characters returned causes a problem. > > This makes me wonder if I am properly following the instructions to > increase the strength of dictionary matches. In this case, I'd be happy to > constrain results to strictly only dictionary words. >
Since these are doubles, you might want to try 0.9 (or even 0.5) to make sure that you're not running into some type of boundary condition. I haven't played with them myself, so I'm not sure how they're handled internally. Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/dde7b791-a62e-46e0-8871-4aec84a0cdf0%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

