Hi, I generated images using Sanskrit 2003 font using text2image default configs. I trained the tesseract using my own box files and compared results using dictionary dawg and without using dictionary dawg.
Using dictionary dawg interestingly increase the word-level accuracy, but in certain words, it give false words, which were correct when dictionary was not used. Ex:- Using internal state debugger, I found out that, if I give image of अब्ज , I get अज(R=66.5974, C=-4.88797) as output when I use dictionary, and अब्ज(Rating=33.2893, Conf=-2.93596) when I do not use dictionary. Important to know that non-dictinary word has better rating and confidence. Clearly, tesseract stop at a point in dictionary where it finds अज and does not move further to try out अब्ज.(as I tried with other such examples as well.) What I want to do is the following:- I want tesseract to give me the output with best rating amongst non-dictionary based recognition and dictionary based recognition. I want this process to be automated for the whole book. Any help in this regard will be deeply appreciated. Thanks in advance Rohit -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/757100da-3b34-4e6e-8de9-30086f37091d%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

