[tesseract-ocr] URGENT HELP NEEDED: False recognition due to Dictionary usage in Sanskrit

rohit saluja Fri, 24 Jun 2016 16:41:37 -0700

Hi,

I generated images using Sanskrit 2003 font using text2image default 
configs.
I trained the tesseract using my own box files and compared results using 
dictionary dawg and without using dictionary dawg.


Using dictionary dawg interestingly increase the word-level accuracy, but 
in certain words, it give false words, which were correct when dictionary 
was not used.

Ex:- Using internal state debugger, I found out that, if I give image of 
अब्ज , I get
अज(R=66.5974, C=-4.88797) as output when I use dictionary, and 
अब्ज(Rating=33.2893, Conf=-2.93596) when I do not use dictionary.
Important to know that non-dictinary word has better rating and confidence.

Clearly, tesseract stop at a point in dictionary where it finds अज and does 
not move further to try out अब्ज.(as I tried with other such examples as 
well.)

What I want to do is the following:-

I want tesseract to give me the output with best rating amongst 
non-dictionary based recognition and dictionary based recognition. I want 
this process to be automated for the whole book. Any help in this regard 
will be deeply appreciated.

Thanks in advance
Rohit

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/757100da-3b34-4e6e-8de9-30086f37091d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] URGENT HELP NEEDED: False recognition due to Dictionary usage in Sanskrit

Reply via email to