Yes, I can pre-process each individual image to make it work, but 
unfortunately I've been unable to come up with a consistent pre-processing 
method that would work in general. I've been trying for a while now.
I've known that retraining is an option from the beginning but I'm 
concerned that it may fix some problems and introduce others. The default 
eng.traineddata works pretty well except that every once in a while a 
character is misread.
I've just downloaded and tried vietocr 4 beta and while it does get this 
one right it regrettably still misses quite a few others.

What I really need is a dictionary lookup for every non-word or garbage 
word tesseract finds that would return the best dictionary match. I'm 
thinking about writing my own but that would be absurd if tesseract is 
supposed to already contain this functionality. I understand from Ray's 
explanation here 
<https://groups.google.com/forum/#!searchin/tesseract-ocr/dictionary/tesseract-ocr/VJXE40iksnI/tr-_9O4F5OcJ>
 
that the correct character choice is not ranked high enough to be 
considered for a dictionary match, and that would make sense if I didn't 
have an ambigs rule for it. But if I have an explicit unicharambigs rule 
that says consider replacing this character with another to look for a 
dictionary match, I don't know how tesseract still ends up preferring a 
non-word over a dictionary match?
I keep thinking I must be missing some obscure config setting. I've already 
tried tweaking a while bunch of them from this list 
<http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version> but to 
no avail.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4b7432da-192d-491d-bdd2-b8de4d8bae0c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to