Re: mis-decoding a single line of text

patrickq Tue, 27 Jul 2010 12:59:12 -0700

I get HAX 6 5-5,- with Tesseract 3.0

What I find remarkable is that half the folks on this forum would love
to disable the word recognition (i.e. dictionary), the other half
would like to enable it - and absolutely no one knows how to enable/
disable the dictionary nor can say for sure if it's actually enabled
or not by default. I am included in the group of the clueless - we
have scanned thousands of business cards and still have no idea
whatsoever what the hell is going on with that elusive dictionary.

I gather from Jimmy's recent answer that the dictionary is contained
in a single file of type text, one word per line, in a file called
eng.user-words (any support for regular expressions there? for example
to say that [\\d]*th is a common word) placed in the Tessdata folder
but we await final confirmation. Is it enough that the file exists?
Does removing the file disable the dictionary?

Clearly many have used the dictionary but sadly it appears that these
knowledgeable people deserted this forum once they got the answers
they need - if you see one of these gentlemen (or ladies, yes) roaming
the streets, please admonish them for not staying subscribed to forum
messages to give back in helping others!

On Jul 27, 2:45 pm, khoshteep <[email protected]> wrote:
> hi everyone,
>
> I am trying to decode a single line of text that is a bit noisy. Link
> to uploaded image is attached. The text is "MAX665," but what I'm
> getting back is "THAI 6 8-51-".
>
> http://tesseract-ocr.googlegroups.com/web/row1.bmp?gda=iO7ypToAAABJaC...
>
> I'm using version 2.04 and default eng language.  I have looked at the
> thresholded image and it looks pretty good and similar to the source
> image.
>
> recog_all_words() in control.cpp tries to decode each word. Inside
> classify_word_pass1 raw_choice for the first word is "TMAX" before
> chopping. But after improve_by_chopping() and word_associator() it is
> changed to "T|4A1". And best_choice string for the word is "MAJ".
>
> After classify_word_pass2() raw_choice is "THAKX" and best_choice is
> "THAI". And for the final string best_choice is used.
>
> It seems like Tesseract is designed for word recognition and not
> character recognition. If there are a sequence of characters that do
> not makeup a meaningful word, it messes up. I'm trying to figure out
> some magic variables, if there are any, to disable the word
> recognition part and do pure OCR. If anyone can give me some pointers
> I'd appreciate it.
>
> I'm Khoshteep.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: mis-decoding a single line of text

Reply via email to