Re: mis-decoding a single line of text

Jimmy O'Regan Tue, 27 Jul 2010 14:01:20 -0700

On 27 July 2010 20:59, patrickq <[email protected]> wrote:
> I get HAX 6 5-5,- with Tesseract 3.0
>
> What I find remarkable is that half the folks on this forum would love
> to disable the word recognition (i.e. dictionary), the other half
> would like to enable it - and absolutely no one knows how to enable/
> disable the dictionary nor can say for sure if it's actually enabled
> or not by default. I am included in the group of the clueless - we
> have scanned thousands of business cards and still have no idea
> whatsoever what the hell is going on with that elusive dictionary.
>


You won't normally see the difference in business cards, because there
isn't enough text to work with, and the text that's there diverges
enough from 'normal' text, that the dictionaries will have relatively
low coverage.

If you want to do something practical to visualise it, generate an
image, with two lines of text:
Footon Q. Barlish
Product Manager

I'll bet line 2 comes out ok, and line 1 doesn't. That'll be the
dictionary at work.

> I gather from Jimmy's recent answer that the dictionary is contained
> in a single file of type text, one word per line, in a file called
> eng.user-words

That's the /user dictionary/.

Tesseract uses a number of dictionaries. The two main dictionaries in
the language data contain the most frequent words, and 'the rest'. In
addition, Tesseract compiles a document dictionary, consisting of the
words found in blobs (but not in the dictionaries) that have high
character confidences.

> (any support for regular expressions there? for example
> to say that [\\d]*th is a common word)

To do that properly would be a pain in the ass, because it would
basically have to be a Tesseract-specific implementation of regular
expressions. Even for '\d', you either have to convert back and forth,
or define '\d' to be the set of unichars marked as numeric; to be
really useful, it would have to be aware of ambiguities too, like that
'i' or 'l' could just as easily be '1', etc.

> placed in the Tessdata folder
> but we await final confirmation. Is it enough that the file exists?
> Does removing the file disable the dictionary?
>
> Clearly many have used the dictionary but sadly it appears that these
> knowledgeable people deserted this forum once they got the answers
> they need - if you see one of these gentlemen (or ladies, yes) roaming
> the streets, please admonish them for not staying subscribed to forum
> messages to give back in helping others!
>
> On Jul 27, 2:45 pm, khoshteep <[email protected]> wrote:
>> hi everyone,
>>
>> I am trying to decode a single line of text that is a bit noisy. Link
>> to uploaded image is attached. The text is "MAX665," but what I'm
>> getting back is "THAI 6 8-51-".
>>
>> http://tesseract-ocr.googlegroups.com/web/row1.bmp?gda=iO7ypToAAABJaC...
>>
>> I'm using version 2.04 and default eng language.  I have looked at the
>> thresholded image and it looks pretty good and similar to the source
>> image.
>>
>> recog_all_words() in control.cpp tries to decode each word. Inside
>> classify_word_pass1 raw_choice for the first word is "TMAX" before
>> chopping. But after improve_by_chopping() and word_associator() it is
>> changed to "T|4A1". And best_choice string for the word is "MAJ".
>>
>> After classify_word_pass2() raw_choice is "THAKX" and best_choice is
>> "THAI". And for the final string best_choice is used.
>>
>> It seems like Tesseract is designed for word recognition and not
>> character recognition. If there are a sequence of characters that do
>> not makeup a meaningful word, it messes up. I'm trying to figure out
>> some magic variables, if there are any, to disable the word
>> recognition part and do pure OCR. If anyone can give me some pointers
>> I'd appreciate it.
>>
>> I'm Khoshteep.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>



-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: mis-decoding a single line of text

Reply via email to