The dictionary is used along with a list of character combinations
considered to be ambiguous. This is a list that is part of the
training set. For example, it includes an entry that says that the
sequence "rn" can be mistaken for the letter 'm'. For each entry in
that list there is an indication whether to always replace or not. So
in your case you would need to rebuild the training set to include and
entry stipulating that 'D' and 'd' can be mistaken for one another (of
course in this case it makes no sense, the letters are NOT similar
looking). It's unfortunate that Tesseract doesn't support providing
the ambig file separately from the training set - but perhaps people
more knowledgable than me can explain how the ambig file can easily be
extracted, edited and then replaced? I only know how to extract it,
not the replacing part ...

One word of caution: you need to set a Tess variable to tell it what's
the name of your dictionary. Another word of caution: I gave the
dictionary a try on about 20 words with real ambiguities and making
sure the dictionary file was taken into account - and only one case
"caught". As a result we are not using the dictionary at all in
ScanBizCards. Instead we have our own code looking for certain
patterns and matching again our internal words list - essentially
doing what Tesseract should have done, except that we don't have the
benefit of knowing what was the next possible character choice -
shame!

Patrick

On May 11, 2:07 am, Parmeet <[email protected]> wrote:
> Hi all,
>
> I am trying to figure out in what sense tesseract-ocr is referring to
> dictionary words. I have made image of the word "Difficult" as
> "diflicult" and given it to OCR. It gives me "diflicult" as the output
> which is same as the image and not "difficult" which should be the
> case in which it would had referred to dictionary.
>
> So now my question is: If i would have given the image "Difficult"
> only and due to some reasons it couldn't identify it correctly, how
> tesseract decide whether it has to refer to dictionary or not..My
> first guess to this is that it uses some kind of confidence threshold
> on detected characters, though i am not sure. Please kindly provide
> some info on the same as it is important from my project perspective..
>
> Thanks and Regards
> Parmeet

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to