Re: When Tesseract refers to dictionary

Parmeet Wed, 11 May 2011 22:26:08 -0700

Hello Patrick,

Thanks for your feedback. I would really like to know how to set tess
variable to tell the name of my dictionary and what is the format of
dictionary. I have many words which are not proper English words( for
instance company names), so i really need to know the exact procedure
to include my own dictionary along with all the original dictionary
files and other files already there in place..


Look forward to your reply..

Thanks,
Parmeet


On May 11, 5:32 pm, patrickq <[email protected]> wrote:
> The dictionary is used along with a list of character combinations
> considered to be ambiguous. This is a list that is part of the
> training set. For example, it includes an entry that says that the
> sequence "rn" can be mistaken for the letter 'm'. For each entry in
> that list there is an indication whether to always replace or not. So
> in your case you would need to rebuild the training set to include and
> entry stipulating that 'D' and 'd' can be mistaken for one another (of
> course in this case it makes no sense, the letters are NOT similar
> looking). It's unfortunate that Tesseract doesn't support providing
> the ambig file separately from the training set - but perhaps people
> more knowledgable than me can explain how the ambig file can easily be
> extracted, edited and then replaced? I only know how to extract it,
> not the replacing part ...
>
> One word of caution: you need to set a Tess variable to tell it what's
> the name of your dictionary. Another word of caution: I gave the
> dictionary a try on about 20 words with real ambiguities and making
> sure the dictionary file was taken into account - and only one case
> "caught". As a result we are not using the dictionary at all in
> ScanBizCards. Instead we have our own code looking for certain
> patterns and matching again our internal words list - essentially
> doing what Tesseract should have done, except that we don't have the
> benefit of knowing what was the next possible character choice -
> shame!
>
> Patrick
>
> On May 11, 2:07 am, Parmeet <[email protected]> wrote:
>
>
>
>
>
>
>
> > Hi all,
>
> > I am trying to figure out in what sense tesseract-ocr is referring to
> > dictionary words. I have made image of the word "Difficult" as
> > "diflicult" and given it to OCR. It gives me "diflicult" as the output
> > which is same as the image and not "difficult" which should be the
> > case in which it would had referred to dictionary.
>
> > So now my question is: If i would have given the image "Difficult"
> > only and due to some reasons it couldn't identify it correctly, how
> > tesseract decide whether it has to refer to dictionary or not..My
> > first guess to this is that it uses some kind of confidence threshold
> > on detected characters, though i am not sure. Please kindly provide
> > some info on the same as it is important from my project perspective..
>
> > Thanks and Regards
> > Parmeet

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: When Tesseract refers to dictionary

Reply via email to