Re: mis-decoding a single line of text

patrickq Tue, 27 Jul 2010 13:55:27 -0700

I assume you are referring to 
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
?
It's helpful, thanks, and I should have checked what's there first.


My understanding is that:
- one dictionary file (eng.word-dawg) is included as part of building
the training data, and includes a separation between frequent and
infrequent words
- there is no guideline explaining what's "frequent" versus not, nor
how the two sets interact. Do the frequent words get picked any time
the recognized text is two letters away (as opposed to infrequent
words where they trigger only if text is one letter away)? Unclear.
- there is no mention of the advantages of the eng.word-dawg method
versus eng.user-words but I guess eng,user-words is the only option to
anyone who is not building his own training data?

I'd like to give it a try (using eng.user-words) - the one question I
still have is how do results get affected by adding a word to the
dictionary? Auto-correct when replacing letters with a low score gets
a match? How many corrections per word? Anyone with answers, please
share and I volunteer to add to the doc - it's a wiki after all, why
do I get a sense that a thick layer of dust covers the doc :-)?

Patrick

On Jul 27, 4:20 pm, Eugene Reimer <[email protected]> wrote:
> A quick glance at the documentation will tell you that "the dictionary"
> lives in several DAWG files, as well in that user-words file.
>
> patrickq wrote, On 2010-07-27 14:59:
>
> > I get HAX 6 5-5,- with Tesseract 3.0
>
> > What I find remarkable is that half the folks on this forum would love
> > to disable the word recognition (i.e. dictionary), the other half
> > would like to enable it - and absolutely no one knows how to enable/
> > disable the dictionary nor can say for sure if it's actually enabled
> > or not by default. I am included in the group of the clueless - we
> > have scanned thousands of business cards and still have no idea
> > whatsoever what the hell is going on with that elusive dictionary.
>
> > I gather from Jimmy's recent answer that the dictionary is contained
> > in a single file of type text, one word per line, in a file called
> > eng.user-words (any support for regular expressions there? for example
> > to say that [\\d]*th is a common word) placed in the Tessdata folder
> > but we await final confirmation. Is it enough that the file exists?
> > Does removing the file disable the dictionary?
>
> > Clearly many have used the dictionary but sadly it appears that these
> > knowledgeable people deserted this forum once they got the answers
> > they need - if you see one of these gentlemen (or ladies, yes) roaming
> > the streets, please admonish them for not staying subscribed to forum
> > messages to give back in helping others!

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: mis-decoding a single line of text

Reply via email to