Hello Traun,

I am also interested in using tesseract to recognize words from a selected 
list. But sorry I don't have an answer to your question.

I am thinking about using tesseract to recognize data on scanned forms 
<https://groups.google.com/forum/?fromgroups=#!topic/tesseract-ocr/vvnIBl7V3Q8>
.
Is it necessary to completely retrain tesseract using the custom dictionary 
a user provides? Or is it possible to override the default behaviour using 
eng.user-words? 

Chris

On Sunday, 20 July 2014 09:27:46 UTC+2, Traun Leyden wrote:
>
>
> I followed the FAQ - How do I provide my own dictionary -- Tesseract 3 
> <https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_provide_my_own_dictionary?>
>  instructions 
> to create a custom dictionary.
>
> In my custom dictionary, I only have the following words:
>
> local
> variables
> variable
> name
> names
>
> When I ran tesseract against this test image <http://bit.ly/ocrimage>, 
> the output was:
>
> You can ereate local variables for the pipelines within the template by
>> prefixing the variable name with a “$" Sign. Variable names have to be
>> eomposed of alphanumeric characters and the underseore. In the example
>> below I have used a few variations that work for variable names.
>
>
> and I was expecting it to _only_ have words from the custom dictionary. 
>  (eg, "local", "variable", etc..)
>
> Am I misunderstanding how custom dictionaries are supposed to work?  Are 
> the words in a custom dictionary merely a "hint" rather than a constraint 
> on what words can be emitted in the ocr output?
>
> Here are the steps I used to regenerate a new eng.traineddata file:
>
> $ combine_tessdata -u tessdata/eng.traineddata /tmp/eng.
> $ wordlist2dawg eng.wordlist eng.word-dawg eng.unicharset (where 
> eng.wordlist contains word list mentioned above with "local", "variables", 
> etc)
> $ combine_tessdata /tmp/eng.
> $ mv eng.traineddata ~/tmp/tessdata/eng.traineddata
>
> And here is how I called tesseract
>
> $ wget http://bit.ly/ocrimage
> $ tesseract --tessdata-dir /tmp ocrimage ocrimage 
>
> I'm using the latest subversion trunk version, built via this dockerfile 
> <https://github.com/tleyden/docker/blob/master/tesseract-training/Dockerfile>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3bc997ab-9d05-4b87-aaa0-3ac95c539925%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to