Hello Traun, I am also interested in using tesseract to recognize words from a selected list. But sorry I don't have an answer to your question.
I am thinking about using tesseract to recognize data on scanned forms <https://groups.google.com/forum/?fromgroups=#!topic/tesseract-ocr/vvnIBl7V3Q8> . Is it necessary to completely retrain tesseract using the custom dictionary a user provides? Or is it possible to override the default behaviour using eng.user-words? Chris On Sunday, 20 July 2014 09:27:46 UTC+2, Traun Leyden wrote: > > > I followed the FAQ - How do I provide my own dictionary -- Tesseract 3 > <https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_provide_my_own_dictionary?> > instructions > to create a custom dictionary. > > In my custom dictionary, I only have the following words: > > local > variables > variable > name > names > > When I ran tesseract against this test image <http://bit.ly/ocrimage>, > the output was: > > You can ereate local variables for the pipelines within the template by >> prefixing the variable name with a “$" Sign. Variable names have to be >> eomposed of alphanumeric characters and the underseore. In the example >> below I have used a few variations that work for variable names. > > > and I was expecting it to _only_ have words from the custom dictionary. > (eg, "local", "variable", etc..) > > Am I misunderstanding how custom dictionaries are supposed to work? Are > the words in a custom dictionary merely a "hint" rather than a constraint > on what words can be emitted in the ocr output? > > Here are the steps I used to regenerate a new eng.traineddata file: > > $ combine_tessdata -u tessdata/eng.traineddata /tmp/eng. > $ wordlist2dawg eng.wordlist eng.word-dawg eng.unicharset (where > eng.wordlist contains word list mentioned above with "local", "variables", > etc) > $ combine_tessdata /tmp/eng. > $ mv eng.traineddata ~/tmp/tessdata/eng.traineddata > > And here is how I called tesseract > > $ wget http://bit.ly/ocrimage > $ tesseract --tessdata-dir /tmp ocrimage ocrimage > > I'm using the latest subversion trunk version, built via this dockerfile > <https://github.com/tleyden/docker/blob/master/tesseract-training/Dockerfile> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3bc997ab-9d05-4b87-aaa0-3ac95c539925%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

