Tesseract has a white-list for glyphs ... not words but glyphs.
Much of tesseract is "hints" and possibilities. 2014 is not 1997 ... it would be nice if we understood how to best train it. I have built a tool to train tesseract, but it doesn't seem to improve my default results much. http://font.mturk.patent-rank.com/frames/popup/popup_training-tess.html On Sunday, July 20, 2014 3:27:46 AM UTC-4, Traun Leyden wrote: > > > I followed the FAQ - How do I provide my own dictionary -- Tesseract 3 > <https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_provide_my_own_dictionary?> > instructions > to create a custom dictionary. > > In my custom dictionary, I only have the following words: > > local > variables > variable > name > names > > When I ran tesseract against this test image <http://bit.ly/ocrimage>, > the output was: > > You can ereate local variables for the pipelines within the template by >> prefixing the variable name with a “$" Sign. Variable names have to be >> eomposed of alphanumeric characters and the underseore. In the example >> below I have used a few variations that work for variable names. > > > and I was expecting it to _only_ have words from the custom dictionary. > (eg, "local", "variable", etc..) > > Am I misunderstanding how custom dictionaries are supposed to work? Are > the words in a custom dictionary merely a "hint" rather than a constraint > on what words can be emitted in the ocr output? > > Here are the steps I used to regenerate a new eng.traineddata file: > > $ combine_tessdata -u tessdata/eng.traineddata /tmp/eng. > $ wordlist2dawg eng.wordlist eng.word-dawg eng.unicharset (where > eng.wordlist contains word list mentioned above with "local", "variables", > etc) > $ combine_tessdata /tmp/eng. > $ mv eng.traineddata ~/tmp/tessdata/eng.traineddata > > And here is how I called tesseract > > $ wget http://bit.ly/ocrimage > $ tesseract --tessdata-dir /tmp ocrimage ocrimage > > I'm using the latest subversion trunk version, built via this dockerfile > <https://github.com/tleyden/docker/blob/master/tesseract-training/Dockerfile> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/18516e47-9304-410a-9ba8-f1260204a043%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

