I have the same question. Any answers? I tried to make tesseract to match the words in my own customized user-words, but it returned the same result. I can not see the effect of the user-words and user-patterns.
On Tuesday, 3 June 2014 03:54:24 UTC-7, Christopher Smeenk wrote: > > I would am attempting to use tesseract to read data from a scanned high > school transcript. The forms contains a bunch of fields (student name, > gender, address) and corresponding values (characters, words or numbers). > > I wish to confirm that I can control the behviour of tesseract using the > eng.user-patters and eng.user-words files as described in the man page > <http://tesseract-ocr.googlecode.com/svn-history/r725/trunk/doc/tesseract.1.html> > and > the file trie.h. I created a test image for this purpose (attached). > > First some info about my system > cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 -v > tesseract 3.03 > leptonica-1.70 > libjpeg 8b : libpng 1.2.46 : zlib 1.2.3.4 > > > > Here is the result of applying tesseract onto the test image with no > config file > cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 testImages/test-002.png > thetext -psm 3 > Tesseract Open Source OCR Engine v3.03 with Leptonica > cs@pleco:/data/OCR/tesseract/tests$ cat thetext.txt > Na me: Roosevelt, Fra nklin > > > Age: 102 > > > Name: Harper, Stephen > Age: 58 > > > Name: Hawk, Tony > Age: 34 > > > Nane: Shakespeare, Bill > Age: 432 > > > > Next I create the config file and the user-patterns and user-words files > cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/eng > .test-words > Name > Age > Roosevelt > Franklin > Harper > Stephen > Hawk > Tony > Shakespeare > > > cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/eng > .test-patterns > Nam\c* > > > cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/ > configs/bazaar_test > load_system_dawg 0 > load_freq_dawg 0 > user_words_suffix test-words > user_patterns_suffix test-patterns > > > > Now here is the output when the config files are used > cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 testImages/test-002.png > thetext -psm 3 bazaar_test > Tesseract Open Source OCR Engine v3.03 with Leptonica > cs@pleco:/data/OCR/tesseract/tests$ cat thetext.txt > Na me: Roosevelt, Fra nklin > > > Age: 102 > > > Name: Harper, Stephen > Age: 58 > > > Name: Hawk, Tony > Age: 34 > > > Nane: Shakespeare, Bill > Age: 432 > > > This is exactly the same as before! It appears the files eng.test-patterns > and eng.test-words have no effect on tesseract. > > > > However, I can modify the config file to force tesseract to use only lower > case letters > cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/ > configs/bazaar_test > tessedit_char_whitelist abcdefghijklmnopqrtsuvwxyz > > > The modified config file does affect the output > cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 testImages/test-002.png > thetext -psm 3 bazaar_test > Tesseract Open Source OCR Engine v3.03 with Leptonica > cs@pleco:/data/OCR/tesseract/tests$ cat thetext.txt > we mei koosevelt lira nklin > > > gei loz > > > wamei rlarpen stephen > gei sa > > > wamei rlawk mny > gei em > > > wanei shakespeara sill > gei wz > > > > So in this case the config file works. > > What other steps can I take to confirm tesseract is using the user-pattern > files? Is it necessary to train tesseract before applying user-patterns? > > Thanks for reading, > Chris > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f268fd72-3aff-494d-9f70-f3ef694812f8%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

