I would am attempting to use tesseract to read data from a scanned high school transcript. The forms contains a bunch of fields (student name, gender, address) and corresponding values (characters, words or numbers).
I wish to confirm that I can control the behviour of tesseract using the eng.user-patters and eng.user-words files as described in the man page <http://tesseract-ocr.googlecode.com/svn-history/r725/trunk/doc/tesseract.1.html> and the file trie.h. I created a test image for this purpose (attached). First some info about my system cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 -v tesseract 3.03 leptonica-1.70 libjpeg 8b : libpng 1.2.46 : zlib 1.2.3.4 Here is the result of applying tesseract onto the test image with no config file cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 testImages/test-002.png thetext -psm 3 Tesseract Open Source OCR Engine v3.03 with Leptonica cs@pleco:/data/OCR/tesseract/tests$ cat thetext.txt Na me: Roosevelt, Fra nklin Age: 102 Name: Harper, Stephen Age: 58 Name: Hawk, Tony Age: 34 Nane: Shakespeare, Bill Age: 432 Next I create the config file and the user-patterns and user-words files cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/eng. test-words Name Age Roosevelt Franklin Harper Stephen Hawk Tony Shakespeare cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/eng. test-patterns Nam\c* cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/ configs/bazaar_test load_system_dawg 0 load_freq_dawg 0 user_words_suffix test-words user_patterns_suffix test-patterns Now here is the output when the config files are used cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 testImages/test-002.png thetext -psm 3 bazaar_test Tesseract Open Source OCR Engine v3.03 with Leptonica cs@pleco:/data/OCR/tesseract/tests$ cat thetext.txt Na me: Roosevelt, Fra nklin Age: 102 Name: Harper, Stephen Age: 58 Name: Hawk, Tony Age: 34 Nane: Shakespeare, Bill Age: 432 This is exactly the same as before! It appears the files eng.test-patterns and eng.test-words have no effect on tesseract. However, I can modify the config file to force tesseract to use only lower case letters cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/ configs/bazaar_test tessedit_char_whitelist abcdefghijklmnopqrtsuvwxyz The modified config file does affect the output cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 testImages/test-002.png thetext -psm 3 bazaar_test Tesseract Open Source OCR Engine v3.03 with Leptonica cs@pleco:/data/OCR/tesseract/tests$ cat thetext.txt we mei koosevelt lira nklin gei loz wamei rlarpen stephen gei sa wamei rlawk mny gei em wanei shakespeara sill gei wz So in this case the config file works. What other steps can I take to confirm tesseract is using the user-pattern files? Is it necessary to train tesseract before applying user-patterns? Thanks for reading, Chris -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6aa66ccc-853f-4eaa-8c81-45dd2d215bdf%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

