I would like to use tesseract to read data from a scanned high school transcript. The form contains a bunch of fields (student name, gender, address) and corresponding values (characters, words or numbers).
I understand the way to do this is using config files augmented with user data [see the man page <http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html>, patterns are explained in more detail in the file /path/to/tesseract-ocr/dict/trie.h]. However, when I try to set my own eng.user-words or eng.user-patterns, tesseract returns a *Segmentation Fault*. First, here is a test image I am using to check the pattern matching: (attached file test-002.png) Here is some info about my install: cs@pleco:/data/OCR/tesseract/tests$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 12.04.4 LTS Release: 12.04 Codename: precise cs@pleco:/data/OCR/tesseract/tests$ tesseract -v tesseract 3.02.02 leptonica-1.69 libjpeg 6b : libpng 1.2.46 : libtiff 3.9.5 : zlib 1.2.3.4 Here's is a good run, showing the output: cs@pleco:/data/OCR/tesseract/tests$ tesseract testImages/test-002.png thetext -psm 3 Tesseract Open Source OCR Engine v3.02.02 with Leptonica cs@pleco:/data/OCR/tesseract/tests$ cat thetext.txt Na me: Roosevelt, Fra nklin Age: 102 Name: Harper, Stephen Age: 58 Name: Hawk, Tony Age: 34 Nane: Shakespeare, Bill Age: 432 Here are the config file and user pattern files: cs@pleco:/usr/share/tesseract-ocr/tessdata$ cat configs/bazaar_test load_system_dawg F load_freq_dawg F user_words_suffix test-words user_patterns_suffix test-patterns cs@pleco:/usr/share/tesseract-ocr/tessdata$ cat eng.test-patterns Name: \A\c*, \A\c* Age: \d* cs@pleco:/usr/share/tesseract-ocr/tessdata$ cat eng.test-words Name: Age: Roosevelt Franklin Harper Stephen Hawk Tony Shakespeare And here is the result when running tesseract with the config file: cs@pleco:/data/OCR/tesseract/tests$ tesseract testImages/test-002.png thetext -psm 3 bazaar_test Tesseract Open Source OCR Engine v3.02.02 with Leptonica Segmentation fault What am I doing wrong? Thanks for reading! Chris -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/bb5b289c-6453-437e-88e1-3506f8d8bf8f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

