I would like to use tesseract to read data from a scanned high school 
transcript. The form contains a bunch of fields (student name, gender, 
address) and corresponding values (characters, words or numbers).

I understand the way to do this is using config files augmented with user 
data [see the man page 
<http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html>, 
 patterns are explained in more detail in the file 
/path/to/tesseract-ocr/dict/trie.h].

However, when I try to set my own eng.user-words or eng.user-patterns, 
tesseract returns a *Segmentation Fault*.

First, here is a test image I am using to check the pattern matching: 
(attached file test-002.png)

Here is some info about my install:
cs@pleco:/data/OCR/tesseract/tests$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 12.04.4 LTS
Release: 12.04
Codename: precise


cs@pleco:/data/OCR/tesseract/tests$ tesseract -v
tesseract 3.02.02
 leptonica-1.69
  libjpeg 6b : libpng 1.2.46 : libtiff 3.9.5 : zlib 1.2.3.4


Here's is a good run, showing the output:
cs@pleco:/data/OCR/tesseract/tests$ tesseract testImages/test-002.png 
thetext -psm 3
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
cs@pleco:/data/OCR/tesseract/tests$ cat thetext.txt 
Na me: Roosevelt, Fra nklin


Age: 102


Name: Harper, Stephen
Age: 58


Name: Hawk, Tony
Age: 34


Nane: Shakespeare, Bill
Age: 432


Here are the config file and user pattern files:
cs@pleco:/usr/share/tesseract-ocr/tessdata$ cat configs/bazaar_test 
load_system_dawg F
load_freq_dawg F
user_words_suffix test-words
user_patterns_suffix test-patterns


cs@pleco:/usr/share/tesseract-ocr/tessdata$ cat eng.test-patterns 
Name: \A\c*, \A\c*
Age: \d*


cs@pleco:/usr/share/tesseract-ocr/tessdata$ cat eng.test-words 
Name:
Age:
Roosevelt
Franklin
Harper
Stephen
Hawk
Tony
Shakespeare


And here is the result when running tesseract with the config file:
cs@pleco:/data/OCR/tesseract/tests$ tesseract testImages/test-002.png 
thetext -psm 3 bazaar_test
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
Segmentation fault



What am I doing wrong? Thanks for reading!

Chris

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bb5b289c-6453-437e-88e1-3506f8d8bf8f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to