I would am attempting to use tesseract to read data from a scanned high 
school transcript. The forms contains a bunch of fields (student name, 
gender, address) and corresponding values (characters, words or numbers).

I wish to confirm that I can control the behviour of tesseract using the 
eng.user-patters and eng.user-words files as described in the man page 
<http://tesseract-ocr.googlecode.com/svn-history/r725/trunk/doc/tesseract.1.html>
 and 
the file trie.h. I created a test image for this purpose (attached).

First some info about my system
cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 -v
tesseract 3.03
 leptonica-1.70
  libjpeg 8b : libpng 1.2.46 : zlib 1.2.3.4



Here is the result of applying tesseract onto the test image with no config 
file
cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 testImages/test-002.png 
thetext -psm 3
Tesseract Open Source OCR Engine v3.03 with Leptonica
cs@pleco:/data/OCR/tesseract/tests$ cat thetext.txt 
Na me: Roosevelt, Fra nklin


Age: 102


Name: Harper, Stephen
Age: 58


Name: Hawk, Tony
Age: 34


Nane: Shakespeare, Bill
Age: 432



Next I create the config file and the user-patterns and user-words files
cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/eng.
test-words
Name
Age
Roosevelt
Franklin
Harper
Stephen
Hawk
Tony
Shakespeare


cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/eng.
test-patterns 
Nam\c*


cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/
configs/bazaar_test 
load_system_dawg 0
load_freq_dawg 0
user_words_suffix test-words
user_patterns_suffix test-patterns



Now here is the output when the config files are used
cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 testImages/test-002.png 
thetext -psm 3 bazaar_test
Tesseract Open Source OCR Engine v3.03 with Leptonica
cs@pleco:/data/OCR/tesseract/tests$ cat thetext.txt 
Na me: Roosevelt, Fra nklin


Age: 102


Name: Harper, Stephen
Age: 58


Name: Hawk, Tony
Age: 34


Nane: Shakespeare, Bill
Age: 432


This is exactly the same as before! It appears the files eng.test-patterns 
and eng.test-words have no effect on tesseract. 



However, I can modify the config file to force tesseract to use only lower 
case letters
cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/
configs/bazaar_test 
tessedit_char_whitelist abcdefghijklmnopqrtsuvwxyz


The modified config file does affect the output
cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 testImages/test-002.png 
thetext -psm 3 bazaar_test
Tesseract Open Source OCR Engine v3.03 with Leptonica
cs@pleco:/data/OCR/tesseract/tests$ cat thetext.txt
we mei koosevelt lira nklin


 gei loz


wamei rlarpen stephen
 gei sa


wamei rlawk mny
 gei em


wanei shakespeara sill
 gei wz



So in this case the config file works.

What other steps can I take to confirm tesseract is using the user-pattern 
files? Is it necessary to train tesseract before applying user-patterns?

Thanks for reading,
Chris

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6aa66ccc-853f-4eaa-8c81-45dd2d215bdf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to