I have the same question. 
Any answers?
I tried to make tesseract to match the words in my own customized 
user-words,
but it returned the same result. 
I can not see the effect of the user-words and user-patterns. 

On Tuesday, 3 June 2014 03:54:24 UTC-7, Christopher Smeenk wrote:
>
> I would am attempting to use tesseract to read data from a scanned high 
> school transcript. The forms contains a bunch of fields (student name, 
> gender, address) and corresponding values (characters, words or numbers).
>
> I wish to confirm that I can control the behviour of tesseract using the 
> eng.user-patters and eng.user-words files as described in the man page 
> <http://tesseract-ocr.googlecode.com/svn-history/r725/trunk/doc/tesseract.1.html>
>  and 
> the file trie.h. I created a test image for this purpose (attached).
>
> First some info about my system
> cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 -v
> tesseract 3.03
>  leptonica-1.70
>   libjpeg 8b : libpng 1.2.46 : zlib 1.2.3.4
>
>
>
> Here is the result of applying tesseract onto the test image with no 
> config file
> cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 testImages/test-002.png 
> thetext -psm 3
> Tesseract Open Source OCR Engine v3.03 with Leptonica
> cs@pleco:/data/OCR/tesseract/tests$ cat thetext.txt 
> Na me: Roosevelt, Fra nklin
>
>
> Age: 102
>
>
> Name: Harper, Stephen
> Age: 58
>
>
> Name: Hawk, Tony
> Age: 34
>
>
> Nane: Shakespeare, Bill
> Age: 432
>
>
>
> Next I create the config file and the user-patterns and user-words files
> cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/eng
> .test-words
> Name
> Age
> Roosevelt
> Franklin
> Harper
> Stephen
> Hawk
> Tony
> Shakespeare
>
>
> cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/eng
> .test-patterns 
> Nam\c*
>
>
> cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/
> configs/bazaar_test 
> load_system_dawg 0
> load_freq_dawg 0
> user_words_suffix test-words
> user_patterns_suffix test-patterns
>
>
>
> Now here is the output when the config files are used
> cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 testImages/test-002.png 
> thetext -psm 3 bazaar_test
> Tesseract Open Source OCR Engine v3.03 with Leptonica
> cs@pleco:/data/OCR/tesseract/tests$ cat thetext.txt 
> Na me: Roosevelt, Fra nklin
>
>
> Age: 102
>
>
> Name: Harper, Stephen
> Age: 58
>
>
> Name: Hawk, Tony
> Age: 34
>
>
> Nane: Shakespeare, Bill
> Age: 432
>
>
> This is exactly the same as before! It appears the files eng.test-patterns 
> and eng.test-words have no effect on tesseract. 
>
>
>
> However, I can modify the config file to force tesseract to use only lower 
> case letters
> cs@pleco:/data/OCR/tesseract$ cat builds/tesseract-3.03/share/tessdata/
> configs/bazaar_test 
> tessedit_char_whitelist abcdefghijklmnopqrtsuvwxyz
>
>
> The modified config file does affect the output
> cs@pleco:/data/OCR/tesseract/tests$ tesseract-3.03 testImages/test-002.png 
> thetext -psm 3 bazaar_test
> Tesseract Open Source OCR Engine v3.03 with Leptonica
> cs@pleco:/data/OCR/tesseract/tests$ cat thetext.txt
> we mei koosevelt lira nklin
>
>
>  gei loz
>
>
> wamei rlarpen stephen
>  gei sa
>
>
> wamei rlawk mny
>  gei em
>
>
> wanei shakespeara sill
>  gei wz
>
>
>
> So in this case the config file works.
>
> What other steps can I take to confirm tesseract is using the user-pattern 
> files? Is it necessary to train tesseract before applying user-patterns?
>
> Thanks for reading,
> Chris
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f268fd72-3aff-494d-9f70-f3ef694812f8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to