Hey Everyone,

This is my first post :-) Thanks for working on and maintaining this 
excellent tool!

I'm trying to refine the accuracy of the results we're getting back from 
Tesseract and seem to have encountered a lack of documentation around the 
user-patterns file.

My belief is that I should be generating this file much like the dawg files 
and user-word files, and referencing it in my config as as such: 

user_patterns_suffix user-patterns


*At the moment i'm trying to accomplish three things:*
1. Ensure that any text strings starting with "www." expect some text and 
then a ".com" at the end.
2. Ensure that phone numbers are recognized. The actual text being 
transcribed is something like "(123) 123-1234". My assumption is that i 
could tell Tesseract expect two brackets containing 3 numbers, a space, 
three numbers, a dash and then 4 numbers. The real issue i'm getting is 
that its not aware that this pattern should only contain numbers, and it 
confuses things like the character D for the letter 0
3. Inform tesseract that I'm expecting a lot of prices, for example 
"$1.12", and that everything after the $ should be decimals or periods only

*So my questions are:*
Is there anyone who can tell me about the format of the user-patterns file 
and provide examples of their working user-patterns file / help me 
understand how to solve my pattern challenges? Also if there is anything 
else i need to do, other than reference this file in the config and include 
it in the same folder as my training data, that would be great to learn 
about.

*What i've done so far:*
I've created a pretty decent training set for my font (Around 4000 boxes) 
and a fairly complete dictionary file. I also defined the ambigchars to 
improve some of the simple 'find and replace' type scenarios, although i 
dont think i'm using this as it was intended as all my '0' type cases seem 
to do nothing. These things combined have had great results (Actually the 
dictionary has done the most for me), but i'm really trying to get to the 
next level by giving it some intelligence around the kinds of patterns it 
should expect to find. I had some issues with Tesseract 3.02 training 
tools, so i checked out the source for v3.03 and compiled it, resolving the 
issue i had.

Thanks for your help!

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to