Hey Everyone, This is my first post :-) Thanks for working on and maintaining this excellent tool!
I'm trying to refine the accuracy of the results we're getting back from Tesseract and seem to have encountered a lack of documentation around the user-patterns file. My belief is that I should be generating this file much like the dawg files and user-word files, and referencing it in my config as as such: user_patterns_suffix user-patterns *At the moment i'm trying to accomplish three things:* 1. Ensure that any text strings starting with "www." expect some text and then a ".com" at the end. 2. Ensure that phone numbers are recognized. The actual text being transcribed is something like "(123) 123-1234". My assumption is that i could tell Tesseract expect two brackets containing 3 numbers, a space, three numbers, a dash and then 4 numbers. The real issue i'm getting is that its not aware that this pattern should only contain numbers, and it confuses things like the character D for the letter 0 3. Inform tesseract that I'm expecting a lot of prices, for example "$1.12", and that everything after the $ should be decimals or periods only *So my questions are:* Is there anyone who can tell me about the format of the user-patterns file and provide examples of their working user-patterns file / help me understand how to solve my pattern challenges? Also if there is anything else i need to do, other than reference this file in the config and include it in the same folder as my training data, that would be great to learn about. *What i've done so far:* I've created a pretty decent training set for my font (Around 4000 boxes) and a fairly complete dictionary file. I also defined the ambigchars to improve some of the simple 'find and replace' type scenarios, although i dont think i'm using this as it was intended as all my '0' type cases seem to do nothing. These things combined have had great results (Actually the dictionary has done the most for me), but i'm really trying to get to the next level by giving it some intelligence around the kinds of patterns it should expect to find. I had some issues with Tesseract 3.02 training tools, so i checked out the source for v3.03 and compiled it, resolving the issue i had. Thanks for your help! -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

