Hi Andrew, Welcome! Sorry to have been a bit slow to reply.
> I'm trying to refine the accuracy of the results we're getting back from > Tesseract and seem to have encountered a lack of documentation around the > user-patterns file. Yes, it certainly is an area where more documentation is needed. I'll try to find the time to dig around the code and what documentation there is on it to get back to you more on it soon. In the meantime I'll answer some of your other questions and thoughts. The main thing I was thinking when reading your email is that you can use number-dawg for some of these tasks. Going through your list: > 2. Ensure that phone numbers are recognized. The actual text being transcribed > is something like "(123) 123-1234". My assumption is that i could tell > Tesseract expect two brackets containing 3 numbers, a space, three numbers, a > dash and then 4 numbers. The real issue i'm getting is that its not aware that > this pattern should only contain numbers, and it confuses things like the > character D for the letter 0 > 3. Inform tesseract that I'm expecting a lot of prices, for example "$1.12", > and that everything after the $ should be decimals or periods only Take a look at the eng.number-dawg - you can get the wordlist it uses by running the following: $ combine-tessdata -u eng.traineddata eng. $ dawg2wordlist eng.unicharset eng.number-dawg eng.number-wordlist As described in the combine_tessdata manpage, each number is represented by a space. Both of these rules should be really easy to put into the number-dawg, something like: ( ) - $ . $ . $ . Note they're untested, and I haven't used number-dawg myself, but that looks like it ought to work to me. > 1. Ensure that any text strings starting with "www." expect some text and then > a ".com" at the end. The punc-dawg may be enough for this. Maybe something like this: www. .com > I also defined the ambigchars to improve some > of the simple 'find and replace' type scenarios, although i dont think i'm > using this as it was intended as all my '0' type cases seem to do nothing. Yes, the '0' type cases don't make a large difference. Arguably they should make a bit more. I wonder if there's a config variable to control that... Anyway, I agree, someone should document the user-patterns stuff. I'll try to do so if I get time, but if anyone wants to look sooner, or offer their own experiences with it, do go ahead! Nick -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

