Hey Sam, Did you ever get this working sufficiently?
I'm using a user-pattern file containing the following: (\d\d\d) \d\d\d-\d\d\d\d www.\n\*.ca\n\* www.\n\*.com\n\* CHANGE DUE $\d\*.\d\d My hope is to detect phone numbers in the format of "(123) 123-1234", website address that are .ca and .com with www in the front, and a change due field Unsure if i'm approaching this right, so i'd love to hear about your experiences. On Tuesday, September 3, 2013 6:59:56 AM UTC-4, sam vara wrote: > > Thanks for the reply . A couple of clarifications > > 1.Tesseract will not bother loading the system dictionary nor the > dictionary of frequent words and will load and use the eng.user-words -- ?? > mean i have to define all possible words that my application might > encounter? > > 2. I want to use many regular expression patterns for various fields in my > app. should i define one per line each pattern? if so which one will it > pick up for which field? > > On Thursday, August 29, 2013 11:54:29 PM UTC-4, shree wrote: >> >> For details regarding bazaar pattern, see section regarding config files >> in >> >> http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html >> >> Now, if you pass the word *bazaar* as a trailing command line parameter >>> to Tesseract, Tesseract will not bother loading the system dictionary nor >>> the dictionary of frequent words and will load and use the eng.user-words >>> and eng.user-patterns files you provided. The former is a simple word list, >>> one per line. The format of the latter is documented in dict/trie.h on >>> read_pattern_list(). >> >> >> See link below for details of the patterns >> >> >> http://code.google.com/p/tesseract-ocr/source/browse/trunk/dict/trie.h?r=714 >> >> // The pattern list file should contain one pattern per line in UTF-8 >>> format. >>> // >>> // Each pattern can contain any non-whitespace characters, however >>> only the >>> // patterns that contain characters from the unicharset of the >>> corresponding >>> // language will be useful. >>> // The only meta character is '\'. To be used in a pattern as an >>> ordinary >>> // string it should be escaped with '\' (e.g. string "C:\Documents" >>> should >>> // be written in the patterns file as "C:\\Documents"). >>> // This function supports a very limited regular expression syntax. >>> One can >>> // express a character, a certain character class and a number of >>> times the >>> // entity should be repeated in the pattern. >>> // >>> // To denote a character class use one of: >>> // \c - unichar for which UNICHARSET::get_isalpha() is true (character) >>> // \d - unichar for which UNICHARSET::get_isdigit() is true >>> // \n - unichar for which UNICHARSET::get_isdigit() and >>> // UNICHARSET::isalpha() are true >>> // \p - unichar for which UNICHARSET::get_ispunct() is true >>> // \a - unichar for which UNICHARSET::get_islower() is true >>> // \A - unichar for which UNICHARSET::get_isupper() is true >>> // >>> // \* could be specified after each character or pattern to indicate >>> that >>> // the character/pattern can be repeated any number of times before >>> the next >>> // character/pattern occurs. >>> // >>> // Examples: >>> // 1-8\d\d-GOOG-411 will be expanded to strings: >>> // 1-800-GOOG-411, 1-801-GOOG-411, ... 1-899-GOOG-411. >>> // >>> // http://www.\n\*.com will be expanded to strings like: >>> // http://www.a.com http://www.a123.com ... >>> http://www.ABCDefgHIJKLMNop.com >>> // >>> // Note: In choosing which patterns to include please be aware of the >>> fact >>> // providing very generic patterns will make tesseract run slower. >>> // For example \n\* at the beginning of the pattern will make Tesseract >>> // consider all the combinations of proposed character choices for each >>> // of the segmentations, which will be unacceptably slow. >>> // Because of potential problems with speed that could be difficult to >>> // identify, each user pattern has to have at least >>> kSaneNumConcreteChars >>> // concrete characters from the unicharset at the beginning. >>> >>> >> On Fri, Aug 30, 2013 at 9:10 AM, Quan Nguyen <[email protected]> wrote: >> >>> Try bazaar pattern matching and see if you will have better results. >>> >>> http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html >>> >>> On Thursday, August 29, 2013 3:33:28 AM UTC-5, sam vara wrote: >>>> >>>> this is my first OCR project . I am trying to feed an image that is >>>> [email protected] i.e an email field. I have a charset restriction >>>> defined which is alphanumeric (A thru Z and a thru z and @_). When >>>> tesseract processes this image it outputs 'G' for the @ symbol and _ for >>>> '.'. I get back xyzG gmail_com. What is the way to solve this ? Should i >>>> define a more restrictive char set? >>>> >>>> Thanks >>>> >>> -- >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To post to this group, send email to [email protected] >>> To unsubscribe from this group, send email to >>> [email protected] >>> For more options, visit this group at >>> http://groups.google.com/group/tesseract-ocr?hl=en >>> >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> For more options, visit https://groups.google.com/groups/opt_out. >>> >> >> -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

