Thanks for the reply . A couple of clarifications 1.Tesseract will not bother loading the system dictionary nor the dictionary of frequent words and will load and use the eng.user-words -- ?? mean i have to define all possible words that my application might encounter?
2. I want to use many regular expression patterns for various fields in my app. should i define one per line each pattern? if so which one will it pick up for which field? On Thursday, August 29, 2013 11:54:29 PM UTC-4, shree wrote: > > For details regarding bazaar pattern, see section regarding config files in > > http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html > > Now, if you pass the word *bazaar* as a trailing command line parameter >> to Tesseract, Tesseract will not bother loading the system dictionary nor >> the dictionary of frequent words and will load and use the eng.user-words >> and eng.user-patterns files you provided. The former is a simple word list, >> one per line. The format of the latter is documented in dict/trie.h on >> read_pattern_list(). > > > See link below for details of the patterns > > > http://code.google.com/p/tesseract-ocr/source/browse/trunk/dict/trie.h?r=714 > > // The pattern list file should contain one pattern per line in UTF-8 >> format. >> // >> // Each pattern can contain any non-whitespace characters, however only >> the >> // patterns that contain characters from the unicharset of the >> corresponding >> // language will be useful. >> // The only meta character is '\'. To be used in a pattern as an >> ordinary >> // string it should be escaped with '\' (e.g. string "C:\Documents" >> should >> // be written in the patterns file as "C:\\Documents"). >> // This function supports a very limited regular expression syntax. One >> can >> // express a character, a certain character class and a number of times >> the >> // entity should be repeated in the pattern. >> // >> // To denote a character class use one of: >> // \c - unichar for which UNICHARSET::get_isalpha() is true (character) >> // \d - unichar for which UNICHARSET::get_isdigit() is true >> // \n - unichar for which UNICHARSET::get_isdigit() and >> // UNICHARSET::isalpha() are true >> // \p - unichar for which UNICHARSET::get_ispunct() is true >> // \a - unichar for which UNICHARSET::get_islower() is true >> // \A - unichar for which UNICHARSET::get_isupper() is true >> // >> // \* could be specified after each character or pattern to indicate >> that >> // the character/pattern can be repeated any number of times before the >> next >> // character/pattern occurs. >> // >> // Examples: >> // 1-8\d\d-GOOG-411 will be expanded to strings: >> // 1-800-GOOG-411, 1-801-GOOG-411, ... 1-899-GOOG-411. >> // >> // http://www.\n\*.com will be expanded to strings like: >> // http://www.a.com http://www.a123.com ... >> http://www.ABCDefgHIJKLMNop.com >> // >> // Note: In choosing which patterns to include please be aware of the >> fact >> // providing very generic patterns will make tesseract run slower. >> // For example \n\* at the beginning of the pattern will make Tesseract >> // consider all the combinations of proposed character choices for each >> // of the segmentations, which will be unacceptably slow. >> // Because of potential problems with speed that could be difficult to >> // identify, each user pattern has to have at least >> kSaneNumConcreteChars >> // concrete characters from the unicharset at the beginning. >> >> > On Fri, Aug 30, 2013 at 9:10 AM, Quan Nguyen <[email protected]<javascript:> > > wrote: > >> Try bazaar pattern matching and see if you will have better results. >> >> http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html >> >> On Thursday, August 29, 2013 3:33:28 AM UTC-5, sam vara wrote: >>> >>> this is my first OCR project . I am trying to feed an image that is >>> [email protected] i.e an email field. I have a charset restriction defined >>> which is alphanumeric (A thru Z and a thru z and @_). When tesseract >>> processes this image it outputs 'G' for the @ symbol and _ for '.'. I get >>> back xyzG gmail_com. What is the way to solve this ? Should i define a more >>> restrictive char set? >>> >>> Thanks >>> >> -- >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected]<javascript:> >> To unsubscribe from this group, send email to >> [email protected] <javascript:> >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> >> --- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> For more options, visit https://groups.google.com/groups/opt_out. >> > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

