https://code.google.com/p/tesseract-ocr/source/browse/dict/trie.h
// Inserts the list of patterns from the given file into the Trie. // The pattern list file should contain one pattern per line in UTF-8 format. // // Each pattern can contain any non-whitespace characters, however only the // patterns that contain characters from the unicharset of the corresponding // language will be useful. // The only meta character is '\'. To be used in a pattern as an ordinary // string it should be escaped with '\' (e.g. string "C:\Documents" should // be written in the patterns file as "C:\\Documents"). // This function supports a very limited regular expression syntax. One can // express a character, a certain character class and a number of times the // entity should be repeated in the pattern. // // To denote a character class use one of: // \c - unichar for which UNICHARSET::get_isalpha() is true (character) // \d - unichar for which UNICHARSET::get_isdigit() is true // \n - unichar for which UNICHARSET::get_isdigit() and // UNICHARSET::isalpha() are true // \p - unichar for which UNICHARSET::get_ispunct() is true // \a - unichar for which UNICHARSET::get_islower() is true // \A - unichar for which UNICHARSET::get_isupper() is true // // \* could be specified after each character or pattern to indicate that // the character/pattern can be repeated any number of times before the next // character/pattern occurs. // // Examples: // 1-8\d\d-GOOG-411 will be expanded to strings: // 1-800-GOOG-411, 1-801-GOOG-411, ... 1-899-GOOG-411. // // http://www.\n\*.com will be expanded to strings like: // http://www.a.com http://www.a123.com ... http://www.ABCDefgHIJKLMNop.com // // Note: In choosing which patterns to include please be aware of the fact // providing very generic patterns will make tesseract run slower. // For example \n\* at the beginning of the pattern will make Tesseract // consider all the combinations of proposed character choices for each // of the segmentations, which will be unacceptably slow. // Because of potential problems with speed that could be difficult to // identify, each user pattern has to have at least kSaneNumConcreteChars // concrete characters from the unicharset at the beginning. bool read_pattern_list(const char *filename, const UNICHARSET &unicharset ); // Initializes the values of *_pattern_ unichar ids. // This function should be called before calling read_pattern_list(). void initialize_patterns(UNICHARSET *unicharset); // Fills in the given unichar id vector with the unichar ids that represent // the patterns of the character classes of the given unichar_id. void unichar_id_to_patterns(UNICHAR_ID unichar_id, const UNICHARSET &unicharset, GenericVector<UNICHAR_ID> *vec) const; ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, Nov 12, 2014 at 9:57 PM, Steven Norris <ste...@fortyau.com> wrote: > That may work then. Is there any documentation on patterns that you know > of? Syntax, format, anything? I'm not sure how to go about formatting my > patterns. > > > On Wed, Nov 12, 2014 at 10:12 AM, ShreeDevi Kumar <shreesh...@gmail.com> > wrote: > >> bazaar is nothing but a config file which sets values for a set of config >> variables, please see >> >> >> https://code.google.com/p/tesseract-ocr/source/browse/tessdata/configs/bazaar >> >> So, if patterns are helpful, you can that as a config. >> >> ShreeDevi >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> On Wed, Nov 12, 2014 at 9:09 PM, Steven Norris <ste...@fortyau.com> >> wrote: >> >>> In a way. I can set values for keys that would appear in a config file. >>> Like the below: >>> >>> [tesseract setVariableValue:@"0123456789" >>> forKey:@"tessedit_char_whitelist"]; >>> >>> >>> On Wed, Nov 12, 2014 at 12:30 AM, ShreeDevi Kumar <shreesh...@gmail.com> >>> wrote: >>> >>>> Are you able to pass a configuration variable with iOS CocoaPod ? >>>> >>>> *-c configvar=value* >>>> >>>> Set value for control parameter. Multiple -c arguments are allowed. >>>> >>>> >>>> *configfile* >>>> >>>> The name of a config to use. A config is a plaintext file which >>>> contains a list of variables and their values, one per line, with a space >>>> separating variable from value. >>>> >>>> ShreeDevi >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>>> On Wed, Nov 12, 2014 at 10:33 AM, Steven Norris <ste...@fortyau.com> >>>> wrote: >>>> >>>>> I did see that. Unfortunately I cannot use bazaar, as the final >>>>> version of what I'm using will be using an iOS CocoaPod that does not >>>>> support the bazaar functionality of Tesseract. >>>>> >>>>> On Tue, Nov 11, 2014 at 8:51 PM, ShreeDevi Kumar <shreesh...@gmail.com >>>>> > wrote: >>>>> >>>>>> On Wed, Nov 12, 2014 at 2:13 AM, <ste...@fortyau.com> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> The user-patterns looks helpful, but I can't find any documentation >>>>>>> on formatting or how it works. Is there documentation on this somewhere? >>>>>>> >>>>>> >>>>>> >>>>>> Did you see the man page? I had also sent link to a related >>>>>> discussion in the past. Search the archives for other tips. >>>>>> >>>>>> https://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html >>>>>> says >>>>>> "if you pass the word *bazaar* as a trailing command line parameter >>>>>> to Tesseract, Tesseract will not bother loading the system dictionary nor >>>>>> the dictionary of frequent words and will load and use the eng.user-words >>>>>> and eng.user-patterns files you provided. The former is a simple word >>>>>> list, >>>>>> one per line. The format of the latter is documented in dict/trie.h on >>>>>> read_pattern_list()." >>>>>> >>>>>> https://code.google.com/p/tesseract-ocr/source/browse/dict/trie.h >>>>>> see >>>>>> lines 199-232 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> On Tuesday, November 11, 2014 10:50:57 AM UTC-6, ste...@fortyau.com >>>>>>> wrote: >>>>>>>> >>>>>>>> I am working on getting Tesseract to recognize VINs for an >>>>>>>> application I am developing. I have a clean VIN image (work around to >>>>>>>> be >>>>>>>> black text on white background). Have traineddata using fonts Courier, >>>>>>>> HelveticaNeue, LatoBold, LatoLight, OpenSans, and RobotoSlab as a first >>>>>>>> attempt. I've also limited the unicharset to A-Z except I and O and >>>>>>>> 0-9. >>>>>>>> >>>>>>>> The result is not very good. It returns a great deal of characters >>>>>>>> that surpass the number of characters present (17). Is there a way to >>>>>>>> limit >>>>>>>> tesseract to only detecting a 17 character word in one line? I'd also >>>>>>>> like >>>>>>>> to have tesseract prefer, but not require, the last 5 characters to be >>>>>>>> digits. There are a few other preferences that may help too, but I >>>>>>>> want to >>>>>>>> start with these. I'm not sure how to go about setting up those >>>>>>>> preferences. >>>>>>>> >>>>>>>> Also, any suggestions past these on being able to clean up the OCR >>>>>>>> to read more correctly would be helpful. I can't post full data and >>>>>>>> image >>>>>>>> here (they're VINs. I'd need permission to do so), but I can say that >>>>>>>> a in >>>>>>>> one instance WM is coming back as 6W6M and that the digits 67258 are >>>>>>>> coming >>>>>>>> back as 572S5 in another. >>>>>>>> >>>>>>>> Any guidance would be appreciated. I'll provide whatever >>>>>>>> information I can. >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to tesseract-ocr+unsubscr...@googlegroups.com. >>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/065a4b64-bcba-4d02-bc81-461d9ae11655%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/065a4b64-bcba-4d02-bc81-461d9ae11655%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to a topic in >>>>>> the Google Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this topic, visit >>>>>> https://groups.google.com/d/topic/tesseract-ocr/AyCNiju1x1Y/unsubscribe >>>>>> . >>>>>> To unsubscribe from this group and all its topics, send an email to >>>>>> tesseract-ocr+unsubscr...@googlegroups.com. >>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWoMKQg7enZUxOBfe35fCthkMOLvA6MmnwtqnuiFjacEw%40mail.gmail.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWoMKQg7enZUxOBfe35fCthkMOLvA6MmnwtqnuiFjacEw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> *Steven T. Norris* >>>>> *Software Engineer - Forty AU* >>>>> >>>>> *p: (615)997-0836 <%28615%29997-0836>* >>>>> *e: s <steventnor...@gmail.com>te...@fortyau.com <te...@fortyau.com>* >>>>> *w: http://www.linkedin.com/in/steventnorris >>>>> <http://www.linkedin.com/in/steventnorris>* >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-ocr+unsubscr...@googlegroups.com. >>>>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>>>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG5%2BCTEGQcag4QsX9Gy5Ei7dXrHzB5N4icc3tEUj0vt3dO6Fbg%40mail.gmail.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG5%2BCTEGQcag4QsX9Gy5Ei7dXrHzB5N4icc3tEUj0vt3dO6Fbg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>>> You received this message because you are subscribed to a topic in the >>>> Google Groups "tesseract-ocr" group. >>>> To unsubscribe from this topic, visit >>>> https://groups.google.com/d/topic/tesseract-ocr/AyCNiju1x1Y/unsubscribe >>>> . >>>> To unsubscribe from this group and all its topics, send an email to >>>> tesseract-ocr+unsubscr...@googlegroups.com. >>>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVgjzY8GDv9wea4emyEju%2B3gXZdHZL0krUjzWOD3jHF%2BA%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVgjzY8GDv9wea4emyEju%2B3gXZdHZL0krUjzWOD3jHF%2BA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> >>> >>> -- >>> *Steven T. Norris* >>> *Software Engineer - Forty AU* >>> >>> *p: (615)997-0836 <%28615%29997-0836>* >>> *e: s <steventnor...@gmail.com>te...@fortyau.com <te...@fortyau.com>* >>> *w: http://www.linkedin.com/in/steventnorris >>> <http://www.linkedin.com/in/steventnorris>* >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-ocr+unsubscr...@googlegroups.com. >>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAG5%2BCTF%3DEXLTscCHxg%2B585E2Q7zKOH4Kn%2B3dPhmMDVDpV-P2hg%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG5%2BCTF%3DEXLTscCHxg%2B585E2Q7zKOH4Kn%2B3dPhmMDVDpV-P2hg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "tesseract-ocr" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/tesseract-ocr/AyCNiju1x1Y/unsubscribe. >> To unsubscribe from this group and all its topics, send an email to >> tesseract-ocr+unsubscr...@googlegroups.com. >> To post to this group, send email to tesseract-ocr@googlegroups.com. >> Visit this group at http://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUJHWJbm1ku0dV8K-Wd_6O2i2%2B8%3DkgzK%2B7F2kmTmjMYeQ%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUJHWJbm1ku0dV8K-Wd_6O2i2%2B8%3DkgzK%2B7F2kmTmjMYeQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > > > -- > *Steven T. Norris* > *Software Engineer - Forty AU* > > *p: (615)997-0836 <%28615%29997-0836>* > *e: s <steventnor...@gmail.com>te...@fortyau.com <te...@fortyau.com>* > *w: http://www.linkedin.com/in/steventnorris > <http://www.linkedin.com/in/steventnorris>* > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAG5%2BCTF5D%2BDZPoNsaPWWe2wY26kM4_MApQid3p1DYXYwXxKz9Q%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAG5%2BCTF5D%2BDZPoNsaPWWe2wY26kM4_MApQid3p1DYXYwXxKz9Q%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXV0kBj2dJ6b0V95znt%2BV9OPm%2BMnfsK_2rx6j2sNStu%2Bg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.