https://code.google.com/p/tesseract-ocr/source/browse/dict/trie.h

// Inserts the list of patterns from the given file into the Trie.
  // The pattern list file should contain one pattern per line in UTF-8
format.
  //
  // Each pattern can contain any non-whitespace characters, however only
the
  // patterns that contain characters from the unicharset of the
corresponding
  // language will be useful.
  // The only meta character is '\'. To be used in a pattern as an ordinary
  // string it should be escaped with '\' (e.g. string "C:\Documents" should
  // be written in the patterns file as "C:\\Documents").
  // This function supports a very limited regular expression syntax. One
can
  // express a character, a certain character class and a number of times
the
  // entity should be repeated in the pattern.
  //
  // To denote a character class use one of:
  // \c - unichar for which UNICHARSET::get_isalpha() is true (character)
  // \d - unichar for which UNICHARSET::get_isdigit() is true
  // \n - unichar for which UNICHARSET::get_isdigit() and
  //      UNICHARSET::isalpha() are true
  // \p - unichar for which UNICHARSET::get_ispunct() is true
  // \a - unichar for which UNICHARSET::get_islower() is true
  // \A - unichar for which UNICHARSET::get_isupper() is true
  //
  // \* could be specified after each character or pattern to indicate that
  // the character/pattern can be repeated any number of times before the
next
  // character/pattern occurs.
  //
  // Examples:
  // 1-8\d\d-GOOG-411 will be expanded to strings:
  // 1-800-GOOG-411, 1-801-GOOG-411, ... 1-899-GOOG-411.
  //
  // http://www.\n\*.com will be expanded to strings like:
  // http://www.a.com http://www.a123.com ...
http://www.ABCDefgHIJKLMNop.com
  //
  // Note: In choosing which patterns to include please be aware of the fact
  // providing very generic patterns will make tesseract run slower.
  // For example \n\* at the beginning of the pattern will make Tesseract
  // consider all the combinations of proposed character choices for each
  // of the segmentations, which will be unacceptably slow.
  // Because of potential problems with speed that could be difficult to
  // identify, each user pattern has to have at least kSaneNumConcreteChars
  // concrete characters from the unicharset at the beginning.
  bool read_pattern_list(const char *filename, const UNICHARSET &unicharset
);

  // Initializes the values of *_pattern_ unichar ids.
  // This function should be called before calling read_pattern_list().
  void initialize_patterns(UNICHARSET *unicharset);

  // Fills in the given unichar id vector with the unichar ids that
represent
  // the patterns of the character classes of the given unichar_id.
  void unichar_id_to_patterns(UNICHAR_ID unichar_id,
                              const UNICHARSET &unicharset,
                              GenericVector<UNICHAR_ID> *vec) const;


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Nov 12, 2014 at 9:57 PM, Steven Norris <ste...@fortyau.com> wrote:

> That may work then. Is there any documentation on patterns that you know
> of? Syntax, format, anything? I'm not sure how to go about formatting my
> patterns.
>
>
> On Wed, Nov 12, 2014 at 10:12 AM, ShreeDevi Kumar <shreesh...@gmail.com>
> wrote:
>
>> bazaar is nothing but a config file which sets values for a set of config
>> variables, please see
>>
>>
>> https://code.google.com/p/tesseract-ocr/source/browse/tessdata/configs/bazaar
>>
>> So, if patterns are helpful, you can that as a config.
>>
>> ShreeDevi
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Wed, Nov 12, 2014 at 9:09 PM, Steven Norris <ste...@fortyau.com>
>> wrote:
>>
>>> In a way. I can set values for keys that would appear in a config file.
>>> Like the below:
>>>
>>> [tesseract setVariableValue:@"0123456789" 
>>> forKey:@"tessedit_char_whitelist"];
>>>
>>>
>>> On Wed, Nov 12, 2014 at 12:30 AM, ShreeDevi Kumar <shreesh...@gmail.com>
>>> wrote:
>>>
>>>> Are you able to pass a configuration variable with iOS CocoaPod ?
>>>>
>>>> *-c configvar=value*
>>>>
>>>> Set value for control parameter. Multiple -c arguments are allowed.
>>>>
>>>>
>>>> *configfile*
>>>>
>>>> The name of a config to use. A config is a plaintext file which
>>>> contains a list of variables and their values, one per line, with a space
>>>> separating variable from value.
>>>>
>>>> ShreeDevi
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>>> On Wed, Nov 12, 2014 at 10:33 AM, Steven Norris <ste...@fortyau.com>
>>>> wrote:
>>>>
>>>>> I did see that. Unfortunately I cannot use bazaar, as the final
>>>>> version of what I'm using will be using an iOS CocoaPod that does not
>>>>> support the bazaar functionality of Tesseract.
>>>>>
>>>>> On Tue, Nov 11, 2014 at 8:51 PM, ShreeDevi Kumar <shreesh...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> On Wed, Nov 12, 2014 at 2:13 AM, <ste...@fortyau.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The user-patterns looks helpful, but I can't find any documentation
>>>>>>> on formatting or how it works. Is there documentation on this somewhere?
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ​Did you see the man page? I had also sent link to a related
>>>>>> discussion in the past. Search the archives for other tips.
>>>>>>
>>>>>> https://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html
>>>>>> says
>>>>>> "if you pass the word *bazaar* as a trailing command line parameter
>>>>>> to Tesseract, Tesseract will not bother loading the system dictionary nor
>>>>>> the dictionary of frequent words and will load and use the eng.user-words
>>>>>> and eng.user-patterns files you provided. The former is a simple word 
>>>>>> list,
>>>>>> one per line. The format of the latter is documented in dict/trie.h on
>>>>>> read_pattern_list()."
>>>>>>
>>>>>> https://code.google.com/p/tesseract-ocr/source/browse/dict/trie.h
>>>>>> ​see
>>>>>> lines 199-232​
>>>>>>
>>>>>>
>>>>>>
>>>>>> ​
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tuesday, November 11, 2014 10:50:57 AM UTC-6, ste...@fortyau.com
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I am working on getting Tesseract to recognize VINs for an
>>>>>>>> application I am developing. I have a clean VIN image (work around to 
>>>>>>>> be
>>>>>>>> black text on white background). Have traineddata using fonts Courier,
>>>>>>>> HelveticaNeue, LatoBold, LatoLight, OpenSans, and RobotoSlab as a first
>>>>>>>> attempt. I've also limited the unicharset to A-Z except I and O and 
>>>>>>>> 0-9.
>>>>>>>>
>>>>>>>> The result is not very good. It returns a great deal of characters
>>>>>>>> that surpass the number of characters present (17). Is there a way to 
>>>>>>>> limit
>>>>>>>> tesseract to only detecting a 17 character word in one line? I'd also 
>>>>>>>> like
>>>>>>>> to have tesseract prefer, but not require, the last 5 characters to be
>>>>>>>> digits. There are a few other preferences that may help too, but I 
>>>>>>>> want to
>>>>>>>> start with these. I'm not sure how to go about setting up those 
>>>>>>>> preferences.
>>>>>>>>
>>>>>>>> Also, any suggestions past these on being able to clean up the OCR
>>>>>>>> to read more correctly would be helpful. I can't post full data and 
>>>>>>>> image
>>>>>>>> here (they're VINs. I'd need permission to do so), but I can say that 
>>>>>>>> a in
>>>>>>>> one instance WM is coming back as 6W6M and that the digits 67258 are 
>>>>>>>> coming
>>>>>>>> back as 572S5 in another.
>>>>>>>>
>>>>>>>> Any guidance would be appreciated. I'll provide whatever
>>>>>>>> information I can.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>  --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/065a4b64-bcba-4d02-bc81-461d9ae11655%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/065a4b64-bcba-4d02-bc81-461d9ae11655%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>  --
>>>>>> You received this message because you are subscribed to a topic in
>>>>>> the Google Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this topic, visit
>>>>>> https://groups.google.com/d/topic/tesseract-ocr/AyCNiju1x1Y/unsubscribe
>>>>>> .
>>>>>> To unsubscribe from this group and all its topics, send an email to
>>>>>> tesseract-ocr+unsubscr...@googlegroups.com.
>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWoMKQg7enZUxOBfe35fCthkMOLvA6MmnwtqnuiFjacEw%40mail.gmail.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWoMKQg7enZUxOBfe35fCthkMOLvA6MmnwtqnuiFjacEw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Steven T. Norris*
>>>>> *Software Engineer - Forty AU*
>>>>>
>>>>> *p: (615)997-0836 <%28615%29997-0836>*
>>>>> *e: s <steventnor...@gmail.com>te...@fortyau.com <te...@fortyau.com>*
>>>>> *w: http://www.linkedin.com/in/steventnorris
>>>>> <http://www.linkedin.com/in/steventnorris>*
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG5%2BCTEGQcag4QsX9Gy5Ei7dXrHzB5N4icc3tEUj0vt3dO6Fbg%40mail.gmail.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG5%2BCTEGQcag4QsX9Gy5Ei7dXrHzB5N4icc3tEUj0vt3dO6Fbg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>>> You received this message because you are subscribed to a topic in the
>>>> Google Groups "tesseract-ocr" group.
>>>> To unsubscribe from this topic, visit
>>>> https://groups.google.com/d/topic/tesseract-ocr/AyCNiju1x1Y/unsubscribe
>>>> .
>>>> To unsubscribe from this group and all its topics, send an email to
>>>> tesseract-ocr+unsubscr...@googlegroups.com.
>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVgjzY8GDv9wea4emyEju%2B3gXZdHZL0krUjzWOD3jHF%2BA%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVgjzY8GDv9wea4emyEju%2B3gXZdHZL0krUjzWOD3jHF%2BA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>>
>>> --
>>> *Steven T. Norris*
>>> *Software Engineer - Forty AU*
>>>
>>> *p: (615)997-0836 <%28615%29997-0836>*
>>> *e: s <steventnor...@gmail.com>te...@fortyau.com <te...@fortyau.com>*
>>> *w: http://www.linkedin.com/in/steventnorris
>>> <http://www.linkedin.com/in/steventnorris>*
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG5%2BCTF%3DEXLTscCHxg%2B585E2Q7zKOH4Kn%2B3dPhmMDVDpV-P2hg%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG5%2BCTF%3DEXLTscCHxg%2B585E2Q7zKOH4Kn%2B3dPhmMDVDpV-P2hg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "tesseract-ocr" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/tesseract-ocr/AyCNiju1x1Y/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUJHWJbm1ku0dV8K-Wd_6O2i2%2B8%3DkgzK%2B7F2kmTmjMYeQ%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUJHWJbm1ku0dV8K-Wd_6O2i2%2B8%3DkgzK%2B7F2kmTmjMYeQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
> *Steven T. Norris*
> *Software Engineer - Forty AU*
>
> *p: (615)997-0836 <%28615%29997-0836>*
> *e: s <steventnor...@gmail.com>te...@fortyau.com <te...@fortyau.com>*
> *w: http://www.linkedin.com/in/steventnorris
> <http://www.linkedin.com/in/steventnorris>*
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAG5%2BCTF5D%2BDZPoNsaPWWe2wY26kM4_MApQid3p1DYXYwXxKz9Q%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAG5%2BCTF5D%2BDZPoNsaPWWe2wY26kM4_MApQid3p1DYXYwXxKz9Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXV0kBj2dJ6b0V95znt%2BV9OPm%2BMnfsK_2rx6j2sNStu%2Bg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to