Re: OCR char restriction

Shree Devi Kumar Thu, 29 Aug 2013 20:56:09 -0700

For details regarding bazaar pattern, see section regarding config files in


http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html

Now, if you pass the word *bazaar* as a trailing command line parameter to
> Tesseract, Tesseract will not bother loading the system dictionary nor the
> dictionary of frequent words and will load and use the eng.user-words and
> eng.user-patterns files you provided. The former is a simple word list, one
> per line. The format of the latter is documented in dict/trie.h on
> read_pattern_list().


See link below  for details of the patterns

http://code.google.com/p/tesseract-ocr/source/browse/trunk/dict/trie.h?r=714

  // The pattern list file should contain one pattern per line in UTF-8
> format.
>   //
>   // Each pattern can contain any non-whitespace characters, however only
> the
>   // patterns that contain characters from the unicharset of the
> corresponding
>   // language will be useful.
>   // The only meta character is '\'. To be used in a pattern as an ordinary
>   // string it should be escaped with '\' (e.g. string "C:\Documents"
> should
>   // be written in the patterns file as "C:\\Documents").
>   // This function supports a very limited regular expression syntax. One
> can
>   // express a character, a certain character class and a number of times
> the
>   // entity should be repeated in the pattern.
>   //
>   // To denote a character class use one of:
>   // \c - unichar for which UNICHARSET::get_isalpha() is true (character)
>   // \d - unichar for which UNICHARSET::get_isdigit() is true
>   // \n - unichar for which UNICHARSET::get_isdigit() and
>   //      UNICHARSET::isalpha() are true
>   // \p - unichar for which UNICHARSET::get_ispunct() is true
>   // \a - unichar for which UNICHARSET::get_islower() is true
>   // \A - unichar for which UNICHARSET::get_isupper() is true
>   //
>   // \* could be specified after each character or pattern to indicate that
>   // the character/pattern can be repeated any number of times before the
> next
>   // character/pattern occurs.
>   //
>   // Examples:
>   // 1-8\d\d-GOOG-411 will be expanded to strings:
>   // 1-800-GOOG-411, 1-801-GOOG-411, ... 1-899-GOOG-411.
>   //
>   // http://www.\n\*.com will be expanded to strings like:
>   // http://www.a.com http://www.a123.com ...
> http://www.ABCDefgHIJKLMNop.com
>   //
>   // Note: In choosing which patterns to include please be aware of the
> fact
>   // providing very generic patterns will make tesseract run slower.
>   // For example \n\* at the beginning of the pattern will make Tesseract
>   // consider all the combinations of proposed character choices for each
>   // of the segmentations, which will be unacceptably slow.
>   // Because of potential problems with speed that could be difficult to
>   // identify, each user pattern has to have at least kSaneNumConcreteChars
>   // concrete characters from the unicharset at the beginning.
>
>
On Fri, Aug 30, 2013 at 9:10 AM, Quan Nguyen <[email protected]> wrote:

> Try bazaar pattern matching and see if you will have better results.
>
> http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html
>
> On Thursday, August 29, 2013 3:33:28 AM UTC-5, sam vara wrote:
>>
>> this is my first OCR project . I am trying to feed an image that is
>> [email protected] i.e an email field. I have a charset restriction defined
>> which is alphanumeric (A thru Z and a thru z and @_). When tesseract
>> processes this image it outputs 'G' for the @ symbol and _ for '.'. I get
>> back xyzG gmail_com. What is the way to solve this ? Should i define a more
>> restrictive char set?
>>
>> Thanks
>>
>  --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: OCR char restriction

Reply via email to