Re: OCR char restriction

Andrew McGrath Mon, 20 Jan 2014 12:40:52 -0800

Hey Sam,

Did you ever get this working sufficiently?


I'm using a user-pattern file containing the following:
(\d\d\d) \d\d\d-\d\d\d\d
www.\n\*.ca\n\*
www.\n\*.com\n\*
CHANGE DUE $\d\*.\d\d

My hope is to detect phone numbers in the format of "(123) 123-1234", 
website address that are .ca and .com with www in the front, and a change 
due field

Unsure if i'm approaching this right, so i'd love to hear about your 
experiences.

On Tuesday, September 3, 2013 6:59:56 AM UTC-4, sam vara wrote:
>
> Thanks for the reply . A couple of clarifications 
>
> 1.Tesseract will not bother loading the system dictionary nor the 
> dictionary of frequent words and will load and use the eng.user-words -- ?? 
> mean i have to define all possible words that my application might 
> encounter?
>
> 2. I want to use many regular expression patterns for various fields in my 
> app. should i define one per line each pattern? if so which one will it 
> pick up for which field?
>
> On Thursday, August 29, 2013 11:54:29 PM UTC-4, shree wrote:
>>
>> For details regarding bazaar pattern, see section regarding config files 
>> in
>>
>> http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html
>>
>> Now, if you pass the word *bazaar* as a trailing command line parameter 
>>> to Tesseract, Tesseract will not bother loading the system dictionary nor 
>>> the dictionary of frequent words and will load and use the eng.user-words 
>>> and eng.user-patterns files you provided. The former is a simple word list, 
>>> one per line. The format of the latter is documented in dict/trie.h on 
>>> read_pattern_list().
>>
>>
>> See link below  for details of the patterns
>>
>>
>> http://code.google.com/p/tesseract-ocr/source/browse/trunk/dict/trie.h?r=714
>>
>>   // The pattern list file should contain one pattern per line in UTF-8 
>>> format.
>>>   //
>>>   // Each pattern can contain any non-whitespace characters, however 
>>> only the
>>>   // patterns that contain characters from the unicharset of the 
>>> corresponding
>>>   // language will be useful.
>>>   // The only meta character is '\'. To be used in a pattern as an 
>>> ordinary
>>>   // string it should be escaped with '\' (e.g. string "C:\Documents" 
>>> should
>>>   // be written in the patterns file as "C:\\Documents").
>>>   // This function supports a very limited regular expression syntax. 
>>> One can
>>>   // express a character, a certain character class and a number of 
>>> times the
>>>   // entity should be repeated in the pattern.
>>>   //
>>>   // To denote a character class use one of:
>>>   // \c - unichar for which UNICHARSET::get_isalpha() is true (character)
>>>   // \d - unichar for which UNICHARSET::get_isdigit() is true
>>>   // \n - unichar for which UNICHARSET::get_isdigit() and
>>>   //      UNICHARSET::isalpha() are true
>>>   // \p - unichar for which UNICHARSET::get_ispunct() is true
>>>   // \a - unichar for which UNICHARSET::get_islower() is true
>>>   // \A - unichar for which UNICHARSET::get_isupper() is true
>>>   //
>>>   // \* could be specified after each character or pattern to indicate 
>>> that
>>>   // the character/pattern can be repeated any number of times before 
>>> the next
>>>   // character/pattern occurs.
>>>   //
>>>   // Examples:
>>>   // 1-8\d\d-GOOG-411 will be expanded to strings:
>>>   // 1-800-GOOG-411, 1-801-GOOG-411, ... 1-899-GOOG-411.
>>>   //
>>>   // http://www.\n\*.com will be expanded to strings like:
>>>   // http://www.a.com http://www.a123.com ... 
>>> http://www.ABCDefgHIJKLMNop.com
>>>   //
>>>   // Note: In choosing which patterns to include please be aware of the 
>>> fact
>>>   // providing very generic patterns will make tesseract run slower.
>>>   // For example \n\* at the beginning of the pattern will make Tesseract
>>>   // consider all the combinations of proposed character choices for each
>>>   // of the segmentations, which will be unacceptably slow.
>>>   // Because of potential problems with speed that could be difficult to
>>>   // identify, each user pattern has to have at least 
>>> kSaneNumConcreteChars
>>>   // concrete characters from the unicharset at the beginning.
>>>
>>>
>> On Fri, Aug 30, 2013 at 9:10 AM, Quan Nguyen <[email protected]> wrote:
>>
>>> Try bazaar pattern matching and see if you will have better results.
>>>
>>> http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html
>>>
>>> On Thursday, August 29, 2013 3:33:28 AM UTC-5, sam vara wrote:
>>>>
>>>> this is my first OCR project . I am trying to feed an image that is 
>>>> [email protected] i.e an email field. I have a charset restriction 
>>>> defined which is alphanumeric (A thru Z and a thru z and @_). When 
>>>> tesseract processes this image it outputs 'G' for the @ symbol and _ for 
>>>> '.'. I get back xyzG gmail_com. What is the way to solve this ? Should i 
>>>> define a more restrictive char set?
>>>>
>>>> Thanks
>>>>
>>>  -- 
>>> -- 
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To post to this group, send email to [email protected]
>>> To unsubscribe from this group, send email to
>>> [email protected]
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>  
>>> --- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>
>>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: OCR char restriction

Reply via email to