Re: word review

Thilanka Kaushalya Sat, 20 Mar 2010 22:25:40 -0700

Hi Joe and Moffette,


         I'm recognising the data from a hand written form, and the scenario
is extracting the
letters one by one and sending the each letter to the tesseract seperately.
So the recognition
is done letter-vice. So I can,t use the dictionary file for the word
reviewing in that case.
****************
*How do I provide my own dictionary?*

*Easy: Replace tessdata/eng.user-words with your own word list, in the
same format - UTF8 text, one word per line.

More difficult, but better for a large dictionary: Replace tessdata/
eng.word-dawg with one created from your own word list, using
wordlist2dawg. See the TrainingTesseract wiki page for details.*

***********************
          Is the Tesseract output the words only included in the above
mentioned libraries.
If so can I send the set of recognised letter again to the Tesseract as an
image to review
it to the defined domain of per-defined words.

          Or else can you give some instructions about a method to do how.

Thanks and regards,
Thilanka.

On Thu, Mar 11, 2010 at 9:29 PM, Thilanka Kaushalya
<[email protected]>wrote:

>
> Hi Joe and Moffette,
>
>              Thanks for the tips you provided. those are very helpful for
> me. These days
> I'm testing your instructions. Thanks again.
>
> regards thilanka
>
>>
>>
>>
>>   Topic: word 
>> review<http://groups.google.com/group/tesseract-ocr/t/4e723fa1766b7167>
>>
>>    Joe K <[email protected]> Mar 08 11:02AM -0800 
>> ^<#1274df41b3f536b0_12749c27dfe006e1_digest_top>
>>
>>    Hey Thilanka,
>>
>>    I ran into a similar problem when I only needed it to look at
>>    hexidecimal values. What I ended up doing was creating a separate
>>    "langauge" that only contained the specified characters. So you could
>>    create a langauge of numbers and a language with letters and use
>>    tesseract to read each part of your image using the appropriate
>>    language.
>>
>>    The web address below shows you how to train tesseract for a specific
>>    language. Hope this helps.
>>
>>    http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
>>
>>
>>
>>
>>
>>
>>
>>    Moffette <[email protected]> Mar 08 12:26PM -0800 
>> ^<#1274df41b3f536b0_12749c27dfe006e1_digest_top>
>>
>>    Hi,
>>
>>    An easier way to deal with number only or letter, is to use this from
>>    FAQ 
>> (http://code.google.com/p/tesseract-ocr/wiki/FAQ):<http://code.google.com/p/tesseract-ocr/wiki/FAQ%29:>
>>
>>    
>> ----------------------------------------------------------------------------------------------------------------------------
>>    How do I recognize only digits?
>>
>>    In 2.03 and above:
>>
>>    Use
>>
>>    TessBaseAPI::SetVariable("tessedit_char_whitelist", "0123456789");
>>
>>    BEFORE calling an Init function or put this in a text file called
>>    tessdata/configs/digits:
>>
>>    tessedit_char_whitelist 0123456789
>>
>>    and then your command line becomes:
>>
>>    tesseract image.tif outputbase nobatch digits
>>
>>    Warning: Until the old and new config variables get merged, you must
>>    have the nobatch parameter too.
>>
>>    
>> ----------------------------------------------------------------------------------------------------------------------------
>>
>>    For the second part : " I'm willing to review the recognised letters
>>    with the
>>    possible words so we can improve the accuracy "
>>
>>    If you are using a 2.0X version you could use the eng.user-words (a
>>    user dictionary) as it's suggested in the FAQ (http://code.google.com/
>>    p/tesseract-ocr/wiki/FAQ)
>>
>>
>>    
>> ----------------------------------------------------------------------------------------------------------------------------
>>    How do I provide my own dictionary?
>>
>>    Easy: Replace tessdata/eng.user-words with your own word list, in the
>>    same format - UTF8 text, one word per line.
>>
>>    More difficult, but better for a large dictionary: Replace tessdata/
>>    eng.word-dawg with one created from your own word list, using
>>    wordlist2dawg. See the TrainingTesseract wiki page for details.
>>
>>    
>> ----------------------------------------------------------------------------------------------------------------------------
>>
>> --
> http://coders-view.blogspot.com/
> http://thilankagekawuluwa.blogspot.com/
> http://twitter.com/thilanka_k
>



-- 
http://coders-view.blogspot.com/
http://thilankagekawuluwa.blogspot.com/
http://twitter.com/thilanka_k

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: word review

Reply via email to