Hi Joe and Moffette,
I'm recognising the data from a hand written form, and the scenario
is extracting the
letters one by one and sending the each letter to the tesseract seperately.
So the recognition
is done letter-vice. So I can,t use the dictionary file for the word
reviewing in that case.
****************
*How do I provide my own dictionary?*
*Easy: Replace tessdata/eng.user-words with your own word list, in the
same format - UTF8 text, one word per line.
More difficult, but better for a large dictionary: Replace tessdata/
eng.word-dawg with one created from your own word list, using
wordlist2dawg. See the TrainingTesseract wiki page for details.*
***********************
Is the Tesseract output the words only included in the above
mentioned libraries.
If so can I send the set of recognised letter again to the Tesseract as an
image to review
it to the defined domain of per-defined words.
Or else can you give some instructions about a method to do how.
Thanks and regards,
Thilanka.
On Thu, Mar 11, 2010 at 9:29 PM, Thilanka Kaushalya
<[email protected]>wrote:
>
> Hi Joe and Moffette,
>
> Thanks for the tips you provided. those are very helpful for
> me. These days
> I'm testing your instructions. Thanks again.
>
> regards thilanka
>
>>
>>
>>
>> Topic: word
>> review<http://groups.google.com/group/tesseract-ocr/t/4e723fa1766b7167>
>>
>> Joe K <[email protected]> Mar 08 11:02AM -0800
>> ^<#1274df41b3f536b0_12749c27dfe006e1_digest_top>
>>
>> Hey Thilanka,
>>
>> I ran into a similar problem when I only needed it to look at
>> hexidecimal values. What I ended up doing was creating a separate
>> "langauge" that only contained the specified characters. So you could
>> create a langauge of numbers and a language with letters and use
>> tesseract to read each part of your image using the appropriate
>> language.
>>
>> The web address below shows you how to train tesseract for a specific
>> language. Hope this helps.
>>
>> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
>>
>>
>>
>>
>>
>>
>>
>> Moffette <[email protected]> Mar 08 12:26PM -0800
>> ^<#1274df41b3f536b0_12749c27dfe006e1_digest_top>
>>
>> Hi,
>>
>> An easier way to deal with number only or letter, is to use this from
>> FAQ
>> (http://code.google.com/p/tesseract-ocr/wiki/FAQ):<http://code.google.com/p/tesseract-ocr/wiki/FAQ%29:>
>>
>>
>> ----------------------------------------------------------------------------------------------------------------------------
>> How do I recognize only digits?
>>
>> In 2.03 and above:
>>
>> Use
>>
>> TessBaseAPI::SetVariable("tessedit_char_whitelist", "0123456789");
>>
>> BEFORE calling an Init function or put this in a text file called
>> tessdata/configs/digits:
>>
>> tessedit_char_whitelist 0123456789
>>
>> and then your command line becomes:
>>
>> tesseract image.tif outputbase nobatch digits
>>
>> Warning: Until the old and new config variables get merged, you must
>> have the nobatch parameter too.
>>
>>
>> ----------------------------------------------------------------------------------------------------------------------------
>>
>> For the second part : " I'm willing to review the recognised letters
>> with the
>> possible words so we can improve the accuracy "
>>
>> If you are using a 2.0X version you could use the eng.user-words (a
>> user dictionary) as it's suggested in the FAQ (http://code.google.com/
>> p/tesseract-ocr/wiki/FAQ)
>>
>>
>>
>> ----------------------------------------------------------------------------------------------------------------------------
>> How do I provide my own dictionary?
>>
>> Easy: Replace tessdata/eng.user-words with your own word list, in the
>> same format - UTF8 text, one word per line.
>>
>> More difficult, but better for a large dictionary: Replace tessdata/
>> eng.word-dawg with one created from your own word list, using
>> wordlist2dawg. See the TrainingTesseract wiki page for details.
>>
>>
>> ----------------------------------------------------------------------------------------------------------------------------
>>
>> --
> http://coders-view.blogspot.com/
> http://thilankagekawuluwa.blogspot.com/
> http://twitter.com/thilanka_k
>
--
http://coders-view.blogspot.com/
http://thilankagekawuluwa.blogspot.com/
http://twitter.com/thilanka_k
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.