Re: Tesseract Training

Sriranga(78yrsold) Sun, 23 Jan 2011 18:39:33 -0800

Sochenda,
I am really happy atleast it works for you now. I could not understand  your
point "improve the classification according to the error"  Will you please
explain little bit. Anyway please feedback  with percentage of accuracy in
the output text. We must thanks to Dmitry for his valuable guidance.
Wish you Good Luck,
-sriranga(78yrs)


On Mon, Jan 24, 2011 at 7:52 AM, KHEM Sochenda <[email protected]>wrote:

> thanks Sriranga,
>
> Here is my box file after editing. I am going to test the recognition and
> improve the classification according to the error.
>
> Best Regards,
> Sochenda
>
>
> On Sat, Jan 22, 2011 at 2:48 PM, Sriranga(78yrsold) <
> [email protected]> wrote:
>
>>
>>
>> ---------- Forwarded message ----------
>> From: Sriranga(78yrsold) <[email protected]>
>> Date: Fri, Jan 21, 2011 at 12:33 PM
>> Subject: Re: Tesseract Training
>> To: KHEM Sochenda <[email protected]>
>>
>>
>> Chenda,
>> It is better to type the character (your lang script) than code in the box
>> file. Because your characters will find  in the unicharset file. I don't
>> know whether your keyboard is able to type your lang and if so, it is better
>> to type.
>>
>>
>> On Fri, Jan 21, 2011 at 11:41 AM, Sriranga(78yrsold) <
>> [email protected]> wrote:
>>
>>> Chenda,
>>>  By guess method I have edited the box file using another tool olwer.exe
>>> (which is for english only)attached herewith. Advantage of attached
>>> owler.exe is you can type character/ hexdecimal code by pressing tab.
>>> consonant and independent vowel may have *single box* but for
>>> consonant/independent vowel +*dependent vowel* must have single box.
>>> (the said owler box is not suitable for kannada and as such I am not using)
>>> If the output using same tif file(used for training) should naturally
>>> correctly displayed. If used tif other than tif used for training purpose
>>> will have naturally have misspelling which can be corrected by post
>>> processor software. the same problem occurred for kannada also.  I hope you
>>> will succeed in generating trained data file successfully since there is no
>>> more complex than Kannada script.
>>> After receipt of  corrected the box file, I shall generated trained data
>>> file.
>>>
>>> With Best Wishes,
>>> -sriranga(78yrs)
>>>
>>>
>>>
>>> On Fri, Jan 21, 2011 at 7:49 AM, KHEM Sochenda 
>>> <[email protected]>wrote:
>>>
>>>> Dear Dmitry and Sriranga,
>>>>
>>>> Here are my result of training. I tried recognize with the same used the
>>>> trained image as a test, the result is perfect. When I tried with the test
>>>> image as attached, there seem problem recognizing the characters.
>>>>
>>>> Please tell me what your thoughts about this.
>>>>
>>>> Best Regards,
>>>>
>>>> Sochenda
>>>>
>>>>
>>>> On Thu, Jan 20, 2011 at 11:47 PM, KHEM Sochenda <[email protected]
>>>> > wrote:
>>>>
>>>>>
>>>>> Dear Sriranga,
>>>>>
>>>>> Here is my train box. It is really tedious editing box file. I just
>>>>> found some glyphs I haven't put the code for them yet, but it difficult to
>>>>> find them in the editing box you gave neigther with pytesseracttrainer.py 
>>>>> as
>>>>> it is too slow..
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> Sochenda
>>>>>
>>>>> On Thu, Jan 20, 2011 at 4:49 PM, Sriranga(78yrsold) <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> **box file for editing
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 20, 2011 at 2:46 PM, KHEM Sochenda <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Dear Dmitry and Sriranga,
>>>>>>>
>>>>>>> But, Sriranga, I guess your computer cannot render KH language well.
>>>>>>> I will send you an image instead ok?
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Sochenda
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 20, 2011 at 4:08 PM, Sriranga(78yrsold) <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Attached zip file containing exe file of owler. Before unzip please
>>>>>>>> delete word {"OM" }first and then unzip
>>>>>>>> with help owler, you edit box file according to your requirement
>>>>>>>> After duly edited box file  please forward to me
>>>>>>>> for further generating traineddata file or if you  are able to
>>>>>>>> generate traineddata file  you can do yourself - no problem. .
>>>>>>>> With best of Luck,
>>>>>>>> -sriranga(78yrs)
>>>>>>>> Dear dmitry,
>>>>>>>> Sorry, I could not post in the forum due to attahed files.Hence I am
>>>>>>>> endorsing copy to you.
>>>>>>>>
>>>>>>>> On Thu, Jan 20, 2011 at 2:22 PM, Sriranga(78yrsold) <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Sochenda
>>>>>>>>> please find attached box with its khtext.png file for editing in
>>>>>>>>> the box file  I am sending separately to you -khtext.tif and owler 
>>>>>>>>> tool for
>>>>>>>>> your editing purpose. since I don't know khemer lang nor unable to 
>>>>>>>>> type in
>>>>>>>>> the keyboard. After editing the box file and return to me for further
>>>>>>>>> processing.
>>>>>>>>>
>>>>>>>>> With best of Luck,
>>>>>>>>> -sriranga(78yrs)
>>>>>>>>>
>>>>>>>>> 2011/1/20 KHEM Sochenda <[email protected]>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Dear Dmitry and Sriranga,
>>>>>>>>>>
>>>>>>>>>> I am so confused now. :(
>>>>>>>>>>
>>>>>>>>>> Maybe I should apply for internship with tesseract, but I am so
>>>>>>>>>> engaged with my project here.
>>>>>>>>>>
>>>>>>>>>> Please find the attachment as KHtext in unicode for training
>>>>>>>>>> sample.
>>>>>>>>>>
>>>>>>>>>> Best Regards,
>>>>>>>>>>
>>>>>>>>>> Sochenda
>>>>>>>>>>
>>>>>>>>>> 2011/1/19 Sriranga(78yrsold) <[email protected]>
>>>>>>>>>>
>>>>>>>>>> Sochenda,
>>>>>>>>>>> output of *lines viz.0ccb 8, 0cd5 8,  20c88 are appeared in
>>>>>>>>>>> vowel1.txt. So we have to convert unicode numbers to Kannada
>>>>>>>>>>> Character(script) with help of post-processor)*
>>>>>>>>>>> -Regards,
>>>>>>>>>>> -sriranga(78yrs)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jan 19, 2011 at 4:04 PM, Sriranga(78yrsold) <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Sochenda,
>>>>>>>>>>>> pleas see inline reply below.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jan 19, 2011 at 12:58 PM, KHEM Sochenda <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Dear Dmitry and Sriranga,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you very much for you help. The reason why my output file
>>>>>>>>>>>>> is empty because I put my person ID to the glyphs, isn't it?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Dear Dmitry,
>>>>>>>>>>>>> Please see the image attached, shall the image in the red box
>>>>>>>>>>>>> assigned to a Unicode character or seperated as in the image? 
>>>>>>>>>>>>> This glyph is
>>>>>>>>>>>>> composed of two other glyphs-- one can be represented by a Unicode
>>>>>>>>>>>>> character, and the other is a part of a vowel.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Dear Sriranga,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Are the several first lines in your unicharset files represent
>>>>>>>>>>>>> a characters, or just any unicode character represent no any 
>>>>>>>>>>>>> character.
>>>>>>>>>>>>> *These lines viz.0ccb 8, 0cd5 8,  20c88 , 30ce0 are unicode
>>>>>>>>>>>>> number instead of  characters* *of Kannada* *to show you*. 
>>>>>>>>>>>>> *Usually
>>>>>>>>>>>>> I am using characters(Script) instead of unicode number for 
>>>>>>>>>>>>> training
>>>>>>>>>>>>> purpose.  I am using tesseract 3.01 alpha(r-529)
>>>>>>>>>>>>> *
>>>>>>>>>>>>> Khmer font is also attached. Thanks but unable to type. However
>>>>>>>>>>>>> it appeared in CharacterMap.
>>>>>>>>>>>>>
>>>>>>>>>>>>   On receipt of your alphabets list I shall generated datafiles
>>>>>>>>>>>> and forwarded to you.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>> Sochenda
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jan 18, 2011 at 8:27 PM, Dmitry Silaev <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Dear Sochenda,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In addition to what Sriranga said I'd remind that you should
>>>>>>>>>>>>>> do a lot of manual work:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In pyTesseractTrainer check that no bounding boxes intersect
>>>>>>>>>>>>>> glyphs; if some does - correct its BB coordinates manually.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In cases of BB overlap you should space out participating
>>>>>>>>>>>>>> glyphs in the training image (see the attached picture for 
>>>>>>>>>>>>>> examples).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You should use manual spacing if participating glyphs are
>>>>>>>>>>>>>> dependent characters (in your language - vowels) and the number 
>>>>>>>>>>>>>> of possible
>>>>>>>>>>>>>> combinations is practically uncountable. Then you would assign 
>>>>>>>>>>>>>> every glyph
>>>>>>>>>>>>>> its own code. Tess would consider these glyphs as separate 
>>>>>>>>>>>>>> characters and
>>>>>>>>>>>>>> you should post-process the resulting code sequence to obtain a 
>>>>>>>>>>>>>> well-formed
>>>>>>>>>>>>>> dependent Unicode pair (or triplet).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If there can be only few such combinations - you can merge
>>>>>>>>>>>>>> these BBs into one to encompass all the required glyphs and 
>>>>>>>>>>>>>> assign a single
>>>>>>>>>>>>>> code to the entire glyph combination. Then during the 
>>>>>>>>>>>>>> post-processing you'll
>>>>>>>>>>>>>> need to replace this single code with a predefined dependent 
>>>>>>>>>>>>>> Unicode pair.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hope I've managed to express myself clearly.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Warm regards,
>>>>>>>>>>>>>> Dmitry Silaev
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  --
>>>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>> To post to this group, send email to
>>>>>>>>>>>>>> [email protected].
>>>>>>>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>>>>>>>> [email protected]<tesseract-ocr%[email protected]>
>>>>>>>>>>>>>> .
>>>>>>>>>>>>>> For more options, visit this group at
>>>>>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  --
>>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>> To post to this group, send email to
>>>>>>>>>>>>> [email protected].
>>>>>>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>>>>>>> [email protected]<tesseract-ocr%[email protected]>
>>>>>>>>>>>>> .
>>>>>>>>>>>>> For more options, visit this group at
>>>>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected]<tesseract-ocr%[email protected]>
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<tesseract-ocr%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Tesseract Training

Reply via email to