Re: Tesseract Training

KHEM Sochenda Sun, 23 Jan 2011 19:04:00 -0800

Dear Sriranga,

I mean I will test and check if it work well with what I classify now, or
may I have to adapt something more.


I have one more question, I put some entry in the unicharambigs file;
however It seems the tess doesn't care what I have put in the file and the
output is just the same as no entry in the unicharambigs. Please see the
attachment as the test file and unicharambigs.

Of course, I thank you and Dmitry so much for his fruitful comments on this
issues.

Best Regards,

Sochenda



On Mon, Jan 24, 2011 at 9:39 AM, Sriranga(78yrsold) <[email protected]
> wrote:

> Sochenda,
> I am really happy atleast it works for you now. I could not understand
> your point "improve the classification according to the error"  Will you
> please explain little bit. Anyway please feedback  with percentage of
> accuracy in the output text. We must thanks to Dmitry for his valuable
> guidance.
> Wish you Good Luck,
> -sriranga(78yrs)
>
>
> On Mon, Jan 24, 2011 at 7:52 AM, KHEM Sochenda <[email protected]>wrote:
>
>> thanks Sriranga,
>>
>> Here is my box file after editing. I am going to test the recognition and
>> improve the classification according to the error.
>>
>> Best Regards,
>> Sochenda
>>
>>
>> On Sat, Jan 22, 2011 at 2:48 PM, Sriranga(78yrsold) <
>> [email protected]> wrote:
>>
>>>
>>>
>>> ---------- Forwarded message ----------
>>> From: Sriranga(78yrsold) <[email protected]>
>>> Date: Fri, Jan 21, 2011 at 12:33 PM
>>> Subject: Re: Tesseract Training
>>> To: KHEM Sochenda <[email protected]>
>>>
>>>
>>> Chenda,
>>> It is better to type the character (your lang script) than code in the
>>> box file. Because your characters will find  in the unicharset file. I don't
>>> know whether your keyboard is able to type your lang and if so, it is better
>>> to type.
>>>
>>>
>>> On Fri, Jan 21, 2011 at 11:41 AM, Sriranga(78yrsold) <
>>> [email protected]> wrote:
>>>
>>>> Chenda,
>>>>  By guess method I have edited the box file using another tool olwer.exe
>>>> (which is for english only)attached herewith. Advantage of attached
>>>> owler.exe is you can type character/ hexdecimal code by pressing tab.
>>>> consonant and independent vowel may have *single box* but for
>>>> consonant/independent vowel +*dependent vowel* must have single box.
>>>> (the said owler box is not suitable for kannada and as such I am not using)
>>>> If the output using same tif file(used for training) should naturally
>>>> correctly displayed. If used tif other than tif used for training purpose
>>>> will have naturally have misspelling which can be corrected by post
>>>> processor software. the same problem occurred for kannada also.  I hope you
>>>> will succeed in generating trained data file successfully since there is no
>>>> more complex than Kannada script.
>>>> After receipt of  corrected the box file, I shall generated trained data
>>>> file.
>>>>
>>>> With Best Wishes,
>>>> -sriranga(78yrs)
>>>>
>>>>
>>>>
>>>> On Fri, Jan 21, 2011 at 7:49 AM, KHEM Sochenda 
>>>> <[email protected]>wrote:
>>>>
>>>>> Dear Dmitry and Sriranga,
>>>>>
>>>>> Here are my result of training. I tried recognize with the same used
>>>>> the trained image as a test, the result is perfect. When I tried with the
>>>>> test image as attached, there seem problem recognizing the characters.
>>>>>
>>>>> Please tell me what your thoughts about this.
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> Sochenda
>>>>>
>>>>>
>>>>> On Thu, Jan 20, 2011 at 11:47 PM, KHEM Sochenda <
>>>>> [email protected]> wrote:
>>>>>
>>>>>>
>>>>>> Dear Sriranga,
>>>>>>
>>>>>> Here is my train box. It is really tedious editing box file. I just
>>>>>> found some glyphs I haven't put the code for them yet, but it difficult 
>>>>>> to
>>>>>> find them in the editing box you gave neigther with 
>>>>>> pytesseracttrainer.py as
>>>>>> it is too slow..
>>>>>>
>>>>>> Best Regards,
>>>>>>
>>>>>> Sochenda
>>>>>>
>>>>>> On Thu, Jan 20, 2011 at 4:49 PM, Sriranga(78yrsold) <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> **box file for editing
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 20, 2011 at 2:46 PM, KHEM Sochenda <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Dear Dmitry and Sriranga,
>>>>>>>>
>>>>>>>> But, Sriranga, I guess your computer cannot render KH language well.
>>>>>>>> I will send you an image instead ok?
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Sochenda
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jan 20, 2011 at 4:08 PM, Sriranga(78yrsold) <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Attached zip file containing exe file of owler. Before unzip please
>>>>>>>>> delete word {"OM" }first and then unzip
>>>>>>>>> with help owler, you edit box file according to your requirement
>>>>>>>>> After duly edited box file  please forward to me
>>>>>>>>> for further generating traineddata file or if you  are able to
>>>>>>>>> generate traineddata file  you can do yourself - no problem. .
>>>>>>>>> With best of Luck,
>>>>>>>>> -sriranga(78yrs)
>>>>>>>>> Dear dmitry,
>>>>>>>>> Sorry, I could not post in the forum due to attahed files.Hence I
>>>>>>>>> am endorsing copy to you.
>>>>>>>>>
>>>>>>>>> On Thu, Jan 20, 2011 at 2:22 PM, Sriranga(78yrsold) <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Sochenda
>>>>>>>>>> please find attached box with its khtext.png file for editing in
>>>>>>>>>> the box file  I am sending separately to you -khtext.tif and owler 
>>>>>>>>>> tool for
>>>>>>>>>> your editing purpose. since I don't know khemer lang nor unable to 
>>>>>>>>>> type in
>>>>>>>>>> the keyboard. After editing the box file and return to me for further
>>>>>>>>>> processing.
>>>>>>>>>>
>>>>>>>>>> With best of Luck,
>>>>>>>>>> -sriranga(78yrs)
>>>>>>>>>>
>>>>>>>>>> 2011/1/20 KHEM Sochenda <[email protected]>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Dear Dmitry and Sriranga,
>>>>>>>>>>>
>>>>>>>>>>> I am so confused now. :(
>>>>>>>>>>>
>>>>>>>>>>> Maybe I should apply for internship with tesseract, but I am so
>>>>>>>>>>> engaged with my project here.
>>>>>>>>>>>
>>>>>>>>>>> Please find the attachment as KHtext in unicode for training
>>>>>>>>>>> sample.
>>>>>>>>>>>
>>>>>>>>>>> Best Regards,
>>>>>>>>>>>
>>>>>>>>>>> Sochenda
>>>>>>>>>>>
>>>>>>>>>>> 2011/1/19 Sriranga(78yrsold) <[email protected]>
>>>>>>>>>>>
>>>>>>>>>>> Sochenda,
>>>>>>>>>>>> output of *lines viz.0ccb 8, 0cd5 8,  20c88 are appeared in
>>>>>>>>>>>> vowel1.txt. So we have to convert unicode numbers to Kannada
>>>>>>>>>>>> Character(script) with help of post-processor)*
>>>>>>>>>>>> -Regards,
>>>>>>>>>>>> -sriranga(78yrs)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jan 19, 2011 at 4:04 PM, Sriranga(78yrsold) <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Sochenda,
>>>>>>>>>>>>> pleas see inline reply below.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Jan 19, 2011 at 12:58 PM, KHEM Sochenda <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Dear Dmitry and Sriranga,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you very much for you help. The reason why my output
>>>>>>>>>>>>>> file is empty because I put my person ID to the glyphs, isn't it?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Dear Dmitry,
>>>>>>>>>>>>>> Please see the image attached, shall the image in the red box
>>>>>>>>>>>>>> assigned to a Unicode character or seperated as in the image? 
>>>>>>>>>>>>>> This glyph is
>>>>>>>>>>>>>> composed of two other glyphs-- one can be represented by a 
>>>>>>>>>>>>>> Unicode
>>>>>>>>>>>>>> character, and the other is a part of a vowel.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Dear Sriranga,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Are the several first lines in your unicharset files represent
>>>>>>>>>>>>>> a characters, or just any unicode character represent no any 
>>>>>>>>>>>>>> character.
>>>>>>>>>>>>>> *These lines viz.0ccb 8, 0cd5 8,  20c88 , 30ce0 are unicode
>>>>>>>>>>>>>> number instead of  characters* *of Kannada* *to show you*. 
>>>>>>>>>>>>>> *Usually
>>>>>>>>>>>>>> I am using characters(Script) instead of unicode number for 
>>>>>>>>>>>>>> training
>>>>>>>>>>>>>> purpose.  I am using tesseract 3.01 alpha(r-529)
>>>>>>>>>>>>>> *
>>>>>>>>>>>>>> Khmer font is also attached. Thanks but unable to type.
>>>>>>>>>>>>>> However it appeared in CharacterMap.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>   On receipt of your alphabets list I shall generated datafiles
>>>>>>>>>>>>> and forwarded to you.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>> Sochenda
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jan 18, 2011 at 8:27 PM, Dmitry Silaev <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Dear Sochenda,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In addition to what Sriranga said I'd remind that you should
>>>>>>>>>>>>>>> do a lot of manual work:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In pyTesseractTrainer check that no bounding boxes intersect
>>>>>>>>>>>>>>> glyphs; if some does - correct its BB coordinates manually.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In cases of BB overlap you should space out participating
>>>>>>>>>>>>>>> glyphs in the training image (see the attached picture for 
>>>>>>>>>>>>>>> examples).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You should use manual spacing if participating glyphs are
>>>>>>>>>>>>>>> dependent characters (in your language - vowels) and the number 
>>>>>>>>>>>>>>> of possible
>>>>>>>>>>>>>>> combinations is practically uncountable. Then you would assign 
>>>>>>>>>>>>>>> every glyph
>>>>>>>>>>>>>>> its own code. Tess would consider these glyphs as separate 
>>>>>>>>>>>>>>> characters and
>>>>>>>>>>>>>>> you should post-process the resulting code sequence to obtain a 
>>>>>>>>>>>>>>> well-formed
>>>>>>>>>>>>>>> dependent Unicode pair (or triplet).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If there can be only few such combinations - you can merge
>>>>>>>>>>>>>>> these BBs into one to encompass all the required glyphs and 
>>>>>>>>>>>>>>> assign a single
>>>>>>>>>>>>>>> code to the entire glyph combination. Then during the 
>>>>>>>>>>>>>>> post-processing you'll
>>>>>>>>>>>>>>> need to replace this single code with a predefined dependent 
>>>>>>>>>>>>>>> Unicode pair.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hope I've managed to express myself clearly.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Warm regards,
>>>>>>>>>>>>>>> Dmitry Silaev
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  --
>>>>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>> To post to this group, send email to
>>>>>>>>>>>>>>> [email protected].
>>>>>>>>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>>>>>>>>> [email protected]<tesseract-ocr%[email protected]>
>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>> For more options, visit this group at
>>>>>>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  --
>>>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>> To post to this group, send email to
>>>>>>>>>>>>>> [email protected].
>>>>>>>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>>>>>>>> [email protected]<tesseract-ocr%[email protected]>
>>>>>>>>>>>>>> .
>>>>>>>>>>>>>> For more options, visit this group at
>>>>>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To post to this group, send email to [email protected].
>>> To unsubscribe from this group, send email to
>>> [email protected]<tesseract-ocr%[email protected]>
>>> .
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected]<tesseract-ocr%[email protected]>
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<tesseract-ocr%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

khm.unicharambigs
Description: Binary data

<<attachment: kexeKe1.tif>>

Re: Tesseract Training

Reply via email to