Re: Tesseract Training

KHEM Sochenda Sun, 23 Jan 2011 21:57:24 -0800

Dear Sriranga,

Yes, I use owler tool to edit the box file.
េ is dependent vowel. In the unicharambigs, I just try to change the
position order when meet េ and ក .


Best Regards,
Sochenda

On Mon, Jan 24, 2011 at 12:42 PM, Sriranga(78yrsold) <
[email protected]> wrote:

> Dear  Sochenda,
> thanks for the updated me.
> 1) I am curious to know whether you are able to edit in owler tool or
> manually edited the box file ?
> 2) unicharambigs file  = I am not able to create unicharambigs for Kannada
> even following latest intructions in wiki
>  In this connection, I have posted problem faced by me  under issue No:433
> - which is still pending for solution. copy of om.unicharambigs is attached
> for your information. I could not understand where I made a mistake?
> េ ក  tested  it does not merge with consonant  so it appears េ  is
> independent vowel and not dependent vowel. As such your file is appears to
> be in order and I feel it  should work - however please change v1 to v12  as
> per wiki instructions. and try again.
> With Best of Luck,
> -sriranga(78yrsold)
>
>
>
> On Mon, Jan 24, 2011 at 8:33 AM, KHEM Sochenda <[email protected]>wrote:
>
>> Dear Sriranga,
>>
>> I mean I will test and check if it work well with what I classify now, or
>> may I have to adapt something more.
>>
>> I have one more question, I put some entry in the unicharambigs file;
>> however It seems the tess doesn't care what I have put in the file and the
>> output is just the same as no entry in the unicharambigs. Please see the
>> attachment as the test file and unicharambigs.
>>
>> Of course, I thank you and Dmitry so much for his fruitful comments on
>> this issues.
>>
>> Best Regards,
>>
>> Sochenda
>>
>>
>>
>>
>> On Mon, Jan 24, 2011 at 9:39 AM, Sriranga(78yrsold) <
>> [email protected]> wrote:
>>
>>> Sochenda,
>>> I am really happy atleast it works for you now. I could not understand
>>> your point "improve the classification according to the error"  Will you
>>> please explain little bit. Anyway please feedback  with percentage of
>>> accuracy in the output text. We must thanks to Dmitry for his valuable
>>> guidance.
>>> Wish you Good Luck,
>>> -sriranga(78yrs)
>>>
>>>
>>> On Mon, Jan 24, 2011 at 7:52 AM, KHEM Sochenda 
>>> <[email protected]>wrote:
>>>
>>>> thanks Sriranga,
>>>>
>>>> Here is my box file after editing. I am going to test the recognition
>>>> and improve the classification according to the error.
>>>>
>>>> Best Regards,
>>>> Sochenda
>>>>
>>>>
>>>> On Sat, Jan 22, 2011 at 2:48 PM, Sriranga(78yrsold) <
>>>> [email protected]> wrote:
>>>>
>>>>>
>>>>>
>>>>> ---------- Forwarded message ----------
>>>>> From: Sriranga(78yrsold) <[email protected]>
>>>>> Date: Fri, Jan 21, 2011 at 12:33 PM
>>>>> Subject: Re: Tesseract Training
>>>>> To: KHEM Sochenda <[email protected]>
>>>>>
>>>>>
>>>>> Chenda,
>>>>> It is better to type the character (your lang script) than code in the
>>>>> box file. Because your characters will find  in the unicharset file. I 
>>>>> don't
>>>>> know whether your keyboard is able to type your lang and if so, it is 
>>>>> better
>>>>> to type.
>>>>>
>>>>>
>>>>> On Fri, Jan 21, 2011 at 11:41 AM, Sriranga(78yrsold) <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Chenda,
>>>>>>  By guess method I have edited the box file using another tool
>>>>>> olwer.exe (which is for english only)attached herewith. Advantage of
>>>>>> attached owler.exe is you can type character/ hexdecimal code by pressing
>>>>>> tab. consonant and independent vowel may have *single box* but for
>>>>>> consonant/independent vowel +*dependent vowel* must have single box.
>>>>>> (the said owler box is not suitable for kannada and as such I am not 
>>>>>> using)
>>>>>> If the output using same tif file(used for training) should naturally
>>>>>> correctly displayed. If used tif other than tif used for training purpose
>>>>>> will have naturally have misspelling which can be corrected by post
>>>>>> processor software. the same problem occurred for kannada also.  I hope 
>>>>>> you
>>>>>> will succeed in generating trained data file successfully since there is 
>>>>>> no
>>>>>> more complex than Kannada script.
>>>>>> After receipt of  corrected the box file, I shall generated trained
>>>>>> data file.
>>>>>>
>>>>>> With Best Wishes,
>>>>>> -sriranga(78yrs)
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 21, 2011 at 7:49 AM, KHEM Sochenda <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Dear Dmitry and Sriranga,
>>>>>>>
>>>>>>> Here are my result of training. I tried recognize with the same used
>>>>>>> the trained image as a test, the result is perfect. When I tried with 
>>>>>>> the
>>>>>>> test image as attached, there seem problem recognizing the characters.
>>>>>>>
>>>>>>> Please tell me what your thoughts about this.
>>>>>>>
>>>>>>> Best Regards,
>>>>>>>
>>>>>>> Sochenda
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 20, 2011 at 11:47 PM, KHEM Sochenda <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Dear Sriranga,
>>>>>>>>
>>>>>>>> Here is my train box. It is really tedious editing box file. I just
>>>>>>>> found some glyphs I haven't put the code for them yet, but it 
>>>>>>>> difficult to
>>>>>>>> find them in the editing box you gave neigther with 
>>>>>>>> pytesseracttrainer.py as
>>>>>>>> it is too slow..
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>>
>>>>>>>> Sochenda
>>>>>>>>
>>>>>>>> On Thu, Jan 20, 2011 at 4:49 PM, Sriranga(78yrsold) <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> **box file for editing
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jan 20, 2011 at 2:46 PM, KHEM Sochenda <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Dear Dmitry and Sriranga,
>>>>>>>>>>
>>>>>>>>>> But, Sriranga, I guess your computer cannot render KH language
>>>>>>>>>> well. I will send you an image instead ok?
>>>>>>>>>>
>>>>>>>>>> Best Regards,
>>>>>>>>>> Sochenda
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 20, 2011 at 4:08 PM, Sriranga(78yrsold) <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Attached zip file containing exe file of owler. Before unzip
>>>>>>>>>>> please delete word {"OM" }first and then unzip
>>>>>>>>>>> with help owler, you edit box file according to your requirement
>>>>>>>>>>> After duly edited box file  please forward to me
>>>>>>>>>>> for further generating traineddata file or if you  are able to
>>>>>>>>>>> generate traineddata file  you can do yourself - no problem. .
>>>>>>>>>>> With best of Luck,
>>>>>>>>>>> -sriranga(78yrs)
>>>>>>>>>>> Dear dmitry,
>>>>>>>>>>> Sorry, I could not post in the forum due to attahed files.Hence I
>>>>>>>>>>> am endorsing copy to you.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jan 20, 2011 at 2:22 PM, Sriranga(78yrsold) <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Sochenda
>>>>>>>>>>>> please find attached box with its khtext.png file for editing in
>>>>>>>>>>>> the box file  I am sending separately to you -khtext.tif and owler 
>>>>>>>>>>>> tool for
>>>>>>>>>>>> your editing purpose. since I don't know khemer lang nor unable to 
>>>>>>>>>>>> type in
>>>>>>>>>>>> the keyboard. After editing the box file and return to me for 
>>>>>>>>>>>> further
>>>>>>>>>>>> processing.
>>>>>>>>>>>>
>>>>>>>>>>>> With best of Luck,
>>>>>>>>>>>> -sriranga(78yrs)
>>>>>>>>>>>>
>>>>>>>>>>>> 2011/1/20 KHEM Sochenda <[email protected]>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Dear Dmitry and Sriranga,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am so confused now. :(
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maybe I should apply for internship with tesseract, but I am so
>>>>>>>>>>>>> engaged with my project here.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please find the attachment as KHtext in unicode for training
>>>>>>>>>>>>> sample.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sochenda
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2011/1/19 Sriranga(78yrsold) <[email protected]>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sochenda,
>>>>>>>>>>>>>> output of *lines viz.0ccb 8, 0cd5 8,  20c88 are appeared in
>>>>>>>>>>>>>> vowel1.txt. So we have to convert unicode numbers to Kannada
>>>>>>>>>>>>>> Character(script) with help of post-processor)*
>>>>>>>>>>>>>> -Regards,
>>>>>>>>>>>>>> -sriranga(78yrs)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Jan 19, 2011 at 4:04 PM, Sriranga(78yrsold) <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sochenda,
>>>>>>>>>>>>>>> pleas see inline reply below.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Jan 19, 2011 at 12:58 PM, KHEM Sochenda <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Dear Dmitry and Sriranga,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thank you very much for you help. The reason why my output
>>>>>>>>>>>>>>>> file is empty because I put my person ID to the glyphs, isn't 
>>>>>>>>>>>>>>>> it?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Dear Dmitry,
>>>>>>>>>>>>>>>> Please see the image attached, shall the image in the red
>>>>>>>>>>>>>>>> box assigned to a Unicode character or seperated as in the 
>>>>>>>>>>>>>>>> image? This glyph
>>>>>>>>>>>>>>>> is composed of two other glyphs-- one can be represented by a 
>>>>>>>>>>>>>>>> Unicode
>>>>>>>>>>>>>>>> character, and the other is a part of a vowel.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Dear Sriranga,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Are the several first lines in your unicharset files
>>>>>>>>>>>>>>>> represent a characters, or just any unicode character 
>>>>>>>>>>>>>>>> represent no any
>>>>>>>>>>>>>>>> character. *These lines viz.0ccb 8, 0cd5 8,  20c88 , 30ce0
>>>>>>>>>>>>>>>> are unicode number instead of  characters* *of Kannada* *to
>>>>>>>>>>>>>>>> show you*. *Usually I am using characters(Script) instead
>>>>>>>>>>>>>>>> of unicode number for training purpose.  I am using tesseract 
>>>>>>>>>>>>>>>> 3.01
>>>>>>>>>>>>>>>> alpha(r-529)
>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>> Khmer font is also attached. Thanks but unable to type.
>>>>>>>>>>>>>>>> However it appeared in CharacterMap.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   On receipt of your alphabets list I shall generated
>>>>>>>>>>>>>>> datafiles and forwarded to you.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>> Sochenda
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Jan 18, 2011 at 8:27 PM, Dmitry Silaev <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Dear Sochenda,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In addition to what Sriranga said I'd remind that you
>>>>>>>>>>>>>>>>> should do a lot of manual work:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In pyTesseractTrainer check that no bounding boxes
>>>>>>>>>>>>>>>>> intersect glyphs; if some does - correct its BB coordinates 
>>>>>>>>>>>>>>>>> manually.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In cases of BB overlap you should space out participating
>>>>>>>>>>>>>>>>> glyphs in the training image (see the attached picture for 
>>>>>>>>>>>>>>>>> examples).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You should use manual spacing if participating glyphs are
>>>>>>>>>>>>>>>>> dependent characters (in your language - vowels) and the 
>>>>>>>>>>>>>>>>> number of possible
>>>>>>>>>>>>>>>>> combinations is practically uncountable. Then you would 
>>>>>>>>>>>>>>>>> assign every glyph
>>>>>>>>>>>>>>>>> its own code. Tess would consider these glyphs as separate 
>>>>>>>>>>>>>>>>> characters and
>>>>>>>>>>>>>>>>> you should post-process the resulting code sequence to obtain 
>>>>>>>>>>>>>>>>> a well-formed
>>>>>>>>>>>>>>>>> dependent Unicode pair (or triplet).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If there can be only few such combinations - you can merge
>>>>>>>>>>>>>>>>> these BBs into one to encompass all the required glyphs and 
>>>>>>>>>>>>>>>>> assign a single
>>>>>>>>>>>>>>>>> code to the entire glyph combination. Then during the 
>>>>>>>>>>>>>>>>> post-processing you'll
>>>>>>>>>>>>>>>>> need to replace this single code with a predefined dependent 
>>>>>>>>>>>>>>>>> Unicode pair.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hope I've managed to express myself clearly.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Warm regards,
>>>>>>>>>>>>>>>>> Dmitry Silaev
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  --
>>>>>>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>>>> To post to this group, send email to
>>>>>>>>>>>>>>>>> [email protected].
>>>>>>>>>>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>>>>>>>>>>> [email protected]<tesseract-ocr%[email protected]>
>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>> For more options, visit this group at
>>>>>>>>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  --
>>>>>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>>> To post to this group, send email to
>>>>>>>>>>>>>>>> [email protected].
>>>>>>>>>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>>>>>>>>>> [email protected]<tesseract-ocr%[email protected]>
>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>> For more options, visit this group at
>>>>>>>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To post to this group, send email to [email protected].
>>>>> To unsubscribe from this group, send email to
>>>>> [email protected]<tesseract-ocr%[email protected]>
>>>>> .
>>>>> For more options, visit this group at
>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>
>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To post to this group, send email to [email protected].
>>>> To unsubscribe from this group, send email to
>>>> [email protected]<tesseract-ocr%[email protected]>
>>>> .
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To post to this group, send email to [email protected].
>>> To unsubscribe from this group, send email to
>>> [email protected]<tesseract-ocr%[email protected]>
>>> .
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected]<tesseract-ocr%[email protected]>
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<tesseract-ocr%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Tesseract Training

Reply via email to