Re: Tesseract Training

Sriranga(78yrsold) Sun, 23 Jan 2011 21:42:25 -0800

Dear  Sochenda,
thanks for the updated me.
1) I am curious to know whether you are able to edit in owler tool or
manually edited the box file ?
2) unicharambigs file  = I am not able to create unicharambigs for Kannada
even following latest intructions in wiki
 In this connection, I have posted problem faced by me  under issue No:433 -
which is still pending for solution. copy of om.unicharambigs is attached
for your information. I could not understand where I made a mistake?
េ ក  tested  it does not merge with consonant  so it appears េ  is
independent vowel and not dependent vowel. As such your file is appears to
be in order and I feel it  should work - however please change v1 to v12  as
per wiki instructions. and try again.
With Best of Luck,
-sriranga(78yrsold)



On Mon, Jan 24, 2011 at 8:33 AM, KHEM Sochenda <[email protected]>wrote:

> Dear Sriranga,
>
> I mean I will test and check if it work well with what I classify now, or
> may I have to adapt something more.
>
> I have one more question, I put some entry in the unicharambigs file;
> however It seems the tess doesn't care what I have put in the file and the
> output is just the same as no entry in the unicharambigs. Please see the
> attachment as the test file and unicharambigs.
>
> Of course, I thank you and Dmitry so much for his fruitful comments on this
> issues.
>
> Best Regards,
>
> Sochenda
>
>
>
>
> On Mon, Jan 24, 2011 at 9:39 AM, Sriranga(78yrsold) <
> [email protected]> wrote:
>
>> Sochenda,
>> I am really happy atleast it works for you now. I could not understand
>> your point "improve the classification according to the error"  Will you
>> please explain little bit. Anyway please feedback  with percentage of
>> accuracy in the output text. We must thanks to Dmitry for his valuable
>> guidance.
>> Wish you Good Luck,
>> -sriranga(78yrs)
>>
>>
>> On Mon, Jan 24, 2011 at 7:52 AM, KHEM Sochenda <[email protected]>wrote:
>>
>>> thanks Sriranga,
>>>
>>> Here is my box file after editing. I am going to test the recognition and
>>> improve the classification according to the error.
>>>
>>> Best Regards,
>>> Sochenda
>>>
>>>
>>> On Sat, Jan 22, 2011 at 2:48 PM, Sriranga(78yrsold) <
>>> [email protected]> wrote:
>>>
>>>>
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: Sriranga(78yrsold) <[email protected]>
>>>> Date: Fri, Jan 21, 2011 at 12:33 PM
>>>> Subject: Re: Tesseract Training
>>>> To: KHEM Sochenda <[email protected]>
>>>>
>>>>
>>>> Chenda,
>>>> It is better to type the character (your lang script) than code in the
>>>> box file. Because your characters will find  in the unicharset file. I 
>>>> don't
>>>> know whether your keyboard is able to type your lang and if so, it is 
>>>> better
>>>> to type.
>>>>
>>>>
>>>> On Fri, Jan 21, 2011 at 11:41 AM, Sriranga(78yrsold) <
>>>> [email protected]> wrote:
>>>>
>>>>> Chenda,
>>>>>  By guess method I have edited the box file using another tool
>>>>> olwer.exe (which is for english only)attached herewith. Advantage of
>>>>> attached owler.exe is you can type character/ hexdecimal code by pressing
>>>>> tab. consonant and independent vowel may have *single box* but for
>>>>> consonant/independent vowel +*dependent vowel* must have single box.
>>>>> (the said owler box is not suitable for kannada and as such I am not 
>>>>> using)
>>>>> If the output using same tif file(used for training) should naturally
>>>>> correctly displayed. If used tif other than tif used for training purpose
>>>>> will have naturally have misspelling which can be corrected by post
>>>>> processor software. the same problem occurred for kannada also.  I hope 
>>>>> you
>>>>> will succeed in generating trained data file successfully since there is 
>>>>> no
>>>>> more complex than Kannada script.
>>>>> After receipt of  corrected the box file, I shall generated trained
>>>>> data file.
>>>>>
>>>>> With Best Wishes,
>>>>> -sriranga(78yrs)
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jan 21, 2011 at 7:49 AM, KHEM Sochenda <[email protected]
>>>>> > wrote:
>>>>>
>>>>>> Dear Dmitry and Sriranga,
>>>>>>
>>>>>> Here are my result of training. I tried recognize with the same used
>>>>>> the trained image as a test, the result is perfect. When I tried with the
>>>>>> test image as attached, there seem problem recognizing the characters.
>>>>>>
>>>>>> Please tell me what your thoughts about this.
>>>>>>
>>>>>> Best Regards,
>>>>>>
>>>>>> Sochenda
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 20, 2011 at 11:47 PM, KHEM Sochenda <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>>
>>>>>>> Dear Sriranga,
>>>>>>>
>>>>>>> Here is my train box. It is really tedious editing box file. I just
>>>>>>> found some glyphs I haven't put the code for them yet, but it difficult 
>>>>>>> to
>>>>>>> find them in the editing box you gave neigther with 
>>>>>>> pytesseracttrainer.py as
>>>>>>> it is too slow..
>>>>>>>
>>>>>>> Best Regards,
>>>>>>>
>>>>>>> Sochenda
>>>>>>>
>>>>>>> On Thu, Jan 20, 2011 at 4:49 PM, Sriranga(78yrsold) <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> **box file for editing
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jan 20, 2011 at 2:46 PM, KHEM Sochenda <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Dear Dmitry and Sriranga,
>>>>>>>>>
>>>>>>>>> But, Sriranga, I guess your computer cannot render KH language
>>>>>>>>> well. I will send you an image instead ok?
>>>>>>>>>
>>>>>>>>> Best Regards,
>>>>>>>>> Sochenda
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jan 20, 2011 at 4:08 PM, Sriranga(78yrsold) <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Attached zip file containing exe file of owler. Before unzip
>>>>>>>>>> please delete word {"OM" }first and then unzip
>>>>>>>>>> with help owler, you edit box file according to your requirement
>>>>>>>>>> After duly edited box file  please forward to me
>>>>>>>>>> for further generating traineddata file or if you  are able to
>>>>>>>>>> generate traineddata file  you can do yourself - no problem. .
>>>>>>>>>> With best of Luck,
>>>>>>>>>> -sriranga(78yrs)
>>>>>>>>>> Dear dmitry,
>>>>>>>>>> Sorry, I could not post in the forum due to attahed files.Hence I
>>>>>>>>>> am endorsing copy to you.
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 20, 2011 at 2:22 PM, Sriranga(78yrsold) <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Sochenda
>>>>>>>>>>> please find attached box with its khtext.png file for editing in
>>>>>>>>>>> the box file  I am sending separately to you -khtext.tif and owler 
>>>>>>>>>>> tool for
>>>>>>>>>>> your editing purpose. since I don't know khemer lang nor unable to 
>>>>>>>>>>> type in
>>>>>>>>>>> the keyboard. After editing the box file and return to me for 
>>>>>>>>>>> further
>>>>>>>>>>> processing.
>>>>>>>>>>>
>>>>>>>>>>> With best of Luck,
>>>>>>>>>>> -sriranga(78yrs)
>>>>>>>>>>>
>>>>>>>>>>> 2011/1/20 KHEM Sochenda <[email protected]>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Dear Dmitry and Sriranga,
>>>>>>>>>>>>
>>>>>>>>>>>> I am so confused now. :(
>>>>>>>>>>>>
>>>>>>>>>>>> Maybe I should apply for internship with tesseract, but I am so
>>>>>>>>>>>> engaged with my project here.
>>>>>>>>>>>>
>>>>>>>>>>>> Please find the attachment as KHtext in unicode for training
>>>>>>>>>>>> sample.
>>>>>>>>>>>>
>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>
>>>>>>>>>>>> Sochenda
>>>>>>>>>>>>
>>>>>>>>>>>> 2011/1/19 Sriranga(78yrsold) <[email protected]>
>>>>>>>>>>>>
>>>>>>>>>>>> Sochenda,
>>>>>>>>>>>>> output of *lines viz.0ccb 8, 0cd5 8,  20c88 are appeared in
>>>>>>>>>>>>> vowel1.txt. So we have to convert unicode numbers to Kannada
>>>>>>>>>>>>> Character(script) with help of post-processor)*
>>>>>>>>>>>>> -Regards,
>>>>>>>>>>>>> -sriranga(78yrs)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Jan 19, 2011 at 4:04 PM, Sriranga(78yrsold) <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sochenda,
>>>>>>>>>>>>>> pleas see inline reply below.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Jan 19, 2011 at 12:58 PM, KHEM Sochenda <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Dear Dmitry and Sriranga,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you very much for you help. The reason why my output
>>>>>>>>>>>>>>> file is empty because I put my person ID to the glyphs, isn't 
>>>>>>>>>>>>>>> it?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Dear Dmitry,
>>>>>>>>>>>>>>> Please see the image attached, shall the image in the red box
>>>>>>>>>>>>>>> assigned to a Unicode character or seperated as in the image? 
>>>>>>>>>>>>>>> This glyph is
>>>>>>>>>>>>>>> composed of two other glyphs-- one can be represented by a 
>>>>>>>>>>>>>>> Unicode
>>>>>>>>>>>>>>> character, and the other is a part of a vowel.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Dear Sriranga,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Are the several first lines in your unicharset files
>>>>>>>>>>>>>>> represent a characters, or just any unicode character represent 
>>>>>>>>>>>>>>> no any
>>>>>>>>>>>>>>> character. *These lines viz.0ccb 8, 0cd5 8,  20c88 , 30ce0
>>>>>>>>>>>>>>> are unicode number instead of  characters* *of Kannada* *to
>>>>>>>>>>>>>>> show you*. *Usually I am using characters(Script) instead of
>>>>>>>>>>>>>>> unicode number for training purpose.  I am using tesseract 3.01 
>>>>>>>>>>>>>>> alpha(r-529)
>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>> Khmer font is also attached. Thanks but unable to type.
>>>>>>>>>>>>>>> However it appeared in CharacterMap.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   On receipt of your alphabets list I shall generated
>>>>>>>>>>>>>> datafiles and forwarded to you.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>> Sochenda
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Jan 18, 2011 at 8:27 PM, Dmitry Silaev <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Dear Sochenda,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In addition to what Sriranga said I'd remind that you should
>>>>>>>>>>>>>>>> do a lot of manual work:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In pyTesseractTrainer check that no bounding boxes intersect
>>>>>>>>>>>>>>>> glyphs; if some does - correct its BB coordinates manually.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In cases of BB overlap you should space out participating
>>>>>>>>>>>>>>>> glyphs in the training image (see the attached picture for 
>>>>>>>>>>>>>>>> examples).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You should use manual spacing if participating glyphs are
>>>>>>>>>>>>>>>> dependent characters (in your language - vowels) and the 
>>>>>>>>>>>>>>>> number of possible
>>>>>>>>>>>>>>>> combinations is practically uncountable. Then you would assign 
>>>>>>>>>>>>>>>> every glyph
>>>>>>>>>>>>>>>> its own code. Tess would consider these glyphs as separate 
>>>>>>>>>>>>>>>> characters and
>>>>>>>>>>>>>>>> you should post-process the resulting code sequence to obtain 
>>>>>>>>>>>>>>>> a well-formed
>>>>>>>>>>>>>>>> dependent Unicode pair (or triplet).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If there can be only few such combinations - you can merge
>>>>>>>>>>>>>>>> these BBs into one to encompass all the required glyphs and 
>>>>>>>>>>>>>>>> assign a single
>>>>>>>>>>>>>>>> code to the entire glyph combination. Then during the 
>>>>>>>>>>>>>>>> post-processing you'll
>>>>>>>>>>>>>>>> need to replace this single code with a predefined dependent 
>>>>>>>>>>>>>>>> Unicode pair.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hope I've managed to express myself clearly.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Warm regards,
>>>>>>>>>>>>>>>> Dmitry Silaev
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  --
>>>>>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>>> To post to this group, send email to
>>>>>>>>>>>>>>>> [email protected].
>>>>>>>>>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>>>>>>>>>> [email protected]<tesseract-ocr%[email protected]>
>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>> For more options, visit this group at
>>>>>>>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  --
>>>>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>> To post to this group, send email to
>>>>>>>>>>>>>>> [email protected].
>>>>>>>>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>>>>>>>>> [email protected]<tesseract-ocr%[email protected]>
>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>> For more options, visit this group at
>>>>>>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To post to this group, send email to [email protected].
>>>> To unsubscribe from this group, send email to
>>>> [email protected]<tesseract-ocr%[email protected]>
>>>> .
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To post to this group, send email to [email protected].
>>> To unsubscribe from this group, send email to
>>> [email protected]<tesseract-ocr%[email protected]>
>>> .
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected]<tesseract-ocr%[email protected]>
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<tesseract-ocr%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

om.unicharambigs
Description: Binary data

Re: Tesseract Training

Reply via email to