Dear Sriranga, Yes, I use owler tool to edit the box file. េ is dependent vowel. In the unicharambigs, I just try to change the position order when meet េ and ក .
Best Regards, Sochenda On Mon, Jan 24, 2011 at 12:42 PM, Sriranga(78yrsold) < [email protected]> wrote: > Dear Sochenda, > thanks for the updated me. > 1) I am curious to know whether you are able to edit in owler tool or > manually edited the box file ? > 2) unicharambigs file = I am not able to create unicharambigs for Kannada > even following latest intructions in wiki > In this connection, I have posted problem faced by me under issue No:433 > - which is still pending for solution. copy of om.unicharambigs is attached > for your information. I could not understand where I made a mistake? > េ ក tested it does not merge with consonant so it appears េ is > independent vowel and not dependent vowel. As such your file is appears to > be in order and I feel it should work - however please change v1 to v12 as > per wiki instructions. and try again. > With Best of Luck, > -sriranga(78yrsold) > > > > On Mon, Jan 24, 2011 at 8:33 AM, KHEM Sochenda <[email protected]>wrote: > >> Dear Sriranga, >> >> I mean I will test and check if it work well with what I classify now, or >> may I have to adapt something more. >> >> I have one more question, I put some entry in the unicharambigs file; >> however It seems the tess doesn't care what I have put in the file and the >> output is just the same as no entry in the unicharambigs. Please see the >> attachment as the test file and unicharambigs. >> >> Of course, I thank you and Dmitry so much for his fruitful comments on >> this issues. >> >> Best Regards, >> >> Sochenda >> >> >> >> >> On Mon, Jan 24, 2011 at 9:39 AM, Sriranga(78yrsold) < >> [email protected]> wrote: >> >>> Sochenda, >>> I am really happy atleast it works for you now. I could not understand >>> your point "improve the classification according to the error" Will you >>> please explain little bit. Anyway please feedback with percentage of >>> accuracy in the output text. We must thanks to Dmitry for his valuable >>> guidance. >>> Wish you Good Luck, >>> -sriranga(78yrs) >>> >>> >>> On Mon, Jan 24, 2011 at 7:52 AM, KHEM Sochenda >>> <[email protected]>wrote: >>> >>>> thanks Sriranga, >>>> >>>> Here is my box file after editing. I am going to test the recognition >>>> and improve the classification according to the error. >>>> >>>> Best Regards, >>>> Sochenda >>>> >>>> >>>> On Sat, Jan 22, 2011 at 2:48 PM, Sriranga(78yrsold) < >>>> [email protected]> wrote: >>>> >>>>> >>>>> >>>>> ---------- Forwarded message ---------- >>>>> From: Sriranga(78yrsold) <[email protected]> >>>>> Date: Fri, Jan 21, 2011 at 12:33 PM >>>>> Subject: Re: Tesseract Training >>>>> To: KHEM Sochenda <[email protected]> >>>>> >>>>> >>>>> Chenda, >>>>> It is better to type the character (your lang script) than code in the >>>>> box file. Because your characters will find in the unicharset file. I >>>>> don't >>>>> know whether your keyboard is able to type your lang and if so, it is >>>>> better >>>>> to type. >>>>> >>>>> >>>>> On Fri, Jan 21, 2011 at 11:41 AM, Sriranga(78yrsold) < >>>>> [email protected]> wrote: >>>>> >>>>>> Chenda, >>>>>> By guess method I have edited the box file using another tool >>>>>> olwer.exe (which is for english only)attached herewith. Advantage of >>>>>> attached owler.exe is you can type character/ hexdecimal code by pressing >>>>>> tab. consonant and independent vowel may have *single box* but for >>>>>> consonant/independent vowel +*dependent vowel* must have single box. >>>>>> (the said owler box is not suitable for kannada and as such I am not >>>>>> using) >>>>>> If the output using same tif file(used for training) should naturally >>>>>> correctly displayed. If used tif other than tif used for training purpose >>>>>> will have naturally have misspelling which can be corrected by post >>>>>> processor software. the same problem occurred for kannada also. I hope >>>>>> you >>>>>> will succeed in generating trained data file successfully since there is >>>>>> no >>>>>> more complex than Kannada script. >>>>>> After receipt of corrected the box file, I shall generated trained >>>>>> data file. >>>>>> >>>>>> With Best Wishes, >>>>>> -sriranga(78yrs) >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Jan 21, 2011 at 7:49 AM, KHEM Sochenda < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Dear Dmitry and Sriranga, >>>>>>> >>>>>>> Here are my result of training. I tried recognize with the same used >>>>>>> the trained image as a test, the result is perfect. When I tried with >>>>>>> the >>>>>>> test image as attached, there seem problem recognizing the characters. >>>>>>> >>>>>>> Please tell me what your thoughts about this. >>>>>>> >>>>>>> Best Regards, >>>>>>> >>>>>>> Sochenda >>>>>>> >>>>>>> >>>>>>> On Thu, Jan 20, 2011 at 11:47 PM, KHEM Sochenda < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> >>>>>>>> Dear Sriranga, >>>>>>>> >>>>>>>> Here is my train box. It is really tedious editing box file. I just >>>>>>>> found some glyphs I haven't put the code for them yet, but it >>>>>>>> difficult to >>>>>>>> find them in the editing box you gave neigther with >>>>>>>> pytesseracttrainer.py as >>>>>>>> it is too slow.. >>>>>>>> >>>>>>>> Best Regards, >>>>>>>> >>>>>>>> Sochenda >>>>>>>> >>>>>>>> On Thu, Jan 20, 2011 at 4:49 PM, Sriranga(78yrsold) < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> **box file for editing >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Jan 20, 2011 at 2:46 PM, KHEM Sochenda < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Dear Dmitry and Sriranga, >>>>>>>>>> >>>>>>>>>> But, Sriranga, I guess your computer cannot render KH language >>>>>>>>>> well. I will send you an image instead ok? >>>>>>>>>> >>>>>>>>>> Best Regards, >>>>>>>>>> Sochenda >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Jan 20, 2011 at 4:08 PM, Sriranga(78yrsold) < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Attached zip file containing exe file of owler. Before unzip >>>>>>>>>>> please delete word {"OM" }first and then unzip >>>>>>>>>>> with help owler, you edit box file according to your requirement >>>>>>>>>>> After duly edited box file please forward to me >>>>>>>>>>> for further generating traineddata file or if you are able to >>>>>>>>>>> generate traineddata file you can do yourself - no problem. . >>>>>>>>>>> With best of Luck, >>>>>>>>>>> -sriranga(78yrs) >>>>>>>>>>> Dear dmitry, >>>>>>>>>>> Sorry, I could not post in the forum due to attahed files.Hence I >>>>>>>>>>> am endorsing copy to you. >>>>>>>>>>> >>>>>>>>>>> On Thu, Jan 20, 2011 at 2:22 PM, Sriranga(78yrsold) < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Sochenda >>>>>>>>>>>> please find attached box with its khtext.png file for editing in >>>>>>>>>>>> the box file I am sending separately to you -khtext.tif and owler >>>>>>>>>>>> tool for >>>>>>>>>>>> your editing purpose. since I don't know khemer lang nor unable to >>>>>>>>>>>> type in >>>>>>>>>>>> the keyboard. After editing the box file and return to me for >>>>>>>>>>>> further >>>>>>>>>>>> processing. >>>>>>>>>>>> >>>>>>>>>>>> With best of Luck, >>>>>>>>>>>> -sriranga(78yrs) >>>>>>>>>>>> >>>>>>>>>>>> 2011/1/20 KHEM Sochenda <[email protected]> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> Dear Dmitry and Sriranga, >>>>>>>>>>>>> >>>>>>>>>>>>> I am so confused now. :( >>>>>>>>>>>>> >>>>>>>>>>>>> Maybe I should apply for internship with tesseract, but I am so >>>>>>>>>>>>> engaged with my project here. >>>>>>>>>>>>> >>>>>>>>>>>>> Please find the attachment as KHtext in unicode for training >>>>>>>>>>>>> sample. >>>>>>>>>>>>> >>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>> >>>>>>>>>>>>> Sochenda >>>>>>>>>>>>> >>>>>>>>>>>>> 2011/1/19 Sriranga(78yrsold) <[email protected]> >>>>>>>>>>>>> >>>>>>>>>>>>> Sochenda, >>>>>>>>>>>>>> output of *lines viz.0ccb 8, 0cd5 8, 20c88 are appeared in >>>>>>>>>>>>>> vowel1.txt. So we have to convert unicode numbers to Kannada >>>>>>>>>>>>>> Character(script) with help of post-processor)* >>>>>>>>>>>>>> -Regards, >>>>>>>>>>>>>> -sriranga(78yrs) >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Jan 19, 2011 at 4:04 PM, Sriranga(78yrsold) < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sochenda, >>>>>>>>>>>>>>> pleas see inline reply below. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Jan 19, 2011 at 12:58 PM, KHEM Sochenda < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Dear Dmitry and Sriranga, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thank you very much for you help. The reason why my output >>>>>>>>>>>>>>>> file is empty because I put my person ID to the glyphs, isn't >>>>>>>>>>>>>>>> it? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Dear Dmitry, >>>>>>>>>>>>>>>> Please see the image attached, shall the image in the red >>>>>>>>>>>>>>>> box assigned to a Unicode character or seperated as in the >>>>>>>>>>>>>>>> image? This glyph >>>>>>>>>>>>>>>> is composed of two other glyphs-- one can be represented by a >>>>>>>>>>>>>>>> Unicode >>>>>>>>>>>>>>>> character, and the other is a part of a vowel. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Dear Sriranga, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Are the several first lines in your unicharset files >>>>>>>>>>>>>>>> represent a characters, or just any unicode character >>>>>>>>>>>>>>>> represent no any >>>>>>>>>>>>>>>> character. *These lines viz.0ccb 8, 0cd5 8, 20c88 , 30ce0 >>>>>>>>>>>>>>>> are unicode number instead of characters* *of Kannada* *to >>>>>>>>>>>>>>>> show you*. *Usually I am using characters(Script) instead >>>>>>>>>>>>>>>> of unicode number for training purpose. I am using tesseract >>>>>>>>>>>>>>>> 3.01 >>>>>>>>>>>>>>>> alpha(r-529) >>>>>>>>>>>>>>>> * >>>>>>>>>>>>>>>> Khmer font is also attached. Thanks but unable to type. >>>>>>>>>>>>>>>> However it appeared in CharacterMap. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On receipt of your alphabets list I shall generated >>>>>>>>>>>>>>> datafiles and forwarded to you. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>>>>> Sochenda >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, Jan 18, 2011 at 8:27 PM, Dmitry Silaev < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Dear Sochenda, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> In addition to what Sriranga said I'd remind that you >>>>>>>>>>>>>>>>> should do a lot of manual work: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> In pyTesseractTrainer check that no bounding boxes >>>>>>>>>>>>>>>>> intersect glyphs; if some does - correct its BB coordinates >>>>>>>>>>>>>>>>> manually. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> In cases of BB overlap you should space out participating >>>>>>>>>>>>>>>>> glyphs in the training image (see the attached picture for >>>>>>>>>>>>>>>>> examples). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> You should use manual spacing if participating glyphs are >>>>>>>>>>>>>>>>> dependent characters (in your language - vowels) and the >>>>>>>>>>>>>>>>> number of possible >>>>>>>>>>>>>>>>> combinations is practically uncountable. Then you would >>>>>>>>>>>>>>>>> assign every glyph >>>>>>>>>>>>>>>>> its own code. Tess would consider these glyphs as separate >>>>>>>>>>>>>>>>> characters and >>>>>>>>>>>>>>>>> you should post-process the resulting code sequence to obtain >>>>>>>>>>>>>>>>> a well-formed >>>>>>>>>>>>>>>>> dependent Unicode pair (or triplet). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> If there can be only few such combinations - you can merge >>>>>>>>>>>>>>>>> these BBs into one to encompass all the required glyphs and >>>>>>>>>>>>>>>>> assign a single >>>>>>>>>>>>>>>>> code to the entire glyph combination. Then during the >>>>>>>>>>>>>>>>> post-processing you'll >>>>>>>>>>>>>>>>> need to replace this single code with a predefined dependent >>>>>>>>>>>>>>>>> Unicode pair. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hope I've managed to express myself clearly. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Warm regards, >>>>>>>>>>>>>>>>> Dmitry Silaev >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>>>>>>> [email protected]. >>>>>>>>>>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>>>>>>>>>> [email protected]<tesseract-ocr%[email protected]> >>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>> For more options, visit this group at >>>>>>>>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>>>>>> [email protected]. >>>>>>>>>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>>>>>>>>> [email protected]<tesseract-ocr%[email protected]> >>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>> For more options, visit this group at >>>>>>>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To post to this group, send email to [email protected]. >>>>> To unsubscribe from this group, send email to >>>>> [email protected]<tesseract-ocr%[email protected]> >>>>> . >>>>> For more options, visit this group at >>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To post to this group, send email to [email protected]. >>>> To unsubscribe from this group, send email to >>>> [email protected]<tesseract-ocr%[email protected]> >>>> . >>>> For more options, visit this group at >>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>> >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "tesseract-ocr" group. >>> To post to this group, send email to [email protected]. >>> To unsubscribe from this group, send email to >>> [email protected]<tesseract-ocr%[email protected]> >>> . >>> For more options, visit this group at >>> http://groups.google.com/group/tesseract-ocr?hl=en. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]<tesseract-ocr%[email protected]> >> . >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en. >> > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]<tesseract-ocr%[email protected]> > . > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

