Dear Sochenda, thanks for the updated me. 1) I am curious to know whether you are able to edit in owler tool or manually edited the box file ? 2) unicharambigs file = I am not able to create unicharambigs for Kannada even following latest intructions in wiki In this connection, I have posted problem faced by me under issue No:433 - which is still pending for solution. copy of om.unicharambigs is attached for your information. I could not understand where I made a mistake?
On Mon, Jan 24, 2011 at 8:33 AM, KHEM Sochenda <[email protected]>wrote: > Dear Sriranga, > > I mean I will test and check if it work well with what I classify now, or > may I have to adapt something more. > > I have one more question, I put some entry in the unicharambigs file; > however It seems the tess doesn't care what I have put in the file and the > output is just the same as no entry in the unicharambigs. Please see the > attachment as the test file and unicharambigs. > > Of course, I thank you and Dmitry so much for his fruitful comments on this > issues. > > Best Regards, > > Sochenda > > > > > On Mon, Jan 24, 2011 at 9:39 AM, Sriranga(78yrsold) < > [email protected]> wrote: > >> Sochenda, >> I am really happy atleast it works for you now. I could not understand >> your point "improve the classification according to the error" Will you >> please explain little bit. Anyway please feedback with percentage of >> accuracy in the output text. We must thanks to Dmitry for his valuable >> guidance. >> Wish you Good Luck, >> -sriranga(78yrs) >> >> >> On Mon, Jan 24, 2011 at 7:52 AM, KHEM Sochenda <[email protected]>wrote: >> >>> thanks Sriranga, >>> >>> Here is my box file after editing. I am going to test the recognition and >>> improve the classification according to the error. >>> >>> Best Regards, >>> Sochenda >>> >>> >>> On Sat, Jan 22, 2011 at 2:48 PM, Sriranga(78yrsold) < >>> [email protected]> wrote: >>> >>>> >>>> >>>> ---------- Forwarded message ---------- >>>> From: Sriranga(78yrsold) <[email protected]> >>>> Date: Fri, Jan 21, 2011 at 12:33 PM >>>> Subject: Re: Tesseract Training >>>> To: KHEM Sochenda <[email protected]> >>>> >>>> >>>> Chenda, >>>> It is better to type the character (your lang script) than code in the >>>> box file. Because your characters will find in the unicharset file. I >>>> don't >>>> know whether your keyboard is able to type your lang and if so, it is >>>> better >>>> to type. >>>> >>>> >>>> On Fri, Jan 21, 2011 at 11:41 AM, Sriranga(78yrsold) < >>>> [email protected]> wrote: >>>> >>>>> Chenda, >>>>> By guess method I have edited the box file using another tool >>>>> olwer.exe (which is for english only)attached herewith. Advantage of >>>>> attached owler.exe is you can type character/ hexdecimal code by pressing >>>>> tab. consonant and independent vowel may have *single box* but for >>>>> consonant/independent vowel +*dependent vowel* must have single box. >>>>> (the said owler box is not suitable for kannada and as such I am not >>>>> using) >>>>> If the output using same tif file(used for training) should naturally >>>>> correctly displayed. If used tif other than tif used for training purpose >>>>> will have naturally have misspelling which can be corrected by post >>>>> processor software. the same problem occurred for kannada also. I hope >>>>> you >>>>> will succeed in generating trained data file successfully since there is >>>>> no >>>>> more complex than Kannada script. >>>>> After receipt of corrected the box file, I shall generated trained >>>>> data file. >>>>> >>>>> With Best Wishes, >>>>> -sriranga(78yrs) >>>>> >>>>> >>>>> >>>>> On Fri, Jan 21, 2011 at 7:49 AM, KHEM Sochenda <[email protected] >>>>> > wrote: >>>>> >>>>>> Dear Dmitry and Sriranga, >>>>>> >>>>>> Here are my result of training. I tried recognize with the same used >>>>>> the trained image as a test, the result is perfect. When I tried with the >>>>>> test image as attached, there seem problem recognizing the characters. >>>>>> >>>>>> Please tell me what your thoughts about this. >>>>>> >>>>>> Best Regards, >>>>>> >>>>>> Sochenda >>>>>> >>>>>> >>>>>> On Thu, Jan 20, 2011 at 11:47 PM, KHEM Sochenda < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> >>>>>>> Dear Sriranga, >>>>>>> >>>>>>> Here is my train box. It is really tedious editing box file. I just >>>>>>> found some glyphs I haven't put the code for them yet, but it difficult >>>>>>> to >>>>>>> find them in the editing box you gave neigther with >>>>>>> pytesseracttrainer.py as >>>>>>> it is too slow.. >>>>>>> >>>>>>> Best Regards, >>>>>>> >>>>>>> Sochenda >>>>>>> >>>>>>> On Thu, Jan 20, 2011 at 4:49 PM, Sriranga(78yrsold) < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> **box file for editing >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jan 20, 2011 at 2:46 PM, KHEM Sochenda < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Dear Dmitry and Sriranga, >>>>>>>>> >>>>>>>>> But, Sriranga, I guess your computer cannot render KH language >>>>>>>>> well. I will send you an image instead ok? >>>>>>>>> >>>>>>>>> Best Regards, >>>>>>>>> Sochenda >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Jan 20, 2011 at 4:08 PM, Sriranga(78yrsold) < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Attached zip file containing exe file of owler. Before unzip >>>>>>>>>> please delete word {"OM" }first and then unzip >>>>>>>>>> with help owler, you edit box file according to your requirement >>>>>>>>>> After duly edited box file please forward to me >>>>>>>>>> for further generating traineddata file or if you are able to >>>>>>>>>> generate traineddata file you can do yourself - no problem. . >>>>>>>>>> With best of Luck, >>>>>>>>>> -sriranga(78yrs) >>>>>>>>>> Dear dmitry, >>>>>>>>>> Sorry, I could not post in the forum due to attahed files.Hence I >>>>>>>>>> am endorsing copy to you. >>>>>>>>>> >>>>>>>>>> On Thu, Jan 20, 2011 at 2:22 PM, Sriranga(78yrsold) < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Sochenda >>>>>>>>>>> please find attached box with its khtext.png file for editing in >>>>>>>>>>> the box file I am sending separately to you -khtext.tif and owler >>>>>>>>>>> tool for >>>>>>>>>>> your editing purpose. since I don't know khemer lang nor unable to >>>>>>>>>>> type in >>>>>>>>>>> the keyboard. After editing the box file and return to me for >>>>>>>>>>> further >>>>>>>>>>> processing. >>>>>>>>>>> >>>>>>>>>>> With best of Luck, >>>>>>>>>>> -sriranga(78yrs) >>>>>>>>>>> >>>>>>>>>>> 2011/1/20 KHEM Sochenda <[email protected]> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> Dear Dmitry and Sriranga, >>>>>>>>>>>> >>>>>>>>>>>> I am so confused now. :( >>>>>>>>>>>> >>>>>>>>>>>> Maybe I should apply for internship with tesseract, but I am so >>>>>>>>>>>> engaged with my project here. >>>>>>>>>>>> >>>>>>>>>>>> Please find the attachment as KHtext in unicode for training >>>>>>>>>>>> sample. >>>>>>>>>>>> >>>>>>>>>>>> Best Regards, >>>>>>>>>>>> >>>>>>>>>>>> Sochenda >>>>>>>>>>>> >>>>>>>>>>>> 2011/1/19 Sriranga(78yrsold) <[email protected]> >>>>>>>>>>>> >>>>>>>>>>>> Sochenda, >>>>>>>>>>>>> output of *lines viz.0ccb 8, 0cd5 8, 20c88 are appeared in >>>>>>>>>>>>> vowel1.txt. So we have to convert unicode numbers to Kannada >>>>>>>>>>>>> Character(script) with help of post-processor)* >>>>>>>>>>>>> -Regards, >>>>>>>>>>>>> -sriranga(78yrs) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Jan 19, 2011 at 4:04 PM, Sriranga(78yrsold) < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Sochenda, >>>>>>>>>>>>>> pleas see inline reply below. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Jan 19, 2011 at 12:58 PM, KHEM Sochenda < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Dear Dmitry and Sriranga, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thank you very much for you help. The reason why my output >>>>>>>>>>>>>>> file is empty because I put my person ID to the glyphs, isn't >>>>>>>>>>>>>>> it? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Dear Dmitry, >>>>>>>>>>>>>>> Please see the image attached, shall the image in the red box >>>>>>>>>>>>>>> assigned to a Unicode character or seperated as in the image? >>>>>>>>>>>>>>> This glyph is >>>>>>>>>>>>>>> composed of two other glyphs-- one can be represented by a >>>>>>>>>>>>>>> Unicode >>>>>>>>>>>>>>> character, and the other is a part of a vowel. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Dear Sriranga, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Are the several first lines in your unicharset files >>>>>>>>>>>>>>> represent a characters, or just any unicode character represent >>>>>>>>>>>>>>> no any >>>>>>>>>>>>>>> character. *These lines viz.0ccb 8, 0cd5 8, 20c88 , 30ce0 >>>>>>>>>>>>>>> are unicode number instead of characters* *of Kannada* *to >>>>>>>>>>>>>>> show you*. *Usually I am using characters(Script) instead of >>>>>>>>>>>>>>> unicode number for training purpose. I am using tesseract 3.01 >>>>>>>>>>>>>>> alpha(r-529) >>>>>>>>>>>>>>> * >>>>>>>>>>>>>>> Khmer font is also attached. Thanks but unable to type. >>>>>>>>>>>>>>> However it appeared in CharacterMap. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> On receipt of your alphabets list I shall generated >>>>>>>>>>>>>> datafiles and forwarded to you. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>>>> Sochenda >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Jan 18, 2011 at 8:27 PM, Dmitry Silaev < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Dear Sochenda, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In addition to what Sriranga said I'd remind that you should >>>>>>>>>>>>>>>> do a lot of manual work: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In pyTesseractTrainer check that no bounding boxes intersect >>>>>>>>>>>>>>>> glyphs; if some does - correct its BB coordinates manually. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In cases of BB overlap you should space out participating >>>>>>>>>>>>>>>> glyphs in the training image (see the attached picture for >>>>>>>>>>>>>>>> examples). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> You should use manual spacing if participating glyphs are >>>>>>>>>>>>>>>> dependent characters (in your language - vowels) and the >>>>>>>>>>>>>>>> number of possible >>>>>>>>>>>>>>>> combinations is practically uncountable. Then you would assign >>>>>>>>>>>>>>>> every glyph >>>>>>>>>>>>>>>> its own code. Tess would consider these glyphs as separate >>>>>>>>>>>>>>>> characters and >>>>>>>>>>>>>>>> you should post-process the resulting code sequence to obtain >>>>>>>>>>>>>>>> a well-formed >>>>>>>>>>>>>>>> dependent Unicode pair (or triplet). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> If there can be only few such combinations - you can merge >>>>>>>>>>>>>>>> these BBs into one to encompass all the required glyphs and >>>>>>>>>>>>>>>> assign a single >>>>>>>>>>>>>>>> code to the entire glyph combination. Then during the >>>>>>>>>>>>>>>> post-processing you'll >>>>>>>>>>>>>>>> need to replace this single code with a predefined dependent >>>>>>>>>>>>>>>> Unicode pair. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hope I've managed to express myself clearly. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Warm regards, >>>>>>>>>>>>>>>> Dmitry Silaev >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>>>>>> [email protected]. >>>>>>>>>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>>>>>>>>> [email protected]<tesseract-ocr%[email protected]> >>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>> For more options, visit this group at >>>>>>>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>>>>> [email protected]. >>>>>>>>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>>>>>>>> [email protected]<tesseract-ocr%[email protected]> >>>>>>>>>>>>>>> . >>>>>>>>>>>>>>> For more options, visit this group at >>>>>>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To post to this group, send email to [email protected]. >>>> To unsubscribe from this group, send email to >>>> [email protected]<tesseract-ocr%[email protected]> >>>> . >>>> For more options, visit this group at >>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>> >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "tesseract-ocr" group. >>> To post to this group, send email to [email protected]. >>> To unsubscribe from this group, send email to >>> [email protected]<tesseract-ocr%[email protected]> >>> . >>> For more options, visit this group at >>> http://groups.google.com/group/tesseract-ocr?hl=en. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]<tesseract-ocr%[email protected]> >> . >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en. >> > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]<tesseract-ocr%[email protected]> > . > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

