Dear Sriranga, I mean I will test and check if it work well with what I classify now, or may I have to adapt something more.
I have one more question, I put some entry in the unicharambigs file; however It seems the tess doesn't care what I have put in the file and the output is just the same as no entry in the unicharambigs. Please see the attachment as the test file and unicharambigs. Of course, I thank you and Dmitry so much for his fruitful comments on this issues. Best Regards, Sochenda On Mon, Jan 24, 2011 at 9:39 AM, Sriranga(78yrsold) <[email protected] > wrote: > Sochenda, > I am really happy atleast it works for you now. I could not understand > your point "improve the classification according to the error" Will you > please explain little bit. Anyway please feedback with percentage of > accuracy in the output text. We must thanks to Dmitry for his valuable > guidance. > Wish you Good Luck, > -sriranga(78yrs) > > > On Mon, Jan 24, 2011 at 7:52 AM, KHEM Sochenda <[email protected]>wrote: > >> thanks Sriranga, >> >> Here is my box file after editing. I am going to test the recognition and >> improve the classification according to the error. >> >> Best Regards, >> Sochenda >> >> >> On Sat, Jan 22, 2011 at 2:48 PM, Sriranga(78yrsold) < >> [email protected]> wrote: >> >>> >>> >>> ---------- Forwarded message ---------- >>> From: Sriranga(78yrsold) <[email protected]> >>> Date: Fri, Jan 21, 2011 at 12:33 PM >>> Subject: Re: Tesseract Training >>> To: KHEM Sochenda <[email protected]> >>> >>> >>> Chenda, >>> It is better to type the character (your lang script) than code in the >>> box file. Because your characters will find in the unicharset file. I don't >>> know whether your keyboard is able to type your lang and if so, it is better >>> to type. >>> >>> >>> On Fri, Jan 21, 2011 at 11:41 AM, Sriranga(78yrsold) < >>> [email protected]> wrote: >>> >>>> Chenda, >>>> By guess method I have edited the box file using another tool olwer.exe >>>> (which is for english only)attached herewith. Advantage of attached >>>> owler.exe is you can type character/ hexdecimal code by pressing tab. >>>> consonant and independent vowel may have *single box* but for >>>> consonant/independent vowel +*dependent vowel* must have single box. >>>> (the said owler box is not suitable for kannada and as such I am not using) >>>> If the output using same tif file(used for training) should naturally >>>> correctly displayed. If used tif other than tif used for training purpose >>>> will have naturally have misspelling which can be corrected by post >>>> processor software. the same problem occurred for kannada also. I hope you >>>> will succeed in generating trained data file successfully since there is no >>>> more complex than Kannada script. >>>> After receipt of corrected the box file, I shall generated trained data >>>> file. >>>> >>>> With Best Wishes, >>>> -sriranga(78yrs) >>>> >>>> >>>> >>>> On Fri, Jan 21, 2011 at 7:49 AM, KHEM Sochenda >>>> <[email protected]>wrote: >>>> >>>>> Dear Dmitry and Sriranga, >>>>> >>>>> Here are my result of training. I tried recognize with the same used >>>>> the trained image as a test, the result is perfect. When I tried with the >>>>> test image as attached, there seem problem recognizing the characters. >>>>> >>>>> Please tell me what your thoughts about this. >>>>> >>>>> Best Regards, >>>>> >>>>> Sochenda >>>>> >>>>> >>>>> On Thu, Jan 20, 2011 at 11:47 PM, KHEM Sochenda < >>>>> [email protected]> wrote: >>>>> >>>>>> >>>>>> Dear Sriranga, >>>>>> >>>>>> Here is my train box. It is really tedious editing box file. I just >>>>>> found some glyphs I haven't put the code for them yet, but it difficult >>>>>> to >>>>>> find them in the editing box you gave neigther with >>>>>> pytesseracttrainer.py as >>>>>> it is too slow.. >>>>>> >>>>>> Best Regards, >>>>>> >>>>>> Sochenda >>>>>> >>>>>> On Thu, Jan 20, 2011 at 4:49 PM, Sriranga(78yrsold) < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> **box file for editing >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Jan 20, 2011 at 2:46 PM, KHEM Sochenda < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Dear Dmitry and Sriranga, >>>>>>>> >>>>>>>> But, Sriranga, I guess your computer cannot render KH language well. >>>>>>>> I will send you an image instead ok? >>>>>>>> >>>>>>>> Best Regards, >>>>>>>> Sochenda >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jan 20, 2011 at 4:08 PM, Sriranga(78yrsold) < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Attached zip file containing exe file of owler. Before unzip please >>>>>>>>> delete word {"OM" }first and then unzip >>>>>>>>> with help owler, you edit box file according to your requirement >>>>>>>>> After duly edited box file please forward to me >>>>>>>>> for further generating traineddata file or if you are able to >>>>>>>>> generate traineddata file you can do yourself - no problem. . >>>>>>>>> With best of Luck, >>>>>>>>> -sriranga(78yrs) >>>>>>>>> Dear dmitry, >>>>>>>>> Sorry, I could not post in the forum due to attahed files.Hence I >>>>>>>>> am endorsing copy to you. >>>>>>>>> >>>>>>>>> On Thu, Jan 20, 2011 at 2:22 PM, Sriranga(78yrsold) < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Sochenda >>>>>>>>>> please find attached box with its khtext.png file for editing in >>>>>>>>>> the box file I am sending separately to you -khtext.tif and owler >>>>>>>>>> tool for >>>>>>>>>> your editing purpose. since I don't know khemer lang nor unable to >>>>>>>>>> type in >>>>>>>>>> the keyboard. After editing the box file and return to me for further >>>>>>>>>> processing. >>>>>>>>>> >>>>>>>>>> With best of Luck, >>>>>>>>>> -sriranga(78yrs) >>>>>>>>>> >>>>>>>>>> 2011/1/20 KHEM Sochenda <[email protected]> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Dear Dmitry and Sriranga, >>>>>>>>>>> >>>>>>>>>>> I am so confused now. :( >>>>>>>>>>> >>>>>>>>>>> Maybe I should apply for internship with tesseract, but I am so >>>>>>>>>>> engaged with my project here. >>>>>>>>>>> >>>>>>>>>>> Please find the attachment as KHtext in unicode for training >>>>>>>>>>> sample. >>>>>>>>>>> >>>>>>>>>>> Best Regards, >>>>>>>>>>> >>>>>>>>>>> Sochenda >>>>>>>>>>> >>>>>>>>>>> 2011/1/19 Sriranga(78yrsold) <[email protected]> >>>>>>>>>>> >>>>>>>>>>> Sochenda, >>>>>>>>>>>> output of *lines viz.0ccb 8, 0cd5 8, 20c88 are appeared in >>>>>>>>>>>> vowel1.txt. So we have to convert unicode numbers to Kannada >>>>>>>>>>>> Character(script) with help of post-processor)* >>>>>>>>>>>> -Regards, >>>>>>>>>>>> -sriranga(78yrs) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Jan 19, 2011 at 4:04 PM, Sriranga(78yrsold) < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Sochenda, >>>>>>>>>>>>> pleas see inline reply below. >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Jan 19, 2011 at 12:58 PM, KHEM Sochenda < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Dear Dmitry and Sriranga, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thank you very much for you help. The reason why my output >>>>>>>>>>>>>> file is empty because I put my person ID to the glyphs, isn't it? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Dear Dmitry, >>>>>>>>>>>>>> Please see the image attached, shall the image in the red box >>>>>>>>>>>>>> assigned to a Unicode character or seperated as in the image? >>>>>>>>>>>>>> This glyph is >>>>>>>>>>>>>> composed of two other glyphs-- one can be represented by a >>>>>>>>>>>>>> Unicode >>>>>>>>>>>>>> character, and the other is a part of a vowel. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Dear Sriranga, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Are the several first lines in your unicharset files represent >>>>>>>>>>>>>> a characters, or just any unicode character represent no any >>>>>>>>>>>>>> character. >>>>>>>>>>>>>> *These lines viz.0ccb 8, 0cd5 8, 20c88 , 30ce0 are unicode >>>>>>>>>>>>>> number instead of characters* *of Kannada* *to show you*. >>>>>>>>>>>>>> *Usually >>>>>>>>>>>>>> I am using characters(Script) instead of unicode number for >>>>>>>>>>>>>> training >>>>>>>>>>>>>> purpose. I am using tesseract 3.01 alpha(r-529) >>>>>>>>>>>>>> * >>>>>>>>>>>>>> Khmer font is also attached. Thanks but unable to type. >>>>>>>>>>>>>> However it appeared in CharacterMap. >>>>>>>>>>>>>> >>>>>>>>>>>>> On receipt of your alphabets list I shall generated datafiles >>>>>>>>>>>>> and forwarded to you. >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>>> Sochenda >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Jan 18, 2011 at 8:27 PM, Dmitry Silaev < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Dear Sochenda, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> In addition to what Sriranga said I'd remind that you should >>>>>>>>>>>>>>> do a lot of manual work: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> In pyTesseractTrainer check that no bounding boxes intersect >>>>>>>>>>>>>>> glyphs; if some does - correct its BB coordinates manually. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> In cases of BB overlap you should space out participating >>>>>>>>>>>>>>> glyphs in the training image (see the attached picture for >>>>>>>>>>>>>>> examples). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> You should use manual spacing if participating glyphs are >>>>>>>>>>>>>>> dependent characters (in your language - vowels) and the number >>>>>>>>>>>>>>> of possible >>>>>>>>>>>>>>> combinations is practically uncountable. Then you would assign >>>>>>>>>>>>>>> every glyph >>>>>>>>>>>>>>> its own code. Tess would consider these glyphs as separate >>>>>>>>>>>>>>> characters and >>>>>>>>>>>>>>> you should post-process the resulting code sequence to obtain a >>>>>>>>>>>>>>> well-formed >>>>>>>>>>>>>>> dependent Unicode pair (or triplet). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If there can be only few such combinations - you can merge >>>>>>>>>>>>>>> these BBs into one to encompass all the required glyphs and >>>>>>>>>>>>>>> assign a single >>>>>>>>>>>>>>> code to the entire glyph combination. Then during the >>>>>>>>>>>>>>> post-processing you'll >>>>>>>>>>>>>>> need to replace this single code with a predefined dependent >>>>>>>>>>>>>>> Unicode pair. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hope I've managed to express myself clearly. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Warm regards, >>>>>>>>>>>>>>> Dmitry Silaev >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>>>>> [email protected]. >>>>>>>>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>>>>>>>> [email protected]<tesseract-ocr%[email protected]> >>>>>>>>>>>>>>> . >>>>>>>>>>>>>>> For more options, visit this group at >>>>>>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>>>> [email protected]. >>>>>>>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>>>>>>> [email protected]<tesseract-ocr%[email protected]> >>>>>>>>>>>>>> . >>>>>>>>>>>>>> For more options, visit this group at >>>>>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "tesseract-ocr" group. >>> To post to this group, send email to [email protected]. >>> To unsubscribe from this group, send email to >>> [email protected]<tesseract-ocr%[email protected]> >>> . >>> For more options, visit this group at >>> http://groups.google.com/group/tesseract-ocr?hl=en. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]<tesseract-ocr%[email protected]> >> . >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en. >> > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]<tesseract-ocr%[email protected]> > . > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
khm.unicharambigs
Description: Binary data
<<attachment: kexeKe1.tif>>

