I know how to do it in tesseract, but the image just to show you how the glyphs should be boxed.
I can send you the box file generate by tesseract anyway. Regards, Sochenda On Mon, Jan 17, 2011 at 1:41 PM, Sriranga(78yrsold) <[email protected] > wrote: > as per wiki instructions.- commandline has to be used to generate box file > as follow - as per wiki instructions. > tesseract <lang.fontname.number.tif > <lang.fontname.number> batch.nochop > makebox > > > > On Mon, Jan 17, 2011 at 11:55 AM, KHEM Sochenda <[email protected]>wrote: > >> In the image, I've done manually. >> >> On Mon, Jan 17, 2011 at 12:16 PM, Sriranga(78yrsold) < >> [email protected]> wrote: >> >>> Which tool you have used to create boxes. Please also upload box file >>> generated by you. >>> >>> >>> On Mon, Jan 17, 2011 at 9:31 AM, KHEM Sochenda >>> <[email protected]>wrote: >>> >>>> Dear Dmitry, >>>> >>>> Thank you again for a very quick response. >>>> >>>> I am going to train tesseract for Khmer Language in which there are many >>>> ligatures are in the same cases as "fi" in some latin fonts. >>>> The attachment show you the example of the one line khmer sentence, >>>> please count the box from left to right. You can see that some glyphs are >>>> above to others. The first glyph is formed of two unicode characters, >>>> somehow the third glyph and the fifth glyph form a Unicode characters. This >>>> is the reason why I wish to give each glype its own ID and then I do a post >>>> processing afterward. >>>> >>>> Regarding the two glyphs which are overlapped each other like the case >>>> of 7th glyph and the 8th glyph, how tesseract will segment these glyphs? >>>> How to give the position of the boxes? >>>> >>>> >>>> Thank you very much in advance for your response. >>>> >>>> >>>> Best Regards, >>>> >>>> Sochenda >>>> >>>> >>>> >>>> On Sun, Jan 16, 2011 at 3:48 PM, Dmitry Silaev >>>> <[email protected]>wrote: >>>> >>>>> Dear Sochenda, >>>>> >>>>> I'm not sure what's the ultimate goal of your code assignment but a >>>>> formal answer to your question is "Yes". You can assign "k001" or "k002" >>>>> to >>>>> a bounding box in a .box file. Moreover, you can assign any UTF-8 encoded >>>>> character sequence. In Tess version 3.0x (current) the only restriction >>>>> is a >>>>> 24 byte limit for the entire char sequence length. This also allows you to >>>>> use not only an abstract code like "k001" but a meaningful character >>>>> sequence from your real language (e.g. a well-known "fi" ligature in some >>>>> Latin fonts) which then relieves you from using the pre- and >>>>> post-processing. >>>>> >>>>> If you still prefer using abstract codes then pre-/post-processing can >>>>> be done without tinkering with Tess's code. Since training as well as >>>>> recognition result in generation of output files, you can develop a couple >>>>> of file processing command-line utilities which then can be used along >>>>> with >>>>> calls to the Tesseract executable within shell scripts (or .bat files in >>>>> Windows). >>>>> >>>>> For further details you definitely should study thoroughly the >>>>> "TrainingTesseract3" and "ReadMe" (section "Installation Notes - Tesseract >>>>> 3.00") documents ( >>>>> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 and >>>>> http://code.google.com/p/tesseract-ocr/wiki/ReadMe). These are not >>>>> quite easy searchable documents but they contain all the info you might >>>>> need. >>>>> >>>>> Warm regards, >>>>> Dmitry Silaev >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Sun, Jan 16, 2011 at 10:42 AM, KHEM Sochenda < >>>>> [email protected]> wrote: >>>>> >>>>>> >>>>>> Dear Dmitry, >>>>>> >>>>>> Thank you very much for a comprehensive explanation. >>>>>> Let say, to go straight, does it sound ok by assigning a code like >>>>>> 'k001' or 'k002' to the glype obtain from tesseract segmentation? >>>>>> >>>>>> For post processing, touching the code tesseract, could you please >>>>>> point me out which I files I should modify to work on. Advice me if the >>>>>> last >>>>>> version of tesseract will do fine. >>>>>> >>>>>> Thank you very much in advance for your time and response back. >>>>>> >>>>>> Best Regards, >>>>>> >>>>>> Sochenda >>>>>> >>>>>> >>>>>> On Sat, Jan 15, 2011 at 3:05 AM, Dmitry Silaev <[email protected] >>>>>> > wrote: >>>>>> >>>>>>> Chenda, >>>>>>> >>>>>>> In fact Tesseract doesn't care if you do training for a real >>>>>>> language's letter and which language this letter belongs to. >>>>>>> Simplistically >>>>>>> saying Tess only saves the mapping of feature sets obtained from >>>>>>> training to >>>>>>> Unicode ids. This implies that during training you can assign virtually >>>>>>> any >>>>>>> character code to virtually any glyph (to be exact, to a connected >>>>>>> component >>>>>>> or to a set of connected components). >>>>>>> >>>>>>> If your language script is comprised by a reasonable number of joint >>>>>>> character combinations then while training you can assign every such >>>>>>> combination a predefined Unicode id (some restrictions apply). Later, >>>>>>> when >>>>>>> running recognition, you should do some post-processing to decode your >>>>>>> predefined ids into real language's character sequences. >>>>>>> >>>>>>> For good results all this requires you to develop a training file >>>>>>> pre-processor (mapping: language char combinations -> provisional ids) >>>>>>> and a >>>>>>> recognition result post-processor (mapping: provisional ids -> language >>>>>>> char >>>>>>> sequences). I'm not sure but this also may require correcting character >>>>>>> property bit masks in the unicharset file (I don't know exactly how this >>>>>>> information is used by Tess as I don't need it in my project). >>>>>>> >>>>>>> Warm regards, >>>>>>> Dmitry Silaev >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Jan 14, 2011 at 10:25 AM, KHEM Sochenda < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Dear Tesseract Team, >>>>>>>> >>>>>>>> In training new language step, we have to assign a unicode value to >>>>>>>> each box. >>>>>>>> I would like to know if a shape that is composed of *several >>>>>>>> unicode characters? >>>>>>>> Is there anyway to assign only an id for each box in tesseract? >>>>>>>> >>>>>>>> Thank you very much in advance for your response. >>>>>>>> >>>>>>>> Best Regards, >>>>>>>> Chenda * >>>>>>>> >>>>>>>> 1. ** >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To post to this group, send email to [email protected] >>>>>>>> . >>>>>>>> To unsubscribe from this group, send email to >>>>>>>> [email protected]<tesseract-ocr%[email protected]> >>>>>>>> . >>>>>>>> For more options, visit this group at >>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To post to this group, send email to [email protected]. >>>>>>> To unsubscribe from this group, send email to >>>>>>> [email protected]<tesseract-ocr%[email protected]> >>>>>>> . >>>>>>> For more options, visit this group at >>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>>>> >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To post to this group, send email to [email protected]. >>>>>> To unsubscribe from this group, send email to >>>>>> [email protected]<tesseract-ocr%[email protected]> >>>>>> . >>>>>> For more options, visit this group at >>>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To post to this group, send email to [email protected]. >>>>> To unsubscribe from this group, send email to >>>>> [email protected]<tesseract-ocr%[email protected]> >>>>> . >>>>> For more options, visit this group at >>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To post to this group, send email to [email protected]. >>>> To unsubscribe from this group, send email to >>>> [email protected]<tesseract-ocr%[email protected]> >>>> . >>>> For more options, visit this group at >>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>> >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "tesseract-ocr" group. >>> To post to this group, send email to [email protected]. >>> To unsubscribe from this group, send email to >>> [email protected]<tesseract-ocr%[email protected]> >>> . >>> For more options, visit this group at >>> http://groups.google.com/group/tesseract-ocr?hl=en. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]<tesseract-ocr%[email protected]> >> . >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en. >> > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]<tesseract-ocr%[email protected]> > . > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

