Dear Dmitry, Thank you very much for a comprehensive explanation. Let say, to go straight, does it sound ok by assigning a code like 'k001' or 'k002' to the glype obtain from tesseract segmentation?
For post processing, touching the code tesseract, could you please point me out which I files I should modify to work on. Advice me if the last version of tesseract will do fine. Thank you very much in advance for your time and response back. Best Regards, Sochenda On Sat, Jan 15, 2011 at 3:05 AM, Dmitry Silaev <[email protected]>wrote: > Chenda, > > In fact Tesseract doesn't care if you do training for a real language's > letter and which language this letter belongs to. Simplistically saying Tess > only saves the mapping of feature sets obtained from training to Unicode > ids. This implies that during training you can assign virtually any > character code to virtually any glyph (to be exact, to a connected component > or to a set of connected components). > > If your language script is comprised by a reasonable number of joint > character combinations then while training you can assign every such > combination a predefined Unicode id (some restrictions apply). Later, when > running recognition, you should do some post-processing to decode your > predefined ids into real language's character sequences. > > For good results all this requires you to develop a training file > pre-processor (mapping: language char combinations -> provisional ids) and a > recognition result post-processor (mapping: provisional ids -> language char > sequences). I'm not sure but this also may require correcting character > property bit masks in the unicharset file (I don't know exactly how this > information is used by Tess as I don't need it in my project). > > Warm regards, > Dmitry Silaev > > > > > On Fri, Jan 14, 2011 at 10:25 AM, KHEM Sochenda <[email protected]>wrote: > >> Dear Tesseract Team, >> >> In training new language step, we have to assign a unicode value to each >> box. >> I would like to know if a shape that is composed of *several unicode >> characters? >> Is there anyway to assign only an id for each box in tesseract? >> >> Thank you very much in advance for your response. >> >> Best Regards, >> Chenda * >> >> 1. ** >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]<tesseract-ocr%[email protected]> >> . >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en. >> > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]<tesseract-ocr%[email protected]> > . > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

