Chenda, In fact Tesseract doesn't care if you do training for a real language's letter and which language this letter belongs to. Simplistically saying Tess only saves the mapping of feature sets obtained from training to Unicode ids. This implies that during training you can assign virtually any character code to virtually any glyph (to be exact, to a connected component or to a set of connected components).
If your language script is comprised by a reasonable number of joint character combinations then while training you can assign every such combination a predefined Unicode id (some restrictions apply). Later, when running recognition, you should do some post-processing to decode your predefined ids into real language's character sequences. For good results all this requires you to develop a training file pre-processor (mapping: language char combinations -> provisional ids) and a recognition result post-processor (mapping: provisional ids -> language char sequences). I'm not sure but this also may require correcting character property bit masks in the unicharset file (I don't know exactly how this information is used by Tess as I don't need it in my project). Warm regards, Dmitry Silaev On Fri, Jan 14, 2011 at 10:25 AM, KHEM Sochenda <khemsoche...@gmail.com>wrote: > Dear Tesseract Team, > > In training new language step, we have to assign a unicode value to each > box. > I would like to know if a shape that is composed of *several unicode > characters? > Is there anyway to assign only an id for each box in tesseract? > > Thank you very much in advance for your response. > > Best Regards, > Chenda * > > 1. ** > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to tesseract-ocr@googlegroups.com. > To unsubscribe from this group, send email to > tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com> > . > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.