Dear Sochenda, I've checked the Unicode table range you've sent and now I see what the problem is. I'd agree that in such "algorithmic" writing system (contrasted with simpler "positional" systems like say Roman or Cyrillic) the stages of pre-/post-processing are inevitable.
I'd suggest making special hand-crafted or generated training images. In these images you would properly space out all the joint character combinations as well as character components that can make up Khmer characters. Then you would edit the resulting box files to assign codes according to your coding system. The noted process should be repeated as many times as required to achieve the sample count of 15-20 for every glyph. At the recognition stage, if trained properly, overlapping bounding boxes is not a problem for Tess. My experience shows that it is very inventive in character segmentation even in cases of BB overlap. So I hope you should have no severe difficulties with partially overhanging or underlying glyphs. Your post-processor should be able to "decode" recognition output using an algorithmic approach to form good Unicode characters. You can also use some Khmer bigram or trigram statistics to do error correction. Probably you'd want to play around with Tess's dictionary facility but I doubt it would be helpful in your case. Dmitry -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

