Hi Anton Thank you for your tool. I have waited long for such thing to come along. But it has an issue. In many indian languages unicode-codpoints combine to form glyphs
e.g:- ra + i = ri ర + ి = రి \u0C30 + \u0C3F = \u0C30\u0C3F (one glyph) similarly, ra + ā = rā ra + ī = rī When letters come individually say just \u0C30 (i.e without the next character being a combining character - also known as a Unicode Mark) they are entered properly in the box file. When they come in complex forms, I just see \00 for character code in box file. An example is worth a thousand words. Consider the Telugu text where ర (ra) is a standalone constant (with an inherent 'a' sound) where as rā and ri are formed by combining 'ra' with dependent vowel sings for 'ā' and 'i' ) రరారరి ra rā ra ri Is correctly classified as four boxes. But the box file looks like this ర 4 747 11 757 0 \00 13 747 24 754 0 ర 25 747 32 757 0 \00 34 747 41 757 0 While what is expected is ర 4 747 11 757 0 రా 13 747 24 754 0 ర 25 747 32 757 0 రి 34 747 41 757 0 I have been running through the code and I find that in file boxbuilder.cpp; it might be wise to use QString instead of QChar in line 35: std::map<uint,QChar> glyphIndicesToChars; Similarly line 39 needs to be modified to fetch a bunch of characters. As a sidenotes: 0) Should not the code in lines 32 to 39 in the above file be outside the loop? 1) I can not type unicode characters directly into the textbox using ibus on ubuntu. It would be good to have that working. It is possible, as CowBoxer runs on Qt and we do it there. 2) Can you save the box file in utf-8 text file with BOM Thanks a lot for your tool. I have been looking for one such tool for a loong time. Please do fix this to make it more unicode friendly. Going forward I expect more problems. But we will look at them when we get there. For e.g:- In Kannada, Telugu, Malayalam etc., rvi is written as ర్వి - The first symbol being 'ri' and the next being vowelless 'v' - rākēśvara -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

