Hi Anton

Thank you for your tool. I have waited long for such thing to come along.
But it has an issue. In many indian languages unicode-codpoints combine to 
form glyphs

e.g:- 
ra + i = ri 
ర + ి = రి
\u0C30 + \u0C3F = \u0C30\u0C3F (one glyph)

similarly, 
ra + ā = rā
ra + ī = rī 

When letters come individually say just \u0C30 (i.e without the next 
character being a combining character - also known as a Unicode Mark) they 
are entered properly in the box file. When they come in complex forms, I 
just see \00 for character code in box file. 

An example is worth a thousand words. 
Consider the Telugu text where ర (ra) is a standalone constant (with an 
inherent 'a' sound) where as rā and ri are formed by combining 'ra' with 
dependent vowel sings for 'ā' and 'i' )

రరారరి 
ra rā ra ri 

Is correctly classified as four boxes. 

But the box file looks like this
ర 4 747 11 757 0
\00 13 747 24 754 0
ర 25 747 32 757 0
\00 34 747 41 757 0

While what is expected is 
ర 4 747 11 757 0
రా 13 747 24 754 0
ర 25 747 32 757 0
రి 34 747 41 757 0

I have been running through the code and I find that in file 
boxbuilder.cpp; it might be wise to use QString instead of QChar in line 
35:             std::map<uint,QChar> glyphIndicesToChars;
Similarly line 39 needs to be modified to fetch a bunch of characters. 

As a sidenotes: 

0) Should not the code in lines 32 to 39 in the above file be outside the 
loop? 

1) I can not type unicode characters directly into the textbox using ibus 
on ubuntu. It would be good to have that working. It is possible, as 
CowBoxer runs on Qt and we do it there. 

2) Can you save the box file in utf-8 text file with BOM 

Thanks a lot for your tool. I have been looking for one such tool for a 
loong time. Please do fix this to make it more unicode friendly. 

Going forward I expect more problems. But we will look at them when we get 
there. 
For e.g:- In Kannada, Telugu, Malayalam etc., rvi is written as ర్వి - The 
first symbol being 'ri' and the next being vowelless 'v'

- rākēśvara 



-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to