The bug is in the following piece of code (slightly modified from the 
original) 

        QRawFont rawFont = glyphRun.rawFont();
        QVector<quint32> blockTextIndices = 
rawFont.glyphIndexesForString(blockText_);
        std::map<uint,QChar> glyphIndicesToChars;
        for(int i = 0; i < blockTextIndices.size(); ++i)
            glyphIndicesToChars[blockTextIndices[i] ] = blockText_.at(i);

glyphIndexesForString (or glyphIndexesForChars) does not get the text index 
correctly. 

As I said రా(ra+ā) is one glyph and it has its own index. But instead what 
we get are the indices for ra and ā seperately! 
When we access the map glyphIndicesToChars[*ixIt] (where ixIt points to rā 
రా ), the corresponding entry is not there and we get a null value returned.
That is the bug. This also happens along side with glyphsNum != 
blockText.length() being true.


I am not good at fonts, so I can not fix it. (I would really like to get 
better at it). 
Hope that helps. Looking for a fix soon.

rākēśvara 


On Tuesday, February 5, 2013 2:58:29 PM UTC+5:30, raakeshvara rao wrote:
>
> Hi Anton
>
> Thank you for your tool. I have waited long for such thing to come along.
> But it has an issue. In many indian languages unicode-codpoints combine to 
> form glyphs
>
> e.g:- 
> ra + i = ri 
> ర + ి = రి
> \u0C30 + \u0C3F = \u0C30\u0C3F (one glyph)
>
> similarly, 
> ra + ā = rā
> ra + ī = rī 
>
> When letters come individually say just \u0C30 (i.e without the next 
> character being a combining character - also known as a Unicode Mark) they 
> are entered properly in the box file. When they come in complex forms, I 
> just see \00 for character code in box file. 
>
> An example is worth a thousand words. 
> Consider the Telugu text where ర (ra) is a standalone constant (with an 
> inherent 'a' sound) where as rā and ri are formed by combining 'ra' with 
> dependent vowel sings for 'ā' and 'i' )
>
> రరారరి 
> ra rā ra ri 
>
> Is correctly classified as four boxes. 
>
> But the box file looks like this
> ర 4 747 11 757 0
> \00 13 747 24 754 0
> ర 25 747 32 757 0
> \00 34 747 41 757 0
>
> While what is expected is 
> ర 4 747 11 757 0
> రా 13 747 24 754 0
> ర 25 747 32 757 0
> రి 34 747 41 757 0
>
> I have been running through the code and I find that in file 
> boxbuilder.cpp; it might be wise to use QString instead of QChar in line 
> 35:             std::map<uint,QChar> glyphIndicesToChars;
> Similarly line 39 needs to be modified to fetch a bunch of characters. 
>
> As a sidenotes: 
>
> 0) Should not the code in lines 32 to 39 in the above file be outside the 
> loop? 
>
> 1) I can not type unicode characters directly into the textbox using ibus 
> on ubuntu. It would be good to have that working. It is possible, as 
> CowBoxer runs on Qt and we do it there. 
>
> 2) Can you save the box file in utf-8 text file with BOM 
>
> Thanks a lot for your tool. I have been looking for one such tool for a 
> loong time. Please do fix this to make it more unicode friendly. 
>
> Going forward I expect more problems. But we will look at them when we get 
> there. 
> For e.g:- In Kannada, Telugu, Malayalam etc., rvi is written as ర్వి - The 
> first symbol being 'ri' and the next being vowelless 'v'
>
> - rākēśvara 
>
>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to