The bug is in the following piece of code (slightly modified from the
original)
QRawFont rawFont = glyphRun.rawFont();
QVector<quint32> blockTextIndices =
rawFont.glyphIndexesForString(blockText_);
std::map<uint,QChar> glyphIndicesToChars;
for(int i = 0; i < blockTextIndices.size(); ++i)
glyphIndicesToChars[blockTextIndices[i] ] = blockText_.at(i);
glyphIndexesForString (or glyphIndexesForChars) does not get the text index
correctly.
As I said రా(ra+ā) is one glyph and it has its own index. But instead what
we get are the indices for ra and ā seperately!
When we access the map glyphIndicesToChars[*ixIt] (where ixIt points to rā
రా ), the corresponding entry is not there and we get a null value returned.
That is the bug. This also happens along side with glyphsNum !=
blockText.length() being true.
I am not good at fonts, so I can not fix it. (I would really like to get
better at it).
Hope that helps. Looking for a fix soon.
rākēśvara
On Tuesday, February 5, 2013 2:58:29 PM UTC+5:30, raakeshvara rao wrote:
>
> Hi Anton
>
> Thank you for your tool. I have waited long for such thing to come along.
> But it has an issue. In many indian languages unicode-codpoints combine to
> form glyphs
>
> e.g:-
> ra + i = ri
> ర + ి = రి
> \u0C30 + \u0C3F = \u0C30\u0C3F (one glyph)
>
> similarly,
> ra + ā = rā
> ra + ī = rī
>
> When letters come individually say just \u0C30 (i.e without the next
> character being a combining character - also known as a Unicode Mark) they
> are entered properly in the box file. When they come in complex forms, I
> just see \00 for character code in box file.
>
> An example is worth a thousand words.
> Consider the Telugu text where ర (ra) is a standalone constant (with an
> inherent 'a' sound) where as rā and ri are formed by combining 'ra' with
> dependent vowel sings for 'ā' and 'i' )
>
> రరారరి
> ra rā ra ri
>
> Is correctly classified as four boxes.
>
> But the box file looks like this
> ర 4 747 11 757 0
> \00 13 747 24 754 0
> ర 25 747 32 757 0
> \00 34 747 41 757 0
>
> While what is expected is
> ర 4 747 11 757 0
> రా 13 747 24 754 0
> ర 25 747 32 757 0
> రి 34 747 41 757 0
>
> I have been running through the code and I find that in file
> boxbuilder.cpp; it might be wise to use QString instead of QChar in line
> 35: std::map<uint,QChar> glyphIndicesToChars;
> Similarly line 39 needs to be modified to fetch a bunch of characters.
>
> As a sidenotes:
>
> 0) Should not the code in lines 32 to 39 in the above file be outside the
> loop?
>
> 1) I can not type unicode characters directly into the textbox using ibus
> on ubuntu. It would be good to have that working. It is possible, as
> CowBoxer runs on Qt and we do it there.
>
> 2) Can you save the box file in utf-8 text file with BOM
>
> Thanks a lot for your tool. I have been looking for one such tool for a
> loong time. Please do fix this to make it more unicode friendly.
>
> Going forward I expect more problems. But we will look at them when we get
> there.
> For e.g:- In Kannada, Telugu, Malayalam etc., rvi is written as ర్వి - The
> first symbol being 'ri' and the next being vowelless 'v'
>
> - rākēśvara
>
>
>
>
--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
---
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.