I have got in touch with the developer - he has very much todo, but I
sent a donation and he looked at the issue (I exchanged a few emails
with him) - here is his final response so far:

On Mon, Sep 13, 2010 at 10:28, Rene Rebe <[email protected]> wrote:

Dear Martin,

the problem is that the latest cuneiform version completely changed the
way the bounding box information is written. Actually in a way that
makes no sense to me. Before each glyph had a bounding box, which is
exactly what we need to write a proper PDF. Now they have a bounding box
per line (we we do not need at all) and then an additional array of x
start position. However, this can easily get out of sync in regard to
multi-byte utf-8 sequences, and also in regards to whitespace. It would
also be particularly ugly to adapt the horc2pdf HTML parser to cope with
this x position spans written out after the actual text. I doubt this is
valid hOCR, and even if it is, it makes no sense to first write out the
<span> with the text, and then another <span> just for the x
coordinates. And for proper font size estimation we even need the real
y-height of the single glyphs in any case (information not present in
the new format).

I suggest to revert the change that mangled the hOCR annotation in
cuneiform, ... That would approximately be these:

revno: 415
committer: julien <[email protected]>
branch nick: cuneiform-linux
timestamp: Wed 2009-10-07 10:10:13 +0200
message:
 moved some tags around, now follows html spec and hocr spec. fixed russian 
comments that were destroyed during encoding
------------------------------------------------------------
revno: 414
committer: julien <[email protected]>
branch nick: cuneiform-linux
timestamp: Fri 2009-10-02 21:48:45 +0200
message:
 separated ocr_line and character bboxes. now follows the hocr standard using 
the ocr_cinfo tag for char bboxes
------------------------------------------------------------
revno: 413
author: Dmitry Polevoy
committer: julien <[email protected]>
branch nick: cuneiform-linux
timestamp: Thu 2009-10-01 17:07:51 +0200
message:
 hocr format now supports ocr_line. Replaced 
cuneiform_src/Kern/rout/src/html.cpp to the patch submitted in the cuneiform 
mailing list the 24th of February by Dmitry Polevoy. Cha
nged %d to %l in a few sprintf statements in html.cpp

-- 
Font size not correct in merged sandvich PDF
https://bugs.launchpad.net/bugs/623438
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to