I have got in touch with the developer - he has very much todo, but I sent a donation and he looked at the issue (I exchanged a few emails with him) - here is his final response so far:
On Mon, Sep 13, 2010 at 10:28, Rene Rebe <[email protected]> wrote: Dear Martin, the problem is that the latest cuneiform version completely changed the way the bounding box information is written. Actually in a way that makes no sense to me. Before each glyph had a bounding box, which is exactly what we need to write a proper PDF. Now they have a bounding box per line (we we do not need at all) and then an additional array of x start position. However, this can easily get out of sync in regard to multi-byte utf-8 sequences, and also in regards to whitespace. It would also be particularly ugly to adapt the horc2pdf HTML parser to cope with this x position spans written out after the actual text. I doubt this is valid hOCR, and even if it is, it makes no sense to first write out the <span> with the text, and then another <span> just for the x coordinates. And for proper font size estimation we even need the real y-height of the single glyphs in any case (information not present in the new format). I suggest to revert the change that mangled the hOCR annotation in cuneiform, ... That would approximately be these: revno: 415 committer: julien <[email protected]> branch nick: cuneiform-linux timestamp: Wed 2009-10-07 10:10:13 +0200 message: moved some tags around, now follows html spec and hocr spec. fixed russian comments that were destroyed during encoding ------------------------------------------------------------ revno: 414 committer: julien <[email protected]> branch nick: cuneiform-linux timestamp: Fri 2009-10-02 21:48:45 +0200 message: separated ocr_line and character bboxes. now follows the hocr standard using the ocr_cinfo tag for char bboxes ------------------------------------------------------------ revno: 413 author: Dmitry Polevoy committer: julien <[email protected]> branch nick: cuneiform-linux timestamp: Thu 2009-10-01 17:07:51 +0200 message: hocr format now supports ocr_line. Replaced cuneiform_src/Kern/rout/src/html.cpp to the patch submitted in the cuneiform mailing list the 24th of February by Dmitry Polevoy. Cha nged %d to %l in a few sprintf statements in html.cpp -- Font size not correct in merged sandvich PDF https://bugs.launchpad.net/bugs/623438 You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
