Let me first summarize the cuneiform specific issues / proposed changes
from Martin Wildam's conversation with Rene Rebe.

1) rev 413 to 415 completely changed the way bounding box info is written, now 
bbox per line and additional array of x start position, missing y height for 
proper font size estimation
2) bbox per char can easily get out of sync in regard to multi-byte utf-8 
sequences and also in regards to whitespace
3) Rene doubts that writing out x positions after actual text is valid hOCR 
output
4) Rene propose it makes no sense to first write out <span> with the text and 
then another <span> for just the x coordinates
(Let me know if there were other specific cuneiform issues mentioned)

nr2 is an issue - will create a separate bug for this as it is cuneiform
internal.

In my view, 1,3,4 are not an error of cuneiform but an interpretation issue of 
the hOCR spec.
Official hOCR spec: https://docs.google.com/View?docid=dfxcv4vc_67g844kf
I believe cuneiform does the right thing with the ocr_line, ocr_cinfo and the 
x_bboxes. More details below.

Perhaps Rene could have a look here for help on parsing the hocr output from 
cuneiform:
http://bazaar.launchpad.net/~hocr-parsers/hocr-parsers/main/files

Unless a violation of the hOCR spec regarding this topic is found, I
think this bug should be closed.

Details
1) incorrect - y height is available:
Output from rev412:
        <span title="bbox 363 1253 382 1279">B</span>
        <span title="bbox 383 1254 407 1281">Y</span>
        <span title="bbox 409 1255 431 1283">G</span>
        <span title="bbox 434 1256 458 1284">G</span>
        <span title="bbox 460 1258 485 1285">N</span>
        <span title="bbox 486 1260 511 1286">A</span>
        <span title="bbox 514 1261 538 1287">D</span>
        <span title="bbox 541 1260 560 1289">E</span>
        <span title="bbox 561 1261 581 1289">R</span>
Output in cuneiform 1.0 (or after rev415):
        <span class='ocr_line' id='line_18' title="bbox 363 1253 581 1289">
                <b>BYGGNADER </b>
                <span class='ocr_cinfo' title="x_bboxes 363 1253 382 1279 383 
1254 407 1281 409 1255 431 1283 434 1256 458 1284 460 1258 485 1285 486 1260 
511 1286 514 1261 538 1287 541 1260 560 1289 561 1261 581 1289 -1 -1 -1 -1 ">
        </span>

It is an incorrect assumption that the x_bboxes are only x positions. The 
official specification for the hOCR format can be found here: 
https://docs.google.com/View?docid=dfxcv4vc_67g844kf
My understanding is that the above is the correct way for hOCR output. 

2) I do not understand the comment that "it can easily get out of sync", there 
is exactly one bbox per character on the line.
however, I confirm that there is an issue with whitespace and control 
characters being part of the characters on the line and for which the bounding 
boxes are not correct. I will open this as a separate bug, needs to be checked 
whether this needs to be special-case treated in the hocr output or if it is an 
issue upstream in cuneiform (an issue of not providing a bounding box for 
whitespace and of producing control characters in the recognized text)

3 and 4)
I find the specification somewhat difficult to interpret at times but it is my 
understanding that character bbox info goes within the ocr_line tag element. 
whether it goes before or after the textual elements is irrelevant. E.g.
        <span class='ocr_line' id='line_18' title="bbox 363 1253 581 1289">
                <b>BYGGNADER </b>
                <span class='ocr_cinfo' title="x_bboxes 363 1253 382 1279 383 
1254 407 1281 409 1255 431 1283 434 1256 458 1284 460 1258 485 1285 486 1260 
511 1286 514 1261 538 1287 541 1260 560 1289 561 1261 581 1289 -1 -1 -1 -1 ">
        </span>
and
        <span class='ocr_line' id='line_18' title="bbox 363 1253 581 1289">
        <span class='ocr_cinfo' title="x_bboxes 363 1253 382 1279 383 1254 407 
1281 409 1255 431 1283 434 1256 458 1284 460 1258 485 1285 486 1260 511 1286 
514 1261 538 1287 541 1260 560 1289 561 1261 581 1289 -1 -1 -1 -1 ">
                <b>BYGGNADER </b>
        </span>
are equally correct, it is the association to the correct line which matters.
So unless it can be pointed out that the hocr output is breaking the hocr spec, 
I would not change it in cuneiform.

-- 
Font size not correct in merged sandvich PDF
https://bugs.launchpad.net/bugs/623438
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to