Vincent Hennebert wrote:
> Hi,
> 
<snip/>
>> There's another side-effect to tagged PDF: It allows for better text
>> extraction from the document. PDF even describes ways to make
>> round-trips from XML -> PDF -> XML -> PDF if certain conditions were met.
>> However, we don't do that.
> 
> Speaking of that, the current code doesn’t insert empty elements (like
> <fo:block/>) into the structure tree. The corresponding StructElem
> object /is/ created, but is not linked to its parent. Actually it’s
> present in the PDF without being referred to by any other object.
> I think this is inconsistent, and actually wrong since that would cause
> a loss of information possibly needed by a round-trip transformation.
> I’m going to change that.

I mean, /at some point/ I’m going to change that...

This is not as easily done as it is said. Take the following example:
<fo:block>
  Before the empty block.
  <fo:block/>
  After the empty block.
</fo:block>

What basically happens currently is that two text drawing requests are
made to the PDF renderer. The renderer creates the appropriate PDF
stream and registers the pieces of text as children of the structure
element corresponding to the outer block. But nothing happens regarding
the inner empty block, since obviously there’s nothing to do.

The structure element for the inner empty block can’t be added to the
outer block’s children at creation time, otherwise the logical order
wouldn’t be followed.

>From the quick look I had this is a fundamental limitation of the
current approach. There’s no way to know at which place an empty element
must be inserted into the children list of its parent.

The only way to solve this issue probably is to integrate the handling
of the logical structure into the whole processing chain, passing the
suitable information from the FO tree to the layout engine to the area
tree to the renderer. Probably something that should have been done from
the beginning but this is all but trivial.

Vincent

Reply via email to