Re: Position, Leaf/NonLeafPosition, wrapping positions

2007-03-10 Thread Simon Pepping
On Fri, Mar 09, 2007 at 09:46:11PM +0100, Andreas L Delmelle wrote:
 On Mar 9, 2007, at 18:35, Vincent Hennebert wrote:
 That's it --for now :)
 
 Those notes deserve their wiki page, to not get lost in the mailing  
 list
 archives. I'll create one as soon as I have time. The documentation  
 part
 of the website might also need some cleaning up, BTW.
 
 Simon also once published a description of 'FOP at work' on his  
 homepage, following and explaining the call stack of FOP while  
 processing a document. This was, however, quite a while ago, dating  
 back from before the implementation of the Knuth algorithm. Maybe it  
 is still available and parts of it can be used to add to such a Wiki  
 or to the development documentation...

The docbook code is available in the FOP sources,
src/documentation/content/xdocs/DnI. On my web site I have the XHTML
and PDF versions. The document is however rather old.

Later I have added a few notes to the Wiki, see
FOPImplementationNotes. It does contain a section on property lists.

Regards, Simon

-- 
Simon Pepping
home page: http://www.leverkruid.eu


Re: Position, Leaf/NonLeafPosition, wrapping positions

2007-03-09 Thread Andreas L Delmelle

On Mar 8, 2007, at 20:15, Andreas L Delmelle wrote:

Hi Vincent,


On Mar 8, 2007, at 18:29, Vincent Hennebert wrote:


snip /

Now, does that picture represent the current code?


For the largest part, yes, I think so.
The only difference is the addArea() part, which currently goes top- 
down. It is not the Position that notifies its LM that the areas  
must be added, but the LM that iterates over the Positions and adds  
their corresponding areas.


Just felt like elaborating a bit, maybe putting the whole thing in  
perspective, and offering a glimpse on a few of the current pain- 
points, so here's a rough sketch of what happens inside FOP when a FO  
document gets processed, with some pointers into the code. Maybe it  
saves you and other interested subscribers a bit of time when trying  
to get the bigger picture. I know it took me long enough... ;)


The XSL (identity) transform results in the event
FOTreeBuilder.MainFOHandler.startElement(
  http://www.w3.org/1999/XSL/Format;,
  root,
  fo:root, attributes);

This event occurs for every FO node, starting with the fo:root.
Basically, if you follow that method's logic, what is done is no more  
than:

- create a FONode of the right type
- create a PropertyList from the Attributes
- bind the PropertyList to the FONode
(= transfer the applicable properties for that particular node-type
   to instance members of the FONode; the PropertyList itself is
   only stored by the FOTreeBuilder to use as parent PropertyList
   for the FONode's childrens' PropertyLists)

At the end of startElement(), the processed FONode itself is added to  
the list of childNodes of the MainFOHandler's currentFObj, if all  
basic validation passed. Subsequently, the node's startOfNode()  
method is called, which in most cases is no more than a mapping to:


FOEventHandler.startXXX()

[Note that, IIRC, in the current Trunk code, FONode.startOfNode() is  
called /before/ the child has been added to the parent. I think it  
makes more sense to reverse this order... My local sandbox reflects  
the above description. In the current overall design, however, this  
makes only little difference: see further on.]


When a node has no children, or all of its children have been  
processed, we receive a MainFOHandler.endElement(), where  
FONode.endOfNode() is called, which maps to:


FOEventHandler.endXXX()

The FOEventHandler that is used for area-tree based rendering, is of  
course the AreaTreeHandler, which currently offers implementations  
only for startPageSequence() and endPageSequence(). The latter is  
where the fun starts (read: the layout-loop)


What this boils down to, is the first pain-point:
The whole FO tree for an fo:page-sequence has to be built /before/  
anything layout-related is triggered. In most cases relatively  
unimportant, since the scenarios where the content absolutely / 
cannot/ be split into multiple logical sequences beforehand are rare.  
However, in some of those cases --tables that would span hundreds of  
pages-- the amount of heap space required to build the corresponding  
tree is simply too much for any JVM to bear. In environments where  
such a document is one amongst other concurrently processed documents  
--heap space is shared by different threads--, it will lead to an  
OOMError *long* before the end of the page-sequence is reached, and  
since the heap is shared between threads, either the error will also  
occur there or those other threads will be left with very limited  
heap space to work with, resulting in a much slower processing.


OK, if you then follow the trail starting at  
AreaTreeHandler.endPageSequence(), we almost immediately arrive at  
another pain-point, to which the problem of changing available ipd is  
related:


PageSequenceLM.activateLayout()
...
- AbstractBreaker.doLayout()
   ...
   - PageBreaker.getNextBlockList()
  (first) page provided
  - AbstractBreaker.getNextBlockList()
 ...
 - PageBreaker.getNextKnuthElements()
- FlowLayoutManager.getNextKnuthElements()

Now, as I recall from a few debug sessions in that area, the latter  
method-name is actually a bit misleading, as it gets called only once  
per fo:flow, and triggers (line-)layout for all descendants of the  
flow, based on a LayoutContext that has the available ipd of the  
region-body of the first page in the sequence. The page-break  
computation is started only after the entire fo:flow has been broken  
into lines. The PageBreaker operates completely upon the single  
element list returned by the main flow, and that element list has  
been constructed under the (possibly false) assumption that the ipd  
would be the same for all pages.


This design is also partly related to the above (memory consumption  
of the fo tree), precisely because the LM creation loop is currently  
tied in to the getNextKnuthElements() loop.
Each LM uses getChildLM() to obtain a reference to its 'next' child  
LM, and 

Re: Position, Leaf/NonLeafPosition, wrapping positions

2007-03-09 Thread Vincent Hennebert
Andreas,

First, many thanks for your explanations!
Some comments below.

Andreas L Delmelle a écrit :
 On Mar 8, 2007, at 20:15, Andreas L Delmelle wrote:
 
 Hi Vincent,
 
 On Mar 8, 2007, at 18:29, Vincent Hennebert wrote:

 snip /

 Now, does that picture represent the current code?

 For the largest part, yes, I think so.
 The only difference is the addArea() part, which currently goes
 top-down. It is not the Position that notifies its LM that the areas
 must be added, but the LM that iterates over the Positions and adds
 their corresponding areas.
 
 Just felt like elaborating a bit, maybe putting the whole thing in
 perspective, and offering a glimpse on a few of the current pain-points,
 so here's a rough sketch of what happens inside FOP when a FO document
 gets processed, with some pointers into the code. Maybe it saves you and
 other interested subscribers a bit of time when trying to get the bigger
 picture. I know it took me long enough... ;)
 
 The XSL (identity) transform results in the event
 FOTreeBuilder.MainFOHandler.startElement(
   http://www.w3.org/1999/XSL/Format;,
   root,
   fo:root, attributes);
 
 This event occurs for every FO node, starting with the fo:root.
 Basically, if you follow that method's logic, what is done is no more than:
 - create a FONode of the right type
 - create a PropertyList from the Attributes
 - bind the PropertyList to the FONode
 (= transfer the applicable properties for that particular node-type
to instance members of the FONode; the PropertyList itself is
only stored by the FOTreeBuilder to use as parent PropertyList
for the FONode's childrens' PropertyLists)

When you say transfer the applicable properties, you mean that
inheritance is also handled here? That is, from all the specified +
inherited properties, pick up the ones which apply?

 
 At the end of startElement(), the processed FONode itself is added to
 the list of childNodes of the MainFOHandler's currentFObj, if all basic
 validation passed. Subsequently, the node's startOfNode() method is
 called, which in most cases is no more than a mapping to:
 
 FOEventHandler.startXXX()
 
 [Note that, IIRC, in the current Trunk code, FONode.startOfNode() is
 called /before/ the child has been added to the parent. I think it makes
 more sense to reverse this order... My local sandbox reflects the above
 description. In the current overall design, however, this makes only
 little difference: see further on.]

My understanding of this part is still limited, but it also seems to
make more sense to me.


 When a node has no children, or all of its children have been processed,
 we receive a MainFOHandler.endElement(), where FONode.endOfNode() is
 called, which maps to:
 
 FOEventHandler.endXXX()
 
 The FOEventHandler that is used for area-tree based rendering, is of
 course the AreaTreeHandler, which currently offers implementations only
 for startPageSequence() and endPageSequence(). The latter is where the
 fun starts (read: the layout-loop)
 
 What this boils down to, is the first pain-point:
 The whole FO tree for an fo:page-sequence has to be built /before/
 anything layout-related is triggered. In most cases relatively
 unimportant, since the scenarios where the content absolutely /cannot/
 be split into multiple logical sequences beforehand are rare. However,
 in some of those cases --tables that would span hundreds of pages-- the
 amount of heap space required to build the corresponding tree is simply
 too much for any JVM to bear. In environments where such a document is
 one amongst other concurrently processed documents --heap space is
 shared by different threads--, it will lead to an OOMError *long* before
 the end of the page-sequence is reached, and since the heap is shared
 between threads, either the error will also occur there or those other
 threads will be left with very limited heap space to work with,
 resulting in a much slower processing.

Some thoughts related to this:
It would anyway be best to start the layout process as soon as possible;
ideally there would be multiple, chained threads for the multiple tasks:
FO tree generation, Knuth elements generation, breaking, area tree
generation, rendering, etc. They would act like Unix pipes, in a
producer/consumer model where each thread would be fed by the thread it
depends on, and would itself feed the subsequent thread.
Questions are: does that make sense, when does a thread know it can
start its work, can we clearly separate the several processes, oh well
all those thread synchronizing issues, etc.
But that might give some real performance boost on multi-processor machines.


 OK, if you then follow the trail starting at
 AreaTreeHandler.endPageSequence(), we almost immediately arrive at
 another pain-point, to which the problem of changing available ipd is
 related:
 
 PageSequenceLM.activateLayout()
 ...
 - AbstractBreaker.doLayout()
...
- PageBreaker.getNextBlockList()
   (first) page 

Re: Position, Leaf/NonLeafPosition, wrapping positions

2007-03-09 Thread Andreas L Delmelle

On Mar 9, 2007, at 18:35, Vincent Hennebert wrote:

snip /
  [Me:]

- bind the PropertyList to the FONode
(= transfer the applicable properties for that particular node- 
type

   to instance members of the FONode; the PropertyList itself is
   only stored by the FOTreeBuilder to use as parent PropertyList
   for the FONode's childrens' PropertyLists)


When you say transfer the applicable properties, you mean that
inheritance is also handled here? That is, from all the specified +
inherited properties, pick up the ones which apply?


Correct. Roughly each applicable property(-bundle) corresponds to an  
instance member (where the 'bundle' refers to the Common*Properties)


Property inheritance is handled entirely by the PropertyList.
The very first step is taken in FONode.processNode(), where the  
attributes are added to the list. This takes care of the explicitly  
specified properties.
In FObj.bind(), which is called right after that, each call to  
PropertyList.get() triggers the property resolution mechanism, roughly:

try specified value
try getting the implied value from a specified shorthand property if  
applicable

try inheritance if applicable
and if all this fails, fall back to the initial value defined in the  
Recommendation.


The only exception are descendant nodes of fo:marker, for which all  
this happens only during layout when the marker is actually  
retrieved. Here, a different type of PropertyList is used, which only  
stores the explicitly specified value as a string, and does not  
convert it to a Property yet. (see: Marker.MarkerPropertyList)
When the MainFOHandler receives a startElement() event for an  
fo:marker, the context is switched, so that the descendant nodes know  
the bind() step should be skipped at that point.
This is done because the properties of marker-descendants must be  
resolved as if they had been specified as descendants of the static- 
content where the corresponding retrieve-marker is defined. For  
marker-descendants, inheritance actually means inheritance from  
the ancestry of the retrieve-marker. Therefore, the RetrieveMarker  
keeps its PropertyList alive, so that it can serve as a parent  
PropertyList for the marker-descendants later on, when cloning the  
subtree.



snip /
Some thoughts related to this:
It would anyway be best to start the layout process as soon as  
possible;
ideally there would be multiple, chained threads for the multiple  
tasks:

FO tree generation, Knuth elements generation, breaking, area tree
generation, rendering, etc. They would act like Unix pipes, in a
producer/consumer model where each thread would be fed by the  
thread it

depends on, and would itself feed the subsequent thread.
Questions are: does that make sense, when does a thread know it can
start its work, can we clearly separate the several processes, oh well
all those thread synchronizing issues, etc.
But that might give some real performance boost on multi-processor  
machines.


That's roughly the idea a few of us are dreaming of implementing, I  
think... if they had the time. :)
If you compare the codebase to FOP 0.20.5, you'll notice that a large  
part of the redesign precisely consisted of separating the different  
processes, mainly FO tree generation and layout. In FOP 0.20.5, there  
is very much layout-related code scattered in the FObjs.


OTOH, before starting to use separate threads, I think some other  
refactoring can/needs to be done first, if only to make it easier  
afterwards to implement the threads. I was already playing with the  
idea to move the LM-construction and some of the initialization to a  
separate LMInitThread, but then bumped into the mentioned problem.  
The list of child-nodes cannot be extended after the LM has been  
constructed. The fo:page-sequence, or at least one of its fo:flows  
must be completely parsed before instantiating a FlowLayoutManager.
[An attempt to work around this by dropping the ArrayList entirely,  
and using an iterator over a virtual list without this particular  
limitation can be found in Bugzilla 41656. Patch not applied yet,  
since I haven't done any extensive testing of its effects. I only  
know that the junit tests pass.]


The idea in the long run was to move towards a separation of concerns  
inside the layoutengine, which could then later on make room for  
something like a ListProducerThread, and a BreakerThread ?


In the worst case, at first without multi-threading, FOP would use  
the exact same amount of heap as it does now, only distributed a bit  
differently. More of it sooner than it does now, but in the end not  
exceeding the current state, where we ultimately end up with a whole  
FO tree for a page-sequence on the one hand /and/ a corresponding  
tree of LayoutManagers. Not to mention all the Lists and ListElements  
that are generated in between...
If you don't give the breaking-algorithm a chance to reset from time  
to time, memory consumption shoots 

Re: Position, Leaf/NonLeafPosition, wrapping positions

2007-03-08 Thread Vincent Hennebert
Guys,

thanks for your inputs. That strengthens a bit the picture I have in
mind. I'm wondering if the following makes any sense:

We would have a list of page-level Knuth elements; some of those
elements would be wrapper around line-level Knuth elements. Almost every
element would contain a Position pointing to the LayoutManager that
generated the element. Special elements created to achieve various
effects (say, text-centering), would contain null Positions.

The breaking algorithm would break the list of Knuth elements into
chunks representing lines/pages. From that we would iterate over the
non-null Positions, and for each one call
position.getLM().addArea(position). That is, the position would notify
its LM that its corresponding areas must be added.

For page-level elements wrapping line-level elements, we would also
iterate over the wrapped inline Positions.


Now, does that picture represent the current code? If not, what is
missing/wrong, apart from the many details relevant to each particular
LM? Does that make sense?

Thanks,
Vincent


Jeremias Maerki a écrit :
 On 07.03.2007 15:28:37 Andreas L Delmelle wrote:
 On Mar 7, 2007, at 09:07, Vincent Hennebert wrote:

 Hi Vincent,

 I have some questions regarding the handling of Position elements. I'm
 not familiar with that part of the code yet, and as there is little or
 no javadoc for those it's a bit difficult to guess their purposes just
 by looking at the code.
 I think you're not the only one with this problem... :/
 Note that there is also an illustrious resetPosition() method, which  
 currently seems to be used nowhere. All references to it seem to be  
 circular. BlockLayoutManager.resetPosition() is only called by  
 AbstractLayoutManager.reset(), which in turn is only called by one of  
 the other LM's resetPosition(), etc.
 
 That's a left-over from the Keiron's and Karen's first approach before
 we introduced Knuth's algorithm. The resetting was used to go back to a
 certain position after deciding line/page breaks. When I merged in
 Luca's Knuth code I mostly removed most of this resetting because I
 thought we wouldn't need it anymore. Turns out I could have been wrong 
 (changing available IPD topic). But maybe it's also good I didn't
 remove everything so we still know that there was such a mechanism. At
 some point we obviously have to resurrect it (if we need it again) or to
 remove it for good.
 
 The closest understanding that I have of those Positions is still a  
 bit limited, I'm afraid, so I'm hoping others chime in to correct my  
 mistakes or offer further clarification. (Just noticed that Luca  
 already did; will share my thoughts in any case, to see if I may have  
 misinterpreted something)
 
 snip/
 
 That is all: as far as I get it...
 
 Oh, I think you pretty much got it.
 
 
 Jeremias Maerki
 


Re: Position, Leaf/NonLeafPosition, wrapping positions

2007-03-08 Thread Jeremias Maerki
Yes, that's pretty much my view of the whole thing.

On 08.03.2007 18:29:07 Vincent Hennebert wrote:
 Guys,
 
 thanks for your inputs. That strengthens a bit the picture I have in
 mind. I'm wondering if the following makes any sense:
 
 We would have a list of page-level Knuth elements; some of those
 elements would be wrapper around line-level Knuth elements. Almost every
 element would contain a Position pointing to the LayoutManager that
 generated the element. Special elements created to achieve various
 effects (say, text-centering), would contain null Positions.
 
 The breaking algorithm would break the list of Knuth elements into
 chunks representing lines/pages. From that we would iterate over the
 non-null Positions, and for each one call
 position.getLM().addArea(position). That is, the position would notify
 its LM that its corresponding areas must be added.
 
 For page-level elements wrapping line-level elements, we would also
 iterate over the wrapped inline Positions.
 
 
 Now, does that picture represent the current code? If not, what is
 missing/wrong, apart from the many details relevant to each particular
 LM? Does that make sense?
snip/

Jeremias Maerki



Re: Position, Leaf/NonLeafPosition, wrapping positions

2007-03-08 Thread Andreas L Delmelle

On Mar 8, 2007, at 18:29, Vincent Hennebert wrote:


thanks for your inputs. That strengthens a bit the picture I have in
mind. I'm wondering if the following makes any sense:

We would have a list of page-level Knuth elements; some of those
elements would be wrapper around line-level Knuth elements. Almost  
every

element would contain a Position pointing to the LayoutManager that
generated the element. Special elements created to achieve various
effects (say, text-centering), would contain null Positions.

The breaking algorithm would break the list of Knuth elements into
chunks representing lines/pages. From that we would iterate over the
non-null Positions, and for each one call
position.getLM().addArea(position). That is, the position would notify
its LM that its corresponding areas must be added.

For page-level elements wrapping line-level elements, we would also
iterate over the wrapped inline Positions.


Now, does that picture represent the current code?


For the largest part, yes, I think so.
The only difference is the addArea() part, which currently goes top- 
down. It is not the Position that notifies its LM that the areas must  
be added, but the LM that iterates over the Positions and adds their  
corresponding areas.


Cheers,

Andreas




Position, Leaf/NonLeafPosition, wrapping positions

2007-03-07 Thread Vincent Hennebert
Hi,

I have some questions regarding the handling of Position elements. I'm
not familiar with that part of the code yet, and as there is little or
no javadoc for those it's a bit difficult to guess their purposes just
by looking at the code.
What's the purpose of a LeafPosition? Of a NonLeafPosition? What's the
purpose of the wrapPositionElements method in
BlockStackingLayoutManager? Subsidiary question: why would one sometimes
force the wrapping, sometimes not?

Hope someone can answer those questions...

Thanks,
Vincent


Re: Position, Leaf/NonLeafPosition, wrapping positions

2007-03-07 Thread Andreas L Delmelle

On Mar 7, 2007, at 09:07, Vincent Hennebert wrote:

Hi Vincent,


I have some questions regarding the handling of Position elements. I'm
not familiar with that part of the code yet, and as there is little or
no javadoc for those it's a bit difficult to guess their purposes just
by looking at the code.


I think you're not the only one with this problem... :/
Note that there is also an illustrious resetPosition() method, which  
currently seems to be used nowhere. All references to it seem to be  
circular. BlockLayoutManager.resetPosition() is only called by  
AbstractLayoutManager.reset(), which in turn is only called by one of  
the other LM's resetPosition(), etc.


The closest understanding that I have of those Positions is still a  
bit limited, I'm afraid, so I'm hoping others chime in to correct my  
mistakes or offer further clarification. (Just noticed that Luca  
already did; will share my thoughts in any case, to see if I may have  
misinterpreted something)


Anyway, AFAIU, a Position can be viewed as a kind of a pointer to a  
position within the FObj.

The significance is roughly:
- when the initial element list is constructed, each generated  
ListElement receives a Position
- after the breaking is done, and the areas get added, the Position  
is used to get to the 'part' of the FO that corresponds to that  
particular ListElement (and the right AreaInfos)


Auxiliary elements get null as a Position, since they generally do  
not correspond to anything in the FO source. They are only added to  
'steer' the breaking-algorithm in a certain direction.


Each break possibility can be said to correspond to such a Position.  
In the addAreas() phase, IIC, the set of chosen break-points will be  
presented as a List of Positions which are then iterated over (see:  
PositionIterator, which is passed to addAreas())


Perhaps this is easiest to follow by means of an example:
When the TextLM creates the base element list, each Position it  
generates and associates with one of its ListElements --a  
LeafPosition in this case-- will correspond roughly to an index in  
the character array... Actually, that's an index into the List of  
AreaInfos, and those ultimately point to an index in the char array.  
The AreaInfos correspond to words or fragments of hyphenated words.
The line-layout algorithm relies solely on the generated  
ListElements, and does not deal with the Positions in the content  
directly, nor with the AreaInfos.
Preserved linefeeds, for example, are 'translated' into separate  
KnuthInlineSequences by the TextLM. All the LineLayoutManager gets to  
see (collectInlineKnuthElements()) is a one-element sequence  
containing nothing but a penalty (forced break), which it interprets  
as a signal to end the current Paragraph, adding only an auxiliary  
glue to generate an empty line if the Paragraph contains nothing but  
the forced break.
Come to think of it, maybe this is a bad example, precisely because  
no Position corresponds to the linefeed. :/
There is no AreaInfo, the index is simply skipped. (see:  
TextLM.getNextKnuthElements() line 709)


Anyway, as far as I get the picture, the whole breaking-algorithm  
operates entirely upon the ListElements, and ultimately translates  
the set of chosen break-possibilities (= positions in the element  
lists) into a List of Positions (associated to the ListElements), and  
this latter list serves as base list for the PositionIterator that is  
passed to addAreas().



What's the purpose of a LeafPosition? Of a NonLeafPosition?


IIC, then the key difference between them is that a LeafPosition will  
have no sub-positions. They only have a position-index within a  
NonLeafPosition. If that clarifies anything...?



What's the purpose of the wrapPositionElements method in
BlockStackingLayoutManager?
Subsidiary question: why would one sometimes
force the wrapping, sometimes not?


Judging from the code, in general this only has effects if the  
calling LM is not the LM that is associated with a given ListElement.  
The only exception seems to be ListItemLayoutManager, which forces  
wrapping regardless of the base LM.


The purpose seems to be to create auxiliary Positions, such that an  
element associated with a LeafPosition from one LM's point of view,  
can be transformed into one that is associated with a NonLeafPosition  
without losing the original Positions' info.


That is all: as far as I get it...


Cheers,

Andreas



Re: Position, Leaf/NonLeafPosition, wrapping positions

2007-03-07 Thread Jeremias Maerki

On 07.03.2007 15:28:37 Andreas L Delmelle wrote:
 On Mar 7, 2007, at 09:07, Vincent Hennebert wrote:
 
 Hi Vincent,
 
  I have some questions regarding the handling of Position elements. I'm
  not familiar with that part of the code yet, and as there is little or
  no javadoc for those it's a bit difficult to guess their purposes just
  by looking at the code.
 
 I think you're not the only one with this problem... :/
 Note that there is also an illustrious resetPosition() method, which  
 currently seems to be used nowhere. All references to it seem to be  
 circular. BlockLayoutManager.resetPosition() is only called by  
 AbstractLayoutManager.reset(), which in turn is only called by one of  
 the other LM's resetPosition(), etc.

That's a left-over from the Keiron's and Karen's first approach before
we introduced Knuth's algorithm. The resetting was used to go back to a
certain position after deciding line/page breaks. When I merged in
Luca's Knuth code I mostly removed most of this resetting because I
thought we wouldn't need it anymore. Turns out I could have been wrong 
(changing available IPD topic). But maybe it's also good I didn't
remove everything so we still know that there was such a mechanism. At
some point we obviously have to resurrect it (if we need it again) or to
remove it for good.

 The closest understanding that I have of those Positions is still a  
 bit limited, I'm afraid, so I'm hoping others chime in to correct my  
 mistakes or offer further clarification. (Just noticed that Luca  
 already did; will share my thoughts in any case, to see if I may have  
 misinterpreted something)

snip/

 That is all: as far as I get it...

Oh, I think you pretty much got it.


Jeremias Maerki