Re: Position, Leaf/NonLeafPosition, wrapping positions
On Fri, Mar 09, 2007 at 09:46:11PM +0100, Andreas L Delmelle wrote: On Mar 9, 2007, at 18:35, Vincent Hennebert wrote: That's it --for now :) Those notes deserve their wiki page, to not get lost in the mailing list archives. I'll create one as soon as I have time. The documentation part of the website might also need some cleaning up, BTW. Simon also once published a description of 'FOP at work' on his homepage, following and explaining the call stack of FOP while processing a document. This was, however, quite a while ago, dating back from before the implementation of the Knuth algorithm. Maybe it is still available and parts of it can be used to add to such a Wiki or to the development documentation... The docbook code is available in the FOP sources, src/documentation/content/xdocs/DnI. On my web site I have the XHTML and PDF versions. The document is however rather old. Later I have added a few notes to the Wiki, see FOPImplementationNotes. It does contain a section on property lists. Regards, Simon -- Simon Pepping home page: http://www.leverkruid.eu
Re: Position, Leaf/NonLeafPosition, wrapping positions
On Mar 8, 2007, at 20:15, Andreas L Delmelle wrote: Hi Vincent, On Mar 8, 2007, at 18:29, Vincent Hennebert wrote: snip / Now, does that picture represent the current code? For the largest part, yes, I think so. The only difference is the addArea() part, which currently goes top- down. It is not the Position that notifies its LM that the areas must be added, but the LM that iterates over the Positions and adds their corresponding areas. Just felt like elaborating a bit, maybe putting the whole thing in perspective, and offering a glimpse on a few of the current pain- points, so here's a rough sketch of what happens inside FOP when a FO document gets processed, with some pointers into the code. Maybe it saves you and other interested subscribers a bit of time when trying to get the bigger picture. I know it took me long enough... ;) The XSL (identity) transform results in the event FOTreeBuilder.MainFOHandler.startElement( http://www.w3.org/1999/XSL/Format;, root, fo:root, attributes); This event occurs for every FO node, starting with the fo:root. Basically, if you follow that method's logic, what is done is no more than: - create a FONode of the right type - create a PropertyList from the Attributes - bind the PropertyList to the FONode (= transfer the applicable properties for that particular node-type to instance members of the FONode; the PropertyList itself is only stored by the FOTreeBuilder to use as parent PropertyList for the FONode's childrens' PropertyLists) At the end of startElement(), the processed FONode itself is added to the list of childNodes of the MainFOHandler's currentFObj, if all basic validation passed. Subsequently, the node's startOfNode() method is called, which in most cases is no more than a mapping to: FOEventHandler.startXXX() [Note that, IIRC, in the current Trunk code, FONode.startOfNode() is called /before/ the child has been added to the parent. I think it makes more sense to reverse this order... My local sandbox reflects the above description. In the current overall design, however, this makes only little difference: see further on.] When a node has no children, or all of its children have been processed, we receive a MainFOHandler.endElement(), where FONode.endOfNode() is called, which maps to: FOEventHandler.endXXX() The FOEventHandler that is used for area-tree based rendering, is of course the AreaTreeHandler, which currently offers implementations only for startPageSequence() and endPageSequence(). The latter is where the fun starts (read: the layout-loop) What this boils down to, is the first pain-point: The whole FO tree for an fo:page-sequence has to be built /before/ anything layout-related is triggered. In most cases relatively unimportant, since the scenarios where the content absolutely / cannot/ be split into multiple logical sequences beforehand are rare. However, in some of those cases --tables that would span hundreds of pages-- the amount of heap space required to build the corresponding tree is simply too much for any JVM to bear. In environments where such a document is one amongst other concurrently processed documents --heap space is shared by different threads--, it will lead to an OOMError *long* before the end of the page-sequence is reached, and since the heap is shared between threads, either the error will also occur there or those other threads will be left with very limited heap space to work with, resulting in a much slower processing. OK, if you then follow the trail starting at AreaTreeHandler.endPageSequence(), we almost immediately arrive at another pain-point, to which the problem of changing available ipd is related: PageSequenceLM.activateLayout() ... - AbstractBreaker.doLayout() ... - PageBreaker.getNextBlockList() (first) page provided - AbstractBreaker.getNextBlockList() ... - PageBreaker.getNextKnuthElements() - FlowLayoutManager.getNextKnuthElements() Now, as I recall from a few debug sessions in that area, the latter method-name is actually a bit misleading, as it gets called only once per fo:flow, and triggers (line-)layout for all descendants of the flow, based on a LayoutContext that has the available ipd of the region-body of the first page in the sequence. The page-break computation is started only after the entire fo:flow has been broken into lines. The PageBreaker operates completely upon the single element list returned by the main flow, and that element list has been constructed under the (possibly false) assumption that the ipd would be the same for all pages. This design is also partly related to the above (memory consumption of the fo tree), precisely because the LM creation loop is currently tied in to the getNextKnuthElements() loop. Each LM uses getChildLM() to obtain a reference to its 'next' child LM, and
Re: Position, Leaf/NonLeafPosition, wrapping positions
Andreas, First, many thanks for your explanations! Some comments below. Andreas L Delmelle a écrit : On Mar 8, 2007, at 20:15, Andreas L Delmelle wrote: Hi Vincent, On Mar 8, 2007, at 18:29, Vincent Hennebert wrote: snip / Now, does that picture represent the current code? For the largest part, yes, I think so. The only difference is the addArea() part, which currently goes top-down. It is not the Position that notifies its LM that the areas must be added, but the LM that iterates over the Positions and adds their corresponding areas. Just felt like elaborating a bit, maybe putting the whole thing in perspective, and offering a glimpse on a few of the current pain-points, so here's a rough sketch of what happens inside FOP when a FO document gets processed, with some pointers into the code. Maybe it saves you and other interested subscribers a bit of time when trying to get the bigger picture. I know it took me long enough... ;) The XSL (identity) transform results in the event FOTreeBuilder.MainFOHandler.startElement( http://www.w3.org/1999/XSL/Format;, root, fo:root, attributes); This event occurs for every FO node, starting with the fo:root. Basically, if you follow that method's logic, what is done is no more than: - create a FONode of the right type - create a PropertyList from the Attributes - bind the PropertyList to the FONode (= transfer the applicable properties for that particular node-type to instance members of the FONode; the PropertyList itself is only stored by the FOTreeBuilder to use as parent PropertyList for the FONode's childrens' PropertyLists) When you say transfer the applicable properties, you mean that inheritance is also handled here? That is, from all the specified + inherited properties, pick up the ones which apply? At the end of startElement(), the processed FONode itself is added to the list of childNodes of the MainFOHandler's currentFObj, if all basic validation passed. Subsequently, the node's startOfNode() method is called, which in most cases is no more than a mapping to: FOEventHandler.startXXX() [Note that, IIRC, in the current Trunk code, FONode.startOfNode() is called /before/ the child has been added to the parent. I think it makes more sense to reverse this order... My local sandbox reflects the above description. In the current overall design, however, this makes only little difference: see further on.] My understanding of this part is still limited, but it also seems to make more sense to me. When a node has no children, or all of its children have been processed, we receive a MainFOHandler.endElement(), where FONode.endOfNode() is called, which maps to: FOEventHandler.endXXX() The FOEventHandler that is used for area-tree based rendering, is of course the AreaTreeHandler, which currently offers implementations only for startPageSequence() and endPageSequence(). The latter is where the fun starts (read: the layout-loop) What this boils down to, is the first pain-point: The whole FO tree for an fo:page-sequence has to be built /before/ anything layout-related is triggered. In most cases relatively unimportant, since the scenarios where the content absolutely /cannot/ be split into multiple logical sequences beforehand are rare. However, in some of those cases --tables that would span hundreds of pages-- the amount of heap space required to build the corresponding tree is simply too much for any JVM to bear. In environments where such a document is one amongst other concurrently processed documents --heap space is shared by different threads--, it will lead to an OOMError *long* before the end of the page-sequence is reached, and since the heap is shared between threads, either the error will also occur there or those other threads will be left with very limited heap space to work with, resulting in a much slower processing. Some thoughts related to this: It would anyway be best to start the layout process as soon as possible; ideally there would be multiple, chained threads for the multiple tasks: FO tree generation, Knuth elements generation, breaking, area tree generation, rendering, etc. They would act like Unix pipes, in a producer/consumer model where each thread would be fed by the thread it depends on, and would itself feed the subsequent thread. Questions are: does that make sense, when does a thread know it can start its work, can we clearly separate the several processes, oh well all those thread synchronizing issues, etc. But that might give some real performance boost on multi-processor machines. OK, if you then follow the trail starting at AreaTreeHandler.endPageSequence(), we almost immediately arrive at another pain-point, to which the problem of changing available ipd is related: PageSequenceLM.activateLayout() ... - AbstractBreaker.doLayout() ... - PageBreaker.getNextBlockList() (first) page
Re: Position, Leaf/NonLeafPosition, wrapping positions
On Mar 9, 2007, at 18:35, Vincent Hennebert wrote: snip / [Me:] - bind the PropertyList to the FONode (= transfer the applicable properties for that particular node- type to instance members of the FONode; the PropertyList itself is only stored by the FOTreeBuilder to use as parent PropertyList for the FONode's childrens' PropertyLists) When you say transfer the applicable properties, you mean that inheritance is also handled here? That is, from all the specified + inherited properties, pick up the ones which apply? Correct. Roughly each applicable property(-bundle) corresponds to an instance member (where the 'bundle' refers to the Common*Properties) Property inheritance is handled entirely by the PropertyList. The very first step is taken in FONode.processNode(), where the attributes are added to the list. This takes care of the explicitly specified properties. In FObj.bind(), which is called right after that, each call to PropertyList.get() triggers the property resolution mechanism, roughly: try specified value try getting the implied value from a specified shorthand property if applicable try inheritance if applicable and if all this fails, fall back to the initial value defined in the Recommendation. The only exception are descendant nodes of fo:marker, for which all this happens only during layout when the marker is actually retrieved. Here, a different type of PropertyList is used, which only stores the explicitly specified value as a string, and does not convert it to a Property yet. (see: Marker.MarkerPropertyList) When the MainFOHandler receives a startElement() event for an fo:marker, the context is switched, so that the descendant nodes know the bind() step should be skipped at that point. This is done because the properties of marker-descendants must be resolved as if they had been specified as descendants of the static- content where the corresponding retrieve-marker is defined. For marker-descendants, inheritance actually means inheritance from the ancestry of the retrieve-marker. Therefore, the RetrieveMarker keeps its PropertyList alive, so that it can serve as a parent PropertyList for the marker-descendants later on, when cloning the subtree. snip / Some thoughts related to this: It would anyway be best to start the layout process as soon as possible; ideally there would be multiple, chained threads for the multiple tasks: FO tree generation, Knuth elements generation, breaking, area tree generation, rendering, etc. They would act like Unix pipes, in a producer/consumer model where each thread would be fed by the thread it depends on, and would itself feed the subsequent thread. Questions are: does that make sense, when does a thread know it can start its work, can we clearly separate the several processes, oh well all those thread synchronizing issues, etc. But that might give some real performance boost on multi-processor machines. That's roughly the idea a few of us are dreaming of implementing, I think... if they had the time. :) If you compare the codebase to FOP 0.20.5, you'll notice that a large part of the redesign precisely consisted of separating the different processes, mainly FO tree generation and layout. In FOP 0.20.5, there is very much layout-related code scattered in the FObjs. OTOH, before starting to use separate threads, I think some other refactoring can/needs to be done first, if only to make it easier afterwards to implement the threads. I was already playing with the idea to move the LM-construction and some of the initialization to a separate LMInitThread, but then bumped into the mentioned problem. The list of child-nodes cannot be extended after the LM has been constructed. The fo:page-sequence, or at least one of its fo:flows must be completely parsed before instantiating a FlowLayoutManager. [An attempt to work around this by dropping the ArrayList entirely, and using an iterator over a virtual list without this particular limitation can be found in Bugzilla 41656. Patch not applied yet, since I haven't done any extensive testing of its effects. I only know that the junit tests pass.] The idea in the long run was to move towards a separation of concerns inside the layoutengine, which could then later on make room for something like a ListProducerThread, and a BreakerThread ? In the worst case, at first without multi-threading, FOP would use the exact same amount of heap as it does now, only distributed a bit differently. More of it sooner than it does now, but in the end not exceeding the current state, where we ultimately end up with a whole FO tree for a page-sequence on the one hand /and/ a corresponding tree of LayoutManagers. Not to mention all the Lists and ListElements that are generated in between... If you don't give the breaking-algorithm a chance to reset from time to time, memory consumption shoots
Re: Position, Leaf/NonLeafPosition, wrapping positions
Guys, thanks for your inputs. That strengthens a bit the picture I have in mind. I'm wondering if the following makes any sense: We would have a list of page-level Knuth elements; some of those elements would be wrapper around line-level Knuth elements. Almost every element would contain a Position pointing to the LayoutManager that generated the element. Special elements created to achieve various effects (say, text-centering), would contain null Positions. The breaking algorithm would break the list of Knuth elements into chunks representing lines/pages. From that we would iterate over the non-null Positions, and for each one call position.getLM().addArea(position). That is, the position would notify its LM that its corresponding areas must be added. For page-level elements wrapping line-level elements, we would also iterate over the wrapped inline Positions. Now, does that picture represent the current code? If not, what is missing/wrong, apart from the many details relevant to each particular LM? Does that make sense? Thanks, Vincent Jeremias Maerki a écrit : On 07.03.2007 15:28:37 Andreas L Delmelle wrote: On Mar 7, 2007, at 09:07, Vincent Hennebert wrote: Hi Vincent, I have some questions regarding the handling of Position elements. I'm not familiar with that part of the code yet, and as there is little or no javadoc for those it's a bit difficult to guess their purposes just by looking at the code. I think you're not the only one with this problem... :/ Note that there is also an illustrious resetPosition() method, which currently seems to be used nowhere. All references to it seem to be circular. BlockLayoutManager.resetPosition() is only called by AbstractLayoutManager.reset(), which in turn is only called by one of the other LM's resetPosition(), etc. That's a left-over from the Keiron's and Karen's first approach before we introduced Knuth's algorithm. The resetting was used to go back to a certain position after deciding line/page breaks. When I merged in Luca's Knuth code I mostly removed most of this resetting because I thought we wouldn't need it anymore. Turns out I could have been wrong (changing available IPD topic). But maybe it's also good I didn't remove everything so we still know that there was such a mechanism. At some point we obviously have to resurrect it (if we need it again) or to remove it for good. The closest understanding that I have of those Positions is still a bit limited, I'm afraid, so I'm hoping others chime in to correct my mistakes or offer further clarification. (Just noticed that Luca already did; will share my thoughts in any case, to see if I may have misinterpreted something) snip/ That is all: as far as I get it... Oh, I think you pretty much got it. Jeremias Maerki
Re: Position, Leaf/NonLeafPosition, wrapping positions
Yes, that's pretty much my view of the whole thing. On 08.03.2007 18:29:07 Vincent Hennebert wrote: Guys, thanks for your inputs. That strengthens a bit the picture I have in mind. I'm wondering if the following makes any sense: We would have a list of page-level Knuth elements; some of those elements would be wrapper around line-level Knuth elements. Almost every element would contain a Position pointing to the LayoutManager that generated the element. Special elements created to achieve various effects (say, text-centering), would contain null Positions. The breaking algorithm would break the list of Knuth elements into chunks representing lines/pages. From that we would iterate over the non-null Positions, and for each one call position.getLM().addArea(position). That is, the position would notify its LM that its corresponding areas must be added. For page-level elements wrapping line-level elements, we would also iterate over the wrapped inline Positions. Now, does that picture represent the current code? If not, what is missing/wrong, apart from the many details relevant to each particular LM? Does that make sense? snip/ Jeremias Maerki
Re: Position, Leaf/NonLeafPosition, wrapping positions
On Mar 8, 2007, at 18:29, Vincent Hennebert wrote: thanks for your inputs. That strengthens a bit the picture I have in mind. I'm wondering if the following makes any sense: We would have a list of page-level Knuth elements; some of those elements would be wrapper around line-level Knuth elements. Almost every element would contain a Position pointing to the LayoutManager that generated the element. Special elements created to achieve various effects (say, text-centering), would contain null Positions. The breaking algorithm would break the list of Knuth elements into chunks representing lines/pages. From that we would iterate over the non-null Positions, and for each one call position.getLM().addArea(position). That is, the position would notify its LM that its corresponding areas must be added. For page-level elements wrapping line-level elements, we would also iterate over the wrapped inline Positions. Now, does that picture represent the current code? For the largest part, yes, I think so. The only difference is the addArea() part, which currently goes top- down. It is not the Position that notifies its LM that the areas must be added, but the LM that iterates over the Positions and adds their corresponding areas. Cheers, Andreas
Position, Leaf/NonLeafPosition, wrapping positions
Hi, I have some questions regarding the handling of Position elements. I'm not familiar with that part of the code yet, and as there is little or no javadoc for those it's a bit difficult to guess their purposes just by looking at the code. What's the purpose of a LeafPosition? Of a NonLeafPosition? What's the purpose of the wrapPositionElements method in BlockStackingLayoutManager? Subsidiary question: why would one sometimes force the wrapping, sometimes not? Hope someone can answer those questions... Thanks, Vincent
Re: Position, Leaf/NonLeafPosition, wrapping positions
On Mar 7, 2007, at 09:07, Vincent Hennebert wrote: Hi Vincent, I have some questions regarding the handling of Position elements. I'm not familiar with that part of the code yet, and as there is little or no javadoc for those it's a bit difficult to guess their purposes just by looking at the code. I think you're not the only one with this problem... :/ Note that there is also an illustrious resetPosition() method, which currently seems to be used nowhere. All references to it seem to be circular. BlockLayoutManager.resetPosition() is only called by AbstractLayoutManager.reset(), which in turn is only called by one of the other LM's resetPosition(), etc. The closest understanding that I have of those Positions is still a bit limited, I'm afraid, so I'm hoping others chime in to correct my mistakes or offer further clarification. (Just noticed that Luca already did; will share my thoughts in any case, to see if I may have misinterpreted something) Anyway, AFAIU, a Position can be viewed as a kind of a pointer to a position within the FObj. The significance is roughly: - when the initial element list is constructed, each generated ListElement receives a Position - after the breaking is done, and the areas get added, the Position is used to get to the 'part' of the FO that corresponds to that particular ListElement (and the right AreaInfos) Auxiliary elements get null as a Position, since they generally do not correspond to anything in the FO source. They are only added to 'steer' the breaking-algorithm in a certain direction. Each break possibility can be said to correspond to such a Position. In the addAreas() phase, IIC, the set of chosen break-points will be presented as a List of Positions which are then iterated over (see: PositionIterator, which is passed to addAreas()) Perhaps this is easiest to follow by means of an example: When the TextLM creates the base element list, each Position it generates and associates with one of its ListElements --a LeafPosition in this case-- will correspond roughly to an index in the character array... Actually, that's an index into the List of AreaInfos, and those ultimately point to an index in the char array. The AreaInfos correspond to words or fragments of hyphenated words. The line-layout algorithm relies solely on the generated ListElements, and does not deal with the Positions in the content directly, nor with the AreaInfos. Preserved linefeeds, for example, are 'translated' into separate KnuthInlineSequences by the TextLM. All the LineLayoutManager gets to see (collectInlineKnuthElements()) is a one-element sequence containing nothing but a penalty (forced break), which it interprets as a signal to end the current Paragraph, adding only an auxiliary glue to generate an empty line if the Paragraph contains nothing but the forced break. Come to think of it, maybe this is a bad example, precisely because no Position corresponds to the linefeed. :/ There is no AreaInfo, the index is simply skipped. (see: TextLM.getNextKnuthElements() line 709) Anyway, as far as I get the picture, the whole breaking-algorithm operates entirely upon the ListElements, and ultimately translates the set of chosen break-possibilities (= positions in the element lists) into a List of Positions (associated to the ListElements), and this latter list serves as base list for the PositionIterator that is passed to addAreas(). What's the purpose of a LeafPosition? Of a NonLeafPosition? IIC, then the key difference between them is that a LeafPosition will have no sub-positions. They only have a position-index within a NonLeafPosition. If that clarifies anything...? What's the purpose of the wrapPositionElements method in BlockStackingLayoutManager? Subsidiary question: why would one sometimes force the wrapping, sometimes not? Judging from the code, in general this only has effects if the calling LM is not the LM that is associated with a given ListElement. The only exception seems to be ListItemLayoutManager, which forces wrapping regardless of the base LM. The purpose seems to be to create auxiliary Positions, such that an element associated with a LeafPosition from one LM's point of view, can be transformed into one that is associated with a NonLeafPosition without losing the original Positions' info. That is all: as far as I get it... Cheers, Andreas
Re: Position, Leaf/NonLeafPosition, wrapping positions
On 07.03.2007 15:28:37 Andreas L Delmelle wrote: On Mar 7, 2007, at 09:07, Vincent Hennebert wrote: Hi Vincent, I have some questions regarding the handling of Position elements. I'm not familiar with that part of the code yet, and as there is little or no javadoc for those it's a bit difficult to guess their purposes just by looking at the code. I think you're not the only one with this problem... :/ Note that there is also an illustrious resetPosition() method, which currently seems to be used nowhere. All references to it seem to be circular. BlockLayoutManager.resetPosition() is only called by AbstractLayoutManager.reset(), which in turn is only called by one of the other LM's resetPosition(), etc. That's a left-over from the Keiron's and Karen's first approach before we introduced Knuth's algorithm. The resetting was used to go back to a certain position after deciding line/page breaks. When I merged in Luca's Knuth code I mostly removed most of this resetting because I thought we wouldn't need it anymore. Turns out I could have been wrong (changing available IPD topic). But maybe it's also good I didn't remove everything so we still know that there was such a mechanism. At some point we obviously have to resurrect it (if we need it again) or to remove it for good. The closest understanding that I have of those Positions is still a bit limited, I'm afraid, so I'm hoping others chime in to correct my mistakes or offer further clarification. (Just noticed that Luca already did; will share my thoughts in any case, to see if I may have misinterpreted something) snip/ That is all: as far as I get it... Oh, I think you pretty much got it. Jeremias Maerki