Re: White space handling Wiki page
On Tue, Nov 08, 2005 at 11:19:15AM +0800, Manuel Mall wrote: On Tue, 8 Nov 2005 04:40 am, Simon Pepping wrote: Step 2. Refinement: white-space-collapse Issue 1. The spec intentionally addresses only XML white space, because only such white space is manipulated by editors to obtain pretty printing. Point taken, although I have no experience with non western editors. Do they all use 0x20 for 'pretty printing'? The XML spec does not allow one to use other characters than XML white space for pretty printing, at least not in element content. It would result in an invalid XML file because PCDATA would be present where the DTD or schema does not allow it. That is even true for non-breaking-space, U+A0. ul liSome text./li /ul is valid XHTML, but ul #xA0;#xA0;#xA0;#xA0;liSome text./li /ul is not. In PCDATA it is slightly different. p This is some content. We wrap the lines at a narrow width. /p Formally these data are different from the case when the text of the paragraph were written in one line: spaces have been converted to linefeeds, and sequences of spaces have been inserted. The XML parser reports all linefeeds and spaces as character data to the application. But almost all applications treat the two cases as equivalent, certainly when the data are considered as textual data. It is exactly this convention that the FO spec tries to formalize. fo:block This is some content. We wrap the lines at a narrow width. /fo:block _is_ equivalent to the case when the text of the block were written in one line, due to the line-feed-treatment and white-space-collapse properties (at default values). Such a convention is not usually applied to non-XML-whitespace characters, and the FO spec shows no intention to do so. A side effect is that 'This is some content' is equivalent to 'This is some content', but that is not the case with any other character, even if that is considered as white space in some script. Example 2 = The space in fo:block.fo:block is suppressed because it is at the start of the block. Interesting - I agree that this is the intention but you don't find that sentence in the spec. In 1.1 this is covered by the deleting spaces at the beginning of a line under white-space-treatment / line building. Again the discussion is probably academic - we all agree what the expected outcome is. If we can derive that outcome from the spec or not is a very interesting discussion but won't change what we will do. This is convered under the notion that the start and end of an fo:block are equivalent to line breaks. And fo:blockfo:block does not generate an empty line. fo:block starts a new line, but that is not equivalent to a linefeed. When at the start of the nested fo:block there is no content in the line yet, it starts the same line. A similar thing happens in the case of /fo:block#x0A;/fo:block, which was discussed in an email thread. I assume you mean the discussion under linefeed-treatment=preserve. I am still confused about that because /fo:block#x0A;#x0A;/fo:block will generate one linefeed or should this create also none? Yes, I am referring to that discussion, and I quoted it wrong. The case is: #x0A;/fo:block. The linefeed creates a linebreak, /fo:block does not add another one since the line has already been ended. /fo:block#x0A;/fo:block should create one empty line, and /fo:block#x0A;#x0A;/fo:block two empty lines, I suppose. Nowhere in the spec is a conversion of tabs and CRs to spaces specified. Under 7.15.8 it says: preserve Specifies that any character flow object whose character is classified, before any linefeed-treatment handling is considered, as white space in XML, except for U+000A (linefeed) characters, shall be converted during the refinement process into a character flow object whose Unicode code point is U+0020 (space). But they removed it in 7.16.8 in the 1.1 draft. Regards, Simon -- Simon Pepping home page: http://www.leverkruid.nl
Re: White space handling Wiki page
On Oct 31, 2005, at 22:18, Andreas L Delmelle wrote: On Oct 27, 2005, at 06:29, Manuel Mall wrote: Actually something like: fo:block background-color=yellowword1fo:character character=#10;/fo:character character= /word2fo:character character= /word3fo:character character=#10;//fo:block currently causes an exception! The problem can be solved by a slight modification to OneCharIterator: * add a constructor with Character parameter (and member) * add a remove() implementation which makes Character's parent remove it from its list of child nodes Tested locally (very quickly), and seems to work nicely. If I get the chance to commit it in the next few days, I'll do so myself, but if you want to have a go, it's a pretty easy fix (adds up to about 10-15 LOC incl. javadocs :-)) Oops, been too quick. From an UnsupportedOperationException to a ConcurrentModificationException... The trick seems to be to introduce a small boolean 'discard' switch to the Character object, flip this upon calling OCIter.remove(), and have the Block/Inline later remove any of its characters marked as discardable, but do this (of course) only after the RecursiveCharIterator has finished --to avoid the childNodes list from being altered while it's being iterated over... Other option: store a list of the discardable space fo:characters at Block or Inline level, instead of marking the Character itself as such... A bit more than 15 LOC, but still quite doable. Cheers, Andreas
Re: White space handling Wiki page
On Nov 1, 2005, at 10:04, Manuel Mall wrote: I am sure it is doable - but is it worth it at this stage? Possibly after a better understanding of the white-space handling issues that whole current system needs revision? One problem with the current char iterator is that it iterates over inline boundaries which causes white space to be collapsed across those which according to the clarification of the WG is incorrect. IMO to implement the refinement step of the white space handling (which currently happens in the flow.Block object) we need an iterator which goes through all characters but indicates fo boundaries (not including fo:characters) so we can do: a) linefeed treatment across all characters; b) white space collapse across each consecutive section of implicit/explicit fo:characters, i.e. delimited by the start/end of fo's; c 1) white-space-treatment from the start of the fo:block to the first non white-space character; The iterator must also be able to either operate backwards or be able to be reset to a particular position (last non white space character) so we can do: c 2) white-space-treatment from the end of the fo:block backwards to the first non white-space character It must also support character deletions and character substitutions. Does that make sense? Very much. Precisely with that in mind, I've also been contemplating moving part of the whitespace-handling to inline-level. This would keep the nested inlines separated from the Block's own direct FOText descendants (and at the same time, in combination with the modification I already described, this would provide us with an opportunity to remove fo:characters from within the nested inlines -- which would become quite a pain if this removal is deferred to block- level) So the RecursiveCharIterator should only create Iterators over regular FOText or fo:characters that are direct descendants of the Block/Inline. FOText of nested FObjs should be left alone, since the whitespace will already be collapsed. IOW, it should stop being -- recursive? Currently, whitespace handling is triggered from the moment a Block encounters a child node that isn't FOText nor generates inline areas. At the basis this seems OK, the only difference I'd propose is that inlines do their own whitespace handling, so that *if* whitespace needs to be collapsed across fo boundaries --maybe there are cases?--, the block-level only needs to look at the first and last characters in an inline's text. Cheers, Andreas
Re: White space handling Wiki page
On Oct 27, 2005, at 06:29, Manuel Mall wrote: Manuel, Some more on this example: Actually something like: fo:block background-color=yellowword1fo:character character=#10;/fo:character character= /word2fo:character character= /word3fo:character character=#10;//fo:block currently causes an exception! I think I see the problem (don't know if you've seen it that way): fop.fo.flow.Character overrides FONode.charIterator(), which returns a OneCharIterator over its own character. If remove() is called from within a RecursiveCharIterator for the surrounding block, the spaces surrounding the linefeed get recognized as discardable whitespace. The superclass throws an UnsupportedOperationException, because OneCharIterator doesn't have an implementation for remove(). This may be the reason the example didn't work properly. The problem can be solved by a slight modification to OneCharIterator: * add a constructor with Character parameter (and member) * add a remove() implementation which makes Character's parent remove it from its list of child nodes Tested locally (very quickly), and seems to work nicely. If I get the chance to commit it in the next few days, I'll do so myself, but if you want to have a go, it's a pretty easy fix (adds up to about 10-15 LOC incl. javadocs :-)) Cheers, Andreas
Re: White space handling Wiki page
If its any consolation in this discussion I wrote a little HTML test to see how browsers deal with these white space (and some line height) issues. To mimic the XSL-FO situation I used only div and span. You can see the results here: http://people.apache.org/~manuel/fop/test5.html I viewed this page with IE 6, Firefox and Opera under Windows and Firefox and Konqueror under Linux. There are differences in rendering some subtle / some quite significant between all 5 browser variants. Although IE stands out in doing it most differently than the rest. If they cannot sort out white space handling (and line height) in HTML (billions of users, designers, spec writers, reviewers, ...) how should we have a chance :-)? Manuel
Re: White space handling Wiki page
Andreas, excellent - I think there is now lots of convergence and common understanding between your and my interpretations. A bit more inline. On Fri, 28 Oct 2005 04:58 am, Andreas L Delmelle wrote: On Oct 25, 2005, at 10:57, Manuel Mall wrote: When FOP is collapsing (b) or removing (c) white space are there any fences we need to observe. For example a border/padding between two spaces, e.g. (spaces represented by a .): fo:block...fo:inline border=..Text .../fo:inline.../fo:block There are 4 sequences of 3 spaces each. What would we expect the final outcome to be (assuming it fits on one line): a) all removed: [border]Text[border] b) only first and last removed: [border].Text.[border] c) first, 2nd and last removed: [border]Text.[border] d) ??? To me b) makes sense. However, a) is the HTML way and c) seems what RenderX and AntennaHouse are doing. What do we want to do? Having read that 1.1 definition more closely now, I'd say a). Somehow it begins to fall into place... I fully agree with your analysis of the spec here, that is saying option a) is the most likely answer that can be derived from the wording in the spec. Is it the intended / most sensible answer? A bit more below. snip/ And what about this: fo:block...A...fo:inline border=..Text .../fo:inline...B.../fo:block a) all removed: A[border]Text[border]B b) only first and last removed: A.[border].Text.[border].B c) only first and last removed and others collapsed across the borders: A.[border]Text.[border]B d) ??? a) is most likely wrong, b) looks OK, c) is the HTML way. Same thinking here, b) seems to be the way to go. We agree but did you notice the difference it would make in visual appearance if the inline just happens to be at the beginning / end of the line if we follow option a) from the first example? That is if we have a line break after the A you would get: [border]Text.[border].B If we have a line break before the B you would get: A.[border].Text[border].B That is depending on where the linebreaks are there would be a space or not between the border and the word 'Text'. It is these 'strange' or 'unsymmetric' outcomes which made me think that a border should possibly act like a fence with respect to whitespace removal (option b) in the first example). I wonder... what if: 1. as much as possible of the whitespace handling is done in the FO parsing stage (before any LayoutManager is created) Yes, isn't that what the refinement stage is about (partly)? 2. after linefeed-treatment is handled, all remaining whitespace characters are converted internally into fo:characters This is precisely what the definition of fo:character seems to prescribe for all characters, but that may be overkill (?) Yes, logically - practically within FOP no because creating separate FOs and possibly areas for each character is most likely prohibitive in terms of memory consumption and processing. But logically FOP should behave as if that is what is happening. Especially if we want to implement Unicode compliant line breaking, bidi, etc. This needs to be done on a per paragraph basis and not on a per 'text section' basis as is now. That is analysis where a line break opportunity is must go across inline boundaries, include fo:characters, etc.. As such, all those whitespace characters would get a default suppress- at-line-break of auto, meaning: for the plain old space --U+0020-- suppress, and retain for all the others. So, in case of linefeed- treatment=preserve: fo:block l-t=p#x0A;/fo:block is the same as fo:block l-t=pfo:character character=#x0A; suppress-at-line-break=retain ...//fo:block Which should IIC, in terms of layout, create something like a penalty of -INFINITE (= effect should be a forced line-break), but the effect of surrounding feasible breaks should be taken into account. In case one is wondering: with default white-space-treatment and preserved linefeeds, this means that if a linefeed glyph-area immediately follows another line-break (start-block)...? Empty line or not? The glyph-area is not deleted, and it should be the last area of the line-area subset it occurs in, so I'm inclined to say: yes, empty line. Agree Anyway, this would definitely mean something in terms of treating whitespace consistently and uniformly, whether the stylesheet author used explicit fo:characters or not. At the very least the treatment between characters and fo:characters should be normalized in *some* way vis-a-vis the layout-engine. Agree - It probably means when we operate on this stuff we need to use character iterators which go through 'plain text', nested inline, fo:characters, etc. transparently. The other way around is certainly impossible, since we'd lose the original fo:character's property info which is used during layout. In between lies an idea of a temporary whitespace map, into which both types of whitespace chars are
Re: White space handling Wiki page
Manuel Mall wrote: Side note: FOP doesn't quite do the same internally, i.e. a character explicitly specified using fo:character.../ is handled separately from 'plain text'. If someone would write a style sheet which does a transform of every character into a fo:character / object and would feed the output to FOP the formatting results would be lets say VERY DISAPPOINTING. Actually something like: fo:block background-color=yellowword1fo:character character= /fo:character character= /word2fo:character character= /word3fo:character character= //fo:block currently causes an exception! This is a problem of the whitespace-related code, but anyway the CharacterLM always creates a sequence of element corresponding to a non-space character, so the only feasible breaks recognized by the algorithm would be the hyphenation points inside the words ... I think that just as TextArea and Character both extend an AbstractTextArea, TextLM and CharLM should have a common super class holding the createElementsFor*() methods. It would not be necessary to add a SpaceArea or a WordArea child to a Character area, anyway (but we could decide to do it anyway just for analogy). Regards Luca
Re: White space handling Wiki page
On Fri, 28 Oct 2005 11:08 pm, Luca Furini wrote: Manuel Mall wrote: Side note: FOP doesn't quite do the same internally, i.e. a character explicitly specified using fo:character.../ is handled separately from 'plain text'. If someone would write a style sheet which does a transform of every character into a fo:character / object and would feed the output to FOP the formatting results would be lets say VERY DISAPPOINTING. Actually something like: fo:block background-color=yellowword1fo:character character= /fo:character character= /word2fo:character character= /word3fo:character character= //fo:block currently causes an exception! This is a problem of the whitespace-related code, but anyway the CharacterLM always creates a sequence of element corresponding to a non-space character, so the only feasible breaks recognized by the algorithm would be the hyphenation points inside the words ... I think that just as TextArea and Character both extend an AbstractTextArea, TextLM and CharLM should have a common super class holding the createElementsFor*() methods. It would not be necessary to add a SpaceArea or a WordArea child to a Character area, anyway (but we could decide to do it anyway just for analogy). Yes I agree but it is IMO a bit more complicated. The Unicode line breaking algorithm does require more than one character to make decisions. Simple example: No break after/before an opening/closing punctuation, e.g. (, [, ], ) etc.. So in a sequence like ( HELP ) neither the the space following ( nor the space preceding ) would be a legal break opportunity. If someone would write: ...(fo:inline font-weight=bold HELP /fo:inline)... then even if the current getNextKnuth functions would implement the Unicode algorithm we still would create a break opportunity for the spaces because the fo snippet above would generate 3 calls to getNextKnuth because 3 different LMs are created: one for '...(', one for ' HELP ', and the last for ')...' and each do the analysis just limited to their piece of text. Therefore having the createElementsFor*() methods centralised solves only part of the problem. Regards Luca Cheers Manuel
Re: White space handling Wiki page
On Oct 28, 2005, at 12:28, Manuel Mall wrote: On Fri, 28 Oct 2005 04:58 am, Andreas L Delmelle wrote: (second example) Same thinking here, b) seems to be the way to go. We agree but did you notice the difference it would make in visual appearance if the inline just happens to be at the beginning / end of the line if we follow option a) from the first example? That is if we have a line break after the A you would get: [border]Text.[border].B If we have a line break before the B you would get: A.[border].Text[border].B Errm, typo? I'd delete the space before 'B' as well, so fully: A.[border].Text[border] B That is depending on where the linebreaks are there would be a space or not between the border and the word 'Text'. It is these 'strange' or 'unsymmetric' outcomes which made me think that a border should possibly act like a fence with respect to whitespace removal (option b) in the first example). In a certain way, yes. The white-spaces before 'A' and after 'B' can be literally removed from the stream (so that they don't have a corresponding glyph-area; why create one if you already know it's going to be deleted much further on?), while the other space- sequences can at most be collapsed to one space. These single spaces will automatically have glyph-areas which will or will not be deleted, depending on whether a line-break precedes/follows. So I agree, but I don't think the borders need to be explicitly tracked/checked for this, as they coincide with the boundaries of the inline anyway. The effect of the border acting as a fence should more be seen as a consequence, following naturally from the process of whitespace handling. It's the element borders --but in XML markup terms, not the presence of border properties-- that act as 'fences' (quoted since the term is not really applicable at that level). My main point was the difference between blocks and inlines in this respect. For instance, the following different possibilities: 1) fo:block A ... 2) fo:block A ... 3) fo:block #x0A;A ... 4) fo:block#x09;#x0D;#x0A;#x20; A ... would all be treated during layout as if they were fo:blockA ... (supposing default values for all related properties) The space between 'A' and '...' always remains --whether the '...' refers to content or markup for a nested inline-- but the spaces between the start-block markup and the character 'A' are all dropped (=implicit line-break immediately preceding). Analogous for XML whitespace between the last non-whitespace char and the end-block markup. For inlines, this becomes nearly the opposite: only if the current inline FO is the first child-node to its parent (first inline in a block, no preceding characters), then we could cheat and throw away any whitespace between start-inline and the first non-whitespace, but as a general ROT, those white-spaces can at most be collapsed to a single space, since they could end up in the middle of a line area. Generally, inserting spaces, tabs or linefeeds as the first/last characters of a fo:block should make no difference, but for a fo:inline this would always result in an extra space in the output if it ends up in the middle of a line. Talking nested blocks: fo:block A fo:block B ... is the same as fo:blockAfo:blockB ... or fo:blockA fo:blockB The above doesn't hold for nested inlines, hence: beware of indent=yes in XSLT. In case of deeply nested inlines, this could result in the number of spaces in the output increasing with the depth of the fo:inline in the source document. :-) snip / 2. after linefeed-treatment is handled, all remaining whitespace characters are converted internally into fo:characters This is precisely what the definition of fo:character seems to prescribe for all characters, but that may be overkill (?) Yes, logically - practically within FOP no because creating separate FOs and possibly areas for each character is most likely prohibitive in terms of memory consumption and processing. But logically FOP should behave as if that is what is happening. Especially if we want to implement Unicode compliant line breaking, bidi, etc. This needs to be done on a per paragraph basis and not on a per 'text section' basis as is now. That is analysis where a line break opportunity is must go across inline boundaries, include fo:characters, etc.. Not necessarily separate FOs, but the same type of LayoutManager would probably be more in the right direction. CharLM (or subclass?) should be able to operate on either an attached fo:character or a simple char instance variable; instantiated either from an explicit fo:character object, or by the TextLM responsible for the larger context from a Unicode whitespace character it encounters (instead of creating the elements for whitespace itself, the TextLM instantiates a CharLM to delegate?) At the same time, the TextLM's operating context for line-breaking
Re: White space handling Wiki page
On Wed, 26 Oct 2005 06:22 am, Andreas L Delmelle wrote: On Oct 25, 2005, at 10:57, Manuel Mall wrote: /snip No, it talks about 'character flow objects', which makes me wonder... Are all characters to be considered 'character flow objects' or only those that were specified using fo:character? Not that it would make a big difference, I think. See bottom of page 3 (PDF version) and top of page 4 of the spec. There it talks about 'objectifying' the XML elements and attributes which includes converting characters into character FO's. From then on the spec always means the value of the character property of a fo:character object when talking about characters and their values. So the answer to your above question is: YES - all characters are 'character flow objects'. Side note: FOP doesn't quite do the same internally, i.e. a character explicitly specified using fo:character.../ is handled separately from 'plain text'. If someone would write a style sheet which does a transform of every character into a fo:character / object and would feed the output to FOP the formatting results would be lets say VERY DISAPPOINTING. Actually something like: fo:block background-color=yellowword1fo:character character=#10;/fo:character character= /word2fo:character character= /word3fo:character character=#10;//fo:block currently causes an exception! Cheers, Andreas Cheers Manuel
Re: White space handling Wiki page
On Wed, 26 Oct 2005 06:22 am, Andreas L Delmelle wrote: On Oct 25, 2005, at 10:57, Manuel Mall wrote: snip/ The right order in which the related properties should be dealt with seems to be: 1. white-space-treatment (property refinement) 2. linefeed-treatment (property refinement) 3. white-space-collapse (layout/area tree construction) 4. suppress-at-line-break (layout/area tree construction) We are very close here in our mutual opinions - if you look at my revised algorithm on the Wiki page it is nearly the same as your 4 steps above. THAT'S GOOD !!! And what do they say: Great minds think alike :-) Cheers, Andreas Cheers Manuel
Re: White space handling Wiki page
Hi, I haven't got any technical comments to the issues raised on the Wiki page. Is this 'too hard' or 'too boring' or 'too messy' or what? The problem is not going away. We currently don't do it right in some parts (that is established) but I don't know overall what is right or wrong. May be if I ask for comments on an issue by issue basis we get somewhere? Quick background: In the default case (which seems to be the most complicated) white space handling consists of 3 things - a) Replace any white space that is not as space char with a space char = easy. b) Collapse any sequence of spaces to a single space. c) Remove any spaces at the beginning and end of lines. First issue for b) and c) (and it may have different answers for b) and c)): In other places the spec has a concept of a fence as a boundary across which certain operations do not apply, e.g. space resolution. When FOP is collapsing (b) or removing (c) white space are there any fences we need to observe. For example a border/padding between two spaces, e.g. (spaces represented by a .): fo:block...fo:inline border=..Text .../fo:inline.../fo:block There are 4 sequences of 3 spaces each. What would we expect the final outcome to be (assuming it fits on one line): a) all removed: [border]Text[border] b) only first and last removed: [border].Text.[border] c) first, 2nd and last removed: [border]Text.[border] d) ??? To me b) makes sense. However, a) is the HTML way and c) seems what RenderX and AntennaHouse are doing. What do we want to do? And what about this: fo:block...A...fo:inline border=..Text .../fo:inline...B.../fo:block a) all removed: A[border]Text[border]B b) only first and last removed: A.[border].Text.[border].B c) only first and last removed and others collapsed across the borders: A.[border]Text.[border]B d) ??? a) is most likely wrong, b) looks OK, c) is the HTML way. Manuel
Re: White space handling Wiki page
Hi Manuel, On Tue, Oct 25, 2005 at 04:57:41PM +0800, Manuel Mall wrote: Hi, I haven't got any technical comments to the issues raised on the Wiki page. Is this 'too hard' or 'too boring' or 'too messy' or what? The problem is not going away. We currently don't do it right in some parts (that is established) but I don't know overall what is right or wrong. May be if I ask for comments on an issue by issue basis we get somewhere? You and Jeremias both turning in so much work is too much to follow. I really would like to comment on your analysis, but I do not right now have time for it. Perhaps in one of the next weeks. Simon -- Simon Pepping home page: http://www.leverkruid.nl
Re: White space handling Wiki page
Andreas, firstly a great thanks for looking at this. I am not going to comment on your comments right now but there is probably an important clarification required: All my interpretations of the spec with respect to white space handling are based on the 1.1WD not the 1.0 spec. The WG has already confirmed that the description of white space handling in 1.0 is flawed and has rewritten that part in 1.1. As I mentioned before I believe their intention was not to change the behaviour they wanted to achieve between 1.0 and 1.1 but to document it more correctly. Therefore I think it is correct and advisable to refer to 1.1 in this case. Could you please review your comments under this aspect as I believe that would clarify why I refer to line breaking vs linefeed, glyph areas vs fo's, etc. below. Which are some of the items you questioned. Manuel On Wed, 26 Oct 2005 06:22 am, Andreas L Delmelle wrote: On Oct 25, 2005, at 10:57, Manuel Mall wrote: Hi Manuel, I haven't got any technical comments to the issues raised on the Wiki page. Is this 'too hard' or 'too boring' or 'too messy' or what? All three, and more :-P Nah, seriously now, trying to comment in on the thread from last week: On Oct 19, 2005, at 03:45, Manuel Mall wrote: On Wed, 19 Oct 2005 05:44 am, Jeremias Maerki wrote: You place the white-space-treatment after the white-space-collapse but I think it is clear that the latter comes last (during area tree construction =after line breaking vs. during line-building and inline-building =before line-breaking). Yes I agree that this is a critical interpretation issue and I expected that part of the algorithm to be controversial. The problem is that in the description of value true for the white-space-collapse property it clearly refers to the fo tree and fo:character siblings in the tree. That was further clarified by an e-mail on the xsl editor list http://lists.w3.org/Archives/Public/xsl-editors/2002OctDec/0004. Once we have done line-building the fo tree (at fo:character level) is largely gone, we have now glyph areas which could have been merged, ligatures combined, etc.. That means referring at this stage back to fo:character siblings in the fo tree seems lets say unusual. Correction: the FO tree isn't 'gone' until layout for the page- sequence is finished. Only at that time are all the FObjs released. It may seem unusual to refer back to fo:character siblings, but that doesn't mean it's wrong. The corresponding FObj is still available to us at that point... The fact that we are dealing with glyph areas and not fo:character elements in line-building / area tree construction is further emphasised by the description of the handling for the white-space-treatment property. It is all defined in terms of glyph areas not fo:characters. Errm... I seem to be missing something, or aren't you talking about the XSL-FO Rec here? I can't find any reference to the term 'glyph' or 'glyph area' in the description of white-space-treatment. Many references to 'character flow object' and 'XML whitespace' though. (XSL 1.0) The passage about line-building (4.7.2) indeed talks about glyph areas, but only seems to mention suppress-at-line-break (a property which, incidentally, only applies to fo:character objects). Besides, the question that needs to be posed when handling white- space-collapse is: Should this whitespace character generate an area or not? The description in the Rec (XSL 1.0 -- 7.15.12) IMO clearly indicates that no layout information is needed to answer the above question, as it depends solely upon the character itself, the preceding and the following character. Further to this it doesn't make sense to me to collapse white space after line breaking as is implied by your interpretation because the amount of white space does contribute to the line breaking decisions. If we remove white space after line breaking we would IMO get sub optimal line breaks. In summary I think white space must be collapsed before or at least during line breaking but not after. 100% agreed. I wouldn't place it after line breaking either. It seems like a decision that can (and should?) be made *during* layout: Should this whitespace character generate an element or not? If it generates an element, it will also --if appropriate-- trigger the generation of an area. Another related issue is the description of collapsing white space around a linefeed in the spec under white-space-collapse. The spec is very specific and refers to U+000A (linefeed) fo:character siblings in the fo tree. No, it talks about 'character flow objects', which makes me wonder... Are all characters to be considered 'character flow objects' or only those that were specified using fo:character? Not that it would make a big difference, I think. This is obviously very different to removing white space around a line break generated
Re: White space handling Wiki page
I was close to throwing temporarily the proverbial towel into the ring with respect to whitespace handling. However an offline IM chat with Jeremias and writing the response to Stephen's post encouraged me to take a different approach. Instead of trying to understand whitespace handling on the basis of a detailed analysis of the spec I looked at it from the perspective: What outcome did the spec writers most likely wanted to achieve? Of course the result is just a bunch of (educated?) guesses on my part. But based on my guesses I have posted a revised algorithm on the Wiki which I hope: a) Deals sensibly with unresolved/unclear issues b) Gives consistent results in the generated output c) Moves most of the work into refinement and only leaves whitespace handling around formatter generated breaks to layout d) Is understandable by others However, I do admit it is at this stage a home cooked approach and certainly requires close scrutiny by others before any implementation into the FOP code base is attempted. Thanks Manuel
Re: White space handling Wiki page
Manuel Mall wrote: Manuel, I was close to throwing temporarily the proverbial towel into the ring with respect to whitespace handling. However an offline IM chat with Jeremias and writing the response to Stephen's post encouraged me to take a different approach. Instead of trying to understand whitespace handling on the basis of a detailed analysis of the spec I looked at it from the perspective: What outcome did the spec writers most likely wanted to achieve? Of course the result is just a bunch of (educated?) guesses on my part. Indeed. This is the approach I agree with and recommend as far as the spec is concerned. As you've found out the spec is often ambiguous. I think the best approach to implement any XSL-FO feature is; 1) work out what you think the XSL-FO WG intended. When doing this, I think it is important not to dwell too long on every single sentence - just get a gut feel of what the WG intended. 2) work through some use cases. 3) and add a pinch of common sense. But based on my guesses I have posted a revised algorithm on the Wiki which I hope: a) Deals sensibly with unresolved/unclear issues b) Gives consistent results in the generated output c) Moves most of the work into refinement and only leaves whitespace handling around formatter generated breaks to layout d) Is understandable by others However, I do admit it is at this stage a home cooked approach and certainly requires close scrutiny by others before any implementation into the FOP code base is attempted. Of course this is just my 2 cents worth and may not be considered a good idea by others. Chris
Re: White space handling Wiki page
On 19.10.2005 03:45:33 Manuel Mall wrote: On Wed, 19 Oct 2005 05:44 am, Jeremias Maerki wrote: I've started to comment on the individual issues you listed and only when I got to the examples I realized there must be something wrong. You place the white-space-treatment after the white-space-collapse but I think it is clear that the latter comes last (during area tree construction =after line breaking vs. during line-building and inline-building =before line-breaking). That's why you run into problems explaining why there is no line generated by the white space between the two starting block elements. Maybe clearing this up might clear up some of the other issues. Jeremias, yes I agree that this is a critical interpretation issue and I expected that part of the algorithm to be controversial. The problem is that in the description of value true for the white-space-collapse property it clearly refers to the fo tree and fo:character siblings in the tree. That was further clarified by an e-mail on the xsl editor list http://lists.w3.org/Archives/Public/xsl-editors/2002OctDec/0004. That document contains a further indicator that white-space-treatment needs to be handled before white-space-collapse. Item 3 makes the intent clear even thought the wording has been changed in the spec. OTOH the usage-context-of-suppress-at-line-break property that is mentioned in that document that never really surfaced diminishes the credibility (or importance) of the document a little. Once we have done line-building the fo tree (at fo:character level) is largely gone, we have now glyph areas which could have been merged, ligatures combined, etc.. That means referring at this stage back to fo:character siblings in the fo tree seems lets say unusual. The fact that we are dealing with glyph areas and not fo:character elements in line-building / area tree construction is further emphasised by the description of the handling for the white-space-treatment property. It is all defined in terms of glyph areas not fo:characters. Further to this it doesn't make sense to me to collapse white space after line breaking as is implied by your interpretation because the amount of white space does contribute to the line breaking decisions. If we remove white space after line breaking we would IMO get sub optimal line breaks. In summary I think white space must be collapsed before or at least during line breaking but not after. Point. That was my bad. Still, I get the impression that *-treatment are both handled before white-space-collapse. Another related issue is the description of collapsing white space around a linefeed in the spec under white-space-collapse. The spec is very specific and refers to U+000A (linefeed) fo:character siblings in the fo tree. This is obviously very different to removing white space around a line break generated during line building. Also in the default case by the time we get to white-space-collapse handling all linefeeds would have been replaced by spaces during refinement. ...or hard breaks or discarded entirely. Which leads to the question do they really mean that in the spec or do they really meant to remove white space around a line break? That's my impression. It's two properties that actually remove space around line breaks, but in two different stages. But then again that is really dealt with by the white-space-treatment property in much more detail. But why then do we need the duplication of white-space-collapse removing white space around a linefeed character and white-space-treatment removing white space (not quite actually - it removes characters with the suppress-at-line-break property being true) around line breaks? Because line-feed-treatment may generate spaces which are not yet handled by white-space-treatment but later picked up by white-space-collapse. At least that is my take. Anyway, this is like fishing in the dark. I have big trouble (again) understanding the spec. Obviously, you found a lot of little details that don't really resolve well. All we can do here is make guesses. This is really frustrating. Jeremias Maerki
Re: White space handling Wiki page
On Wed, 19 Oct 2005 03:33 pm, Jeremias Maerki wrote: On 19.10.2005 03:45:33 Manuel Mall wrote: On Wed, 19 Oct 2005 05:44 am, Jeremias Maerki wrote: snip/ Still, I get the impression that *-treatment are both handled before white-space-collapse. Yes, linefeed-treatment is definitely a refinement activity (and pretty simple) and handled before white-space-collapse. But white-space-treatment is not refinement and this is where we differ (I think). white-space-treatment clearly depends on the line breaks generated, that is you cannot do white-space-treatment without having the line breaks. But we can only generate appropriate line breaks if we have collapsed the white space before. To me this means, together with the fact that white-space-collapse is defined on fo's and white-space-treatment on glyph areas, that white-space-collapse precedes white-space-treatment. snip/ Because line-feed-treatment may generate spaces which are not yet handled by white-space-treatment but later picked up by white-space-collapse. At least that is my take. See above. Anyway, this is like fishing in the dark. I have big trouble (again) understanding the spec. Obviously, you found a lot of little details that don't really resolve well. All we can do here is make guesses. This is really frustrating. I couldn't agree more Jeremias Maerki Manuel
Re: White space handling Wiki page
On Thu, 20 Oct 2005 08:15 am, Stephen Denne wrote: I have not read the spec regarding these attributes recently, but I was wondering whether the treatment of whitespace in the fo file defaults to the normal whitespace treatment in xml files if no special white-space treatment attributes are specified. If I understand the XML spec correctly whitespace handling is a job for the application not the XML parser. That is any character in a XML document outside of markup must be passed to the application by the parser. The only exception is #xD (carriage return) which is dropped if followed by #xA and otherwise replaced with a #xA. With the exception to that being #xD appearing in a named entity in which case it is given to the application. What I am trying to say is I don't think there is something like normal whitespace treatment in xml files. Each application has its own rules. In a typical FOP deployment scenario we have two applications involved. The XSLT processor and FOP itself. I haven't studied what XSLT says about whitespace transformations but I would assume whitespace in element content is passed through unchanged unless special transformations are specified in the stylesheet. So in the end it still boils down to: What does the XSL-FO spec say a conformant processor should do with whitespace in the input? The answer to that question is: I don't know because I don't understand some aspects of that in the spec. However your question was more specific like: Under the default settings is whitespace in the input significant? Again I cannot answer that conclusively because the spec even when just looking at the default case confuses me. But here is my gut feel on how it may be intended: 1. If we have only whitespace and nothing else between markup it should be ignored, that is fo:...#x20;fo:... is the same as fo:...fo:... and so is /fo:...#x20;/fo:... and /fo:...#x20;fo: Note: There may be an exception to this rule related to fo:character / objects. 2. If we have whitespace embedded within text (= surrounded by non whitespace characters) it collapses to a single space. 3. Any whitespace at the beginning or end of a line (this is line in the generated output not line in the input!) is dropped. If this interpretation is correct it means in some cases whitespace disappears completely in others it reduces to a single space (which a possibly specified text justification can change in width in the output). I had an xsl stylesheet producing fo, that resulted in different rendered output (from FOP 0.20.5) if the fo xml document was pretty printed or not. This caused me a problem, as I was changing from running xml through the xslt engine, and serializing the result to a file (which was adding the pretty-printing) then supplying this fo file to FOP. The new process of sending the sax events from the xslt engine directly to FOPs content handler did not get any extra (supposedly meaningless) whitespace added, yet produced different rendering results. The workaround for that particular stylesheet was to add only a single xsl:text /xsl:text to generate one space between two elements, resulting in a sax characters event, and restoring the desired rendering behaviour. (Note: the desire at the time was to get what was previously being produced, irrespective of any fo spec conformance.) I can't find the stylesheet now, but I thought that the location of that single space should have been meaningless as far as xml was concerned, and I'm now wondering whether the xsl:fo spec contradicts what I thought was normal whitespace treatment, when no whitespace related attributes are mentioned. (My thought at the time was that it was just a 0.20.5 bug.) From memory the fo was something like .../fo:blockfo:table... vs .../fo:block fo:table... I would say under default whitepace handling this should not make a difference to the produced output. Stephen Denne. HTH Manuel
White space handling Wiki page
I have started a white space handling Wiki page (http://wiki.apache.org/xmlgraphics-fop/LineLayout/WhitespaceHandling). As with many other areas of the spec it seems to raise more questions than providing answers. I really would appreciate any comments, different views, clarifications, ... on what has been written so far. Cheers Manuel
Re: White space handling Wiki page
I've started to comment on the individual issues you listed and only when I got to the examples I realized there must be something wrong. You place the white-space-treatment after the white-space-collapse but I think it is clear that the latter comes last (during area tree construction =after line breaking vs. during line-building and inline-building =before line-breaking). That's why you run into problems explaining why there is no line generated by the white space between the two starting block elements. Maybe clearing this up might clear up some of the other issues. On 18.10.2005 11:14:09 Manuel Mall wrote: I have started a white space handling Wiki page (http://wiki.apache.org/xmlgraphics-fop/LineLayout/WhitespaceHandling). As with many other areas of the spec it seems to raise more questions than providing answers. I really would appreciate any comments, different views, clarifications, ... on what has been written so far. Cheers Manuel Jeremias Maerki
Re: White space handling Wiki page
On Wed, 19 Oct 2005 05:44 am, Jeremias Maerki wrote: I've started to comment on the individual issues you listed and only when I got to the examples I realized there must be something wrong. You place the white-space-treatment after the white-space-collapse but I think it is clear that the latter comes last (during area tree construction =after line breaking vs. during line-building and inline-building =before line-breaking). That's why you run into problems explaining why there is no line generated by the white space between the two starting block elements. Maybe clearing this up might clear up some of the other issues. Jeremias, yes I agree that this is a critical interpretation issue and I expected that part of the algorithm to be controversial. The problem is that in the description of value true for the white-space-collapse property it clearly refers to the fo tree and fo:character siblings in the tree. That was further clarified by an e-mail on the xsl editor list http://lists.w3.org/Archives/Public/xsl-editors/2002OctDec/0004. Once we have done line-building the fo tree (at fo:character level) is largely gone, we have now glyph areas which could have been merged, ligatures combined, etc.. That means referring at this stage back to fo:character siblings in the fo tree seems lets say unusual. The fact that we are dealing with glyph areas and not fo:character elements in line-building / area tree construction is further emphasised by the description of the handling for the white-space-treatment property. It is all defined in terms of glyph areas not fo:characters. Further to this it doesn't make sense to me to collapse white space after line breaking as is implied by your interpretation because the amount of white space does contribute to the line breaking decisions. If we remove white space after line breaking we would IMO get sub optimal line breaks. In summary I think white space must be collapsed before or at least during line breaking but not after. Another related issue is the description of collapsing white space around a linefeed in the spec under white-space-collapse. The spec is very specific and refers to U+000A (linefeed) fo:character siblings in the fo tree. This is obviously very different to removing white space around a line break generated during line building. Also in the default case by the time we get to white-space-collapse handling all linefeeds would have been replaced by spaces during refinement. Which leads to the question do they really mean that in the spec or do they really meant to remove white space around a line break? But then again that is really dealt with by the white-space-treatment property in much more detail. But why then do we need the duplication of white-space-collapse removing white space around a linefeed character and white-space-treatment removing white space (not quite actually - it removes characters with the suppress-at-line-break property being true) around line breaks? snip/ Manuel Jeremias Maerki Manuel