Re: White space handling Wiki page

2005-11-09 Thread Simon Pepping
On Tue, Nov 08, 2005 at 11:19:15AM +0800, Manuel Mall wrote:
 On Tue, 8 Nov 2005 04:40 am, Simon Pepping wrote:
 
  Step 2. Refinement: white-space-collapse
  
 
  Issue 1. The spec intentionally addresses only XML white space,
  because only such white space is manipulated by editors to obtain
  pretty printing.
 
 Point taken, although I have no experience with non western editors. Do 
 they all use 0x20 for 'pretty printing'?

The XML spec does not allow one to use other characters than XML white
space for pretty printing, at least not in element content. It would
result in an invalid XML file because PCDATA would be present where
the DTD or schema does not allow it. That is even true for
non-breaking-space, U+A0.

   ul
   liSome text./li
   /ul

is valid XHTML, but

   ul
#xA0;#xA0;#xA0;#xA0;liSome text./li
   /ul

is not.

In PCDATA it is slightly different.

   p
  This is some content.
  We wrap the lines at
  a narrow width.
   /p

Formally these data are different from the case when the text of the
paragraph were written in one line: spaces have been converted to
linefeeds, and sequences of spaces have been inserted. The XML parser
reports all linefeeds and spaces as character data to the
application. But almost all applications treat the two cases as
equivalent, certainly when the data are considered as textual data. It
is exactly this convention that the FO spec tries to formalize.

   fo:block
  This is some content.
  We wrap the lines at
  a narrow width.
   /fo:block

_is_ equivalent to the case when the text of the block were written in
one line, due to the line-feed-treatment and white-space-collapse
properties (at default values).

Such a convention is not usually applied to non-XML-whitespace
characters, and the FO spec shows no intention to do so.

A side effect is that 'This is some content' is equivalent to
'This  is   some  content', but that is not the case with any
other character, even if that is considered as white space in
some script.

  Example 2
  =
 
  The space in fo:block.fo:block is suppressed because it is at
  the start of the block. 
 Interesting - I agree that this is the intention but you don't find that 
 sentence in the spec. In 1.1 this is covered by the deleting spaces at 
 the beginning of a line under white-space-treatment / line building. 
 Again the discussion is probably academic - we all agree what the 
 expected outcome is. If we can derive that outcome from the spec or not 
 is a very interesting discussion but won't change what we will do.

This is convered under the notion that the start and end of an fo:block are
equivalent to line breaks.

  And fo:blockfo:block does not generate 
  an empty line. fo:block starts a new line, but that is not
  equivalent to a linefeed. When at the start of the nested fo:block
  there is no content in the line yet, it starts the same line. A
  similar thing happens in the case of /fo:block#x0A;/fo:block,
  which was discussed in an email thread.
 I assume you mean the discussion under linefeed-treatment=preserve. I 
 am still confused about that because
 /fo:block#x0A;#x0A;/fo:block 
 will generate one linefeed or should this create also none?

Yes, I am referring to that discussion, and I quoted it
wrong. The case is: #x0A;/fo:block. The linefeed creates a
linebreak, /fo:block does not add another one since the line has
already been ended. /fo:block#x0A;/fo:block should create one
empty line, and /fo:block#x0A;#x0A;/fo:block two empty lines, I
suppose.

  Nowhere in the spec is a conversion of tabs and CRs to spaces
  specified.
 Under 7.15.8 it says:
 
 preserve
 
 Specifies that any character flow object whose character is 
 classified, before any linefeed-treatment handling is considered, as 
 white space in XML, except for U+000A (linefeed) characters, shall be 
 converted during the refinement process into a character flow object 
 whose Unicode code point is U+0020 (space).

But they removed it in 7.16.8 in the 1.1 draft.

Regards, Simon

-- 
Simon Pepping
home page: http://www.leverkruid.nl



Re: White space handling Wiki page

2005-11-01 Thread Andreas L Delmelle

On Oct 31, 2005, at 22:18, Andreas L Delmelle wrote:


On Oct 27, 2005, at 06:29, Manuel Mall wrote:

Actually something like:
fo:block background-color=yellowword1fo:character
character=#10;/fo:character character=
 /word2fo:character character= /word3fo:character
character=#10;//fo:block
currently causes an exception!




The problem can be solved by a slight modification to OneCharIterator:
* add a constructor with Character parameter (and member)
* add a remove() implementation which makes Character's parent  
remove it from its list of child nodes


Tested locally (very quickly), and seems to work nicely. If I get  
the chance to commit it in the next few days, I'll do so myself,  
but if you want to have a go, it's a pretty easy fix (adds up to  
about 10-15 LOC incl. javadocs :-))


Oops, been too quick. From an UnsupportedOperationException to a  
ConcurrentModificationException...
The trick seems to be to introduce a small boolean 'discard' switch  
to the Character object, flip this upon calling OCIter.remove(), and  
have the Block/Inline later remove any of its characters marked as  
discardable, but do this (of course) only after the  
RecursiveCharIterator has finished --to avoid the childNodes list  
from being altered while it's being iterated over...


Other option: store a list of the discardable space fo:characters at  
Block or Inline level, instead of marking the Character itself as  
such...


A bit more than 15 LOC, but still quite doable.

Cheers,

Andreas



Re: White space handling Wiki page

2005-11-01 Thread Andreas L Delmelle

On Nov 1, 2005, at 10:04, Manuel Mall wrote:



I am sure it is doable - but is it worth it at this stage? Possibly
after a better understanding of the white-space handling issues that
whole current system needs revision? One problem with the current char
iterator is that it iterates over inline boundaries which causes white
space to be collapsed across those which according to the  
clarification

of the WG is incorrect. IMO to implement the refinement step of the
white space handling (which currently happens in the flow.Block  
object)

we need an iterator which goes through all characters but indicates fo
boundaries (not including fo:characters) so we can do:
a) linefeed treatment across all characters;
b) white space collapse across each consecutive section of
implicit/explicit fo:characters, i.e. delimited by the start/end of
fo's;
c 1) white-space-treatment from the start of the fo:block to the first
non white-space character;
The iterator must also be able to either operate backwards or be  
able to

be reset to a particular position (last non white space character) so
we can do:
c 2)  white-space-treatment from the end of the fo:block backwards to
the first non white-space character

It must also support character deletions and character substitutions.

Does that make sense?


Very much. Precisely with that in mind, I've also been contemplating  
moving part of the whitespace-handling to inline-level. This would  
keep the nested inlines separated from the Block's own direct FOText  
descendants (and at the same time, in combination with the  
modification I already described, this would provide us with an  
opportunity to remove fo:characters from within the nested inlines -- 
which would become quite a pain if this removal is deferred to block- 
level)


So the RecursiveCharIterator should only create Iterators over  
regular FOText or fo:characters that are direct descendants of the  
Block/Inline. FOText of nested FObjs should be left alone, since the  
whitespace will already be collapsed. IOW, it should stop being -- 
recursive?


Currently, whitespace handling is triggered from the moment a Block  
encounters a child node that isn't FOText nor generates inline areas.  
At the basis this seems OK, the only difference I'd propose is that  
inlines do their own whitespace handling, so that *if* whitespace  
needs to be collapsed across fo boundaries --maybe there are  
cases?--, the block-level only needs to look at the first and last  
characters in an inline's text.



Cheers,

Andreas



Re: White space handling Wiki page

2005-10-31 Thread Andreas L Delmelle

On Oct 27, 2005, at 06:29, Manuel Mall wrote:

Manuel,

Some more on this example:


Actually something like:
fo:block background-color=yellowword1fo:character
character=#10;/fo:character character=
 /word2fo:character character= /word3fo:character
character=#10;//fo:block
currently causes an exception!


I think I see the problem (don't know if you've seen it that way):

fop.fo.flow.Character overrides FONode.charIterator(), which returns  
a OneCharIterator over its own character. If remove() is called from  
within a RecursiveCharIterator for the surrounding block, the spaces  
surrounding the linefeed get recognized as discardable whitespace.  
The superclass throws an UnsupportedOperationException, because  
OneCharIterator doesn't have an implementation for remove(). This may  
be the reason the example didn't work properly.


The problem can be solved by a slight modification to OneCharIterator:
* add a constructor with Character parameter (and member)
* add a remove() implementation which makes Character's parent remove  
it from its list of child nodes


Tested locally (very quickly), and seems to work nicely. If I get the  
chance to commit it in the next few days, I'll do so myself, but if  
you want to have a go, it's a pretty easy fix (adds up to about 10-15  
LOC incl. javadocs :-))



Cheers,

Andreas



Re: White space handling Wiki page

2005-10-29 Thread Manuel Mall
If its any consolation in this discussion I wrote a little HTML test to 
see how browsers deal with these white space (and some line height) 
issues. To mimic the XSL-FO situation I used only div and span. You 
can see the results here: 
http://people.apache.org/~manuel/fop/test5.html

I viewed this page with IE 6, Firefox and Opera under Windows and 
Firefox and Konqueror under Linux. There are differences in rendering 
some subtle / some quite significant between all 5 browser variants. 
Although IE stands out in doing it most differently than the rest. If 
they cannot sort out white space handling (and line height) in HTML 
(billions of users, designers, spec writers, reviewers, ...) how should 
we have a chance :-)?

Manuel


Re: White space handling Wiki page

2005-10-28 Thread Manuel Mall
Andreas,

excellent - I think there is now lots of convergence and common 
understanding between your and my interpretations.

A bit more inline.

On Fri, 28 Oct 2005 04:58 am, Andreas L Delmelle wrote:
 On Oct 25, 2005, at 10:57, Manuel Mall wrote:
  When FOP is collapsing (b) or removing (c) white space are there
  any fences we need to observe. For example a border/padding between
  two spaces, e.g. (spaces represented by a .):
  fo:block...fo:inline
  border=..Text .../fo:inline.../fo:block
  There are 4 sequences of 3 spaces each. What would we expect the
  final outcome to be (assuming it fits on one line):
  a) all removed: [border]Text[border]
  b) only first and last removed: [border].Text.[border]
  c) first, 2nd and last removed: [border]Text.[border]
  d) ???
 
  To me b) makes sense. However, a) is the HTML way and c) seems what
  RenderX and AntennaHouse are doing. What do we want to do?

 Having read that 1.1 definition more closely now, I'd say a). Somehow
 it begins to fall into place...

I fully agree with your analysis of the spec here, that is saying option 
a) is the most likely answer that can be derived from the wording in 
the spec. Is it the intended / most sensible answer? A bit more below.

snip/
  And what about this:
  fo:block...A...fo:inline
  border=..Text .../fo:inline...B.../fo:block
 
  a) all removed: A[border]Text[border]B
  b) only first and last removed: A.[border].Text.[border].B
  c) only first and last removed and others collapsed across the
  borders:
  A.[border]Text.[border]B
  d) ???
 
  a) is most likely wrong, b) looks OK, c) is the HTML way.

 Same thinking here, b) seems to be the way to go.


We agree but did you notice the difference it would make in visual 
appearance if the inline just happens to be at the beginning / end of 
the line if we follow option a) from the first example? That is if we 
have a line break after the A you would get:
[border]Text.[border].B
If we have a line break before the B you would get:
A.[border].Text[border].B
That is depending on where the linebreaks are there would be a space or 
not between the border and the word 'Text'. It is these 'strange' or 
'unsymmetric' outcomes which made me think that a border should 
possibly act like a fence with respect to whitespace removal (option b) 
in the first example).

 I wonder... what if:
 1. as much as possible of the whitespace handling is done in the FO
 parsing stage (before any LayoutManager is created)
Yes, isn't that what the refinement stage is about (partly)?
 2. after linefeed-treatment is handled, all remaining whitespace
 characters are converted internally into fo:characters

 This is precisely what the definition of fo:character seems to
 prescribe for all characters, but that may be overkill (?)
Yes, logically - practically within FOP no because creating separate FOs 
and possibly areas for each character is most likely prohibitive in 
terms of memory consumption and processing. But logically FOP should 
behave as if that is what is happening. Especially if we want to 
implement Unicode compliant line breaking, bidi, etc. This needs to be 
done on a per paragraph basis and not on a per 'text section' basis as 
is now. That is analysis where a line break opportunity is must go 
across inline boundaries, include fo:characters, etc..


 As such, all those whitespace characters would get a default
 suppress- at-line-break of auto, meaning: for the plain old space
 --U+0020-- suppress, and retain for all the others. So, in case
 of linefeed- treatment=preserve:

 fo:block l-t=p#x0A;/fo:block

 is the same as

 fo:block l-t=pfo:character character=#x0A;
 suppress-at-line-break=retain ...//fo:block

 Which should IIC, in terms of layout, create something like a penalty
 of -INFINITE (= effect should be a forced line-break), but the effect
 of surrounding feasible breaks should be taken into account.
 In case one is wondering: with default white-space-treatment and
 preserved linefeeds, this means that if a linefeed glyph-area
 immediately follows another line-break (start-block)...? Empty line
 or not? The glyph-area is not deleted, and it should be the last area
 of the line-area subset it occurs in, so I'm inclined to say: yes,
 empty line.
Agree


 Anyway, this would definitely mean something in terms of treating
 whitespace consistently and uniformly, whether the stylesheet author
 used explicit fo:characters or not. At the very least the treatment
 between characters and fo:characters should be normalized in *some*
 way vis-a-vis the layout-engine.
Agree - It probably means when we operate on this stuff we need to use 
character iterators which go through 'plain text', nested inline, 
fo:characters, etc. transparently.

 The other way around is certainly impossible, since we'd lose the
 original fo:character's property info which is used during layout. In
 between lies an idea of a temporary whitespace map, into which both
 types of whitespace chars are 

Re: White space handling Wiki page

2005-10-28 Thread Luca Furini

Manuel Mall wrote:

Side note: FOP doesn't quite do the same internally, i.e. a character 
explicitly specified using fo:character.../ is handled separately from 
'plain text'. If someone would write a style sheet which does a 
transform of every character into a fo:character / object and would 
feed the output to FOP the formatting results would be lets say VERY 
DISAPPOINTING. Actually something like: fo:block 
background-color=yellowword1fo:character character= 
/fo:character character=  /word2fo:character character= 
/word3fo:character character= //fo:block currently causes an 
exception!


This is a problem of the whitespace-related code, but anyway the 
CharacterLM always creates a sequence of element corresponding to a 
non-space character, so the only feasible breaks recognized by the 
algorithm would be the hyphenation points inside the words ...


I think that just as TextArea and Character both extend an 
AbstractTextArea, TextLM and CharLM should have a common super class 
holding the createElementsFor*() methods. It would not be necessary to add 
a SpaceArea or a WordArea child to a Character area, anyway (but we could 
decide to do it anyway just for analogy).


Regards
Luca




Re: White space handling Wiki page

2005-10-28 Thread Manuel Mall
On Fri, 28 Oct 2005 11:08 pm, Luca Furini wrote:
 Manuel Mall wrote:
  Side note: FOP doesn't quite do the same internally, i.e. a
  character explicitly specified using fo:character.../ is handled
  separately from 'plain text'. If someone would write a style sheet
  which does a transform of every character into a fo:character /
  object and would feed the output to FOP the formatting results
  would be lets say VERY DISAPPOINTING. Actually something like:
  fo:block
  background-color=yellowword1fo:character character=
  /fo:character character=  /word2fo:character character=
  /word3fo:character character= //fo:block currently causes
  an exception!

 This is a problem of the whitespace-related code, but anyway the
 CharacterLM always creates a sequence of element corresponding to a
 non-space character, so the only feasible breaks recognized by the
 algorithm would be the hyphenation points inside the words ...

 I think that just as TextArea and Character both extend an
 AbstractTextArea, TextLM and CharLM should have a common super class
 holding the createElementsFor*() methods. It would not be necessary
 to add a SpaceArea or a WordArea child to a Character area, anyway
 (but we could decide to do it anyway just for analogy).

Yes I agree but it is IMO a bit more complicated. The Unicode line 
breaking algorithm does require more than one character to make 
decisions. Simple example: No break after/before an opening/closing 
punctuation, e.g. (, [, ], ) etc.. So in a sequence like ( HELP ) 
neither the the space following ( nor the space preceding ) would 
be a legal break opportunity. If someone would write:
...(fo:inline font-weight=bold HELP /fo:inline)...
then even if the current getNextKnuth functions would implement the 
Unicode algorithm we still would create a break opportunity for the 
spaces because the fo snippet above would generate 3 calls to 
getNextKnuth because 3 different LMs are created: one for '...(', one 
for ' HELP ', and the last for ')...' and each do the analysis just 
limited to their piece of text. Therefore having the 
createElementsFor*() methods centralised solves only part of the 
problem.
 Regards
  Luca
Cheers
Manuel


Re: White space handling Wiki page

2005-10-28 Thread Andreas L Delmelle

On Oct 28, 2005, at 12:28, Manuel Mall wrote:


On Fri, 28 Oct 2005 04:58 am, Andreas L Delmelle wrote:


(second example)



Same thinking here, b) seems to be the way to go.




We agree but did you notice the difference it would make in visual
appearance if the inline just happens to be at the beginning /  
end of

the line if we follow option a) from the first example? That is if we
have a line break after the A you would get:
[border]Text.[border].B
If we have a line break before the B you would get:
A.[border].Text[border].B


Errm, typo? I'd delete the space before 'B' as well, so fully:
A.[border].Text[border]
B

That is depending on where the linebreaks are there would be a  
space or

not between the border and the word 'Text'. It is these 'strange' or
'unsymmetric' outcomes which made me think that a border should
possibly act like a fence with respect to whitespace removal  
(option b)

in the first example).


In a certain way, yes. The white-spaces before 'A' and after 'B' can  
be literally removed from the stream (so that they don't have a  
corresponding glyph-area; why create one if you already know it's  
going to be deleted much further on?), while the other space- 
sequences can at most be collapsed to one space. These single spaces  
will automatically have glyph-areas which will or will not be  
deleted, depending on whether a line-break precedes/follows.


So I agree, but I don't think the borders need to be explicitly  
tracked/checked for this, as  they coincide with the boundaries of  
the inline anyway. The effect of the border acting as a fence should  
more be seen as a consequence, following naturally from the process  
of whitespace handling. It's the element borders --but in XML markup  
terms, not the presence of border properties-- that act as  
'fences' (quoted since the term is not really applicable at that level).


My main point was the difference between blocks and inlines in this  
respect.

For instance, the following different possibilities:

1) fo:block A ...
2) fo:block   A ...
3) fo:block #x0A;A ...
4) fo:block#x09;#x0D;#x0A;#x20; A ...

would all be treated during layout as if they were
fo:blockA ...

(supposing default values for all related properties)

The space between 'A' and '...' always remains --whether the '...'  
refers to content or markup for a nested inline-- but the spaces  
between the start-block markup and the character 'A' are all dropped  
(=implicit line-break immediately preceding).


Analogous for XML whitespace between the last non-whitespace char and  
the end-block markup.


For inlines, this becomes nearly the opposite: only if the current  
inline FO is the first child-node to its parent (first inline in a  
block, no preceding characters), then we could cheat and throw away  
any whitespace between start-inline and the first non-whitespace, but  
as a general ROT, those white-spaces can at most be collapsed to a  
single space, since they could end up in the middle of a line area.


Generally, inserting spaces, tabs or linefeeds as the first/last  
characters of a fo:block should make no difference, but for a  
fo:inline this would always result in an extra space in the output if  
it ends up in the middle of a line.


Talking nested blocks:

fo:block A fo:block B ...

is the same as

fo:blockAfo:blockB ...
or
fo:blockA
  fo:blockB

The above doesn't hold for nested inlines, hence: beware of  
indent=yes in XSLT. In case of deeply nested inlines, this could  
result in the number of spaces in the output increasing with the  
depth of the fo:inline in the source document. :-)


snip /

2. after linefeed-treatment is handled, all remaining whitespace
characters are converted internally into fo:characters

This is precisely what the definition of fo:character seems to
prescribe for all characters, but that may be overkill (?)

Yes, logically - practically within FOP no because creating  
separate FOs

and possibly areas for each character is most likely prohibitive in
terms of memory consumption and processing.
But logically FOP should behave as if that is what is happening.
Especially if we want to implement Unicode compliant line breaking,
bidi, etc. This needs to be done on a per paragraph basis and not
on a per 'text section' basis as is now. That is analysis where a
line break opportunity is must go across inline boundaries,
include fo:characters, etc..


Not necessarily separate FOs, but the same type of LayoutManager  
would probably be more in the right direction. CharLM (or subclass?)  
should be able to operate on either an attached fo:character or a  
simple char instance variable; instantiated either from an explicit  
fo:character object, or by the TextLM responsible for the larger  
context from a Unicode whitespace character it encounters (instead of  
creating the elements for whitespace itself, the TextLM instantiates  
a CharLM to delegate?)
At the same time, the TextLM's operating context for line-breaking  

Re: White space handling Wiki page

2005-10-26 Thread Manuel Mall
On Wed, 26 Oct 2005 06:22 am, Andreas L Delmelle wrote:
 On Oct 25, 2005, at 10:57, Manuel Mall wrote:
/snip
 No, it talks about 'character flow objects', which makes me wonder...
 Are all characters to be considered 'character flow objects' or only
 those that were specified using fo:character? Not that it would make
 a big difference, I think.

See bottom of page 3 (PDF version) and top of page 4 of the spec. There 
it talks about 'objectifying' the XML elements and attributes which 
includes converting characters into character FO's. From then on the 
spec always means the value of the character property of a 
fo:character object when talking about characters and their values. 
So the answer to your above question is: YES - all characters are 
'character flow objects'.

Side note: FOP doesn't quite do the same internally, i.e. a character 
explicitly specified using fo:character.../ is handled separately 
from 'plain text'. If someone would write a style sheet which does a 
transform of every character into a fo:character / object and would 
feed the output to FOP the formatting results would be lets say VERY 
DISAPPOINTING. Actually something like:
fo:block background-color=yellowword1fo:character 
character=#10;/fo:character character=
 /word2fo:character character= /word3fo:character 
character=#10;//fo:block
currently causes an exception!

 Cheers,

 Andreas
Cheers

Manuel


Re: White space handling Wiki page

2005-10-26 Thread Manuel Mall
On Wed, 26 Oct 2005 06:22 am, Andreas L Delmelle wrote:
 On Oct 25, 2005, at 10:57, Manuel Mall wrote:

snip/
 The right order in which the related properties should be dealt with
 seems to be:
 1. white-space-treatment (property refinement)
 2. linefeed-treatment (property refinement)
 3. white-space-collapse (layout/area tree construction)
 4. suppress-at-line-break (layout/area tree construction)

We are very close here in our mutual opinions - if you look at my 
revised algorithm on the Wiki page it is nearly the same as your 4 
steps above. THAT'S GOOD !!!

And what do they say: Great minds think alike :-)

 Cheers,

 Andreas

Cheers

Manuel


Re: White space handling Wiki page

2005-10-25 Thread Manuel Mall
Hi,

I haven't got any technical comments to the issues raised on the Wiki 
page. Is this 'too hard' or 'too boring' or 'too messy' or what? The 
problem is not going away. We currently don't do it right in some parts 
(that is established) but I don't know overall what is right or wrong. 
May be if I ask for comments on an issue by issue basis we get 
somewhere?

Quick background: In the default case (which seems to be the most 
complicated) white space handling consists of 3 things -

a) Replace any white space that is not as space char with a space char 
= easy.

b) Collapse any sequence of spaces to a single space.

c) Remove any spaces at the beginning and end of lines.

First issue for b) and c) (and it may have different answers for b) and 
c)):

In other places the spec has a concept of a fence as a boundary across 
which certain operations do not apply, e.g. space resolution.

When FOP is collapsing (b) or removing (c) white space are there any 
fences we need to observe. For example a border/padding between two 
spaces, e.g. (spaces represented by a .):
fo:block...fo:inline 
border=..Text .../fo:inline.../fo:block
There are 4 sequences of 3 spaces each. What would we expect the final 
outcome to be (assuming it fits on one line):
a) all removed: [border]Text[border]
b) only first and last removed: [border].Text.[border]
c) first, 2nd and last removed: [border]Text.[border]
d) ???

To me b) makes sense. However, a) is the HTML way and c) seems what 
RenderX and AntennaHouse are doing. What do we want to do?

And what about this:
fo:block...A...fo:inline 
border=..Text .../fo:inline...B.../fo:block

a) all removed: A[border]Text[border]B
b) only first and last removed: A.[border].Text.[border].B
c) only first and last removed and others collapsed across the borders: 
A.[border]Text.[border]B
d) ???

a) is most likely wrong, b) looks OK, c) is the HTML way.

Manuel


Re: White space handling Wiki page

2005-10-25 Thread Simon Pepping
Hi Manuel,

On Tue, Oct 25, 2005 at 04:57:41PM +0800, Manuel Mall wrote:
 Hi,
 
 I haven't got any technical comments to the issues raised on the Wiki 
 page. Is this 'too hard' or 'too boring' or 'too messy' or what? The 
 problem is not going away. We currently don't do it right in some parts 
 (that is established) but I don't know overall what is right or wrong. 
 May be if I ask for comments on an issue by issue basis we get 
 somewhere?

You and Jeremias both turning in so much work is too much to follow. I
really would like to comment on your analysis, but I do not right now
have time for it. Perhaps in one of the next weeks.

Simon

-- 
Simon Pepping
home page: http://www.leverkruid.nl



Re: White space handling Wiki page

2005-10-25 Thread Manuel Mall
Andreas,

firstly a great thanks for looking at this.

I am not going to comment on your comments right now but there is 
probably an important clarification required: All my interpretations of 
the spec with respect to white space handling are based on the 1.1WD 
not the 1.0 spec. The WG has already confirmed that the description of 
white space handling in 1.0 is flawed and has rewritten that part in 
1.1. As I mentioned before I believe their intention was not to change 
the behaviour they wanted to achieve between 1.0 and 1.1 but to 
document it more correctly. Therefore I think it is correct and 
advisable to refer to 1.1 in this case. Could you please review your 
comments under this aspect as I believe that would clarify why I refer 
to line breaking vs linefeed, glyph areas vs fo's, etc. below. Which 
are some of the items you questioned.

Manuel

On Wed, 26 Oct 2005 06:22 am, Andreas L Delmelle wrote:
 On Oct 25, 2005, at 10:57, Manuel Mall wrote:

 Hi Manuel,

  I haven't got any technical comments to the issues raised on the
  Wiki page. Is this 'too hard' or 'too boring' or 'too messy' or
  what?

 All three, and more :-P
 Nah, seriously now, trying to comment in on the thread from last
 week:

 On Oct 19, 2005, at 03:45, Manuel Mall wrote:
  On Wed, 19 Oct 2005 05:44 am, Jeremias Maerki wrote:
  You place the white-space-treatment after the white-space-collapse
  but I think it is clear that the latter comes last (during area
  tree construction =after line breaking vs. during line-building
  and inline-building =before line-breaking).
 
  Yes I agree that this is a critical interpretation issue and I
  expected
  that part of the algorithm to be controversial. The problem is that
  in the description of value true for the white-space-collapse
  property it clearly refers to the fo tree and fo:character siblings
  in the tree.
  That was further clarified by an e-mail on the xsl editor list
  http://lists.w3.org/Archives/Public/xsl-editors/2002OctDec/0004.
  Once we have done line-building the fo tree (at fo:character level)
  is largely gone, we have now glyph areas which could have been
  merged, ligatures combined, etc.. That means referring at this
  stage back to fo:character siblings in the fo tree seems lets say
  unusual.

 Correction: the FO tree isn't 'gone' until layout for the page-
 sequence is finished. Only at that time are all the FObjs released.
 It may seem unusual to refer back to fo:character siblings, but that
 doesn't mean it's wrong. The corresponding FObj is still available to
 us at that point...

  The fact that we are dealing with glyph areas and not fo:character
  elements in line-building / area tree construction is further
  emphasised by the description of the handling for the
  white-space-treatment property. It is all defined in terms of
  glyph areas not fo:characters.

 Errm... I seem to be missing something, or aren't you talking about
 the XSL-FO Rec here?
 I can't find any reference to the term 'glyph' or 'glyph area' in the
 description of white-space-treatment. Many references to 'character
 flow object' and 'XML whitespace' though. (XSL 1.0) The passage about
 line-building (4.7.2) indeed talks about glyph areas, but only seems
 to mention suppress-at-line-break (a property which, incidentally,
 only applies to fo:character objects).

 Besides, the question that needs to be posed when handling white-
 space-collapse is: Should this whitespace character generate an area
 or not?
 The description in the Rec (XSL 1.0 -- 7.15.12) IMO clearly indicates
 that no layout information is needed to answer the above question, as
 it depends solely upon the character itself, the preceding and the
 following character.

  Further to this it doesn't make sense to me to collapse white space
  after line breaking as is implied by your interpretation because
  the amount of white space does contribute to the line breaking
  decisions. If we remove white space after line breaking we would
  IMO get sub optimal line breaks. In summary I think white space
  must be collapsed before or at least during line breaking but not
  after.

 100% agreed. I wouldn't place it after line breaking either. It seems
 like a decision that can (and should?) be made *during* layout:
 Should this whitespace character generate an element or not?
 If it generates an element, it will also --if appropriate-- trigger
 the generation of an area.

  Another related issue is the description of collapsing white space
  around a linefeed in the spec under white-space-collapse. The spec
  is very specific and refers to U+000A (linefeed) fo:character
  siblings in the fo tree.

 No, it talks about 'character flow objects', which makes me wonder...
 Are all characters to be considered 'character flow objects' or only
 those that were specified using fo:character? Not that it would make
 a big difference, I think.

  This is obviously very different to removing white space around a
  line break generated 

Re: White space handling Wiki page

2005-10-20 Thread Manuel Mall
I was close to throwing temporarily the proverbial towel into the ring 
with respect to whitespace handling. However an offline IM chat with 
Jeremias and writing the response to Stephen's post encouraged me to 
take a different approach.

Instead of trying to understand whitespace handling on the basis of a 
detailed analysis of the spec I looked at it from the perspective: What 
outcome did the spec writers most likely wanted to achieve? Of course 
the result is just a bunch of (educated?) guesses on my part.

But based on my guesses I have posted a revised algorithm on the Wiki 
which I hope:
a) Deals sensibly with unresolved/unclear issues
b) Gives consistent results in the generated output
c) Moves most of the work into refinement and only leaves whitespace 
handling around formatter generated breaks to layout
d) Is understandable by others

However, I do admit it is at this stage a home cooked approach and 
certainly requires close scrutiny by others before any implementation 
into the FOP code base is attempted.

Thanks

Manuel


Re: White space handling Wiki page

2005-10-20 Thread Chris Bowditch

Manuel Mall wrote:

Manuel,

I was close to throwing temporarily the proverbial towel into the ring 
with respect to whitespace handling. However an offline IM chat with 
Jeremias and writing the response to Stephen's post encouraged me to 
take a different approach.


Instead of trying to understand whitespace handling on the basis of a 
detailed analysis of the spec I looked at it from the perspective: What 
outcome did the spec writers most likely wanted to achieve? Of course 
the result is just a bunch of (educated?) guesses on my part.


Indeed. This is the approach I agree with and recommend as far as the 
spec is concerned. As you've found out the spec is often ambiguous. I 
think the best approach to implement any XSL-FO feature is;


1) work out what you think the XSL-FO WG intended. When doing this, I 
think it is important not to dwell too long on every single sentence - 
just get a gut feel of what the WG intended.

2) work through some use cases.
3) and add a pinch of common sense.



But based on my guesses I have posted a revised algorithm on the Wiki 
which I hope:

a) Deals sensibly with unresolved/unclear issues
b) Gives consistent results in the generated output
c) Moves most of the work into refinement and only leaves whitespace 
handling around formatter generated breaks to layout

d) Is understandable by others

However, I do admit it is at this stage a home cooked approach and 
certainly requires close scrutiny by others before any implementation 
into the FOP code base is attempted.


Of course this is just my 2 cents worth and may not be considered a good 
idea by others.


Chris




Re: White space handling Wiki page

2005-10-19 Thread Jeremias Maerki

On 19.10.2005 03:45:33 Manuel Mall wrote:
 On Wed, 19 Oct 2005 05:44 am, Jeremias Maerki wrote:
  I've started to comment on the individual issues you listed and only
  when I got to the examples I realized there must be something wrong.
  You place the white-space-treatment after the white-space-collapse
  but I think it is clear that the latter comes last (during area tree
  construction =after line breaking vs. during line-building and
  inline-building =before line-breaking). That's why you run into
  problems explaining why there is no line generated by the white space
  between the two starting block elements. Maybe clearing this up might
  clear up some of the other issues.
 
 Jeremias,
 
 yes I agree that this is a critical interpretation issue and I expected 
 that part of the algorithm to be controversial. The problem is that in 
 the description of value true for the white-space-collapse property 
 it clearly refers to the fo tree and fo:character siblings in the tree. 
 That was further clarified by an e-mail on the xsl editor list  
 http://lists.w3.org/Archives/Public/xsl-editors/2002OctDec/0004.

That document contains a further indicator that white-space-treatment
needs to be handled before white-space-collapse. Item 3 makes the intent
clear even thought the wording has been changed in the spec. OTOH the
usage-context-of-suppress-at-line-break property that is mentioned in
that document that never really surfaced diminishes the credibility (or
importance) of the document a little.

 Once 
 we have done line-building the fo tree (at fo:character level) is 
 largely gone, we have now glyph areas which could have been merged, 
 ligatures combined, etc.. That means referring at this stage back to 
 fo:character siblings in the fo tree seems lets say unusual. The fact 
 that we are dealing with glyph areas and not fo:character elements in 
 line-building / area tree construction is further emphasised by the 
 description of the handling for the white-space-treatment property. It 
 is all defined in terms of glyph areas not fo:characters.
 
 Further to this it doesn't make sense to me to collapse white space 
 after line breaking as is implied by your interpretation because the 
 amount of white space does contribute to the line breaking decisions. 
 If we remove white space after line breaking we would IMO get sub 
 optimal line breaks. In summary I think white space must be collapsed 
 before or at least during line breaking but not after.

Point. That was my bad. Still, I get the impression that *-treatment are
both handled before white-space-collapse.

 Another related issue is the description of collapsing white space 
 around a linefeed in the spec under white-space-collapse. The spec is 
 very specific and refers to U+000A (linefeed) fo:character siblings in 
 the fo tree. This is obviously very different to removing white space 
 around a line break generated during line building. Also in the default 
 case by the time we get to white-space-collapse handling all linefeeds 
 would have been replaced by spaces during refinement.

...or hard breaks or discarded entirely.

 Which leads to 
 the question do they really mean that in the spec or do they really 
 meant to remove white space around a line break?

That's my impression. It's two properties that actually remove space
around line breaks, but in two different stages.

 But then again that is 
 really dealt with by the white-space-treatment property in much more 
 detail. But why then do we need the duplication of white-space-collapse 
 removing white space around a linefeed character and 
 white-space-treatment removing white space (not quite actually - it 
 removes characters with the suppress-at-line-break property being true) 
 around line breaks?

Because line-feed-treatment may generate spaces which are not yet
handled by white-space-treatment but later picked up by
white-space-collapse. At least that is my take.

Anyway, this is like fishing in the dark. I have big trouble (again)
understanding the spec. Obviously, you found a lot of little details
that don't really resolve well. All we can do here is make guesses. This
is really frustrating.

Jeremias Maerki



Re: White space handling Wiki page

2005-10-19 Thread Manuel Mall
On Wed, 19 Oct 2005 03:33 pm, Jeremias Maerki wrote:
 On 19.10.2005 03:45:33 Manuel Mall wrote:
  On Wed, 19 Oct 2005 05:44 am, Jeremias Maerki wrote:
snip/

 Still, I get the impression that *-treatment
 are both handled before white-space-collapse.


Yes, linefeed-treatment is definitely a refinement activity (and pretty 
simple) and handled before white-space-collapse. But 
white-space-treatment is not refinement and this is where we differ (I 
think). white-space-treatment clearly depends on the line breaks 
generated, that is you cannot do white-space-treatment without having 
the line breaks. But we can only generate appropriate line breaks if we 
have collapsed the white space before. To me this means, together with 
the fact that white-space-collapse is defined on fo's and 
white-space-treatment on glyph areas, that white-space-collapse 
precedes white-space-treatment.

snip/
 Because line-feed-treatment may generate spaces which are not yet
 handled by white-space-treatment but later picked up by
 white-space-collapse. At least that is my take.

See above.


 Anyway, this is like fishing in the dark. I have big trouble (again)
 understanding the spec. Obviously, you found a lot of little details
 that don't really resolve well. All we can do here is make guesses.
 This is really frustrating.


I couldn't agree more

 Jeremias Maerki

Manuel


Re: White space handling Wiki page

2005-10-19 Thread Manuel Mall
On Thu, 20 Oct 2005 08:15 am, Stephen Denne wrote:
 I have not read the spec regarding these attributes recently, but I
 was wondering whether the treatment of whitespace in the fo file
 defaults to the normal whitespace treatment in xml files if no
 special white-space treatment attributes are specified.

If I understand the XML spec correctly whitespace handling is a job for 
the application not the XML parser. That is any character in a XML 
document outside of markup must be passed to the application by the 
parser. The only exception is #xD (carriage return) which is dropped if 
followed by #xA and otherwise replaced with a #xA. With the exception 
to that being #xD appearing in a named entity in which case it is given 
to the application.

What I am trying to say is I don't think there is something like normal 
whitespace treatment in xml files. Each application has its own rules.

In a typical FOP deployment scenario we have two applications involved. 
The XSLT processor and FOP itself. I haven't studied what XSLT says 
about whitespace transformations but I would assume whitespace in 
element content is passed through unchanged unless special 
transformations are specified in the stylesheet.

So in the end it still boils down to: What does the XSL-FO spec say a 
conformant processor should do with whitespace in the input?

The answer to that question is: I don't know because I don't understand 
some aspects of that in the spec.

However your question was more specific like: Under the default settings 
is whitespace in the input significant?

Again I cannot answer that conclusively because the spec even when just 
looking at the default case confuses me. But here is my gut feel on how 
it may be intended:

1. If we have only whitespace and nothing else between markup it should 
be ignored, that is fo:...#x20;fo:... is the same as 
fo:...fo:... and so is /fo:...#x20;/fo:... and 
/fo:...#x20;fo: Note: There may be an exception to this rule 
related to fo:character / objects.

2. If we have whitespace embedded within text (= surrounded by non 
whitespace characters) it collapses to a single space.

3. Any whitespace at the beginning or end of a line (this is line in the 
generated output not line in the input!) is dropped.

If this interpretation is correct it means in some cases whitespace 
disappears completely in others it reduces to a single space (which a 
possibly specified text justification can change in width in the 
output).

 I had an xsl stylesheet producing fo, that resulted in different
 rendered output (from FOP 0.20.5) if the fo xml document was pretty
 printed or not. This caused me a problem, as I was changing from
 running xml through the xslt engine, and serializing the result to a
 file (which was adding the pretty-printing) then supplying this fo
 file to FOP. The new process of sending the sax events from the xslt
 engine directly to FOPs content handler did not get any extra
 (supposedly meaningless) whitespace added, yet produced different
 rendering results.

 The workaround for that particular stylesheet was to add only a
 single xsl:text /xsl:text to generate one space between two
 elements, resulting in a sax characters event, and restoring the
 desired rendering behaviour. (Note: the desire at the time was to get
 what was previously being produced, irrespective of any fo spec
 conformance.)

 I can't find the stylesheet now, but I thought that the location of
 that single space should have been meaningless as far as xml was
 concerned, and I'm now wondering whether the xsl:fo spec contradicts
 what I thought was normal whitespace treatment, when no whitespace
 related attributes are mentioned. (My thought at the time was that it
 was just a 0.20.5 bug.)

 From memory the fo was something like

 .../fo:blockfo:table...

 vs

 .../fo:block fo:table...

I would say under default whitepace handling this should not make a 
difference to the produced output.

 Stephen Denne.

HTH

Manuel


White space handling Wiki page

2005-10-18 Thread Manuel Mall
I have started a white space handling Wiki page
(http://wiki.apache.org/xmlgraphics-fop/LineLayout/WhitespaceHandling).

As with many other areas of the spec it seems to raise more questions than
providing answers. I really would appreciate any comments, different views,
clarifications, ... on what has been written so far.

Cheers

Manuel


Re: White space handling Wiki page

2005-10-18 Thread Jeremias Maerki
I've started to comment on the individual issues you listed and only
when I got to the examples I realized there must be something wrong. You
place the white-space-treatment after the white-space-collapse but I
think it is clear that the latter comes last (during area tree construction
=after line breaking vs. during line-building and inline-building
=before line-breaking). That's why you run into problems explaining why
there is no line generated by the white space between the two starting
block elements. Maybe clearing this up might clear up some of the other
issues.


On 18.10.2005 11:14:09 Manuel Mall wrote:
 I have started a white space handling Wiki page
 (http://wiki.apache.org/xmlgraphics-fop/LineLayout/WhitespaceHandling).
 
 As with many other areas of the spec it seems to raise more questions than
 providing answers. I really would appreciate any comments, different views,
 clarifications, ... on what has been written so far.
 
 Cheers
 
 Manuel



Jeremias Maerki



Re: White space handling Wiki page

2005-10-18 Thread Manuel Mall
On Wed, 19 Oct 2005 05:44 am, Jeremias Maerki wrote:
 I've started to comment on the individual issues you listed and only
 when I got to the examples I realized there must be something wrong.
 You place the white-space-treatment after the white-space-collapse
 but I think it is clear that the latter comes last (during area tree
 construction =after line breaking vs. during line-building and
 inline-building =before line-breaking). That's why you run into
 problems explaining why there is no line generated by the white space
 between the two starting block elements. Maybe clearing this up might
 clear up some of the other issues.

Jeremias,

yes I agree that this is a critical interpretation issue and I expected 
that part of the algorithm to be controversial. The problem is that in 
the description of value true for the white-space-collapse property 
it clearly refers to the fo tree and fo:character siblings in the tree. 
That was further clarified by an e-mail on the xsl editor list  
http://lists.w3.org/Archives/Public/xsl-editors/2002OctDec/0004. Once 
we have done line-building the fo tree (at fo:character level) is 
largely gone, we have now glyph areas which could have been merged, 
ligatures combined, etc.. That means referring at this stage back to 
fo:character siblings in the fo tree seems lets say unusual. The fact 
that we are dealing with glyph areas and not fo:character elements in 
line-building / area tree construction is further emphasised by the 
description of the handling for the white-space-treatment property. It 
is all defined in terms of glyph areas not fo:characters.

Further to this it doesn't make sense to me to collapse white space 
after line breaking as is implied by your interpretation because the 
amount of white space does contribute to the line breaking decisions. 
If we remove white space after line breaking we would IMO get sub 
optimal line breaks. In summary I think white space must be collapsed 
before or at least during line breaking but not after.

Another related issue is the description of collapsing white space 
around a linefeed in the spec under white-space-collapse. The spec is 
very specific and refers to U+000A (linefeed) fo:character siblings in 
the fo tree. This is obviously very different to removing white space 
around a line break generated during line building. Also in the default 
case by the time we get to white-space-collapse handling all linefeeds 
would have been replaced by spaces during refinement. Which leads to 
the question do they really mean that in the spec or do they really 
meant to remove white space around a line break? But then again that is 
really dealt with by the white-space-treatment property in much more 
detail. But why then do we need the duplication of white-space-collapse 
removing white space around a linefeed character and 
white-space-treatment removing white space (not quite actually - it 
removes characters with the suppress-at-line-break property being true) 
around line breaks?

snip/
 Manuel

 Jeremias Maerki

Manuel