Re: Leading/trailing space removal in LineLM

2005-11-08 Thread Manuel Mall
On Sat, 5 Nov 2005 12:05 am, Luca Furini wrote:
 Manuel Mall wrote:
  Here are some of the combinations I have identified:
 
  1. Non breaking / non elastic space = probably just a normal
  character, i.e. part of a word.
 
  2. Non breaking / elastic space - eg. U+00A0 Non breaking space
  = Must prevent break
  = Must handle text-align
 
  3. Break / non elastic - eg. U+200B ZWSP, any other break between
  two characters not involving adding or removing space/characters =
  Must handle border/padding
  = Must handle text-align
 
  4. Break / non elastic / remove if not break - eg. U+00AD Soft
  hyphen = Must remove if not at break
  = Must handle border/padding
  = Must handle text-align
 
  5. Break / non elastic / add character if break - eg. hyphenation
  = Must add space for hyphen if at break
  = Must handle border/padding
  = Must handle text-align
 
  6. Breaking / elastic / non removable - eg. U+3000 Ideographic
  space = Must handle border/padding
  = Must handle text-align
  Question: XSL-FO does not define U+3000 as removable white space
  but would under common CJK typesetting conventions this be removed
  at a line break?
 
  7. Breaking / elastic / removable - eg. U+0020 Space
  = Can occur in runs which must be wholly removed
  = Must handle border/padding
  = Must handle text-align
 
  Any combinations I have missed, e.g. is there a break / non
  elastic / remove at break case?


I moved all this to a Wiki page with the actual Knuth sequences 
(http://wiki.apache.org/xmlgraphics-fop/LineBreaking). Please review / 
check!

 Maybe the fixed width spaces?

Yes - may be.

 Anyway, it seems an exhaustive analysis of the problem!

 Just a few comments / thoughts:

 - non breaking, non elastic: the simple solution would be to handle
 these characters as normal letters, so the text before_after
 (where _ is zwnbsp) would create a single AreaInfo object in the
 TextLM; but this would create problems during hyphenation, as
 non-letter characters in the middle of a word ATM prevents
 hyphenation
I think word breaking, i.e. determining the word boundaries for the 
purpose of hyphenation, and line breaking are not 100% coupled. There 
are actually different Unicode documents describing each. Therefore 
down the track I don't see treating these are normal characters for the 
purpose of line breaking as being a problem as the word breaking would 
be done may be in parallel but logically separate. We also most likely 
want Knuth box elements covering the largest extend of consecutive 
characters as possible because a) it saves resources and b) as the 
width of Knuth elements are the basis of determining what fits on a 
line if we ever look into kerning the calculations would need to be 
done on a per Knuth box element basis.


 - soft hyphen: at the moment it is not properly handled, but it won't
 be difficult to fix the implementation; it could create the same
 elements used for an hyphenation point, but the penalty could have a
 negative value (as probably users would use it to suggest a desired
 line break); note that a word with a soft hyphen in its middle would
 not be hyphenated, unless we ignore this character when collecting
 word fragments

I thought we simply delete the soft-hyphen character and generate a 
normal break with hyphen Knuth sequence at that point.


 Regards
  Luca

Manuel


Re: Leading/trailing space removal in LineLM

2005-11-08 Thread Manuel Mall
On Wed, 9 Nov 2005 12:47 am, Andreas L Delmelle wrote:
 On Nov 4, 2005, at 17:05, Luca Furini wrote:

 Hi Manuel / Luca,

  Manuel Mall wrote:
  Here are some of the combinations I have identified:

 snip /

  6. Breaking / elastic / non removable - eg. U+3000 Ideographic
  space = Must handle border/padding
 = Must handle text-align
 Question: XSL-FO does not define U+3000 as removable white space
  but would under common CJK typesetting conventions this be removed
  at a line break?

 I think so. That's precisely what the definition for the auto value
 of suppress-at-line-break warns about. Does this mean that the use of
 a fo:character is mandated if the user wants it removed? Yes, IMO.

 Unless the editors can be persuaded to make U+3000 an exception to
 the default retain, like common spaces (U+0020), compliance means
 treating this character maybe a bit counter-intuitively.

  7. Breaking / elastic / removable - eg. U+0020 Space
 = Can occur in runs which must be wholly removed
 = Must handle border/padding
 = Must handle text-align
  Any combinations I have missed, e.g. is there a break / non
  elastic / remove at break case?
 
  Maybe the fixed width spaces?

 More generally: any fixed-width character, added through a
 fo:character, implying a feasible/favorable break before or after,
 and having suppress-at-line-break=suppress.

 I could put:

 fo:character character=a suppress-at-line-break=suppress /

 in a document, surrounded by non-collapsible whitespace, and the
 formatter may decide to break before/after and drop the 'a'.

 Fixed-width spaces could be viewed as a subset. If they aren't added
 via a fo:character, they would belong to category 'break - non-
 elastic - non-removable'. (speaking strictly XSL-FO)

Andreas, I tend to disagree with the basic sentiment express here. If we 
accept Simon's notion that white space handling in XSL-FO is about 
dealing with spaces and linefeeds introduced by editors or humans for 
XML readability purposes then dealing with typographic conventions of 
particular scripts has nothing to do with the rules of white space 
handling. XSL-FO in quite a few places mentions user agent flexibility 
when it comes to dealing with script / language / country specific 
items. If we can, as Joerg suggests, replace a base letter followed by 
a combining diacritical mark with a matching combined glyph, why can't 
we replace an ideographic space followed by a line break with simply 
the line break? The point being I am not suggesting to remove the 
ideographic space under the XSL-FO white space rules but under 'script 
specific typographic conventions'. And I believe there is nothing in 
the spec which prohibits this - quite the opposite actually - the spec 
IMO encourages 'intelligent' handling of 'local customs'. Of course I 
don't know what the CJK typographic conventions are so this is all a 
bit hypothetical.

 Cheers,

 Andreas
Regards

Manuel


Re: Leading/trailing space removal in LineLM

2005-11-08 Thread Manuel Mall
On Wed, 9 Nov 2005 12:32 pm, Andreas L Delmelle wrote:
 On Nov 9, 2005, at 02:09, Manuel Mall wrote:

 Manuel,

snip/
 We're (again) more in agreement than we realize, I think... Although,
 now you got me wondering what you think is my 'basic sentiment' :-)

After reading your post = yes we are once again in agreement.

And don't worry about the sentiment stuff - just a minor  
distraction :-).

snip/
 Well, I can't find the exact reference (may be one of the earlier
 posts in this thread), but I seem to remember that the ideographic
 space can only shrink, not expand. Following that, I would say that
 there is little difference between suppressing a character, and
 shrinking it to zero width. Maybe, since it needs to be shrinkable
 anyway, it could be treated along that line?
UAX#14 I think is the reference.

snip/

 Cheers,

 Andreas

Manuel


Re: Leading/trailing space removal in LineLM

2005-11-04 Thread J.Pietschmann

Luca Furini wrote:
note that a word with a soft hyphen in its middle would not be 
hyphenated, unless we ignore this character when collecting word fragments


Well, in order to prepare for hyphenation, other characters
like joiners has to be removed too. We should probably also
use Unicode normalization.

J.Pietschmann




Re: Leading/trailing space removal in LineLM

2005-11-03 Thread Manuel Mall
On Wed, 2 Nov 2005 11:58 pm, Luca Furini wrote:
 Manuel Mall wrote:
  Luca wrote a longer response to this but my mail reader doesn't
  like the character set (is that topical or what?).

 Sorry, it looks really horrible ... still don't know what went wrong,
 but I won't do it again! :-)

  Any way at end Luca ask the question about the UAX#14 line breaking
  algorithm and its handling of spaces. My answer to that is:
  a) Yes UAX#14 always breaks at the of a sequence of spaces
  b) But is also says that it assumes any trailing spaces in a line
  are being removed
  This conflicts with XSL-FO which can force spaces being retained
  therefore adjustments to the algorithm are necessary to cater for
  that. One possible adjustment is simply changing what is given to
  the algorithm as indicated above, ie sp becomes
  zwspnbspzwsp.

 Ok, so back to your previous message:
  2. Removal of white space: This is the current behaviour but it
  works only for a single space and not for a sequence of spaces.
  Actually because the algorithm removes leading glues/penalties it
  is mainly a problem for trailing white space. I am not sure how to
  best tackle this. What comes to mind is:
 
  a) Do the same as for leading glues/penalties at the end of the
  line. However I am not sure how tricky it would be to determine the
  boundary because any 'blocking boxes' (see 1. above) are only
  placed before but
  not after elements. This options suffers from the problem that it
  will not remove leading/trailing white space across inline
  boundaries with border/padding as these generate zero width boxes
  to block removal of the glue elements for the border/padding.
 
  b) Do not generate individual Knuth sequences for each white space
  character but instead collect all consecutive white space and
  create one glue-penalty sequence for it. Again I am uncertain of
  the consequences of doing that. To do that correctly we would need
  to collect white space across inline boundaries. This firstly
  breaks the current getNextKnuth approach which assumes each LM can
  generate its sequences without knowledge of its neighbours. It
  would also break the current area info structures as a single Knuth
  element could now refer to text snippets from different LMs.

 I'm not sure I follow you in all the details of white space handling
 and here we have borders too ... :-)

 I like b) most: after all, this is somewhat similar to the space
 resolution, as we have interactions between spaces coming from
 different nodes, and it's difficult to have each LM decide on its
 own. And I think we could find a way to keep the 1-1 relationship
 between AreaInfo objects and Positions.

 I have tried to play with the elements, and here are a few results: I
 hope they can help!

 At the moments, the sequence for a single space with borders and
 padding is:

 1  glue w=endBP
 2  penalty w=0
 3  glue w=(spaceIPD - endBP - startBP)
 4  box w=0
 5  infinite penalty
 6  glue w=startBP

 total width = spaceIPD
 if break at #2 = endBP / startBP

 If we have two (or more) spaces, we could use the sequence:

 1  glue w=endBP
 2  penalty w=0
 3  glue w=(- endBP - startBP)
 4  glue w=spaceIPD1
 5  glue w=spaceIPD2
 6  box w=0
 7  infinite penalty
 8  glue w=startBP

 total width = spaceIPD1 + spaceIPD2
 if break at #2 = endBP / startBP

 Glues #4 and #5 have a Position pointing to different AreaInfo
 objects (from different LMs). This should solve (?) the case of
 ignore-if-surrounding.

Excellent, because ignore-if-surrounding is the only case we have to 
consider. For formatter generated line breaks this is the same as 
ignore-if-after... and ignore-if-before... because we control the 
position of the line break we can logically position it such that for 
the before and after cases we can remove the spaces. Therefore IMO we 
don't need any other Knuth sequences.

However, as these are integrated sequences we still have to carry info 
about this between LMs. This is for further study and suggestions are 
welcome.


 If white-space-treatment is ignore-if-after, and we have two
 consecutive spaces we could use the sequence:

 1  glue w=endBP
 2  penalty w=0
 3  glue w=(spaceIPD - endBP)
 4  penalty w=0
 5  glue w=(spaceIPD - startBP)
 6  box w=0
 7  infinite penalty
 8  glue w=startBP

 total width = 2 * spaceIPD
 if break at #2 = endBP / startBP
 if break at #4 = endBP + spaceIPD / startBP

 With three or more consecutive spaces:
 1  glue w=endBP
 2  penalty w=0
 3  glue w=(spaceIPD - endBP)
 4  penalty w=0
 5  glue w=spaceIPD
 6  penalty w=0
 7  glue w=(spaceIPD - startBP)
 8  box w=0
 9  infinite penalty
 10 glue w=startBP

 total width = 3 * spaceIPD
 if break at #2 = endBP / startBP
 if break at #4 = endBP + spaceIPD / startBP
 if break at #6 = endBP + 2 * spaceIPD / startBP

 I did not find a sequence for ignore-if-before yet ...

 Regards
 Luca

Cheers

Manuel

PS: I finally feel there is real progress made in this white space 
handling stuff :-)


Re: Leading/trailing space removal in LineLM

2005-11-03 Thread J.Pietschmann

Manuel Mall wrote:
Hmm, to me it appears that UNICODE and XSL-FO have slightly different 
models when it comes to white space in the context of line breaking 
which is causing the discussion here.


I don't think so. The overlap between UAX14 and XSLFO is that both
mandate a line break for each LF which survived the character
level refinement stage.
UAX14 is all about where an application might place a line break, and
where it shouldn't. The notice about space at the end of a line is
usually discarded is just a notice. There is absolutely nothing
in the record on how sequences of spaces should be handled.
XSLFO on the other hand doesn't specify any mechanism for finding line
breaking opportunities. It just says that a LF which is treated as a LF
should cause a line break, and leaves finding other positions to the
implementation.
As an example lets take to following FO snipped (spaces denoted by
underlines for visibility):
 fo:blockA_nice_word./fo:block
Provided all properties are at their default value, a processor
which produces the following layout

A
nice
word.

may claim conformance to both UAX14 and XSLFO 1.0.
If it produces

A ni
ce word.

it may claim conformance to XSLFO but not UAX14
If it produces

A_nice_
word.

it may claim conformance to UAX14 but not XSLFO because of the trailing
space in the first line.

If we want to 'marry' UNICODE linebreaking with XSL-FO white space 
handling we have this interaction to consider.


I still think that finding line break opportunities and handling white
space are different things, and can be handled nearly independently.
Note that white space removal around line breaks happend after a break
opportunity has been actually promoted to a real line break.

J.Pietschmann


Re: Leading/trailing space removal in LineLM

2005-11-03 Thread Andreas L Delmelle

On Nov 3, 2005, at 08:53, Manuel Mall wrote:


On Thu, 3 Nov 2005 06:03 am, J.Pietschmann wrote:


Computing line breaking opportunities and discarding whitespace at
the end (or beginning) of a line are different matters. If whitespace
has to be retained, trailing spaces after a non-space string may
simply mean the previous line breaking opportunity has to be used,
because otherwise the string including the trailing spaces will
overflow the line area. The trailing whitespace may also influence
text justification.



Hmm, to me it appears that UNICODE and XSL-FO have slightly different
models when it comes to white space in the context of line breaking
which is causing the discussion here. In UNICODE everything is based
simply on the properties of the codepoint in question and its
neighbour. In XSL-FO one can change the behaviour of a codepoint by
setting those white space related XSL-FO properties.


Hmm, apart from suppress-at-line-break (which is a more general  
property, not specific wrt whitespace), the whitespace-related  
properties only deal with XML whitespace (which is obviously not the  
same as Unicode whitespace, but a very small subset thereof).


During refinement, all whitespace other than U+0020, U+0009, U+000D  
and U+000A is left alone.
At that stage, it's only these four codepoints' behavior that can be  
influenced/changed by the three properties: white-space-treatment,  
linefeed-treatment and white-space-collapse.
This means that a sequence of nbsp-space-zwsp-space-nbsp should  
arrive in layout untouched (never collapsed).


As Joerg points out, discarding whitespace at line-breaks and  
computing those line-breaks are two different issues.
If I get the intention correctly, we shouldn't be following Unicode  
UAX#14 wherever it mentions white-space-removal/-retaining around  
eventual breaks except for non-XML-whitespace (as we implement a  
Recommendation that, at least from our POV, supersedes what Unicode  
says about this). We're using UAX#14 only to determine the feasible/ 
most desirable (non-)breaks.


If UAX#14 always breaks at the end of a sequence of spaces, then this  
tells us only that doing so would use the most desirable break- 
opportunity. If anything, it seems to make the job less complicated,  
because this means that we will practically never have to consider  
cases of whitespace following a line-break, no? Only in case of  
explicit linefeed-treatment=preserve...
Correct me if I'm wrong, but such a space sequence would correspond  
to a Knuth element sequence with the break-before penalty gradually  
increasing and the break-after penalty decreasing for each  
consecutive space, such that, when the decision has to be made where  
to break, the break-after for the last space will be chosen if  
possible. A break before a space is feasible, but not preferable to  
breaking after it, breaking after the first space should be marked  
less preferable than breaking after the last one.
What happens with immediately preceding XML whitespace (or explicit  
fo:characters with overrides for default suppress-at-line-break), is  
then again determined by the white-space-treatment of the containing  
block. In this respect, the default rules are pretty simple: all  
glyph areas (or non-areas, which could still be relevant to possible  
FO extensions) for whitespace characters are retained, except regular  
spaces, or fo:characters with explicit suppress.


That is not a concept within UNICODE. If you want to retain white  
space
in UNICODE you use a different codepoint. If you want to retain a  
space
in XSL-FO you could use a different codepoint but more likely you  
set a

XSL-FO property if you want this applied widely in your document.

If we want to 'marry' UNICODE linebreaking with XSL-FO white space
handling we have this interaction to consider. One possible solution
would be to replace spaces (U+0020) by different codepoints which
resemble the behaviour modification imposed by any XSL-FO white space
handling properties in effect.


Not really 'replace' but 'treat-as-if' (generate a Knuth sequence  
analogous to codepoint ...)




But I am not sure if this can be done in
all cases. Otherwise we may have to modify the UNICODE line breaking
algorithm to cater for the XSL-FO white space specialities.


Hm. Same as Joerg, I also see these two operate at different levels.  
Don't know if I'm seeing this correctly, but the line breaking  
algorithm operates at the level where we have to decide the value of  
the penalties for breaking before or after a particular type of  
whitespace (or non-whitespace), whereas the white-space treatment  
refers to line-building, so what happens/has to happen in case the  
break is 'elected'.


Cheers,

Andreas


Re: Leading/trailing space removal in LineLM

2005-11-03 Thread Manuel Mall
On Fri, 4 Nov 2005 06:22 am, Andreas L Delmelle wrote:
 On Nov 3, 2005, at 08:53, Manuel Mall wrote:
  On Thu, 3 Nov 2005 06:03 am, J.Pietschmann wrote:

  But I am not sure if this can be done in
  all cases. Otherwise we may have to modify the UNICODE line
  breaking algorithm to cater for the XSL-FO white space
  specialities.

 Hm. Same as Joerg, I also see these two operate at different levels.
 Don't know if I'm seeing this correctly, but the line breaking
 algorithm operates at the level where we have to decide the value of
 the penalties for breaking before or after a particular type of
 whitespace (or non-whitespace), whereas the white-space treatment
 refers to line-building, so what happens/has to happen in case the
 break is 'elected'.

Andreas, Joerg, 
I don't think we actually disagree on anything here really. We are 
probably falling into the 'e-mail communications hole' where we talk 
about the same thing just using slightly different words / descriptions 
and in the end going in circles. I am convinced if we sat around the 
table we sort this out in a few minutes and could then go on to more 
fundamental issues like: German beer is really superior to Belgium 
brews and of course prove that fact by extensive testing -:).

 Cheers,

 Andreas

Cheers

Manuel


Re: Leading/trailing space removal in LineLM

2005-11-02 Thread Luca Furini

Manuel Mall wrote:

So we end up with only two cases to consider: preserve white space and 
remove white space around a line break created by the Knuth algorithm.


1. Preserve white space: IMO in this case the space itself is actually 
not a break opportunity but there are now two break opportunities: one 
before the space and one after the space. That is a sequence like 
'abc#x20;def' is more like 'abc#x200b;#xa0;#x200b;def' or in a more 
readable notation 'abczwspnbspzwspdef'. That is our normal space 
becomes a non-breakable space flanked by zero-width spaces which 
represent the break opportunities. If this is correct the Knuth 
elements would look like:

 glue w=0
 box w=0
 pen +INFINITE
 glue w=space
 pen
 glue w=0
Is this sequence correct? The first and last glue represent the zwsp 
and are break opportunities. The box prevents the removal of the space 
if a break is created before the space. The penalty prevents the space 
to be considered as a break opportunity.
Of course as usual these sequences are further complicated in the 
absence of justification and in the presence of border/padding.


I like your idea of expanding a preserved space into zwsps and nbsp; 
this allows us to forget alignments and borders / padding as we just have 
to insert the appropriate elements for the non breaking space.


The sequence is very good, as it has a couple of interesting properties:

- it interacts with the surrounding elements just a single glue element

- if there are two (or more) consecutive, non-collapsed spaces the 
sequence has just 3 feasible breaks, not 4


However, I have a doubt: reading the Unicode document about line breaking, 
it seems to me that, regardless of the quantity of consecutive spaces, 
there is only *one* feasible break, after the last one (Unicode Standard 
Annex #14, section 2 Definitions, in particular the definition of 
direct break and indirect break)


--- begin quoted text ---

Direct Break - a line break opportunity exists between two adjacent 
characters of the given line breaking classes. This is indicated in the 
rules below as B ? A, where B is the character class of the character 
before and A is the character class of the character after the break. If 
they are separated by one or more space characters, a break opportunity 
also exists after the last space. In the pair table, the optional space 
characters are not shown.


Indirect Break - a line break opportunity exists between two characters of 
the given line breaking classes only if they are separated by one or more 
spaces. In this case, a break opportunity exists after the last space. No 
break opportunity exists if the characters are immediately adjacent. This 
is indicated in the pair table below as B % A, where B is the character 
class of the character before and A is the character class of the 
character after the break. Even though space characters are not shown in 
the pair table, an indirect break can only occur if one or more spaces 
follow B. In the notation of the rules in Section 6, Line Breaking 
Algorithm this would be represented as two rules: B ? A and B SP+ ? A.


--- end quoted text ---

I still have not read the document from top to bottom, and I could have 
misunderstood even the sections I read :-), but I think this point must be 
clarified before we continue.


Regards
Luca



Re: Leading/trailing space removal in LineLM

2005-11-02 Thread Manuel Mall
On Wed, 2 Nov 2005 01:59 pm, Manuel Mall wrote:
 On Wed, 2 Nov 2005 04:18 am, Simon Pepping wrote:
  On Tue, Nov 01, 2005 at 11:40:42PM +0800, Manuel Mall wrote:
   This is probably a question for Luca or Simon.

 snip/

  Glue and penalty items are removed at the start of a line. This is
  part of the Knuth algorithm. It does not touch the matter of
  white-space-collapse. If there is whitespace that may not be
  removed/collapsed at the start of the line, it must be protected by
  a preceding zero-width box. I.o.w., the value of
  white-space-collapse needs to be taken into account at the phase of
  getNextKnuthElements.

 Fair enough - I need some help with the Knuth elements then.

 During getNextKnuth we need to only consider white-space-treatment as
 white-space-collapse can be handled completely during refinement,
 that is consecutive sequences of white space are either collapsed or
 not during refinement.

 We also can limit white-space-treatment during getNextKnuth to any
 line breaks generated by the line breaking algorithm (Knuth
 algorithm). white-space-treatment around hard line breaks (linefeeds,
 start/end of a block) are handled during refinement.

 We can also limit white-space-treatment during getNextKnuth to the
 values preserve vs ignore-if Other values are handled during
 refinement. We also can treat the three different ignore-if...
 values, that is the values: ignore-if-before-linefeed,
 ignore-if-after-linefeed, ignore-if-surrounding-linefeed, as just one
 case: 'delete all white space around a formatter generated break'.

 So we end up with only two cases to consider: preserve white space
 and remove white space around a line break created by the Knuth
 algorithm.

 1. Preserve white space: IMO in this case the space itself is
 actually not a break opportunity but there are now two break
 opportunities: one before the space and one after the space. That is
 a sequence like 'abc#x20;def' is more like
 'abc#x200b;#xa0;#x200b;def' or in a more readable notation
 'abczwspnbspzwspdef'. That is our normal space becomes a
 non-breakable space flanked by zero-width spaces which represent the
 break opportunities. If this is correct the Knuth elements would look
 like:
 glue w=0
 box w=0
 pen +INFINITE
 glue w=space
 pen
 glue w=0
 Is this sequence correct? The first and last glue represent the
 zwsp and are break opportunities. The box prevents the removal of
 the space if a break is created before the space. The penalty
 prevents the space to be considered as a break opportunity.
 Of course as usual these sequences are further complicated in the
 absence of justification and in the presence of border/padding.

 2. Removal of white space: This is the current behaviour but it works
 only for a single space and not for a sequence of spaces. Actually
 because the algorithm removes leading glues/penalties it is mainly a
 problem for trailing white space. I am not sure how to best tackle
 this. What comes to mind is:

 a) Do the same as for leading glues/penalties at the end of the line.
 However I am not sure how tricky it would be to determine the
 boundary because any 'blocking boxes' (see 1. above) are only placed
 before but not after elements. This options suffers from the problem
 that it will not remove leading/trailing white space across inline
 boundaries with border/padding as these generate zero width boxes to
 block removal of the glue elements for the border/padding.

 b) Do not generate individual Knuth sequences for each white space
 character but instead collect all consecutive white space and create
 one glue-penalty sequence for it. Again I am uncertain of the
 consequences of doing that. To do that correctly we would need to
 collect white space across inline boundaries. This firstly breaks the
 current getNextKnuth approach which assumes each LM can generate its
 sequences without knowledge of its neighbours. It would also break
 the current area info structures as a single Knuth element could now
 refer to text snippets from different LMs.

 Comments please.

  Simon

 Thanks

Luca wrote a longer response to this but my mail reader doesn't like the 
character set (is that topical or what?). Any way at end end Luca ask 
the question about the UAX#14 line breaking algorithm and its handling 
of spaces. My answer to that is:
a) Yes UAX#14 always breaks at the of a sequence of spaces
b) But is also says that it assumes any trailing spaces in a line are 
being removed
This conflicts with XSL-FO which can force spaces being retained 
therefore adjustments to the algorithm are necessary to cater for that. 
One possible adjustment is simply changing what is given to the 
algorithm as indicated above, ie sp becomes zwspnbspzwsp.

Manuel

 Manuel

In case other people have the same problem with Luca's post here is the 
content:
 Start Luca's e-mail +
I like your idea of expanding a preserved space into zwsps and nbsp;
this allows us to forget alignments and 

Re: Leading/trailing space removal in LineLM

2005-11-02 Thread Luca Furini

Manuel Mall wrote:

Luca wrote a longer response to this but my mail reader doesn't like the 
character set (is that topical or what?).


Sorry, it looks really horrible ... still don't know what went wrong, but 
I won't do it again! :-)


Any way at end Luca ask the question about the UAX#14 line breaking 
algorithm and its handling of spaces. My answer to that is:

a) Yes UAX#14 always breaks at the of a sequence of spaces
b) But is also says that it assumes any trailing spaces in a line are 
being removed
This conflicts with XSL-FO which can force spaces being retained 
therefore adjustments to the algorithm are necessary to cater for that. 
One possible adjustment is simply changing what is given to the 
algorithm as indicated above, ie sp becomes zwspnbspzwsp.


Ok, so back to your previous message:


2. Removal of white space: This is the current behaviour but it works
only for a single space and not for a sequence of spaces. Actually
because the algorithm removes leading glues/penalties it is mainly a
problem for trailing white space. I am not sure how to best tackle
this. What comes to mind is:

a) Do the same as for leading glues/penalties at the end of the line.
However I am not sure how tricky it would be to determine the boundary
because any 'blocking boxes' (see 1. above) are only placed
before but
not after elements. This options suffers from the problem that it will
not remove leading/trailing white space across inline boundaries with
border/padding as these generate zero width boxes to block removal of
the glue elements for the border/padding.



b) Do not generate individual Knuth sequences for each white space
character but instead collect all consecutive white space and create
one glue-penalty sequence for it. Again I am uncertain of the
consequences of doing that. To do that correctly we would need to
collect white space across inline boundaries. This firstly breaks the
current getNextKnuth approach which assumes each LM can generate its
sequences without knowledge of its neighbours. It would also break the
current area info structures as a single Knuth element could now refer
to text snippets from different LMs.


I'm not sure I follow you in all the details of white space handling and 
here we have borders too ... :-)


I like b) most: after all, this is somewhat similar to the space 
resolution, as we have interactions between spaces coming from different 
nodes, and it's difficult to have each LM decide on its own. And I think 
we could find a way to keep the 1-1 relationship between AreaInfo objects 
and Positions.


I have tried to play with the elements, and here are a few results: I hope 
they can help!


At the moments, the sequence for a single space with borders and padding 
is:


1  glue w=endBP
2  penalty w=0
3  glue w=(spaceIPD - endBP - startBP)
4  box w=0
5  infinite penalty
6  glue w=startBP

total width = spaceIPD
if break at #2 = endBP / startBP

If we have two (or more) spaces, we could use the sequence:

1  glue w=endBP
2  penalty w=0
3  glue w=(- endBP - startBP)
4  glue w=spaceIPD1
5  glue w=spaceIPD2
6  box w=0
7  infinite penalty
8  glue w=startBP

total width = spaceIPD1 + spaceIPD2
if break at #2 = endBP / startBP

Glues #4 and #5 have a Position pointing to different AreaInfo objects 
(from different LMs). This should solve (?) the case of 
ignore-if-surrounding.


If white-space-treatment is ignore-if-after, and we have two consecutive 
spaces we could use the sequence:


1  glue w=endBP
2  penalty w=0
3  glue w=(spaceIPD - endBP)
4  penalty w=0
5  glue w=(spaceIPD - startBP)
6  box w=0
7  infinite penalty
8  glue w=startBP

total width = 2 * spaceIPD
if break at #2 = endBP / startBP
if break at #4 = endBP + spaceIPD / startBP

With three or more consecutive spaces:
1  glue w=endBP
2  penalty w=0
3  glue w=(spaceIPD - endBP)
4  penalty w=0
5  glue w=spaceIPD
6  penalty w=0
7  glue w=(spaceIPD - startBP)
8  box w=0
9  infinite penalty
10 glue w=startBP

total width = 3 * spaceIPD
if break at #2 = endBP / startBP
if break at #4 = endBP + spaceIPD / startBP
if break at #6 = endBP + 2 * spaceIPD / startBP

I did not find a sequence for ignore-if-before yet ...

Regards
   Luca


Re: Leading/trailing space removal in LineLM

2005-11-02 Thread Simon Pepping
On Wed, Nov 02, 2005 at 04:58:09PM +0100, Luca Furini wrote:
 Manuel Mall wrote:
 
 Luca wrote a longer response to this but my mail reader doesn't like the 
 character set (is that topical or what?).
 
 Sorry, it looks really horrible ... still don't know what went wrong, but 
 I won't do it again! :-)

It is in the quoted-printable format, probably due to non-ascii
or non-latin-1 characters in it, the TR14 symbols. 

Simon

-- 
Simon Pepping
home page: http://www.leverkruid.nl



Re: Leading/trailing space removal in LineLM

2005-11-02 Thread J.Pietschmann

Manuel Mall wrote:

a) Yes UAX#14 always breaks at the of a sequence of spaces
b) But is also says that it assumes any trailing spaces in a line are 
being removed
This conflicts with XSL-FO which can force spaces being retained 
therefore adjustments to the algorithm are necessary to cater for that. 


Computing line breaking opportunities and discarding whitespace at the
end (or beginning) of a line are different matters. If whitespace has
to be retained, trailing spaces after a non-space string may simply mean
the previous line breaking opportunity has to be used, because otherwise
the string including the trailing spaces will overflow the line area.
The trailing whitespace may also influence text justification.

J.Pietschmann



Re: Leading/trailing space removal in LineLM

2005-11-02 Thread Manuel Mall
On Thu, 3 Nov 2005 06:03 am, J.Pietschmann wrote:
 Manuel Mall wrote:
  a) Yes UAX#14 always breaks at the of a sequence of spaces
  b) But is also says that it assumes any trailing spaces in a line
  are being removed
  This conflicts with XSL-FO which can force spaces being retained
  therefore adjustments to the algorithm are necessary to cater for
  that.

 Computing line breaking opportunities and discarding whitespace at
 the end (or beginning) of a line are different matters. If whitespace
 has to be retained, trailing spaces after a non-space string may
 simply mean the previous line breaking opportunity has to be used,
 because otherwise the string including the trailing spaces will
 overflow the line area. The trailing whitespace may also influence
 text justification.

Hmm, to me it appears that UNICODE and XSL-FO have slightly different 
models when it comes to white space in the context of line breaking 
which is causing the discussion here. In UNICODE everything is based 
simply on the properties of the codepoint in question and its 
neighbour. In XSL-FO one can change the behaviour of a codepoint by 
setting those white space related XSL-FO properties. That is not a 
concept within UNICODE. If you want to retain white space in UNICODE 
you use a different codepoint. If you want to retain a space in XSL-FO 
you could use a different codepoint but more likely you set a XSL-FO 
property if you want this applied widely in your document.

If we want to 'marry' UNICODE linebreaking with XSL-FO white space 
handling we have this interaction to consider. One possible solution 
would be to replace spaces (U+0020) by different codepoints which 
resemble the behaviour modification imposed by any XSL-FO white space 
handling properties in effect. But I am not sure if this can be done in 
all cases. Otherwise we may have to modify the UNICODE line breaking 
algorithm to cater for the XSL-FO white space specialities.

 J.Pietschmann

Manuel


Re: Leading/trailing space removal in LineLM

2005-11-01 Thread Simon Pepping
On Tue, Nov 01, 2005 at 11:40:42PM +0800, Manuel Mall wrote:
 This is probably a question for Luca or Simon.
 
 In LineLM we have this code:
 // ignore KnuthGlue and KnuthPenalty objects
 // at the beginning of the line
 seqIterator = seq.listIterator(iStartElement);
 tempElement = (KnuthElement) seqIterator.next();
 while (!tempElement.isBox()  seqIterator.hasNext()) {
 tempElement = (KnuthElement) seqIterator.next();
 iStartElement++;
 }
 What is the background to this? This seems to interfere with certain 
 combinations of white-space-collapse=false and 
 white-space-treatment=preserve/ignore-if-before-linefeed. I think 
 there is similar code to remove trailing stuff with similar 
 interference.

Glue and penalty items are removed at the start of a line. This is
part of the Knuth algorithm. It does not touch the matter of
white-space-collapse. If there is whitespace that may not be
removed/collapsed at the start of the line, it must be protected by a
preceding zero-width box. I.o.w., the value of white-space-collapse
needs to be taken into account at the phase of getNextKnuthElements.

Simon

-- 
Simon Pepping
home page: http://www.leverkruid.nl