Re: Leading/trailing space removal in LineLM
On Sat, 5 Nov 2005 12:05 am, Luca Furini wrote: Manuel Mall wrote: Here are some of the combinations I have identified: 1. Non breaking / non elastic space = probably just a normal character, i.e. part of a word. 2. Non breaking / elastic space - eg. U+00A0 Non breaking space = Must prevent break = Must handle text-align 3. Break / non elastic - eg. U+200B ZWSP, any other break between two characters not involving adding or removing space/characters = Must handle border/padding = Must handle text-align 4. Break / non elastic / remove if not break - eg. U+00AD Soft hyphen = Must remove if not at break = Must handle border/padding = Must handle text-align 5. Break / non elastic / add character if break - eg. hyphenation = Must add space for hyphen if at break = Must handle border/padding = Must handle text-align 6. Breaking / elastic / non removable - eg. U+3000 Ideographic space = Must handle border/padding = Must handle text-align Question: XSL-FO does not define U+3000 as removable white space but would under common CJK typesetting conventions this be removed at a line break? 7. Breaking / elastic / removable - eg. U+0020 Space = Can occur in runs which must be wholly removed = Must handle border/padding = Must handle text-align Any combinations I have missed, e.g. is there a break / non elastic / remove at break case? I moved all this to a Wiki page with the actual Knuth sequences (http://wiki.apache.org/xmlgraphics-fop/LineBreaking). Please review / check! Maybe the fixed width spaces? Yes - may be. Anyway, it seems an exhaustive analysis of the problem! Just a few comments / thoughts: - non breaking, non elastic: the simple solution would be to handle these characters as normal letters, so the text before_after (where _ is zwnbsp) would create a single AreaInfo object in the TextLM; but this would create problems during hyphenation, as non-letter characters in the middle of a word ATM prevents hyphenation I think word breaking, i.e. determining the word boundaries for the purpose of hyphenation, and line breaking are not 100% coupled. There are actually different Unicode documents describing each. Therefore down the track I don't see treating these are normal characters for the purpose of line breaking as being a problem as the word breaking would be done may be in parallel but logically separate. We also most likely want Knuth box elements covering the largest extend of consecutive characters as possible because a) it saves resources and b) as the width of Knuth elements are the basis of determining what fits on a line if we ever look into kerning the calculations would need to be done on a per Knuth box element basis. - soft hyphen: at the moment it is not properly handled, but it won't be difficult to fix the implementation; it could create the same elements used for an hyphenation point, but the penalty could have a negative value (as probably users would use it to suggest a desired line break); note that a word with a soft hyphen in its middle would not be hyphenated, unless we ignore this character when collecting word fragments I thought we simply delete the soft-hyphen character and generate a normal break with hyphen Knuth sequence at that point. Regards Luca Manuel
Re: Leading/trailing space removal in LineLM
On Wed, 9 Nov 2005 12:47 am, Andreas L Delmelle wrote: On Nov 4, 2005, at 17:05, Luca Furini wrote: Hi Manuel / Luca, Manuel Mall wrote: Here are some of the combinations I have identified: snip / 6. Breaking / elastic / non removable - eg. U+3000 Ideographic space = Must handle border/padding = Must handle text-align Question: XSL-FO does not define U+3000 as removable white space but would under common CJK typesetting conventions this be removed at a line break? I think so. That's precisely what the definition for the auto value of suppress-at-line-break warns about. Does this mean that the use of a fo:character is mandated if the user wants it removed? Yes, IMO. Unless the editors can be persuaded to make U+3000 an exception to the default retain, like common spaces (U+0020), compliance means treating this character maybe a bit counter-intuitively. 7. Breaking / elastic / removable - eg. U+0020 Space = Can occur in runs which must be wholly removed = Must handle border/padding = Must handle text-align Any combinations I have missed, e.g. is there a break / non elastic / remove at break case? Maybe the fixed width spaces? More generally: any fixed-width character, added through a fo:character, implying a feasible/favorable break before or after, and having suppress-at-line-break=suppress. I could put: fo:character character=a suppress-at-line-break=suppress / in a document, surrounded by non-collapsible whitespace, and the formatter may decide to break before/after and drop the 'a'. Fixed-width spaces could be viewed as a subset. If they aren't added via a fo:character, they would belong to category 'break - non- elastic - non-removable'. (speaking strictly XSL-FO) Andreas, I tend to disagree with the basic sentiment express here. If we accept Simon's notion that white space handling in XSL-FO is about dealing with spaces and linefeeds introduced by editors or humans for XML readability purposes then dealing with typographic conventions of particular scripts has nothing to do with the rules of white space handling. XSL-FO in quite a few places mentions user agent flexibility when it comes to dealing with script / language / country specific items. If we can, as Joerg suggests, replace a base letter followed by a combining diacritical mark with a matching combined glyph, why can't we replace an ideographic space followed by a line break with simply the line break? The point being I am not suggesting to remove the ideographic space under the XSL-FO white space rules but under 'script specific typographic conventions'. And I believe there is nothing in the spec which prohibits this - quite the opposite actually - the spec IMO encourages 'intelligent' handling of 'local customs'. Of course I don't know what the CJK typographic conventions are so this is all a bit hypothetical. Cheers, Andreas Regards Manuel
Re: Leading/trailing space removal in LineLM
On Wed, 9 Nov 2005 12:32 pm, Andreas L Delmelle wrote: On Nov 9, 2005, at 02:09, Manuel Mall wrote: Manuel, snip/ We're (again) more in agreement than we realize, I think... Although, now you got me wondering what you think is my 'basic sentiment' :-) After reading your post = yes we are once again in agreement. And don't worry about the sentiment stuff - just a minor distraction :-). snip/ Well, I can't find the exact reference (may be one of the earlier posts in this thread), but I seem to remember that the ideographic space can only shrink, not expand. Following that, I would say that there is little difference between suppressing a character, and shrinking it to zero width. Maybe, since it needs to be shrinkable anyway, it could be treated along that line? UAX#14 I think is the reference. snip/ Cheers, Andreas Manuel
Re: Leading/trailing space removal in LineLM
Luca Furini wrote: note that a word with a soft hyphen in its middle would not be hyphenated, unless we ignore this character when collecting word fragments Well, in order to prepare for hyphenation, other characters like joiners has to be removed too. We should probably also use Unicode normalization. J.Pietschmann
Re: Leading/trailing space removal in LineLM
On Wed, 2 Nov 2005 11:58 pm, Luca Furini wrote: Manuel Mall wrote: Luca wrote a longer response to this but my mail reader doesn't like the character set (is that topical or what?). Sorry, it looks really horrible ... still don't know what went wrong, but I won't do it again! :-) Any way at end Luca ask the question about the UAX#14 line breaking algorithm and its handling of spaces. My answer to that is: a) Yes UAX#14 always breaks at the of a sequence of spaces b) But is also says that it assumes any trailing spaces in a line are being removed This conflicts with XSL-FO which can force spaces being retained therefore adjustments to the algorithm are necessary to cater for that. One possible adjustment is simply changing what is given to the algorithm as indicated above, ie sp becomes zwspnbspzwsp. Ok, so back to your previous message: 2. Removal of white space: This is the current behaviour but it works only for a single space and not for a sequence of spaces. Actually because the algorithm removes leading glues/penalties it is mainly a problem for trailing white space. I am not sure how to best tackle this. What comes to mind is: a) Do the same as for leading glues/penalties at the end of the line. However I am not sure how tricky it would be to determine the boundary because any 'blocking boxes' (see 1. above) are only placed before but not after elements. This options suffers from the problem that it will not remove leading/trailing white space across inline boundaries with border/padding as these generate zero width boxes to block removal of the glue elements for the border/padding. b) Do not generate individual Knuth sequences for each white space character but instead collect all consecutive white space and create one glue-penalty sequence for it. Again I am uncertain of the consequences of doing that. To do that correctly we would need to collect white space across inline boundaries. This firstly breaks the current getNextKnuth approach which assumes each LM can generate its sequences without knowledge of its neighbours. It would also break the current area info structures as a single Knuth element could now refer to text snippets from different LMs. I'm not sure I follow you in all the details of white space handling and here we have borders too ... :-) I like b) most: after all, this is somewhat similar to the space resolution, as we have interactions between spaces coming from different nodes, and it's difficult to have each LM decide on its own. And I think we could find a way to keep the 1-1 relationship between AreaInfo objects and Positions. I have tried to play with the elements, and here are a few results: I hope they can help! At the moments, the sequence for a single space with borders and padding is: 1 glue w=endBP 2 penalty w=0 3 glue w=(spaceIPD - endBP - startBP) 4 box w=0 5 infinite penalty 6 glue w=startBP total width = spaceIPD if break at #2 = endBP / startBP If we have two (or more) spaces, we could use the sequence: 1 glue w=endBP 2 penalty w=0 3 glue w=(- endBP - startBP) 4 glue w=spaceIPD1 5 glue w=spaceIPD2 6 box w=0 7 infinite penalty 8 glue w=startBP total width = spaceIPD1 + spaceIPD2 if break at #2 = endBP / startBP Glues #4 and #5 have a Position pointing to different AreaInfo objects (from different LMs). This should solve (?) the case of ignore-if-surrounding. Excellent, because ignore-if-surrounding is the only case we have to consider. For formatter generated line breaks this is the same as ignore-if-after... and ignore-if-before... because we control the position of the line break we can logically position it such that for the before and after cases we can remove the spaces. Therefore IMO we don't need any other Knuth sequences. However, as these are integrated sequences we still have to carry info about this between LMs. This is for further study and suggestions are welcome. If white-space-treatment is ignore-if-after, and we have two consecutive spaces we could use the sequence: 1 glue w=endBP 2 penalty w=0 3 glue w=(spaceIPD - endBP) 4 penalty w=0 5 glue w=(spaceIPD - startBP) 6 box w=0 7 infinite penalty 8 glue w=startBP total width = 2 * spaceIPD if break at #2 = endBP / startBP if break at #4 = endBP + spaceIPD / startBP With three or more consecutive spaces: 1 glue w=endBP 2 penalty w=0 3 glue w=(spaceIPD - endBP) 4 penalty w=0 5 glue w=spaceIPD 6 penalty w=0 7 glue w=(spaceIPD - startBP) 8 box w=0 9 infinite penalty 10 glue w=startBP total width = 3 * spaceIPD if break at #2 = endBP / startBP if break at #4 = endBP + spaceIPD / startBP if break at #6 = endBP + 2 * spaceIPD / startBP I did not find a sequence for ignore-if-before yet ... Regards Luca Cheers Manuel PS: I finally feel there is real progress made in this white space handling stuff :-)
Re: Leading/trailing space removal in LineLM
Manuel Mall wrote: Hmm, to me it appears that UNICODE and XSL-FO have slightly different models when it comes to white space in the context of line breaking which is causing the discussion here. I don't think so. The overlap between UAX14 and XSLFO is that both mandate a line break for each LF which survived the character level refinement stage. UAX14 is all about where an application might place a line break, and where it shouldn't. The notice about space at the end of a line is usually discarded is just a notice. There is absolutely nothing in the record on how sequences of spaces should be handled. XSLFO on the other hand doesn't specify any mechanism for finding line breaking opportunities. It just says that a LF which is treated as a LF should cause a line break, and leaves finding other positions to the implementation. As an example lets take to following FO snipped (spaces denoted by underlines for visibility): fo:blockA_nice_word./fo:block Provided all properties are at their default value, a processor which produces the following layout A nice word. may claim conformance to both UAX14 and XSLFO 1.0. If it produces A ni ce word. it may claim conformance to XSLFO but not UAX14 If it produces A_nice_ word. it may claim conformance to UAX14 but not XSLFO because of the trailing space in the first line. If we want to 'marry' UNICODE linebreaking with XSL-FO white space handling we have this interaction to consider. I still think that finding line break opportunities and handling white space are different things, and can be handled nearly independently. Note that white space removal around line breaks happend after a break opportunity has been actually promoted to a real line break. J.Pietschmann
Re: Leading/trailing space removal in LineLM
On Nov 3, 2005, at 08:53, Manuel Mall wrote: On Thu, 3 Nov 2005 06:03 am, J.Pietschmann wrote: Computing line breaking opportunities and discarding whitespace at the end (or beginning) of a line are different matters. If whitespace has to be retained, trailing spaces after a non-space string may simply mean the previous line breaking opportunity has to be used, because otherwise the string including the trailing spaces will overflow the line area. The trailing whitespace may also influence text justification. Hmm, to me it appears that UNICODE and XSL-FO have slightly different models when it comes to white space in the context of line breaking which is causing the discussion here. In UNICODE everything is based simply on the properties of the codepoint in question and its neighbour. In XSL-FO one can change the behaviour of a codepoint by setting those white space related XSL-FO properties. Hmm, apart from suppress-at-line-break (which is a more general property, not specific wrt whitespace), the whitespace-related properties only deal with XML whitespace (which is obviously not the same as Unicode whitespace, but a very small subset thereof). During refinement, all whitespace other than U+0020, U+0009, U+000D and U+000A is left alone. At that stage, it's only these four codepoints' behavior that can be influenced/changed by the three properties: white-space-treatment, linefeed-treatment and white-space-collapse. This means that a sequence of nbsp-space-zwsp-space-nbsp should arrive in layout untouched (never collapsed). As Joerg points out, discarding whitespace at line-breaks and computing those line-breaks are two different issues. If I get the intention correctly, we shouldn't be following Unicode UAX#14 wherever it mentions white-space-removal/-retaining around eventual breaks except for non-XML-whitespace (as we implement a Recommendation that, at least from our POV, supersedes what Unicode says about this). We're using UAX#14 only to determine the feasible/ most desirable (non-)breaks. If UAX#14 always breaks at the end of a sequence of spaces, then this tells us only that doing so would use the most desirable break- opportunity. If anything, it seems to make the job less complicated, because this means that we will practically never have to consider cases of whitespace following a line-break, no? Only in case of explicit linefeed-treatment=preserve... Correct me if I'm wrong, but such a space sequence would correspond to a Knuth element sequence with the break-before penalty gradually increasing and the break-after penalty decreasing for each consecutive space, such that, when the decision has to be made where to break, the break-after for the last space will be chosen if possible. A break before a space is feasible, but not preferable to breaking after it, breaking after the first space should be marked less preferable than breaking after the last one. What happens with immediately preceding XML whitespace (or explicit fo:characters with overrides for default suppress-at-line-break), is then again determined by the white-space-treatment of the containing block. In this respect, the default rules are pretty simple: all glyph areas (or non-areas, which could still be relevant to possible FO extensions) for whitespace characters are retained, except regular spaces, or fo:characters with explicit suppress. That is not a concept within UNICODE. If you want to retain white space in UNICODE you use a different codepoint. If you want to retain a space in XSL-FO you could use a different codepoint but more likely you set a XSL-FO property if you want this applied widely in your document. If we want to 'marry' UNICODE linebreaking with XSL-FO white space handling we have this interaction to consider. One possible solution would be to replace spaces (U+0020) by different codepoints which resemble the behaviour modification imposed by any XSL-FO white space handling properties in effect. Not really 'replace' but 'treat-as-if' (generate a Knuth sequence analogous to codepoint ...) But I am not sure if this can be done in all cases. Otherwise we may have to modify the UNICODE line breaking algorithm to cater for the XSL-FO white space specialities. Hm. Same as Joerg, I also see these two operate at different levels. Don't know if I'm seeing this correctly, but the line breaking algorithm operates at the level where we have to decide the value of the penalties for breaking before or after a particular type of whitespace (or non-whitespace), whereas the white-space treatment refers to line-building, so what happens/has to happen in case the break is 'elected'. Cheers, Andreas
Re: Leading/trailing space removal in LineLM
On Fri, 4 Nov 2005 06:22 am, Andreas L Delmelle wrote: On Nov 3, 2005, at 08:53, Manuel Mall wrote: On Thu, 3 Nov 2005 06:03 am, J.Pietschmann wrote: But I am not sure if this can be done in all cases. Otherwise we may have to modify the UNICODE line breaking algorithm to cater for the XSL-FO white space specialities. Hm. Same as Joerg, I also see these two operate at different levels. Don't know if I'm seeing this correctly, but the line breaking algorithm operates at the level where we have to decide the value of the penalties for breaking before or after a particular type of whitespace (or non-whitespace), whereas the white-space treatment refers to line-building, so what happens/has to happen in case the break is 'elected'. Andreas, Joerg, I don't think we actually disagree on anything here really. We are probably falling into the 'e-mail communications hole' where we talk about the same thing just using slightly different words / descriptions and in the end going in circles. I am convinced if we sat around the table we sort this out in a few minutes and could then go on to more fundamental issues like: German beer is really superior to Belgium brews and of course prove that fact by extensive testing -:). Cheers, Andreas Cheers Manuel
Re: Leading/trailing space removal in LineLM
Manuel Mall wrote: So we end up with only two cases to consider: preserve white space and remove white space around a line break created by the Knuth algorithm. 1. Preserve white space: IMO in this case the space itself is actually not a break opportunity but there are now two break opportunities: one before the space and one after the space. That is a sequence like 'abc#x20;def' is more like 'abc#x200b;#xa0;#x200b;def' or in a more readable notation 'abczwspnbspzwspdef'. That is our normal space becomes a non-breakable space flanked by zero-width spaces which represent the break opportunities. If this is correct the Knuth elements would look like: glue w=0 box w=0 pen +INFINITE glue w=space pen glue w=0 Is this sequence correct? The first and last glue represent the zwsp and are break opportunities. The box prevents the removal of the space if a break is created before the space. The penalty prevents the space to be considered as a break opportunity. Of course as usual these sequences are further complicated in the absence of justification and in the presence of border/padding. I like your idea of expanding a preserved space into zwsps and nbsp; this allows us to forget alignments and borders / padding as we just have to insert the appropriate elements for the non breaking space. The sequence is very good, as it has a couple of interesting properties: - it interacts with the surrounding elements just a single glue element - if there are two (or more) consecutive, non-collapsed spaces the sequence has just 3 feasible breaks, not 4 However, I have a doubt: reading the Unicode document about line breaking, it seems to me that, regardless of the quantity of consecutive spaces, there is only *one* feasible break, after the last one (Unicode Standard Annex #14, section 2 Definitions, in particular the definition of direct break and indirect break) --- begin quoted text --- Direct Break - a line break opportunity exists between two adjacent characters of the given line breaking classes. This is indicated in the rules below as B ? A, where B is the character class of the character before and A is the character class of the character after the break. If they are separated by one or more space characters, a break opportunity also exists after the last space. In the pair table, the optional space characters are not shown. Indirect Break - a line break opportunity exists between two characters of the given line breaking classes only if they are separated by one or more spaces. In this case, a break opportunity exists after the last space. No break opportunity exists if the characters are immediately adjacent. This is indicated in the pair table below as B % A, where B is the character class of the character before and A is the character class of the character after the break. Even though space characters are not shown in the pair table, an indirect break can only occur if one or more spaces follow B. In the notation of the rules in Section 6, Line Breaking Algorithm this would be represented as two rules: B ? A and B SP+ ? A. --- end quoted text --- I still have not read the document from top to bottom, and I could have misunderstood even the sections I read :-), but I think this point must be clarified before we continue. Regards Luca
Re: Leading/trailing space removal in LineLM
On Wed, 2 Nov 2005 01:59 pm, Manuel Mall wrote: On Wed, 2 Nov 2005 04:18 am, Simon Pepping wrote: On Tue, Nov 01, 2005 at 11:40:42PM +0800, Manuel Mall wrote: This is probably a question for Luca or Simon. snip/ Glue and penalty items are removed at the start of a line. This is part of the Knuth algorithm. It does not touch the matter of white-space-collapse. If there is whitespace that may not be removed/collapsed at the start of the line, it must be protected by a preceding zero-width box. I.o.w., the value of white-space-collapse needs to be taken into account at the phase of getNextKnuthElements. Fair enough - I need some help with the Knuth elements then. During getNextKnuth we need to only consider white-space-treatment as white-space-collapse can be handled completely during refinement, that is consecutive sequences of white space are either collapsed or not during refinement. We also can limit white-space-treatment during getNextKnuth to any line breaks generated by the line breaking algorithm (Knuth algorithm). white-space-treatment around hard line breaks (linefeeds, start/end of a block) are handled during refinement. We can also limit white-space-treatment during getNextKnuth to the values preserve vs ignore-if Other values are handled during refinement. We also can treat the three different ignore-if... values, that is the values: ignore-if-before-linefeed, ignore-if-after-linefeed, ignore-if-surrounding-linefeed, as just one case: 'delete all white space around a formatter generated break'. So we end up with only two cases to consider: preserve white space and remove white space around a line break created by the Knuth algorithm. 1. Preserve white space: IMO in this case the space itself is actually not a break opportunity but there are now two break opportunities: one before the space and one after the space. That is a sequence like 'abc#x20;def' is more like 'abc#x200b;#xa0;#x200b;def' or in a more readable notation 'abczwspnbspzwspdef'. That is our normal space becomes a non-breakable space flanked by zero-width spaces which represent the break opportunities. If this is correct the Knuth elements would look like: glue w=0 box w=0 pen +INFINITE glue w=space pen glue w=0 Is this sequence correct? The first and last glue represent the zwsp and are break opportunities. The box prevents the removal of the space if a break is created before the space. The penalty prevents the space to be considered as a break opportunity. Of course as usual these sequences are further complicated in the absence of justification and in the presence of border/padding. 2. Removal of white space: This is the current behaviour but it works only for a single space and not for a sequence of spaces. Actually because the algorithm removes leading glues/penalties it is mainly a problem for trailing white space. I am not sure how to best tackle this. What comes to mind is: a) Do the same as for leading glues/penalties at the end of the line. However I am not sure how tricky it would be to determine the boundary because any 'blocking boxes' (see 1. above) are only placed before but not after elements. This options suffers from the problem that it will not remove leading/trailing white space across inline boundaries with border/padding as these generate zero width boxes to block removal of the glue elements for the border/padding. b) Do not generate individual Knuth sequences for each white space character but instead collect all consecutive white space and create one glue-penalty sequence for it. Again I am uncertain of the consequences of doing that. To do that correctly we would need to collect white space across inline boundaries. This firstly breaks the current getNextKnuth approach which assumes each LM can generate its sequences without knowledge of its neighbours. It would also break the current area info structures as a single Knuth element could now refer to text snippets from different LMs. Comments please. Simon Thanks Luca wrote a longer response to this but my mail reader doesn't like the character set (is that topical or what?). Any way at end end Luca ask the question about the UAX#14 line breaking algorithm and its handling of spaces. My answer to that is: a) Yes UAX#14 always breaks at the of a sequence of spaces b) But is also says that it assumes any trailing spaces in a line are being removed This conflicts with XSL-FO which can force spaces being retained therefore adjustments to the algorithm are necessary to cater for that. One possible adjustment is simply changing what is given to the algorithm as indicated above, ie sp becomes zwspnbspzwsp. Manuel Manuel In case other people have the same problem with Luca's post here is the content: Start Luca's e-mail + I like your idea of expanding a preserved space into zwsps and nbsp; this allows us to forget alignments and
Re: Leading/trailing space removal in LineLM
Manuel Mall wrote: Luca wrote a longer response to this but my mail reader doesn't like the character set (is that topical or what?). Sorry, it looks really horrible ... still don't know what went wrong, but I won't do it again! :-) Any way at end Luca ask the question about the UAX#14 line breaking algorithm and its handling of spaces. My answer to that is: a) Yes UAX#14 always breaks at the of a sequence of spaces b) But is also says that it assumes any trailing spaces in a line are being removed This conflicts with XSL-FO which can force spaces being retained therefore adjustments to the algorithm are necessary to cater for that. One possible adjustment is simply changing what is given to the algorithm as indicated above, ie sp becomes zwspnbspzwsp. Ok, so back to your previous message: 2. Removal of white space: This is the current behaviour but it works only for a single space and not for a sequence of spaces. Actually because the algorithm removes leading glues/penalties it is mainly a problem for trailing white space. I am not sure how to best tackle this. What comes to mind is: a) Do the same as for leading glues/penalties at the end of the line. However I am not sure how tricky it would be to determine the boundary because any 'blocking boxes' (see 1. above) are only placed before but not after elements. This options suffers from the problem that it will not remove leading/trailing white space across inline boundaries with border/padding as these generate zero width boxes to block removal of the glue elements for the border/padding. b) Do not generate individual Knuth sequences for each white space character but instead collect all consecutive white space and create one glue-penalty sequence for it. Again I am uncertain of the consequences of doing that. To do that correctly we would need to collect white space across inline boundaries. This firstly breaks the current getNextKnuth approach which assumes each LM can generate its sequences without knowledge of its neighbours. It would also break the current area info structures as a single Knuth element could now refer to text snippets from different LMs. I'm not sure I follow you in all the details of white space handling and here we have borders too ... :-) I like b) most: after all, this is somewhat similar to the space resolution, as we have interactions between spaces coming from different nodes, and it's difficult to have each LM decide on its own. And I think we could find a way to keep the 1-1 relationship between AreaInfo objects and Positions. I have tried to play with the elements, and here are a few results: I hope they can help! At the moments, the sequence for a single space with borders and padding is: 1 glue w=endBP 2 penalty w=0 3 glue w=(spaceIPD - endBP - startBP) 4 box w=0 5 infinite penalty 6 glue w=startBP total width = spaceIPD if break at #2 = endBP / startBP If we have two (or more) spaces, we could use the sequence: 1 glue w=endBP 2 penalty w=0 3 glue w=(- endBP - startBP) 4 glue w=spaceIPD1 5 glue w=spaceIPD2 6 box w=0 7 infinite penalty 8 glue w=startBP total width = spaceIPD1 + spaceIPD2 if break at #2 = endBP / startBP Glues #4 and #5 have a Position pointing to different AreaInfo objects (from different LMs). This should solve (?) the case of ignore-if-surrounding. If white-space-treatment is ignore-if-after, and we have two consecutive spaces we could use the sequence: 1 glue w=endBP 2 penalty w=0 3 glue w=(spaceIPD - endBP) 4 penalty w=0 5 glue w=(spaceIPD - startBP) 6 box w=0 7 infinite penalty 8 glue w=startBP total width = 2 * spaceIPD if break at #2 = endBP / startBP if break at #4 = endBP + spaceIPD / startBP With three or more consecutive spaces: 1 glue w=endBP 2 penalty w=0 3 glue w=(spaceIPD - endBP) 4 penalty w=0 5 glue w=spaceIPD 6 penalty w=0 7 glue w=(spaceIPD - startBP) 8 box w=0 9 infinite penalty 10 glue w=startBP total width = 3 * spaceIPD if break at #2 = endBP / startBP if break at #4 = endBP + spaceIPD / startBP if break at #6 = endBP + 2 * spaceIPD / startBP I did not find a sequence for ignore-if-before yet ... Regards Luca
Re: Leading/trailing space removal in LineLM
On Wed, Nov 02, 2005 at 04:58:09PM +0100, Luca Furini wrote: Manuel Mall wrote: Luca wrote a longer response to this but my mail reader doesn't like the character set (is that topical or what?). Sorry, it looks really horrible ... still don't know what went wrong, but I won't do it again! :-) It is in the quoted-printable format, probably due to non-ascii or non-latin-1 characters in it, the TR14 symbols. Simon -- Simon Pepping home page: http://www.leverkruid.nl
Re: Leading/trailing space removal in LineLM
Manuel Mall wrote: a) Yes UAX#14 always breaks at the of a sequence of spaces b) But is also says that it assumes any trailing spaces in a line are being removed This conflicts with XSL-FO which can force spaces being retained therefore adjustments to the algorithm are necessary to cater for that. Computing line breaking opportunities and discarding whitespace at the end (or beginning) of a line are different matters. If whitespace has to be retained, trailing spaces after a non-space string may simply mean the previous line breaking opportunity has to be used, because otherwise the string including the trailing spaces will overflow the line area. The trailing whitespace may also influence text justification. J.Pietschmann
Re: Leading/trailing space removal in LineLM
On Thu, 3 Nov 2005 06:03 am, J.Pietschmann wrote: Manuel Mall wrote: a) Yes UAX#14 always breaks at the of a sequence of spaces b) But is also says that it assumes any trailing spaces in a line are being removed This conflicts with XSL-FO which can force spaces being retained therefore adjustments to the algorithm are necessary to cater for that. Computing line breaking opportunities and discarding whitespace at the end (or beginning) of a line are different matters. If whitespace has to be retained, trailing spaces after a non-space string may simply mean the previous line breaking opportunity has to be used, because otherwise the string including the trailing spaces will overflow the line area. The trailing whitespace may also influence text justification. Hmm, to me it appears that UNICODE and XSL-FO have slightly different models when it comes to white space in the context of line breaking which is causing the discussion here. In UNICODE everything is based simply on the properties of the codepoint in question and its neighbour. In XSL-FO one can change the behaviour of a codepoint by setting those white space related XSL-FO properties. That is not a concept within UNICODE. If you want to retain white space in UNICODE you use a different codepoint. If you want to retain a space in XSL-FO you could use a different codepoint but more likely you set a XSL-FO property if you want this applied widely in your document. If we want to 'marry' UNICODE linebreaking with XSL-FO white space handling we have this interaction to consider. One possible solution would be to replace spaces (U+0020) by different codepoints which resemble the behaviour modification imposed by any XSL-FO white space handling properties in effect. But I am not sure if this can be done in all cases. Otherwise we may have to modify the UNICODE line breaking algorithm to cater for the XSL-FO white space specialities. J.Pietschmann Manuel
Re: Leading/trailing space removal in LineLM
On Tue, Nov 01, 2005 at 11:40:42PM +0800, Manuel Mall wrote: This is probably a question for Luca or Simon. In LineLM we have this code: // ignore KnuthGlue and KnuthPenalty objects // at the beginning of the line seqIterator = seq.listIterator(iStartElement); tempElement = (KnuthElement) seqIterator.next(); while (!tempElement.isBox() seqIterator.hasNext()) { tempElement = (KnuthElement) seqIterator.next(); iStartElement++; } What is the background to this? This seems to interfere with certain combinations of white-space-collapse=false and white-space-treatment=preserve/ignore-if-before-linefeed. I think there is similar code to remove trailing stuff with similar interference. Glue and penalty items are removed at the start of a line. This is part of the Knuth algorithm. It does not touch the matter of white-space-collapse. If there is whitespace that may not be removed/collapsed at the start of the line, it must be protected by a preceding zero-width box. I.o.w., the value of white-space-collapse needs to be taken into account at the phase of getNextKnuthElements. Simon -- Simon Pepping home page: http://www.leverkruid.nl