Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output
Manuel Mall wrote: 1. The suppress-at-line-break property can be applied to all characters. I would take the position at the moment that explicit specification of the suppress-at-line-break property is not supported and we worry about it at a later stage. I would certainly argue against just supporting it in the context of nbsp. Ok, it's better to take a step at a time! 2. When we discussed UAX#14 line breaking on this list last year Joerg pointed out that he had a table driven implementation for it. At the the time I took a look, liked it, and updated it for compliance to the lastest UAX#14 spec and then shelved it for integration into FOP. That is when we move determining line break opportunities to the LineLM level (which we discussed extensively before) we get UAX#14 linebreaking as part of it by integrating Joerg's implementation. As a consequence I recommend against putting any UAX#14 specific stuff at the lower levels (e.g. TextLM) now in the context of fixing the nbsp problem. It will disappear anyway and IMO is therefore not worth the effort. Ok, so for the moment I'll avoid considering interaction between spaces, and just fix the character-by-character element creation, which is ready and should be enough to handle the most common situations. This also solves another bug concerning a nbsp being removed when starting a line. I'll make the commit in a few minutes Regards Luca
Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output
On Monday 06 February 2006 18:44, Luca Furini wrote: Manuel Mall wrote: snip/ 1. Justified text: pen INF + elastic glue 2. All other justification modes: either just a box of the width of the space or pen INF + fixed width glue. I think in both cases (justified / unjustified text) we could use either a sequence with only glues and penalties, or a sequence with boxes too. For the justified text, it could be: box w=0 + pen INF + elastic glue The choice of the sequence (completely suppressible / with boxes too) depends on the suppress-at-line-break property, whose default value is auto, meaning that only the normal U+0020 space is suppressed at a break. However, things are not so simple, and maybe we cannot just check the local value of the property. I see a couple of potentially-problematic situations. snip/ Luca, IMO nbsp (and any other Unicode special spaces) are outside the scope of XSL-FO whitespace handling. XSL-FO refers to whitespace as defined in XML. In XML only x#20, x#9, x#a, and x#d are considered whitespace. Therefore nbsp does not need to be considered when looking at white-space-treatment and white-space-collapse. Would that approach remove the complications you mentioned? If nbsps must be suppressed, should an empty line be created or not? WDYT? Regards Luca Cheers Manuel
Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output
Manuel Mall wrote: IMO nbsp (and any other Unicode special spaces) are outside the scope of XSL-FO whitespace handling. XSL-FO refers to whitespace as defined in XML. In XML only x#20, x#9, x#a, and x#d are considered whitespace. Therefore nbsp does not need to be considered when looking at white-space-treatment and white-space-collapse. Would that approach remove the complications you mentioned? Thanks for the clarification, Manuel! This solves the first supposed problem (interaction between nbsp and pretty-printing spaces), but the second one is still open: what happens if we have someContentnbspspaceotherContent ? *IF* (and I'm not at all sure about this) there can be a break , then both spaces should be discarded: in order to implement the correct behaviour for this almost hypothetical situation, we would need to create elements for both spaces as a whole (and thay could belong to different LMs) otherwise the algorithm would not be able to ignore the nbsp during the line breaking. Anyway I think this is quite an unlikely combination of entities and properties :-) ; as I see you are already working on something else, for the moment I will prepare a patch for the most common situations. Regards Luca
Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output
On Monday 06 February 2006 22:35, Luca Furini wrote: Manuel Mall wrote: IMO nbsp (and any other Unicode special spaces) are outside the scope of XSL-FO whitespace handling. XSL-FO refers to whitespace as defined in XML. In XML only x#20, x#9, x#a, and x#d are considered whitespace. Therefore nbsp does not need to be considered when looking at white-space-treatment and white-space-collapse. Would that approach remove the complications you mentioned? Thanks for the clarification, Manuel! This solves the first supposed problem (interaction between nbsp and pretty-printing spaces), but the second one is still open: what happens if we have someContentnbspspaceotherContent ? *IF* (and I'm not at all sure about this) there can be a break , then both spaces should be discarded: IMO yes there can be a break and no only the space needs to be removed. Again the argument is that nbsp is not whitespace as per XSL-FO definition and need not to be removed. What makes you think that both the nbsp and the space needs to be removed around a fop generated linebreak? in order to implement the correct behaviour for this almost hypothetical situation, we would need to create elements for both spaces as a whole (and thay could belong to different LMs) otherwise the algorithm would not be able to ignore the nbsp during the line breaking. Anyway I think this is quite an unlikely combination of entities and properties :-) ; as I see you are already working on something else, for the moment I will prepare a patch for the most common situations. Regards Luca Manuel
Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output
Manuel Mall wrote: This solves the first supposed problem (interaction between nbsp and pretty-printing spaces), but the second one is still open: what happens if we have someContentnbspspaceotherContent ? *IF* (and I'm not at all sure about this) there can be a break , then both spaces should be discarded: IMO yes there can be a break and no only the space needs to be removed. Again the argument is that nbsp is not whitespace as per XSL-FO definition and need not to be removed. What makes you think that both the nbsp and the space needs to be removed around a fop generated linebreak? Oops, I forgot to add an importand condition: if the user explicitly states that the nsbp must be discarded around a line break: fo:inline suppress-at-line-break=suppressnbsp;/fo:inline Well, the more I look at this, the more it seems unlikely to ever happen ... we are probably having a highly theoretical disquisition! :-) Anyway, I was still not sure whether there could be a break so I looked back at the Unicode Annex #14. GL Non-breaking (Glue) (XB/XA) (normative) Non-breaking characters prohibit breaks on either side, but that prohibition can be overridden by SP or ZW. In particular, when NBSP follows SPACE, there is a break opportunity after the SPACE and NBSP will go as visible space onto the next line. See also WJ. The following lists the characters of line break class GL with additional description. 00A0 NO-BREAK SPACE (NBSP) 202F NARROW NO-BREAK SPACE (NNBSP) 180E MONGOLIAN VOWEL SEPARATOR (MVS) NO-BREAK SPACE is the preferred character to use where two words should be visually separated but kept on the same line, as in the case of a title and a name Dr.NBSPJoseph Becker. When SPACE follows NBSP, there is no break, because there never is a break in front of SPACE. NARROW NO-BREAK SPACE is used in Mongolian. The mongolian vowel separator acts like a NNBSP in its line breaking behavior. It additionally affects the shaping of certain vowel characters as described in [Unicode] Section 12.3, Mongolian. So, it seems there could be a break between SPACE and NBSP (with NBSP starting the next line), but not between NBSP and SPACE. Can we say this is settled? Regards Luca
Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output
On Feb 6, 2006, at 08:17, Manuel Mall wrote: [ME:] snip/ A preserved carriage return can be treated the same way as a linefeed, under the very exceptional condition that it survives white- space handling: * white-space-treatment=ignore-if-* * the CR does not follow/precede a linefeed * it is the first character in a sequence of whitespace, so it survives white-space-collapse Shouldn't a CR always survive whitespace handling? Not always: If white-space-treatment=preserve then any XML whitespace other than a linefeed is converted into a normal space. IMO, the editors put it this way because of the possibility of Windows-specific line- endings, where a linefeed is followed by a CR. For a starters it is fairly difficult to get a CR out of a XML parser. Difficult? It's simply a characters event, just like any other... Only if the CR is hidden in an entity reference can it survive. Also, as Simon pointed out in some other contribution, whitespace handling is designed to deal with pretty printing and readable XML layout introduced whitespace. A CR preserved by the XML parser certainly does not fall into that category. Oh yes it does... Remember that not all our users are unix/linux- based, which means for Windows users, you're likely to get the sequence '#x0A;#x0D;' as line-terminator, while Mac-users saving a source file with native line-endings will simply get a '#x0D;'. (UTF-8 encoding is recommended, but not enforced... An XML file can be any encoding the parser supports on top of the UTF-8 minimum.) A carriage-return can survive white-space-handling, for instance, in the following case (suppose Mac-encoding): fo:block First line, then a CR#x0D; some spaces, and more text /fo:block The CR (which isn't necessarily a Numerical Character Reference, but could be just the byte '0D') is not converted into a space (white- space-treatment=ignore-if-surrounding-linefeed). It does not precede or follow a linefeed. It is the first character in a sequence of whitespace, so no matter what the value of white-space-collapse, it will survive... I am also not aware that the XSL-FO spec mentions CR as falling under whitespace. IMO for whitespace handling CR is just a non whitespace character. Nope, it does fall into the category of XML whitespace. There are exactly four of those: #x09; (tab), #x0A; (linefeed), #x0D; (carriage-return) and #x20; (space). If you don't believe me, it's indeed not in the XSL-FO Rec, but you might want to check the XML Recommendation... So, we only need to consider what fop layout should do if it encounters a CR. I would say, keep it simple, throw it away and log a warning. Now, what about a tab character under the same circumstances? Do we use an elastic width of X spaces optimum, where X is purely conventional? Similar considerations as for CR apply to TAB. ... Cheers, Andreas
Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output
On Feb 6, 2006, at 17:04, Luca Furini wrote: Hi Manuel / Luca, Manuel Mall wrote: IMO yes there can be a break and no only the space needs to be removed. Again the argument is that nbsp is not whitespace as per XSL-FO definition and need not to be removed. What makes you think that both the nbsp and the space needs to be removed around a fop generated linebreak? Oops, I forgot to add an importand condition: if the user explicitly states that the nsbp must be discarded around a line break: fo:inline suppress-at-line-break=suppressnbsp;/fo:inline Oops, typo? suppress-at-line-break is a non-inherited property, only applicable to fo:character :-) Well, the more I look at this, the more it seems unlikely to ever happen ... we are probably having a highly theoretical disquisition! :-) fo:character character=#xA0; suppress-at-line-break=suppress / followed by a space is indeed very theoretical. So is (another alternative): fo:inline suppress-at-line-break=suppress fo:character character=#xA0; suppress-at-line-break=inherit / /fo:inline OTOH, if we can make the algorithm work in these exotic cases, then the commonly used scenarios will be a cake-walk. :-) This does, in any case, shed some different light on the notion of 'pretty printing whitespace', since currently --at least that was my understanding of the discussions, and that's what I worked towards-- a fo:character is considered the same as a regular character, in that fo:characters representing XML whitespace are subject to whitespace- removal... Yet, one can arguably defend the idea that any *fo:*character is inserted for *XML* pretty printing purposes, no? Should this change be reverted then? [Maybe partly, because suppose: fo:block fo:character character=#x20; suppress-at-line-break=retain / ... Currently, the fact that it is a fo:character is not known when running this through the algorithm. The CharIterators deal with the characters. The XMLWhiteSpaceHandler makes a decision based purely on the value of the character property. It is agnostic to the suppress- at-line-break property's value... I myself would tend to use a non- breaking space in this case, since it escapes the whitespace handling, but it is a theoretical possibility. :-) Another alternative would be to introduce a member to the CharIterators... Something like isSuppressible(), which would return true if: ( the current element is a regular character and it has codepoint U+0020 ) or ( the current element is a fo:character and (( the value of its character property is codepoint U+0020 and suppress-at-line-break=auto ) or ( suppress-at-line-break=suppress )) As such, refinement (white-space)-character-removal could operate on this basis, and already resolve such issues at that stage. The current approach is still not 100% correct anyway...] Anyway, I was still not sure whether there could be a break so I looked back at the Unicode Annex #14. snip / So, it seems there could be a break between SPACE and NBSP (with NBSP starting the next line), but not between NBSP and SPACE. Can we say this is settled? Yes! Definitely. We're looking for UAX#14 'compliance' as well here. My 2 cents. Cheers, Andreas
Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output
On Feb 6, 2006, at 19:40, Andreas L Delmelle wrote: Currently, the fact that it is a fo:character is not known when running this through the algorithm. The CharIterators deal with the characters. Say... I was just wondering: why does the TextLayoutManager create its own copy of the FOText's character array? Could the LMs be made to re-use the CharIterators' functionality to get to the characters, or would that mean a draw on performance somehow? Anyone? Cheers, Andreas
Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output
On Feb 5, 2006, at 14:13, [EMAIL PROTECTED] wrote: Hi Manuel, --- Additional Comments From [EMAIL PROTECTED] 2006-02-05 14:13 --- Jeremias, no that is not it IMO. Knuth doesn't break between elements as such. The glue or penalty element itself is the break opportunity and is discarded when used as a break. Therefore, IMO we are not breaking before or after a space or NBSP but at the space/NBSP. OK, IIC you're directing this at the wrong person... The last question was mine. :-) The problem is the coding model used for Knuth element element generation for spaces is flawed. What is done is that the only difference between normal space and NBSP is an infinite penalty at the beginning of the sequence. Yep. A few other gaps in that coding model, I'm currently looking at. See my most recent commit, and change of the white-space Wiki. Created some nasty side-effects in exotic situations... currently under investigation. A preserved carriage return can be treated the same way as a linefeed, under the very exceptional condition that it survives white- space handling: * white-space-treatment=ignore-if-* * the CR does not follow/precede a linefeed * it is the first character in a sequence of whitespace, so it survives white-space-collapse Now, what about a tab character under the same circumstances? Do we use an elastic width of X spaces optimum, where X is purely conventional? However, some sequences are pretty long and involve multiple pen- glue combinations and therefore break opportunities further into the sequence. We probably need to separate this more cleanly. Have one function for non breaking elastic elements (e.g. NBSP) and one function for breaking eleastic elements (e.g. SPACE). The non breaking sequences are probably very simple: 1. Justified text: pen INF + elastic glue 2. All other justification modes: either just a box of the width of the space or pen INF + fixed width glue. Curious what Luca and others think. Are the above two cases OK for NBSP or have I oversimplified and missed something, that is for the text-align values other then justify, that is start, center, end, is it enough to just reserve a fixed width for the NBSP? Still depends on text-align-last, no? BTW, is this not one of those situations where it's possible that the used font contains a glyph for the NBSP character, so we should check that as well? Cheers, Andreas
Re: DO NOT REPLY [Bug 38507] - Non-breaking space in PDF title output
On Feb 5, 2006, at 14:13, [EMAIL PROTECTED] wrote: Hi Manuel, --- Additional Comments From [EMAIL PROTECTED] 2006-02-05 14:13 --- snip/ A preserved carriage return can be treated the same way as a linefeed, under the very exceptional condition that it survives white- space handling: * white-space-treatment=ignore-if-* * the CR does not follow/precede a linefeed * it is the first character in a sequence of whitespace, so it survives white-space-collapse Shouldn't a CR always survive whitespace handling? For a starters it is fairly difficult to get a CR out of a XML parser. Only if the CR is hidden in an entity reference can it survive. Also, as Simon pointed out in some other contribution, whitespace handling is designed to deal with pretty printing and readable XML layout introduced whitespace. A CR preserved by the XML parser certainly does not fall into that category. I am also not aware that the XSL-FO spec mentions CR as falling under whitespace. IMO for whitespace handling CR is just a non whitespace character. So, we only need to consider what fop layout should do if it encounters a CR. I would say, keep it simple, throw it away and log a warning. Now, what about a tab character under the same circumstances? Do we use an elastic width of X spaces optimum, where X is purely conventional? Similar considerations as for CR apply to TAB. Any way both CR and TAB have not much to do with the problem at hand: NBSP not handled correctly. snip/ The non breaking sequences are probably very simple: 1. Justified text: pen INF + elastic glue 2. All other justification modes: either just a box of the width of the space or pen INF + fixed width glue. Curious what Luca and others think. Are the above two cases OK for NBSP or have I oversimplified and missed something, that is for the text-align values other then justify, that is start, center, end, is it enough to just reserve a fixed width for the NBSP? Still depends on text-align-last, no? Yes correct but even then do the two rules above suffice, i.e. possible justification required: Rule 1; no justification required: Rule 2? BTW, is this not one of those situations where it's possible that the used font contains a glyph for the NBSP character, so we should check that as well? Yes but again it has very little to do with the problem. If the font has a glyph for NBSP we should use that glyphs width and not the SP width in the glue elements generated. That's all. Cheers, Andreas Manuel