There are a number of incorrect statements. My comments below. ----- Original Message ----- From: "Peter Kirk" <[EMAIL PROTECTED]> To: "Kenneth Whistler" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Monday, August 11, 2003 16:28 Subject: Re: Questions on ZWNBS - for line initial holam plus alef
> I was aware that there should not be a line break or word break between > the space and the NSM, although I suspect that many implementers will > not be aware of this, or at least will not test for it properly and so > treat any space as a word break and a line break opportunity. Hard to be clearer than what is written in the LineBreak UAX. (see below). > As I just > wrote, this requirement to test all spaces for following NSMs is a > significant inefficiency built into the standard. This is incorrect. Characters (not just spaces) only need to be checked for following NSMs in *those processes where that makes a difference*. And in most of those processes, like line-break, some lookahead is required anyway. To see, for example, whether there is a linebreak after a character X, in almost all cases I have to look at the character after X, and in many cases I have to look at more than one character. Notice, for example, that in the sequence "a<space>" I have to look ahead to see if there is a ":", so that French punctuation works correctly. In practice, looking at a character past a space does not represent a significant performance issue. One is typically using a mechanism (like an augmented state machine) that maintains enough state that that is not an issue. > > But there is still a problem if there is considered by default to be a > word break and a line break opportunity AFTER the NSM. I would suggest, > as a candidate for a concrete proposal, that the default behaviour be > adjusted so that there is no word break or line break opportunity here > either. It helps if "concrete proposals" were actually, well, concrete. I see no problem with Line Break. (http://www.unicode.org/reports/tr14/#Algorithm): Space + NSM is treated as a unit, with behavior that is pretty consistent with a stand-alone accent like "^". To quote: LB 7a In all of the following rules, if a space is the base character for a combining mark, the space is changed to type ID. In other words, break before SP CM* in the same cases as one would break before an ID. Treat SP CM* as if it were ID If you want non-breaking behavior, you use NBSP + NSM; if you want breaking behavior, you use SP + NSM. The algorithm does that. I also see no problem with word-break (http://www.unicode.org/reports/tr29/#Word_Boundaries). Look at the specific text. To quote: Treat a grapheme cluster as if it were a single character: the first character of the cluster. GC → FC (3) ... Otherwise, break everywhere (including around ideographs). Any ÷ Any (14) None of the other rules are relevant. So what this does is that SPACE + NSM will break before the space and after the NSM (assuming there is only one). So it will behave like a symbol, such as "*", or ")", or "^". The one area I do see that there may be an issue is with one that you didn't mention, http://www.unicode.org/reports/tr29/#Sentence_Boundaries. Sp + NSM should not behave as Sp in the rules (8), (10), and (11). Even there, it will produce at most a minor oddity. If we wanted to change it, the *concrete* change would be to replace (4) by: Treat a grapheme cluster as if it were a single character: the first character of the cluster, except if that first character is a space. In that case, change to Any. SGC → FC (4a) GC → FC (4b) > > -- > Peter Kirk > [EMAIL PROTECTED] (personal) > [EMAIL PROTECTED] (work) > http://www.qaya.org/ > > >

