There are a number of incorrect statements. My comments below.Thanks for the clarifications. Sorry about the inaccuracies. On some maybe Philippe misled me, on others it is just my inadequate understanding.
...Understood. I hope Microsoft is listening.
In practice, looking at a character past a space does not represent a
significant performance issue. One is typically using a mechanism
(like an augmented state machine) that maintains enough state that
that is not an issue.
...Of course! But I need help to get rid of any inaccuracies before the concrete sets.
It helps if "concrete proposals" were actually, well, concrete.
Thank you. I have looked at this. Well, the ideal for me would be a mechanism whereby base + NSM was AL, rather than ID or GL. The problem comes, if I understand correctly, with a sequence like SP XX CM* AL, where I want a break opportunity after SP but not before AL. If I use NBSP for XX, I get not breaking opportunity at all. If I use SP, I may get a break before AL. But I suppose SP SP CM* WJ AL would do what I want, perhaps also SP ZWSP NBSP CM* AL as the break opportunity after ZWSP takes precedence over the no break before NBSP.I see no problem with Line Break. (http://www.unicode.org/reports/tr14/#Algorithm):
Space + NSM is treated as a unit, with behavior that is pretty consistent with a stand-alone accent like "^". To quote:
LB 7a In all of the following rules, if a space is the base character for a combining mark, the space is changed to type ID. In other words, break before SP CM* in the same cases as one would break before an ID.
Treat SP CM* as if it were ID
If you want non-breaking behavior, you use NBSP + NSM; if you want
breaking behavior, you use SP + NSM. The algorithm does that.
OK, no real problem then. In some circumstances it might have been better for space + NSM to behave like a letter rather than a symbol may be more appropriate, but I recognise that tailoring may be required for fine details.I also see no problem with word-break (http://www.unicode.org/reports/tr29/#Word_Boundaries). Look at the specific text. To quote:
Treat a grapheme cluster as if it were a single character: the first character of the cluster. GC → FC (3) ... Otherwise, break everywhere (including around ideographs). Any ÷ Any (14)
None of the other rules are relevant.
So what this does is that SPACE + NSM will break before the space and
after the NSM (assuming there is only one). So it will behave like a
symbol, such as "*", or ")", or "^".
Do you mean: "SGC → Any (4a)"?The one area I do see that there may be an issue is with one that you didn't mention, http://www.unicode.org/reports/tr29/#Sentence_Boundaries. Sp + NSM should not behave as Sp in the rules (8), (10), and (11). Even there, it will produce at most a minor oddity.
If we wanted to change it, the *concrete* change would be to replace (4) by:
Treat a grapheme cluster as if it were a single character: the first
character of the cluster, except if that first character is a space.
In that case, change to Any.
SGC → FC (4a)
GC → FC (4b)
How should I go about making a concrete proposal for this?
Anyway, many thanks for your help. I think I am beginning to realise that this is a small problem which has been blown out of proportion by others. I still see the space + NSM choice as a rather poor initial design, but one which can be lived with.
-- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/

