On 11/08/2003 18:46, Mark Davis wrote:

There are a number of incorrect statements. My comments below.


Thanks for the clarifications. Sorry about the inaccuracies. On some maybe Philippe misled me, on others it is just my inadequate understanding.

...

In practice, looking at a character past a space does not represent a
significant performance issue. One is typically using a mechanism
(like an augmented state machine) that maintains enough state that
that is not an issue.


Understood. I hope Microsoft is listening.

...

It helps if "concrete proposals" were actually, well, concrete.


Of course! But I need help to get rid of any inaccuracies before the concrete sets.

I see no problem with Line Break.
(http://www.unicode.org/reports/tr14/#Algorithm):

Space + NSM is treated as a unit, with behavior that is pretty
consistent with a stand-alone accent like "^". To quote:

LB 7a  In all of the following rules, if a space is the base character
for a combining mark, the space is changed to type ID. In other words,
break before SP CM* in the same cases as one would break before an ID.

Treat SP CM* as if it were ID

If you want non-breaking behavior, you use NBSP + NSM; if you want
breaking behavior, you use SP + NSM. The algorithm does that.


Thank you. I have looked at this. Well, the ideal for me would be a mechanism whereby base + NSM was AL, rather than ID or GL. The problem comes, if I understand correctly, with a sequence like SP XX CM* AL, where I want a break opportunity after SP but not before AL. If I use NBSP for XX, I get not breaking opportunity at all. If I use SP, I may get a break before AL. But I suppose SP SP CM* WJ AL would do what I want, perhaps also SP ZWSP NBSP CM* AL as the break opportunity after ZWSP takes precedence over the no break before NBSP.

I also see no problem with word-break
(http://www.unicode.org/reports/tr29/#Word_Boundaries). Look at the
specific text. To quote:

Treat a grapheme cluster as if it were a single character: the first
character of the cluster.
           GC    →    FC        (3)
...
Otherwise, break everywhere (including around ideographs).
           Any    ÷    Any        (14)

None of the other rules are relevant.

So what this does is that SPACE + NSM will break before the space and
after the NSM (assuming there is only one). So it will behave like a
symbol, such as "*", or ")", or "^".


OK, no real problem then. In some circumstances it might have been better for space + NSM to behave like a letter rather than a symbol may be more appropriate, but I recognise that tailoring may be required for fine details.

The one area I do see that there may be an issue is with one that you
didn't mention,
http://www.unicode.org/reports/tr29/#Sentence_Boundaries. Sp + NSM
should not behave as Sp in the rules (8), (10), and (11). Even there,
it will produce at most a minor oddity.

If we wanted to change it, the *concrete* change would be to replace
(4) by:

Treat a grapheme cluster as if it were a single character: the first
character of the cluster, except if that first character is a space.
In that case, change to Any.
SGC → FC (4a)
GC → FC (4b)


Do you mean: "SGC → Any (4a)"?

How should I go about making a concrete proposal for this?

Anyway, many thanks for your help. I think I am beginning to realise that this is a small problem which has been blown out of proportion by others. I still see the space + NSM choice as a rather poor initial design, but one which can be lived with.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/





Reply via email to