There are a number of incorrect statements. My comments below.

----- Original Message ----- 
From: "Peter Kirk" <[EMAIL PROTECTED]>
To: "Kenneth Whistler" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Monday, August 11, 2003 16:28
Subject: Re: Questions on ZWNBS - for line initial holam plus alef

> I was aware that there should not be a line break or word break
between
> the space and the NSM, although I suspect that many implementers
will
> not be aware of this, or at least will not test for it properly and
so
> treat any space as a word break and a line break opportunity.

Hard to be clearer than what is written in the LineBreak UAX. (see
below).

> As I just
> wrote, this requirement to test all spaces for following NSMs is a
> significant inefficiency built into the standard.

This is incorrect. Characters (not just spaces) only need to be
checked for following NSMs in *those processes where that makes a
difference*. And in most of those processes, like line-break, some
lookahead is required anyway. To see, for example, whether there is a
linebreak after a character X, in almost all cases I have to look at
the character after X, and in many cases I have to look at more than
one character. Notice, for example, that in the sequence "a<space>" I
have to look ahead to see if there is a ":", so that French
punctuation works correctly.

In practice, looking at a character past a space does not represent a
significant performance issue. One is typically using a mechanism
(like an augmented state machine) that maintains enough state that
that is not an issue.

>
> But there is still a problem if there is considered by default to be
a
> word break and a line break opportunity AFTER the NSM. I would
suggest,
> as a candidate for a concrete proposal, that the default behaviour
be
> adjusted so that there is no word break or line break opportunity
here
> either.

It helps if "concrete proposals" were actually, well, concrete.

I see no problem with Line Break.
(http://www.unicode.org/reports/tr14/#Algorithm):

Space + NSM is treated as a unit, with behavior that is pretty
consistent with a stand-alone accent like "^". To quote:

LB 7a  In all of the following rules, if a space is the base character
for a combining mark, the space is changed to type ID. In other words,
break before SP CM* in the same cases as one would break before an ID.

                        Treat SP CM* as if it were ID

If you want non-breaking behavior, you use NBSP + NSM; if you want
breaking behavior, you use SP + NSM. The algorithm does that.

I also see no problem with word-break
(http://www.unicode.org/reports/tr29/#Word_Boundaries). Look at the
specific text. To quote:

Treat a grapheme cluster as if it were a single character: the first
character of the cluster.
            GC    →    FC        (3)
...
Otherwise, break everywhere (including around ideographs).
            Any    ÷    Any        (14)

None of the other rules are relevant.

So what this does is that SPACE + NSM will break before the space and
after the NSM (assuming there is only one). So it will behave like a
symbol, such as "*", or ")", or "^".

The one area I do see that there may be an issue is with one that you
didn't mention,
http://www.unicode.org/reports/tr29/#Sentence_Boundaries. Sp + NSM
should not behave as Sp in the rules (8), (10), and (11). Even there,
it will produce at most a minor oddity.

If we wanted to change it, the *concrete* change would be to replace
(4) by:

Treat a grapheme cluster as if it were a single character: the first
character of the cluster, except if that first character is a space.
In that case, change to Any.
            SGC    →    FC        (4a)
            GC    →    FC        (4b)

>
> -- 
> Peter Kirk
> [EMAIL PROTECTED] (personal)
> [EMAIL PROTECTED] (work)
> http://www.qaya.org/
>
>
>


Reply via email to