Re: Questions on ZWNBS - for line initial holam plus alef

Peter Kirk Thu, 14 Aug 2003 13:06:57 -0700

On 12/08/2003 20:28, John Cowan wrote:

Peter Kirk scripsit:

2) In attribute values, LF, CR, and TAB characters are normalized to spaces. Not relevant here.

This would be relevant if it is legal for the character after LF, CR, and TAB to be a combining mark. Is this legal? In this case what was previously a defective (but legal) combining sequence would turn into a non-defective one, but the intended whitespace would be lost.

The point is that there is no such thing as an *intended* line break in an attribute value; it will *always* be translated to a space before the application sees it. (More exactly, line-break characters can be inserted into attribute values, but only with the use of a numeric character reference such as "
".)

Sorry, I'm confused. Are you saying that the input processing will translate line breaks into spaces within attribute values, unless inserted as 
 ? Well, I suppose this is fair enough as it is up to the user not to enter garbage.

Not just a rendering glitch, I suspect. If the combining character is combined with the separating space, the space loses many of its separating functions, and perhaps keeps a confusing subset of them with all sorts of possibilities of error.

The space(s) will be used to separate individual tokens at processing time. No spacing diacritic (either single-character or space+combining) is permitted in a NMTOKEN.

OK if this is clearly illegal, but this might restrict use of some languages in NMTOKEN. Would NBSP + combining be allowed?

At best tokens beginning with combining characters will be unusable. At worst they will crash the implementation (and count on someone trying deliberately to do that!).


In effect, the combining character will constitute a defective combining
sequence at the beginning of the individual token.

Stepping away from the letter of the standard for a moment, there is
no real reason to begin a NMTOKEN with a combining character.  It is
only allowed is a result of the miscegenation of SGML concepts with
Unicode ones.

In SGML's original design of tokens, they consisted of letters and digits
(and a few punctuation marks, which functioned as letters).  There were
four kinds: a NUMBER could contain only digits, a NAME could not begin
with a digit, a NUTOKEN had to begin with a digit, and a NMTOKEN had no
restrictions.  ID and IDREF had the same syntax as NAME with additional
semantics.  Later, the categories "letter" and "digit" were generalized,
by redefining the concrete syntax, to be whatever you wanted, and were
renamed "name-start" and "name" characters (technically, a name character
was a letter *or* a digit).

When SGML was simplified to produce XML, only NMTOKEN, the most general
type of token, was kept.  However, in order to keep the semantics of
"letter" and "digit" in the Unicode world, "letter" was extended to be any
letter and "digit" to be any digit *or* combining character.  That worked
well for ID and IDREF, since treating combining characters as part of
"digit" prevented them from appearing first, as was only sensible.

Unfortunately, NMTOKENs, since there were no restrictions, became able
to begin with a combining character, though that made no real sense.
To write in a restriction would make it impossible to specify XML's
concrete syntax in SGML terms, which did not allow for three different
classes of characters within tokens.  So we wound up with a basically
useless capability that if used will only cause trouble.

There is some potential for real trouble here, if one process outputs an NMTOKEN starting with a combining character preceded by a separating space, or something else which is changed into a space, and another process takes the new space plus combining character as a unit and so doesn't recognise the separation. Any hackers and virus programmers reading this will soon start flooding the Internet with tokens beginning with combining characters in the hope of crashing implementations or finding back doors. Of course this wouldn't have been a problem if Unicode had never defined space plus combining character as legal and meaningful. But this is not my problem!

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Questions on ZWNBS - for line initial holam plus alef

Reply via email to