> I do agree: a XML document could require the use at some place of a > given attribute or element. If this attribute name follows the element > name > after a line break, which gets changed into a space during parsing, > forcing > XML parsers to treat SPACE+combining as a unbreakable grapheme > cluster acting like a letter would have the effect of creating a new > element > name which may violate the lement name identity. Now suppose that the > attribute name contains a colon, you have created a custom namespace > name, under which you can add any element you like, even if this was > forbidden by the content-model of the reference schema.
1. SPACE is treated "blindly" as a SPACE by XML. String + space + combining + string would not be treated as a single token, no matter how that space was introduced. That's what you were complaining about in the first place (as far as I can make out). 2. While nmtokens can begin with a combining character names cannot, nor can they contain spaces. 3. This would in no way change the content-model. So even if the above two points didn't hold they would only sneak the document past something which performed validation before parsing(!), and where the content-model was already pretty loose (so it didn't complain about the unrecognised attribute). You've just discovered a way to disguise one document that isn't well-formed as a different document that isn't well-formed. l33t! > So this would invalidate existing documents, or create holes allowing > insertion of arbitrary XML content, if the XML application is not > validating extremely strictly the element names (the pair namespace+ > name) and exclude completely from processing any unrecognized > element (including all its content and attributes). This argument is not on friendly terms with the concept of causality. This would be a > breach in the content model which may have been validated and tested > for security in another layer of the document encoding process (notably > when XML documents are created from templates, such as XSL > processors, or custom C source using simple template substitution). Testing validity without testing well-formedness is not possible. > So for me the sequence SPACE+combining should not be acceptable > as a valid grapheme cluster within element names or attribute names, As it already isn't. > and thus would need to be excluded from NMTOKEN. The correct > way to do it is to consider it NOT A LETTER, but a symbol (Sk), > exactly like other spacing diacritics, which are already invalid in > NMTOKEN. Wait a second. That was my justification for why the fact that space+combining is ALREADY prohibited from NMTOKEN shouldn't be considered a failure on the part of XML to allow for freedom of choice with the strings used for NMTOKENs. Now you actually want to introduce this (already existent) feature. > There still remains the unresolved question of grapheme clusters > that could span the starting "<" or ending ">" or "/>" of tags, or > the leading "&" of a entitity reference. No there isn't. What goes before <, >, / or & isn't a problem since those are all non-combining characters and a new unit for any sort of processing treating more than one codepoint as a unit. What goes after < or & has to be a name (not an nmtoken) and as such is already prohibited from beginning with a combiner. What goes after > is already dealt with by the Charmod, and even if you ignore charmod apart from the possibility of normalisation turning the sequence U+003E, U+0338 into U+226E (a possibility that is well noted) it still isn't going to hurt.

