> In the context of XML processing, where strings should (must?) be

FYI. It's "should" for XML 1.1, and it's quite explicitly stated that normalisation is 
not required for a document to be well-formed. XML1.0 doesn't mention Unicode 
normalisation, although plenty of applications built on top of it do, sometimes with 
"must", or with "must" in certain circumstances. (Of course that character 
normalisation is always based on the W3C character model, which uses NFC).

Therefore you cannot assume that XML is in NFC form. This is potentially problematic 
in a few contexts where non-NFC XML will be fed to an application requiring NFC (and 
as I said some applications do require it) notably in cases where whitespace 
normalisation may change the NFC normalisation. Hence, this normalisation should occur 
before any other processing (which is the obvious logical way to do it anyway, but 
sometimes people do things the wrong way around as the occasional security hole 
related to UTF-8 shows). Hence an application that receives data in XML format will 
still have to take the precautions of checking that spaces aren't followed with 
combining characters before treating them as breaking characters, unless it either 
insists on NFC, or performs normalisation itself.

A more interesting (so you're probably already aware of this, but maybe someone else 
here isn't) potential pitfall with the position NFC has in the XML world is the fact 
that an application which doesn't insist on NFC will happily output U+0338 COMBINING 
LONG SOLIDUS OVERLAY as the first character of an element's content. NFC normalisation 
of the document would then combine the U+003E GREATER-THAN SIGN (>) of the markup with 
the combining solidus to produce U+226F NOT GREATER-THAN. For the most part this is 
only an issue with applications that intend to represent substring operations, as 
there is little sense in beginning a string with U+0338 anyway, but it's a strange 
thing to get a support request about if someone somehow gets it in there.

There's a similar problem with U+0338 as the first character in an element name, 
although such a document wouldn't be well-formed.


Reply via email to