RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

Philippe Verdy Mon, 08 Dec 2003 17:17:12 -0800


 -----Message d'origine-----
De :    Philippe Verdy [mailto:[EMAIL PROTECTED] 
Envoye :        mardi 9 decembre 2003 00:11
A :     Peter Kirk
Cc :    [EMAIL PROTECTED]
Objet : RE: Coloured diacritics (Was: Transcoding Tamil in the presence of
markup)

Peter Kirk writes:
> Agreed. But now we are told that the latter is illegal XML because a 
> combining mark is not permitted (by XML, not by Unicode) after <span>.

It is not forbidden by XML. It's just that handling a XML file (which is not
plain-text) as if it was a Unicode plain-text when performing normalization
of the file may produce unexpected composition of characters which are part
of the XML syntax.

This creates problems in the following cases where a defective combining
sequence is used in XML:
- a quotation mark that delimits the start of XML attribute values, 
- the opening bracket that delimits the start of a CDATA section,
- the superior sign that closes a XML tag or processing instruction
- the text content of <script> or <style> or <object> -like elements which
may contain various delimiting characters to enclose Unicode string values,
these problems depending on the scripting language actually used in these
elements, which is not plain text.

For these reasons, normalization should be used with care on XML files, and
XML encoders may need to consider the XML syntax at the first level, and
avoid converting the whole file as if it was plain text, but rather should
encode each plain-text string that occurs within the parsed XML tree,
possibly by using numeric or named character entities to encode the initial
diacritics in those strings that start by defective combining sequences.

In that case (with all cares taken in the XML encoder), a XML parser will
never be dumbed by an input NFC normalizer, but will still be able to
represent texts containing defective combining sequences without collision
with the XML syntax.

The W3C just _recommends_ the NFC form, but does not mandate it. In XML,
text elements and attribute values are just data and are not limited or
intended to represent only plain text. That's a good reason why defective
combining sequences are not even forbidden in XML, and why a XML parser is
not supposed to force any normalization of its input:

The _Unicode canonical equivalence_ of strings is not considered as
_equality_ in XML, and XML considers canonically equivalent strings coded
with distinct sequences of code points as _distinct_ for processing purpose
(it's up to the application using the parsed XML DOM-tree or InfoSet to see
if normalization of the "text" elements and attribute values are plain-text
and should be normalized before actual processing (for example by a XSLT
stylesheet).

When in doubt, don't perform any normalization of XML _files_ as they are
NOT plain text: you need a XML parser to do it safely only in relevant
sections of this file. All you could do safely is to possibly reencode XML
files (for example from UTF-8 to UTF-16 encoding schemes).

__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com

__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com

<<attachment: winmail.dat>>

RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

Reply via email to