-----Message d'origine----- De : Philippe Verdy [mailto:[EMAIL PROTECTED] Envoye : mardi 9 decembre 2003 00:11 A : Peter Kirk Cc : [EMAIL PROTECTED] Objet : RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
Peter Kirk writes: > Agreed. But now we are told that the latter is illegal XML because a > combining mark is not permitted (by XML, not by Unicode) after <span>. It is not forbidden by XML. It's just that handling a XML file (which is not plain-text) as if it was a Unicode plain-text when performing normalization of the file may produce unexpected composition of characters which are part of the XML syntax. This creates problems in the following cases where a defective combining sequence is used in XML: - a quotation mark that delimits the start of XML attribute values, - the opening bracket that delimits the start of a CDATA section, - the superior sign that closes a XML tag or processing instruction - the text content of <script> or <style> or <object> -like elements which may contain various delimiting characters to enclose Unicode string values, these problems depending on the scripting language actually used in these elements, which is not plain text. For these reasons, normalization should be used with care on XML files, and XML encoders may need to consider the XML syntax at the first level, and avoid converting the whole file as if it was plain text, but rather should encode each plain-text string that occurs within the parsed XML tree, possibly by using numeric or named character entities to encode the initial diacritics in those strings that start by defective combining sequences. In that case (with all cares taken in the XML encoder), a XML parser will never be dumbed by an input NFC normalizer, but will still be able to represent texts containing defective combining sequences without collision with the XML syntax. The W3C just _recommends_ the NFC form, but does not mandate it. In XML, text elements and attribute values are just data and are not limited or intended to represent only plain text. That's a good reason why defective combining sequences are not even forbidden in XML, and why a XML parser is not supposed to force any normalization of its input: The _Unicode canonical equivalence_ of strings is not considered as _equality_ in XML, and XML considers canonically equivalent strings coded with distinct sequences of code points as _distinct_ for processing purpose (it's up to the application using the parsed XML DOM-tree or InfoSet to see if normalization of the "text" elements and attribute values are plain-text and should be normalized before actual processing (for example by a XSLT stylesheet). When in doubt, don't perform any normalization of XML _files_ as they are NOT plain text: you need a XML parser to do it safely only in relevant sections of this file. All you could do safely is to possibly reencode XML files (for example from UTF-8 to UTF-16 encoding schemes). __________________________________________________________________ << ella for Spam Control >> has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com __________________________________________________________________ << ella for Spam Control >> has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com
<<attachment: winmail.dat>>

