2012/11/19 "Martin J. Dürst" <[email protected]> > >> Note also that the W3C >> does not automatically endorses the Unicode and ISO/IEC 10646 standards as >> well (there's a delay before accepting newer releases of TUS and ISO/IEC >> 10646, and the W3C frequently adds now several restrictions). >> > > Can you give examples? As far as I'm aware, the W3C has always tried to > make sure that e.g. new characters encoded in Unicode can be used as soon > as possible. There are some cases where this has been missed in the past > (e.g. XML naming rules), but where corrective action has been taken. >
I did not speak about the characters themselves : the whole UCS is accessible, but with restrictions of use (or incompatibilities of behavior in the context of HTML). XML is more relaxed about this and this will not changed because XML is not just a standard for transporting text but a lot of various datas (even if some data requires a specific syntax, there are also restricted characters for which you need an alternate representation, not handled at the DOM level itself but at an appliation-specific higher level of protocol). The most important differences is in how Unicode charaxter properties are handled, and in the tricky details of Unicode algorithms. We also have differences in the subset of characters usable for identifiers (XML and HTML are more restricted, or will require an escaping mechanims to work at the DOM level, but not directly encodable in the XML syntax without this escaping mechanims). HTML is not perfect because there are also differences of implmentation for the transform between the XML/HTML syntax level and the resulting data accessible at the DOM level (it is not bijective when you start from the XML syntax, due to alternate representations possible andpart of the standard, but the reverse is also true and these are implementation bugs still found everywhere, notably in the XML and HTML parsers where they are frequent, but also sometimes, more rarely, in the XML/HTML encoders, where the encoded data cannot be decoded exactly like it was at the initial DOM level). There are also various interpretations still existing in the behavior of whitespaces (according to the xml:whitespace="*" pseudo-attribute which is frequently not matched exactly as it should be ; such bugs are detected when trying to implement document signatures). Other variations of interpretations are also caused by the named entities (the difference exists between "validating" and "non-validating" parsers, and even within the validating ones, when there are external document entities, and in the specifications of data schemas).

