Alexandre, > Philippe Verdy wrote: > > > From: "Kent Karlsson" <[EMAIL PROTECTED]> > > > >>Philippe Verdy wrote: > >> > >>> (1) a singleton (example the Angström symbol, canonically > >>>mapped to A with diaeresis, > >> > >>The Ångström (note spelling) sign is canonically mapped to > >>capital a with ring. > > Thanks for all explanations,
Please disregard Philippe's misleading blatherings on this topic. The place to start is to read Unicode Technical Report #20, Unicode in XML and other Markup Languages (despite Philippe's disclaimers about it). See, in particular, Section 5 of that report, "Characters with Compatibility Mappings", which provides a series of recommendations for things to do and not to do for compatibility characters in an XML context. > > Keeping the A with ring exemple, does it means that compatibility > characters can be identified according to Unicode charts ? See section 2.3 Compatibility Characters in the Unicode Standard: http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf In general, compatibility characters cannot be identified simply by looking at the Unicode code charts. The subset of compatibility characters known as compatibility composite characters *can* be identified by their decompositions listed in the names list sections of the Unicode code chart. Or you can parse them mechanically out of the UnicodeData.txt file in the Unicode Character Database online. U+212B ANGSTROM SIGN *is* a compatibility character in the first sense defined in Section 2.3 of the standard. It is not, however, a compatibility composite character. > By exemple, in the case of \u212B ANGSTROM SIGN, it is documented : > "preferred representation is 00C5 Å latin capital letter a with ring". > > Is that a clear indication that \u212B is actually a compatibility > character No, it is not. Such comments occur regarding other characters which may or may not be compatibility characters. > and then should be, according to XML 1.1 recommandation, > replaced by the \u00C5 character ? The reason has to do with normalization. U+212B *is* a compatibility character. It is *not* a compatibility composite character. But the crucial factor is that it has a singleton canonical decomposition. If you normalize text data using Unicode normalization form NFC, as recommended by the W3C, then U+212B with be replaced by U+00C5, as a result of the normalization. This stuff *is* rather confusing for people encountering it the first time. But the above sources should help. Also see the W3C working draft for the Character Model for the World Wide Web 1.0: http://www.w3.org/TR/charmod/ --Ken

