Xerces-J supports numerous non-Unicode encodings such as ISO 8859-1 and SJIS. The parser converts these encodings to Unicode when parsing. However, for many characters such as � there are multiple representations in Unicode. It can either be the single character � or an e followed by a combining accent grave. There are many other examples of this.
Unicode defines several different normalization forms to determine how these encodings are performed. Canonical XML requires that transcoding from other character sets use Normalization Form C (NFC). This means that � would be represented as a single character.
What normalization form does Xerces use when transcoding? Does it rely purely on Java or does it do its own transcoding? If it's really true that no one knows what Xerces is using, then it is likely that Xerces is getting at least some characters wrong from the perspective of canonical XML.
-- Elliotte Rusty Harold
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
