Elliotte Harold wrote:
Xerces-J supports numerous non-Unicode encodings such as ISO 8859-1 and SJIS. The parser converts these encodings to Unicode when parsing. However, for many characters such as � there are multiple representations in Unicode. It can either be the single character � or an e followed by a combining accent grave. There are many other examples of this.
In the terms of Charmod
http://www.w3.org/TR/charmod
what you seem to be asking for is a normalizing transcoder.
I believe this can be addressed in Java 1.4 and later (although I have not done so myself), combined with icu4j.
The technique would be to systematically check every transcoder that ships with your Java platform to see whether it is normalizing or not, and for the ones that are not, use the icu4j normalizing routines to construct a new transcoder from the old.
I suspect that the cases in which the built-in transcoder is non-normalizing are few (probably zero).
Jeremy
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
