All of our file XML is UTF-8 input, and I haven't seen any problems with direct file transforms using Xerces 1.6/Xalan 1.3 or Xerces 2.3/Xalan 1.6.  I never saw a reason for ICU, since all of our stuff is UTF-8 (or UTF-16), so don't build with it.  Like I said, the only time I saw (what should have been Japanese) characters incorrectly converted to entities was when the original got mangled, either through an incorrect conversion, or sometimes ignored URL encoding flubbups.  Sorry I can't see what you're doing wrong, offhand.
 
(MSXML is really annoying, isn't it, when they ignore the UTF-8 charset attribute on the xsl:output, and produce UTF-16, whether you want it or not?  I think it's probably laziness on their part, since the COM interface relies on the wide-char BSTRs, and they just don't want to either write a UTF-8 version, or incur the overhead of converting back.)

...
This is a bit troubling. I start with a UTF-8 XML file with japanese
text in it. We transform that XML->XML in Xalan, and then we try an
XML->HTML transformation on the result with Xalan. It's a bit
interesting in that the result of the XML->XML transformation in Xalan
1.7+ICU is UTF-8 XML, while MSXML creates UTF-16 XML. I guess I need
to go over the bytes very carefully in the first output XML to make
sure that they're still the correct UTF-8 encoding. Yarg.

Do I need to do anything special in the API to read the source XML
file? My understanding was that Xerces/Xalan should handle reading the
DOM tree correctly based on the character set in the file.

--
Nick

Reply via email to