> > This just recently happened when I was creating a Xerces text node,
> > and the DOM_String (Xerces 1.6!) was constructed with a char* that
> > pointed to UTF-8, instead of a wchar_t* pointing to UTF-16.� What
> > happens is that Xerces interprets char* as a *multibyte* character
> > set, and converts it to UTF-8 using the local codepage.� If it is
> > ASCII, no harm done, but if it's really UTF-8 (encoded Japanese, for
> > instance), the UTF-8 is treated as SHIFT-JIS and "converted"
> > (corrupted) to UTF-8.� When that is output, you'll get escaped
> > characters because Xalan correctly determines that the byte-stream is
> > not valid UTF-8.� Don't know if this digression applies, but make sure
> > you've still got UTF-8 before using Xalan to process it.� If it really
> > is UTF-8, I haven't seen a problem.
>
> This is a bit troubling.  I start with a UTF-8 XML file with japanese
> text in it.  We transform that XML->XML in Xalan, and then we try an
> XML->HTML transformation on the result with Xalan.  It's a bit
> interesting in that the result of the XML->XML transformation in Xalan
> 1.7+ICU is UTF-8 XML, while MSXML creates UTF-16 XML.  I guess I need
> to go over the bytes very carefully in the first output XML to make
> sure that they're still the correct UTF-8 encoding.  Yarg.

This digression does not apply, because what Keith is doing in his code is
wrong.  Xerces-C is not "interpreting" his string in any way.  He is using
a constructor that expects a character string encoded in the local code
page to create a text node.  What he should do is use a UTF-8 transcoder to
transcode the text to UTF-16 and create a text node using the transcoded
string.

XSLT processors are free to choose either UTF-8 or UTF-16 if you don't
specify an output encoding in the xsl:output element, or when it does not
support the encoding you've specified.  Xalan-C chooses UTF-8, while MSXSL
chooses UTF-16.  If you want UTF-16 from Xalan, specify that encoding on
the xsl:output element, and you'll get UTF-16.  If you want MSXSL to
produce UTF-8, specify that encoding on the xsl:output element, make sure
you're transforming to a stream, and you'll get UTF-8.

Dave

Reply via email to