Re: Avoiding the escaping UTF-8 unicode text

Nick Bastin 7 Mar 2004 23:22:16 -0000


On Mar 7, 2004, at 5:11 PM, Keith Rogers wrote:

Not sure what statement you're having the problem with, but if you've got xsl:output's charset set to UTF-8, and using disable-output-escaping="yes" (e.g., in xsl:value-of or xsl:text), and still see it, then when I've seen this problem, iti turned out that the data wasn't actually UTF-8.

Yes, I forgot to mention that I did try disable-output-escaping='yes' in the xsl:value-of statement (although I don't believe I should have to do that - since I *do* want output-escaping of characters if they end up being invalid entities - but Xalan should figure out which ones those are for me automatically...i.e., it should know which chars are 1 byte wide and which are 3 bytes wide).

This just recently happened when I was creating a Xerces text node, and the DOM_String (Xerces 1.6!) was constructed with a char* that pointed to UTF-8, instead of a wchar_t* pointing to UTF-16.� What happens is that Xerces interprets char* as a *multibyte* character set, and converts it to UTF-8 using the local codepage.� If it is ASCII, no harm done, but if it's really UTF-8 (encoded Japanese, for instance), the UTF-8 is treated as SHIFT-JIS and "converted" (corrupted) to UTF-8.� When that is output, you'll get escaped characters because Xalan correctly determines that the byte-stream is not valid UTF-8.� Don't know if this digression applies, but make sure you've still got UTF-8 before using Xalan to process it.� If it really is UTF-8, I haven't seen a problem.

This is a bit troubling. I start with a UTF-8 XML file with japanese text in it. We transform that XML->XML in Xalan, and then we try an XML->HTML transformation on the result with Xalan. It's a bit interesting in that the result of the XML->XML transformation in Xalan 1.7+ICU is UTF-8 XML, while MSXML creates UTF-16 XML. I guess I need to go over the bytes very carefully in the first output XML to make sure that they're still the correct UTF-8 encoding. Yarg.

Do I need to do anything special in the API to read the source XML file? My understanding was that Xerces/Xalan should handle reading the DOM tree correctly based on the character set in the file.

--
Nick

Re: Avoiding the escaping UTF-8 unicode text

Reply via email to