DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT <http://nagoya.apache.org/bugzilla/show_bug.cgi?id=12105>. ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE.
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=12105 UTF Encoding is not preserved ------- Additional Comments From [EMAIL PROTECTED] 2003-07-21 19:31 ------- Here are some test cases that demonstrates two issues: 1) The issue reported initially is that UTF-8 escaped characters are being converted to character entities rather than outputting them as UTF-8 escape sequences. This may well be defined as an implementation decision. 2) The issue I commented on boils down, I think, to the ToStream.escapingNotNeeded() method incorrectly returning true for some characters. This I think is an actual defect since it can produce invalid XML results. To test this hypothesis, I created the following XML source document and XSL stylesheet: XML source document: <?xml version="1.0" encoding="UTF-8" ?> <xalan_test> <test_case> <escaped>© - Escaped (C) - 0xC2 0xA9</escaped> <entity>© - Entity (C) - character entity 169</entity> </test_case> <test_case> <escaped>® - Escaped (R) - 0xC2 0xAE</escaped> <entity>® - Entity (R) - character entity 174</entity> </test_case> </xalan_test> XSL Stylesheet: <?xml version="1.0" ?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xalan="http://xml.apache.org/xslt" > <!-- ===================================================================== --> <xsl:output method = "xml" omit-xml-declaration = "no" standalone = "yes" indent = "yes" xalan:indent-amount = "2" /> <!-- ===================================================================== --> <xsl:template match="/"> <xsl:apply-templates /> </xsl:template> <!-- ===================================================================== --> <xsl:template match="xalan_test"> <xalan_result> <xsl:apply-templates /> </xalan_result> </xsl:template> <!-- ===================================================================== --> <xsl:template match="test_case"> <test_result> <escaped> <xsl:value-of select="escaped"/> </escaped> <entity> <xsl:value-of select="entity"/> </entity> </test_result> </xsl:template> <!-- ===================================================================== --> </xsl:stylesheet> I then modified the escapingNotNeeded() method to add a special case hack for the character 0xA9 (which is a copyright symbol). In this case, I force escapingNotNeeded() to respond false. Here is the transformed result document: <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <xalan_result> <test_result> <escaped>© - Escaped (C) - 0xC2 0xA9</escaped> <entity>© - Entity (C) - character entity 169</entity> </test_result> <test_result> <escaped>� - Escaped (R) - 0xC2 0xAE</escaped> <entity>� - Entity (R) - character entity 174</entity> </test_result> </xalan_result> Observations: First, in the case of the (R) character (in the second test case), the result document does not properly encoded the (R) character. This results in an invalid UTF-8 encoded XML document. Second, in the case of the (C) character, where the escapingNotNeeded() method was hacked, the (C) character was encoded, producing valid UTF-8 encoded XML. Third, both the escaped and entity version of the (C) in the source document were transformed into entities in the result document, which can be defined as an implementation choice.
