DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=12105>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=12105

UTF Encoding is not preserved





------- Additional Comments From [EMAIL PROTECTED]  2003-07-21 19:31 -------
Here are some test cases that demonstrates two issues:

1) The issue reported initially is that UTF-8 escaped characters are being 
converted to character entities rather than outputting them as UTF-8 escape 
sequences.  This may well be defined as an implementation decision.

2) The issue I commented on boils down, I think, to the 
ToStream.escapingNotNeeded() method incorrectly returning true for some 
characters.  This I think is an actual defect since it can produce invalid XML 
results.

To test this hypothesis, I created the following XML source document and XSL 
stylesheet:

XML source document:

        <?xml version="1.0" encoding="UTF-8" ?>
        <xalan_test>
                <test_case>
                        <escaped>© - Escaped (C) - 0xC2 0xA9</escaped>
                        <entity>&#169; - Entity (C) - character entity 
169</entity>
                </test_case>
                <test_case>
                        <escaped>® - Escaped (R) - 0xC2 0xAE</escaped>
                        <entity>&#174; - Entity (R) - character entity 
174</entity>
                </test_case>
        </xalan_test>

XSL Stylesheet:

        <?xml version="1.0" ?>
        <xsl:stylesheet
                version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
                xmlns:xalan="http://xml.apache.org/xslt";
                >
                <!-- 
===================================================================== -->
                <xsl:output
                        method               = "xml"
                        omit-xml-declaration = "no"  
                        standalone           = "yes" 
                        indent               = "yes" 
                        xalan:indent-amount  = "2"   
                        /> 
                <!-- 
===================================================================== -->
                <xsl:template match="/">
                        <xsl:apply-templates />
                </xsl:template>
                <!-- 
===================================================================== -->
                <xsl:template match="xalan_test">
                        <xalan_result>
                                <xsl:apply-templates />
                        </xalan_result>
                </xsl:template>
                <!-- 
===================================================================== -->
                <xsl:template match="test_case">
                        <test_result>
                                <escaped>
                                        <xsl:value-of select="escaped"/>
                                </escaped>
                                <entity>
                                        <xsl:value-of select="entity"/>
                                </entity>
                        </test_result>
                </xsl:template>
                <!-- 
===================================================================== -->
        </xsl:stylesheet>

I then modified the escapingNotNeeded() method to add a special case hack for 
the character 0xA9 (which is a copyright symbol).  In this case, I force 
escapingNotNeeded() to respond false.

Here is the transformed result document:

        <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
        <xalan_result>
                <test_result>
            <escaped>&#169; - Escaped (C) - 0xC2 0xA9</escaped>
            <entity>&#169; - Entity (C) - character entity 169</entity>
          </test_result>
                <test_result>
            <escaped>� - Escaped (R) - 0xC2 0xAE</escaped>
            <entity>� - Entity (R) - character entity 174</entity>
          </test_result>
        </xalan_result>

Observations:

First, in the case of the (R) character (in the second test case), the result 
document does not properly encoded the (R) character.  This results in an 
invalid UTF-8 encoded XML document.

Second, in the case of the (C) character, where the escapingNotNeeded() method 
was hacked, the (C) character was encoded, producing valid UTF-8 encoded XML.

Third, both the escaped and entity version of the (C) in the source document 
were transformed into entities in the result document, which can be defined as 
an implementation choice.

Reply via email to