DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=22623>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=22623

Tabulator (U+0009) character in element attribute not serialized as numerical entity 
by default xml serializer

           Summary: Tabulator (U+0009) character in element attribute not
                    serialized as numerical entity by default xml serializer
           Product: XalanJ2
           Version: 2.5Dx
          Platform: All
               URL: http://groups.google.de/groups?q=tabulator+attribute+xsl
                    t&hl=de&lr=&ie=UTF-
                    8&selm=1g000ah.cqv8s26o5x8gN%25roth%40visualclick.de&rnu
                    m=1
        OS/Version: MacOS X
            Status: NEW
          Severity: Major
          Priority: Other
         Component: org.apache.xalan.serialize
        AssignedTo: [EMAIL PROTECTED]
        ReportedBy: [EMAIL PROTECTED]


[applies to: XalanJ2 2.5D1]

SUMMARY:
The XML default serializer needs to write tabulator (U+0009), CR and LF characters as 
numerical 
entities on serialization times in element attribute values, as otherwise due to 
attribute normalization 
rules outlined in the XML 1.0 spec, parsing the document by a conforming XML 1.0 
parser, the 
document semantically changes (i.e. the tab is replaced by a single space).

REPRODUCTION INFO & DETAILS:
When using this "identity" processing sheet:

---snip--
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
<xsl:output method="xml" encoding="iso-8859-1" />

<xsl:template match="@*|node()">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>

</xsl:stylesheet>
--snip--

on this XML instance document:

--snip--
<?xml version="1.0" encoding="iso-8859-1" ?>
<element attr="a&#9;tab" />
--snip--

the result is:

--snip--
<?xml version="1.0" encoding="iso-8859-1"?>
<element attr="a  tab"/>
--snip--        ^^
Tabulator(0x9)--^^

, i.e. the &#9; numerical entity from the input document is not
recreated at serialization time, but simply substituted for the real
character, a tab.

Unfortunately, this means that re-applying the identity stylesheet from
above on this document makes the tab character get replaced by a single
space character according to the Attribute-Value Normalization rules
(<http://www.w3.org/TR/REC-xml#AVNormalize>):

--snip--
<?xml version="1.0" encoding="iso-8859-1"?>
<element attr="a tab"/>
--snip--        ^
Space(0x20)-----^

In short: The above "identity" processing sheet does not deliver a
semantically identical document. Because if it did, the tab character in
the attribute value needed to be written as a numerical entity, so that
a later parser would recreate the tab character in the attribute value
(and normalize it away to a single space).

Christian Roth

Reply via email to