DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=24278>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=24278

Incorrect SAXException when serializing &#338; with UTF-8 encoding





------- Additional Comments From [EMAIL PROTECTED]  2003-11-07 19:24 -------
Please let me know if you have reasons that this simple patch should not be 
done, as it has some implications to method="text" output.

Patch just submitted increases the default max printable character when no
output encoding is specified.

This only impacts output="text" serialization and only when no encoding is 
specified for the output.  But isn't the default supposed to be UTF-8 or UTF-16?

If a user had specified
<xsl:output method="text" encoding="UTF-8" />
or
<xsl:output method="text" encoding="UTF-16" />
then the maximum printable character would have been set at 0xFFFF which would
over-ride the default set in Encodings.java.

So this looks like a safe fix to me.

The problem was exposed because TransformerImpl.transformToString() in Xalan-J
interpretive uses a pool of ToTextStream() serializers to convert nodes to 
Strings.  This is a slight mis-use of a serializer and some of the things that 
a serializer does to its output should NOT be done for such an intermediate 
serialization.  

It used to be that the serializers didn't complain if the character didn't fit 
in the output encoding, just write it out. But with bug 795 fixed the stream
serializers check.  Now this ToTextStream() serializer used internally thinks 
that the characters don't fit in the output encoding.  Since no encoding was
specified for these internal, helper serializers, it defaulted to a maximum of 
0x7F before it complained.

Another possibility was to set the encoding on the helper serializers as "UTF-
16", but this would be a performance problem because each helper serializer is 
reset() at the end of the transformToString() method, so one would constantly 
be needing to re-set the encoding, even when grabbed from the pool. Bumping the 
max default value is faster.

It passed all conformance tests.

>From an external point of view the impact is this:
If you are serializing to a text stream and no encoding was specified and a 
character pair is encountered that is a UTF-16 surrogate it would have been 
written as a character reference (e.g. &#45923; ), but now the character pair 
will be written out as-is (see my earlier cut and paste of 
writeUTF16Surrogate(c, ch, i, end); in this bug).

This behavior should actually be better for an internal serializer-helper.


Regards,
Brian Minchau

Reply via email to