DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT <http://nagoya.apache.org/bugzilla/show_bug.cgi?id=24278>. ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE.
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=24278 Incorrect SAXException when serializing Œ with UTF-8 encoding ------- Additional Comments From [EMAIL PROTECTED] 2003-11-07 19:24 ------- Please let me know if you have reasons that this simple patch should not be done, as it has some implications to method="text" output. Patch just submitted increases the default max printable character when no output encoding is specified. This only impacts output="text" serialization and only when no encoding is specified for the output. But isn't the default supposed to be UTF-8 or UTF-16? If a user had specified <xsl:output method="text" encoding="UTF-8" /> or <xsl:output method="text" encoding="UTF-16" /> then the maximum printable character would have been set at 0xFFFF which would over-ride the default set in Encodings.java. So this looks like a safe fix to me. The problem was exposed because TransformerImpl.transformToString() in Xalan-J interpretive uses a pool of ToTextStream() serializers to convert nodes to Strings. This is a slight mis-use of a serializer and some of the things that a serializer does to its output should NOT be done for such an intermediate serialization. It used to be that the serializers didn't complain if the character didn't fit in the output encoding, just write it out. But with bug 795 fixed the stream serializers check. Now this ToTextStream() serializer used internally thinks that the characters don't fit in the output encoding. Since no encoding was specified for these internal, helper serializers, it defaulted to a maximum of 0x7F before it complained. Another possibility was to set the encoding on the helper serializers as "UTF- 16", but this would be a performance problem because each helper serializer is reset() at the end of the transformToString() method, so one would constantly be needing to re-set the encoding, even when grabbed from the pool. Bumping the max default value is faster. It passed all conformance tests. >From an external point of view the impact is this: If you are serializing to a text stream and no encoding was specified and a character pair is encountered that is a UTF-16 surrogate it would have been written as a character reference (e.g. 덣 ), but now the character pair will be written out as-is (see my earlier cut and paste of writeUTF16Surrogate(c, ch, i, end); in this bug). This behavior should actually be better for an internal serializer-helper. Regards, Brian Minchau
