PLEASE DO NOT REPLY TO THIS MESSAGE. TO FURTHER COMMENT
ON THE STATUS OF THIS BUG PLEASE FOLLOW THE LINK BELOW
AND USE THE ON-LINE APPLICATION. REPLYING TO THIS MESSAGE
DOES NOT UPDATE THE DATABASE, AND SO YOUR COMMENT WILL
BE LOST SOMEWHERE.
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=3047
*** shadow/3047 Wed Aug 8 10:01:32 2001
--- shadow/3047.tmp.8134 Wed Aug 8 12:01:07 2001
***************
*** 27,29 ****
--- 27,56 ----
When specifying us-ascii as the encoding in the xsl it works, but this solution
is not very popular.
Is there a fix to this?
+
+ ------- Additional Comments From [EMAIL PROTECTED] 2001-08-08 12:01 -------
+ There's no such thing as a "non-UTF-8 character". UTF-8, like UTF-16, can
+ represent any Unicode character, though it may have to be represented as
+ multiple bytes at the UTF-8 level. So showing the XML character escape should be
+ unnecessary.
+
+ However, I agree that the specific character in question, 252, is in the range
+ where it should be a two-byte sequence rather than appearing as itself.
+
+ As shown in http://www.unicode.org/unicode/reports/tr17/index.html, the width
+ map for UTF-8 is:
+ 0x00..0x7F ==> 1 byte
+ 0x80..0x7FF ==> 2 bytes
+ 0x800..0xD7FF, 0xE000..0xFFFF ==> 3 bytes
+ 0x10000 .. 0x10FFFF ==> 4 bytes
+
+ Character 252 (decimal) is 0xFC. UTF-8 should output it as two bytes. The
+ conversion applied in this case should be
+ 00000yyy yyxxxxxx => 110yyyyy 10xxxxxx
+ or more specifically (if I haven't messed this up)
+ 00000000 11111100 => 11000011 10111100
+ In hex, that's (again, assuming no careless errors)
+ 0x00FC => 0xC3 0xBC
+
+ So I would expect to see the two bytes 195 and 188, in order, written to your
+ UTF-8 output. If that isn't what we're doing, that would indeed be a bug.