PLEASE DO NOT REPLY TO THIS MESSAGE. TO FURTHER COMMENT
ON THE STATUS OF THIS BUG PLEASE FOLLOW THE LINK BELOW
AND USE THE ON-LINE APPLICATION. REPLYING TO THIS MESSAGE
DOES NOT UPDATE THE DATABASE, AND SO YOUR COMMENT WILL
BE LOST SOMEWHERE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=3047

*** shadow/3047 Wed Aug  8 10:01:32 2001
--- shadow/3047.tmp.8134        Wed Aug  8 12:01:07 2001
***************
*** 27,29 ****
--- 27,56 ----
  When specifying us-ascii as the encoding in the xsl it works, but this solution 
  is not very popular.
  Is there a fix to this?
+ 
+ ------- Additional Comments From [EMAIL PROTECTED]  2001-08-08 12:01 -------
+ There's no such thing as a "non-UTF-8 character". UTF-8, like UTF-16, can 
+ represent any Unicode character, though it may have to be represented as 
+ multiple bytes at the UTF-8 level. So showing the XML character escape should be 
+ unnecessary.
+ 
+ However, I agree that the specific character in question, 252, is in the range 
+ where it should be a two-byte sequence rather than appearing as itself.
+ 
+ As shown in http://www.unicode.org/unicode/reports/tr17/index.html, the width 
+ map for UTF-8 is: 
+      0x00..0x7F                      ==>   1 byte
+      0x80..0x7FF                     ==>   2 bytes
+      0x800..0xD7FF, 0xE000..0xFFFF   ==>   3 bytes
+      0x10000 .. 0x10FFFF             ==>   4 bytes 
+ 
+ Character 252 (decimal) is 0xFC. UTF-8 should output it as two bytes. The 
+ conversion applied in this case should be
+   00000yyy yyxxxxxx =>  110yyyyy 10xxxxxx
+ or more specifically (if I haven't messed this up)
+   00000000 11111100 =>  11000011 10111100
+ In hex, that's (again, assuming no careless errors)
+   0x00FC => 0xC3 0xBC
+ 
+ So I would expect to see the two bytes 195 and 188, in order, written to your 
+ UTF-8 output. If that isn't what we're doing, that would indeed be a bug.

Reply via email to