Re: UTF-8 encoding question

Joseph Kesselman 21 Nov 2002 23:18:59 -0000

 In UTF-8, characters over 0x7F are encoded as multi-byte sequences.  Your 
0xD2 character (binary 11010010) should be encoded as the two bytes 
11000011 10010010, or 0xC3 0x92.


See http://www.faqs.org/rfcs/rfc2279.html for the exact details.

As to why an ancient version of Xerces accepted it: It was a bug. Try a 
modern release of Xerces and see if still accepts that byte; I'd bet it 
won't.

______________________________________
Joe Kesselman  / IBM Research

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: UTF-8 encoding question

Reply via email to