Theodore, Many of the explanations of UTF-8 discuss encoding of code points on Code Planes 1-16 using the intermediate concept of surrogates as in UTF-16. I believe that this is both unnecessary and misleading, as UTF-8 is fundamentally a direct 21-bit encoding scheme, as may be seen in the attached document. So, I believe that the concept of surrogates is not relevant for UTF-8 encoding on Code Planes above the BMP.
This is a slightly different explanation of how UTF-8 works, written by me for the Ultracode(r) bar code spec (Ultracode encodes all of Unicode 3 directly). If any Unicodotti find any errors in it... please let me know! Regards, Clive ************************************* Clive P Hohberger, PhD VP, Technology Development & Director of Patent Affairs Zebra Technologies Corporation 333 Corporate Woods Parkway Vernon Hills IL 60061-3109 USA Voice: +1 847 793 2740 FAX: +1 847 793 5573 Cellular: +1 847 910 8794 E-mail: [EMAIL PROTECTED] -----Original Message----- From: Theodore H. Smith [mailto:[EMAIL PROTECTED]] Sent: Wednesday, May 29, 2002 7:12 AM To: [EMAIL PROTECTED] Subject: How is UTF8, UTF16 and UTF32 encoded? I need to know exactly how UTF8, UTF16 and UTF32 is encoded. I heard that UTF32 can have surrogates, so I can't just expect them to be scalar values. Having a nice detailed and clear explanation would help, with plenty of examples and effects of the encoding and all kinds of things to make it easier to understand would help. Or perhaps I'm just reacting to the confusion of the UniCode website and its not that hard to understand and a simple definition would do? But the first idea certainly wouldn't hurt. -- Theodore H. Smith - Macintosh Consultant / Contractor. My website: <www.elfdata.com/>
UTF-8Explained.doc
Description: MS-Word document

