RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

Carl W. Brown Fri, 25 May 2001 11:39:35 -0700

Peter,

There was another abomination proposed.  Oracle rather than adding UTF-16
support proposed that non plane 0 characters be encoded to an from UTF-8 by
encoding each of the surrogate pairs into a separate UTF-8 character.

This way they could encode UTF-16 using the UCS-2 encoding into two 3 byte
UTF-8 characters.  UFT-16 to UTF-8 conversion requires that the UTF-16 be
first converted to UTF-32 (decoding the surrogates into a 32 bit integer)
and then encoded into UTF-8.  This can be done on a character by character
basis so there is no intermediate buffering requirement.

Carl

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of [EMAIL PROTECTED]
Sent: Friday, May 25, 2001 8:29 AM
To: [EMAIL PROTECTED]
Subject: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

On 05/25/2001 02:13:36 AM Bill Kurmey wrote:

>Are there not 2 versions of UTF-8, the Unicode Standard (maximum of 4
>octets) and the ISO/IEC Annex/Amendment to 10646 (maximum of 6 octets)?

The distinction between the Unicode and ISO versions of UTF-8 is pretty
irrelevant. ISO UTF-8 allows a maximum of 6 octets because it is designed
to accommodate a larger codespace than Unicode, but the portion of the
codespace beyond U+10FFFF is now permanently reserved. For all practical
purposes, the usable ISO codespace is the same as that for Unicode, and
thus the usable ISO UTF-8 sequences are at most 4.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>

RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

Reply via email to