Re: any unicode conversion tools?

Ernest Cline Fri, 07 May 2004 14:34:15 -0700

> [Original Message]
> From: Jon Hanna <[EMAIL PROTECTED]>
>
> UTF-8 as defined in Unicode4.0 can never be greater than 4 bytes long.
> However illegal sequences can be up to 6 (not just 5) bytes long.
>
> UTF-8 has been variously defined in various standards and specs as
> an encoding of either Unicode or of ISO 10646. ISO 10646 has space
> up to U+7FFFFFFF, although there is a commitment not to use anything
> about U+10FFFF to maintain compatibility with Unicode.
>
> Because of this some of the specifications for UTF-8 that have been
> published allow for U+7FFFFFFF and below to be encoded
> (U+7FFFFFFF would be encoded as FD BF BF BF BF BF)[1]. For
> example RFC 2279 (which is defined in terms of ISO 10646 alone)
> allows this, but it is obsoleted by RFC 3629 (STD 63) which references
> the Unicode standard.


Theoretically, it is possible to encounter valid 5 or 6 byte sequences
in UTF-8.  ISO 10646 IIRC had some private use areas above U+10FFFF.
Therefore a version of UTF-8 that referenced the earlier ISO 10646
definition could have data that referred to such a character.  Why anyone
would need or want to do this is beyond me, but it would be possible
for there to exist such data.  However, like the possibility of encountering
Unicode 1 Hangul syllables, it isn't something I'd especially worry about.

Re: any unicode conversion tools?

Reply via email to