UTF-8 in RFC 2279 and ISO 10646

2001-05-01 Thread Florian Weimer

Sorry for this question which is slightly off topic:

Are the UTF-8 definitions in ISO/IEC 10646-1:200 and RFC 2279
identical or equivalent?  Can any harm result if a nomative document
refers to both definitions (this is a bad idea if the definitions are
slightly different).

And BTW: Does ISO 10646 define character properties (such as lowercase
letter, uppercase letter, titlecase letter, other letter, decimal
digit, other digit and so on)?
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: UTF-8 in RFC 2279 and ISO 10646

2001-05-01 Thread Markus Kuhn

Florian Weimer wrote on 2001-05-01 12:48 UTC:
 Sorry for this question which is slightly off topic:
 
 Are the UTF-8 definitions in ISO/IEC 10646-1:2000 and RFC 2279
 identical or equivalent?

The differences are rather subtle. For instance, ISO/IEC 10646-1:2000
makes it clear that the UTF-8 sequences of U+D800 .. U+DFFF, U+FFFE and
U+ are not allowed to occur in a UTF-8 stream, whereas RFC 2279
doesn't mention that. RFC 2279 on the other hand warns of the risk of
UTF-8 decoders accepting overlong sequences, which ISO 10646 does not
mention explicitly. ISO 10646-1 specifies ISO 2022 ESC sequences for
UTF-8, whereas the other standards don't. Unicode 3.0 required UTF-8
decoders to decode overlong sequences, whereas Unicode 3.1 requires them
to be treated as malformed sequences. Etc.

 Can any harm result if a nomative document
 refers to both definitions (this is a bad idea if the definitions are
 slightly different).

I'd see RFC 2279 more as the official MIME registration of the UTF-8
encoding as defined in ISO/IEC 10646-1. They are intended to be the same
thing, so I would reference RFC 2279 probably only in the context of
using the MIME charset namespace.

 And BTW: Does ISO 10646 define character properties (such as lowercase
 letter, uppercase letter, titlecase letter, other letter, decimal
 digit, other digit and so on)?

No, only Unicode does that.

Why don't you get a copy of ISO/IEC 10646-1:2000 yourself? With just
80 CHF for the PDF CD-ROM, it is the ISO standard with the lowest
per-page-price ever seen.

http://www.iso.ch/cate/d29819.html

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: http://www.cl.cam.ac.uk/~mgk25/

-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/