Re: UTF-8 in RFC 2279 and ISO 10646

2001-05-07 Thread Keld Jørn Simonsen

On Wed, May 02, 2001 at 11:56:45AM -0400, Sandra O'donnell USG wrote:
> 
> Unicode 3.1 has up-to-date character property information for all 94,000
> characters in the repertoire. As you note, 14652 has information about
> some parts of the 10646 repertoire as it was in 1998.

14652 does cover *all* parts of 10646-1 as per 1998, up to
amendmend 9. Yes, the current 10646 and unicode specs
do cover more, but all the most important characters are covered
with 14652. 14652 should also be updated in due time, but let's first
have it approved as a TR.

Keld
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: UTF-8 in RFC 2279 and ISO 10646

2001-05-02 Thread Sandra O'donnell USG

   > . . .Also, the character repertoire in that draft document
   > most closely matches ISO/IEC 10646-1:1993 and Unicode 2.0, both of which
   > are now quite old.  The current versions are ISO/IEC 10646-1:2000 and
   > Unicode 3.1.
   
   Yes, 14652 is a DTR - a Draft Technical Report.
   It covers ISO/IEC 10646-1 up to 1998, with a few characters adopted
   in 1999, so it has not got the most recent additions to 10646.

According to the Unicode book, by 1998, ISO/IEC 10646-1 was up to the
repertoire of Unicode 2.1, and that included 38,887 assigned characters.
The 2000 version of ISO/IEC 10646-1 and Unicode 3.0 have repertoires of
49,194 characters. And Unicode 3.1 already has the supplementary
characters that will be in ISO/IEC 10646-2 (currently in final
ballot). This represents approximately 45,000 new characters, bringing
the total repertoire in Unicode and both parts of 10646 to approximately
94,000 characters.

Unicode 3.1 has up-to-date character property information for all 94,000
characters in the repertoire. As you note, 14652 has information about
some parts of the 10646 repertoire as it was in 1998.

-- Sandra
---
Sandra Martin O'Donnell
Compaq Computer Corporation
[EMAIL PROTECTED]
[EMAIL PROTECTED]

-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: UTF-8 in RFC 2279 and ISO 10646

2001-05-02 Thread Keld Jørn Simonsen

On Wed, May 02, 2001 at 09:01:49AM -0400, Sandra O'donnell USG wrote:
>> And BTW: Does ISO 10646 define character properties (such as lowercase
>> letter, uppercase letter, titlecase letter, other letter, decimal
>> digit, other digit and so on)?
>
>ISO TR 14652 defines lowercase, uppercase, decimal digits but
>not titlecase and other digit, for all of 10646.
> 
> Keld, I'm sure you're aware that 14652 is a *draft*, and that it has
> not been approved. Also, the character repertoire in that draft document
> most closely matches ISO/IEC 10646-1:1993 and Unicode 2.0, both of which
> are now quite old.  The current versions are ISO/IEC 10646-1:2000 and
> Unicode 3.1.

Yes, 14652 is a DTR - a Draft Technical Report.
It covers ISO/IEC 10646-1 up to 1998, with a few characters adopted
in 1999, so it has not got the most recent additions to 10646.

> And, to answer the specific question, no, ISO 10646 does not define
> character properties.

I just added the information for people that may have thought that
ISO maintains character properties in 10646, while it is actually
done in another specification. Thanks for your clarifications, Sandra!

Keld
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: UTF-8 in RFC 2279 and ISO 10646

2001-05-02 Thread Sandra O'donnell USG

   > And BTW: Does ISO 10646 define character properties (such as lowercase
   > letter, uppercase letter, titlecase letter, other letter, decimal
   > digit, other digit and so on)?
   
   ISO TR 14652 defines lowercase, uppercase, decimal digits but
   not titlecase and other digit, for all of 10646.

Keld, I'm sure you're aware that 14652 is a *draft*, and that it has
not been approved. Also, the character repertoire in that draft document
most closely matches ISO/IEC 10646-1:1993 and Unicode 2.0, both of which
are now quite old.  The current versions are ISO/IEC 10646-1:2000 and
Unicode 3.1.

The most consistent, continuously updated listing of character
properties is available with the Unicode standard (www.unicode.org).

And, to answer the specific question, no, ISO 10646 does not define
character properties.

-- Sandra
---
Sandra Martin O'Donnell
Compaq Computer Corporation
[EMAIL PROTECTED]
[EMAIL PROTECTED]
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: UTF-8 in RFC 2279 and ISO 10646

2001-05-02 Thread Keld Jørn Simonsen

On Tue, May 01, 2001 at 02:48:04PM +0200, Florian Weimer wrote:
> 
> And BTW: Does ISO 10646 define character properties (such as lowercase
> letter, uppercase letter, titlecase letter, other letter, decimal
> digit, other digit and so on)?

ISO TR 14652 defines lowercase, uppercase, decimal digits but
not titlecase and other digit, for all of 10646.

Kind regards
Keld
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: UTF-8 in RFC 2279 and ISO 10646

2001-05-01 Thread Markus Kuhn

Florian Weimer wrote on 2001-05-01 12:48 UTC:
> Sorry for this question which is slightly off topic:
> 
> Are the UTF-8 definitions in ISO/IEC 10646-1:2000 and RFC 2279
> identical or equivalent?

The differences are rather subtle. For instance, ISO/IEC 10646-1:2000
makes it clear that the UTF-8 sequences of U+D800 .. U+DFFF, U+FFFE and
U+ are not allowed to occur in a UTF-8 stream, whereas RFC 2279
doesn't mention that. RFC 2279 on the other hand warns of the risk of
UTF-8 decoders accepting overlong sequences, which ISO 10646 does not
mention explicitly. ISO 10646-1 specifies ISO 2022 ESC sequences for
UTF-8, whereas the other standards don't. Unicode 3.0 required UTF-8
decoders to decode overlong sequences, whereas Unicode 3.1 requires them
to be treated as malformed sequences. Etc.

> Can any harm result if a nomative document
> refers to both definitions (this is a bad idea if the definitions are
> slightly different).

I'd see RFC 2279 more as the official MIME registration of the UTF-8
encoding as defined in ISO/IEC 10646-1. They are intended to be the same
thing, so I would reference RFC 2279 probably only in the context of
using the MIME charset namespace.

> And BTW: Does ISO 10646 define character properties (such as lowercase
> letter, uppercase letter, titlecase letter, other letter, decimal
> digit, other digit and so on)?

No, only Unicode does that.

Why don't you get a copy of ISO/IEC 10646-1:2000 yourself? With just
80 CHF for the PDF CD-ROM, it is the ISO standard with the lowest
per-page-price ever seen.

http://www.iso.ch/cate/d29819.html

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: 

-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



UTF-8 in RFC 2279 and ISO 10646

2001-05-01 Thread Florian Weimer

Sorry for this question which is slightly off topic:

Are the UTF-8 definitions in ISO/IEC 10646-1:200 and RFC 2279
identical or equivalent?  Can any harm result if a nomative document
refers to both definitions (this is a bad idea if the definitions are
slightly different).

And BTW: Does ISO 10646 define character properties (such as lowercase
letter, uppercase letter, titlecase letter, other letter, decimal
digit, other digit and so on)?
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/