RE: More on the ASCII/Unicode support

Ove Kaaven Thu, 27 Apr 2000 12:05:03 -0700

On Thu, 27 Apr 2000, Patrik Stridvall wrote:

> >   -- no support for other encodings (say for Asian languages
> > which may need more bytes than Unicode supports)

Well that is nonsense. Unicode incorporates Asian languages.

> Speaking of Asian languages, reminds me.
> wchar_t in Unix is 32-bit long.

Not in Unix generally, but on GNU systems it is. Other (older) Unix
variants may have a 16-bit wchar_t, since Unicode used to be 16-bit, but
has grown a bit (the Unicode Consortium "guarantees" that no more than 21
bits will ever get used, though (64K+1M characters)).

> Anybody care to enlightment me?

ASCII: 7-bit, one byte per character

ISO 8859 encodings, ordinary SBCS codepages: 8-bit (often extended
ASCII), one byte per character.

Asian languages, DBCS codepages: 8-bit; either one or two bytes per
character (if the first byte is a "lead byte", it's a two-byte character).

UTF16: Unicode encoding, two bytes per character (preferably big-endian
but I doubt MS cares). May employ surrogate pairs (two UTF16 characters in
reserved ranges) to encode Unicode characters beyond the first 64K; the
surrogate pairs allow access to 1M more characters (may be necessary for
very exotic Asian languages, but no such characters are defined yet).

UCS2: Unicode encoding, two bytes per character, but not surrogate pairs.

UCS4: Unicode encoding, four bytes per character, easily and conveniently
encodes the full Unicode set. This is what GNU systems prefer, since they
don't want to deal with surrogate pairs.

UTF32: Same as UCS4, just defined by different organizations (UCS4 is ISO,
UTF32 is Unicode Consortium, plus the added restriction of that no more
than 64K+1M different characters may exist in UTF32).

UTF8 (UTF-FSS): Unicode encoding useful for compatibility with software
written for 8-bit C strings. Variable-width (between 1 and 6 bytes per
character). Lower 128 characters are encoded as plain ASCII.

UTF7: Unicode encoding for compatibility with software written for 7-bit
characters (email, news, etc). A hybrid of Base64 and Quoted-Printable.

> Regardless the "ASCII" (A) function in Windows can
> take other characters encoding than Latin-1 (Latin-[1-?])
> so UTF8 and "ASCII" are not really exchangable without
> conversion is it?

Of course, all the A functions depend on the current ANSI codepage,
retrievable with GetACP(), and settable by various different means.

> Then we have the strange fact that wchar_t is 32-bit,
> which I never really understood since most Unicode
> support in Unix is UTF8 IIRC.

Unicode support in Unix is UTF8? Not sure what you meant here, there are
no Unicode routines anywhere that take UTF8 strings in Unix that I know of
(apart from the charset conversion routines like mbstowcs and iconv, of
course).
RE: More on the ASCII/Unicode support

Reply via email to