RE: Chapter on character sets

Mike Brown Thu, 15 Jun 2000 11:01:41 -0700
Nice work, Lars.

A few months ago I started writing an XML tutorial for my coworkers, but got
so bogged down in understanding these issues that I decided it was better to
write it as "a reintroduction to XML with an emphasis on character
encoding." I still haven't finished it, but with a lot of help from the
Unicode list I got it to a point that I think is about 98% accurate.

http://www.skew.org/xml/tutorial/ is where it lives. Use it, digest it,
regurgitate it, and please, if there are any inaccuracies whatsoever, report
them to me!

I think I still need to clean up some ambiguities between IANA-registered
character set names and the names of the standards they are based on
("US-ASCII" vs ISO 646-US and ANSI X3.4 for example).

On that note, here is something I wrote for someone just a few minutes ago.
Can someone review the statements I make below regarding ASCII?

Thanks.

---

The ANSI X3.4 "ASCII" standard from 1968 defines character assignments for
hex numbers 20 through 7E, and is pretty much just the things you see on
American keyboards, minus the control functions like Shift, Enter, etc.  In
an 8-bit encoding scheme, the byte sequences used to represent the ASCII
numbers are single bytes with the same value as the numbers themselves.

The ISO 646 standard was formalized in 1972, and provided variants of ASCII
for different countries (ISO 646-XX, where XX is one of about a dozen
country codes). In addition to the 20 through 7E range, it also includes the
C0 control set for non-displayable characters assigned to 00 through 1F, and
the delete character at 7F. If the ECMA-6 standard is as equivalent to ISO
646 as I am led to believe, then some leeway is allowed for currency
symbols: hex position 23 can be # or �, and 24 can be $ or �.

The character set defined by the ISO 646-US standard is now known as
"US-ASCII" due to its IANA registration for use on the Internet. It defines
hex position 23 to be # and 24 to be $. It is a subset of all character
encodings except IBM's EBCDIC, which is an encoding for mainframes that was
supposedly easier to read on punch cards.

The ISO/IEC 8859-1 "Latin-1" standard defines character assignments for hex
number A0 through FF, covering the characters used in the major (Western)
European languages and that are not already covered by ASCII, and a few
international symbols. This includes characters with diacritical
marks/accents, �French quotation marks�, non-breaking space, copyright
symbol, etc.

"ISO-8859-1" (note the extra hyphen) is the IANA-registered character set
that covers hex positions 00 to FF, subsetting US-ASCII and the C1 control
set (80 to 9F).

US-ASCII is by definition a 7-bit character set, although 8-bit bytes are
commonly used nowadays to transmit US-ASCII encoded character sequences.
ISO-8859-1 requires 8-bit bytes. In either case, you have 1 byte per
character.

The ISO/IEC 10646-1 "Universal Character Set" standard, which from a user's
standpoint is equivalent to The Unicode Standard, defines character
assignments for hex numbers 00 through 10FFFF, although not in a completely
contiguous range. Since the range goes beyond FF, it cannot simply imply 1
byte per character like its predecessors. Thus, among other things, it
introduces a distinction between the assignment of characters to numbers,
and the conversion of numbers to sequences of bytes or other fixed-bit-width
code values.

The UTF-8 amendment to the ISO/IEC 10646-1 standard defines an algorithm for
converting the ISO 10646-1 character numbers to sequences of 1 to 4 8-bit
bytes. It has also been formalized in the IETF's RFC 2279. "UTF-8" is also
an IANA-registered character set.

In some ways, UTF-8 is nice, because if you are dealing mostly with
ASCII-range characters, the mapping is 1-to-1 (US-ASCII 00-7F = UTF-8 00-7F)
and you can use your favorite text editor, terminal display, or web browser
with it, without caring whether the application is aware that it's dealing
with UTF-8 and not ISO-8859-1 or an OS-sepecific encoding like Windows'
CP-1252.

In other ways, UTF-8 is problematic, because most people aren't aware that
ISO 8859-1 range characters don't enjoy the same 1-to-1 byte mapping, and
they end up having problems when they try to work with those characters and
their ISO-8859-1 byte values. It would help people to understand character
encoding issues better if every UTF-8 sequence were always gibberish. This
is actually the case for the people who don't use ASCII at all.


[I must credit Roman Czyborra for the ASCII and EBCDIC information that I
gleaned from czyborra.com.]

   - Mike
____________________________________________________________________
Mike J. Brown, software engineer at         My XML/XSL resources:
webb.net in Denver, Colorado, USA           http://www.skew.org/xml/
RE: Chapter on character sets

Reply via email to