Re: [users] Character Encodings

Michael Adams Fri, 23 May 2008 05:19:27 -0700

On Fri, 23 May 2008 11:35:08 +0300
Dotan Cohen wrote:

> 2008/5/22 Michelle Konzack <[EMAIL PROTECTED]>:
> 
> > My messages are ALWAYS 7-bit if I write in english.  Otherwise they 
> > are using iso-8859-1, iso-8859-15 (german with Euro)  or  maybe 
> > iso-10646-1(utf-8).
> 
> Why ISO-8859-15 at all? Has it some advantage over UTF-8?
>


I followed a thread on the W3 validator list and learned about character
encodings recently so bear with me here. This may contain errors but it
is the way i remember it. 

WARNING: This may be way more than you wanted to know, but i have tried
to encapsulate it as much as possible.

Most of these character encoding have their roots in ASCII (now formally
US-ASCII) which was a 7-bit encoding where 0 - 31 were control codes
and 32 - 126 were characters 0-9, A-Z, a-z and the most common
punctuation codes, with 127 being the delete character. The issue with
this encoding is that even in English we use words borrowed from other
languages and often we have imported their accents on the words as well
(café is a very common example). US-ASCII does not cater to any
of these accents.

Nearly every computer manufacturer decided to expand ASCII in their own
way buy making it an 8-bit code which allowed an extra 127 futher
characters. A standard was set which said something like "Sure do that
but leave 128 - 159 as control codes like 0 -31 are". 

NOTE: Microsoft defied this and used those character spaces for their
"smart quotes" and other characters in the WINDOWS-1252 encoding which
does not have a lot of approval as an international standard except
by IANA (the Internet Assigned Numbers Authority) for web use.

Internationally ISO-8859-1 was championed through ISO certification by
ECMA in 1987 and after a minor revision in 1998 has been pretty much the
international standard for Western Latinised Language 8-bit encoding.

For Cyrilic and other languges including German (for Michelle) and Dutch
(for Cor) ISO-8859-2 to 8859-14 were brought into existance.

NOTE: Interestingly most of the pages on the web which claim to be
ISO-8859-1 are not accurate because they contain the WINDOWS-1252 smart
quotes or other WINDOWS-1252 characters. Most browsers allow this and
read ISO-8859-1 pages as WINDOWS-1252 anyway because the ISO-8859-1
control codes are illegal for use in a web page anyway.

With the advent of the Euro, in ISO-8859-15 the Euro sign was introduced
as well as incorporating Microsofts "smart quotes" into the code (though
some moved into legitimate character locations). In theory this code
supercedes and combines both WINDOWS-1252 and ISO-8859-1, but ISO-8859-1
is still legal. I am not even sure if ISO-8859-1 is officially
deprecated (to be phased out over time).

UTF-8, UTF-16 and UTF-32 are a whole story. 

This bit i am fairly hazy about: UTF-16 allows 256 * 256 or 65500+
characters and UTF-32 allows 256 * 256 * 256 * 256 characters and are
International standards. So UTF-16 requires 2 bytes for each character
and UTF-32 requires 4 bytes. But many langauges when written leave a lot
of empty bytes because they use only the first 256 or so character for
most letters of their alphabet (like German and Dutch).

UTF-8 gets around the above issue by using 1 byte for most letters, and
a special control character byte which says the next 1 to 3 bytes are
an extended character for rarer characters. This takes up a lot less
memory, disk space and bandwidth than UTF-16 and UTF-32 in normal use.
It still allows the rarer characters in a smaller footprint. A UTF-8
document starts with a special character called a Byte Order Mark(BOM)
which i will do no more than mention as it would take this to far OT
(plus i don't understand it completely).

The official recommendation is that the UTF encodings replace all the
8859 encodings over time and most modern tools allow that they be set
to UTF-8.

Here endeth the scetchy lesson. There quite possibly are mistakes due to
granularity in the above but unless glaring mistakes exist please don't
extend it unnecesarily.


-- 
Michael

All shall be well, and all shall be well, and all manner of things shall
be well

 - Julian of Norwich 1342 - 1416

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [users] Character Encodings

Reply via email to