On Fri, 23 May 2008 11:35:08 +0300 Dotan Cohen wrote: > 2008/5/22 Michelle Konzack <[EMAIL PROTECTED]>: > > > My messages are ALWAYS 7-bit if I write in english. Otherwise they > > are using iso-8859-1, iso-8859-15 (german with Euro) or maybe > > iso-10646-1(utf-8). > > Why ISO-8859-15 at all? Has it some advantage over UTF-8? >
I followed a thread on the W3 validator list and learned about character encodings recently so bear with me here. This may contain errors but it is the way i remember it. WARNING: This may be way more than you wanted to know, but i have tried to encapsulate it as much as possible. Most of these character encoding have their roots in ASCII (now formally US-ASCII) which was a 7-bit encoding where 0 - 31 were control codes and 32 - 126 were characters 0-9, A-Z, a-z and the most common punctuation codes, with 127 being the delete character. The issue with this encoding is that even in English we use words borrowed from other languages and often we have imported their accents on the words as well (café is a very common example). US-ASCII does not cater to any of these accents. Nearly every computer manufacturer decided to expand ASCII in their own way buy making it an 8-bit code which allowed an extra 127 futher characters. A standard was set which said something like "Sure do that but leave 128 - 159 as control codes like 0 -31 are". NOTE: Microsoft defied this and used those character spaces for their "smart quotes" and other characters in the WINDOWS-1252 encoding which does not have a lot of approval as an international standard except by IANA (the Internet Assigned Numbers Authority) for web use. Internationally ISO-8859-1 was championed through ISO certification by ECMA in 1987 and after a minor revision in 1998 has been pretty much the international standard for Western Latinised Language 8-bit encoding. For Cyrilic and other languges including German (for Michelle) and Dutch (for Cor) ISO-8859-2 to 8859-14 were brought into existance. NOTE: Interestingly most of the pages on the web which claim to be ISO-8859-1 are not accurate because they contain the WINDOWS-1252 smart quotes or other WINDOWS-1252 characters. Most browsers allow this and read ISO-8859-1 pages as WINDOWS-1252 anyway because the ISO-8859-1 control codes are illegal for use in a web page anyway. With the advent of the Euro, in ISO-8859-15 the Euro sign was introduced as well as incorporating Microsofts "smart quotes" into the code (though some moved into legitimate character locations). In theory this code supercedes and combines both WINDOWS-1252 and ISO-8859-1, but ISO-8859-1 is still legal. I am not even sure if ISO-8859-1 is officially deprecated (to be phased out over time). UTF-8, UTF-16 and UTF-32 are a whole story. This bit i am fairly hazy about: UTF-16 allows 256 * 256 or 65500+ characters and UTF-32 allows 256 * 256 * 256 * 256 characters and are International standards. So UTF-16 requires 2 bytes for each character and UTF-32 requires 4 bytes. But many langauges when written leave a lot of empty bytes because they use only the first 256 or so character for most letters of their alphabet (like German and Dutch). UTF-8 gets around the above issue by using 1 byte for most letters, and a special control character byte which says the next 1 to 3 bytes are an extended character for rarer characters. This takes up a lot less memory, disk space and bandwidth than UTF-16 and UTF-32 in normal use. It still allows the rarer characters in a smaller footprint. A UTF-8 document starts with a special character called a Byte Order Mark(BOM) which i will do no more than mention as it would take this to far OT (plus i don't understand it completely). The official recommendation is that the UTF encodings replace all the 8859 encodings over time and most modern tools allow that they be set to UTF-8. Here endeth the scetchy lesson. There quite possibly are mistakes due to granularity in the above but unless glaring mistakes exist please don't extend it unnecesarily. -- Michael All shall be well, and all shall be well, and all manner of things shall be well - Julian of Norwich 1342 - 1416 --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
