[users] Re: Character Encodings

Jim Allan Fri, 23 May 2008 07:40:31 -0700

Michael Adams wrote:

NOTE: Microsoft defied this and used those character spaces for their
"smart quotes" and other characters in the WINDOWS-1252 encoding which
does not have a lot of approval as an international standard except
by IANA (the Internet Assigned Numbers Authority) for web use.


Quite seriously, what else could they do at the time?

They were attempting to compete against Apple who had their ownproprietary character sets which included the curly quotation marks,dashes, and various other non-ISO characters.


And hardly anyone has ever used the official 8-bit control characters.

I suppose they could have just followed the DOS route of letting everyword processor and every desktop publishing program have its own way ofproducing characters which are essential to typographically correctpublishing, continuing the mess established under DOS.

Defying the standards on this point was one of the best things they didin my opinion.

NOTE: Interestingly most of the pages on the web which claim to be
ISO-8859-1 are not accurate because they contain the WINDOWS-1252 smart
quotes or other WINDOWS-1252 characters. Most browsers allow this and
read ISO-8859-1 pages as WINDOWS-1252 anyway because the ISO-8859-1
control codes are illegal for use in a web page anyway.

You can, of course, declare that your webpage is coded as Windows-1252or another Windows encoding. That’s really what should be done. I’venever read a discussion that indicates why it wasn’t done more, save theexplanation of ignorance on the part of the page creators.

With the advent of the Euro, in ISO-8859-15 the Euro sign was introduced
as well as incorporating Microsofts "smart quotes" into the code (though
some moved into legitimate character locations). In theory this code
supercedes and combines both WINDOWS-1252 and ISO-8859-1, but ISO-8859-1
is still legal. I am not even sure if ISO-8859-1 is officially
deprecated (to be phased out over time).


I think it was barely used by anyone.

UTF-8 gets around the above issue by using 1 byte for most letters, and
a special control character byte which says the next 1 to 3 bytes are
an extended character for rarer characters. This takes up a lot less
memory, disk space and bandwidth than UTF-16 and UTF-32 in normal use.

True for Latin-alphabet coding. Not true for normal use if you writingChinese, or even Greek or Cyrillic.

A UTF-8
document starts with a special character called a Byte Order Mark(BOM)
which i will do no more than mention as it would take this to far OT
(plus i don't understand it completely).

BOM should not be used in UTF-8. It is required for some UTF-16 andUTF-32 formats.


Jim Allan


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[users] Re: Character Encodings

Reply via email to