> > The problem with slavishly following the charset parameter is that it 
> > is often incorrect.

> I wonder how you could draw such a conclusion. In order to make such
> a statement, there must be some other (god-given?) parameter, which is the 
> "real charset".

> Each and every program (webbrowser, newsreader, e-mailer ...)

Actually, historically, that's not quite right.  NOW they do (if they're 
behaving), but in the past they often just used whatever the system code page 
is.  Even worse, people would write in one local code page, stick it on an 
en-US server, and then "test" it on the same source machine (same locale), so 
then it "worked", but only for them.  Once it gets read by a different machine 
it doesn't work.

Even worse, either the editing software, or the server, might mistag the code 
page because they were trying to fill in missing information.  And there was a 
common abuse of the ISO code pages for what were really windows code page 
encoded data.

So, now, in theory, and in well-behaved environments, the taggings are much 
more accurate, however it can be difficult to distinguish correctly tagged data 
from mis-tagged data.  Using UTF-8 helps a ton, because it's pretty obvious 
that it's UTF-8.

Anyway, I have no clue what Google's doing, however mis-tagging of data is a 
common problem in the industry, and a great reason to use Unicode.  Some 
countries have an even bigger problem do to variations in implementations of 
their commonly used code pages, and extensions which may, or may not, always be 
supported.  It's also part of why you occasionally see things like badly marked 
up rich quotes on major news sites, even now.

-Shawn



Reply via email to