Re: UTF-8 BOM (Re: Charset declaration in HTML)

Leif Halvard Silli Mon, 16 Jul 2012 14:49:13 -0700

Steven Atreju, Mon, 16 Jul 2012 13:35:04 +0200:
> "Doug Ewell" <[email protected]> wrote:


> And:
> 
>   Q: Is the UTF-8 encoding scheme the same irrespective of whether
>   the underlying processor is little endian or big endian?
>   ...
>   Where a BOM is used with UTF-8, it is only used as an ecoding
>   signature to distinguish UTF-8 from other encodings — it has
>   nothing to do with byte order.
> 
> Fifteen years ago i think i would have put effort in including the
> BOM after reading this, for complete correctness!  I'm pretty sure
> that i really would have done so.

I believe that most people that are conscious about inserting the BOM, 
do so because, without it, then Web browsers (with Chrome as the 
exception, whenever the page contains non-ASCII characters, at least) 
are unlikely to sniff a UTF-8 encoded page to be UTF-8 encoded. So, it 
has nothing with "complete correctness" to do, but everything to do 
with complete safety.

> So, given that this page ranks 3 when searching for «utf-8 bom»
> from within Germany i would 1), fix the «ecoding» typo and 2)
> would change this to be less «neutral».  The answer to «Q.» is
> simply «Yes.  Software should be capable to strip an encoded BOM
> in UTF, because some softish Unicode processors fail to do so when
> converting in between different multioctet UTF schemes.  Using BOM
> with UTF-8 is not recommended.»

The current text is much to prefer. Also, you place the wagon before 
the horse. You place tools over users.

There is one reason to use UTF-8 BOM which that FAQ point doesn't 
mention, however, and that is that Chrome/Safari/Webkit plus IE treat a 
UTF-8 encoded text/html page with a BOM different from a UTF-8 encoded 
text/html page without a BOM - even when the page is otherwise properly 
labelled as UTF-8. For the former, then the user would not be able to 
override the encoding, manually. Whereas for a page without the BOM, 
then the user can override the encoding/shoot themselves (and others) 
in the foot.

> And UTF-8 got an additional «wohooo - i'm Unicode text» signature
> tag, though optional.  I like the term «extremely rare» sooo much!!
> :-)

What's the problem?

> If you know how to deal with UTF-8, you can deal with UTF-8.
> If you don't, no signature ever will help you, no?!

Do you mean that, instead of the wohoo, one should do more thorough 
sniffing? I have no insight into how reliable such non-BOM-sniffing is. 
But I take it that it is much less secure than BOM-sniffing. Hence it 
would be risky (?) to deny users to override the encoding of a 
non-BOM-sniffed page. Which, bottom line, means that the BOM got an 
advantage.

> If you don't know the charset of some text, that comes from
> nowhere, i.e., no container format with meta-information, no
> filetype extension with implicit meta-information, as is used on
> Mac OS and DOS, then UTF-8 is still very easily identifieable by
> itself due to the way the algorithm is designed.  Is it??

As I just said in a reply to Doug: Of the Web browsers in current use, 
Chrome is the very best. This is, I think, because it, to a higher 
degree than the competition, assumes UTF-8 whenever it finds non-ASCII 
characters. Clearly, sniffing could improve. At least in the browser 
world. But is that also true for command lines tools?
-- 
Leif H Silli

Re: UTF-8 BOM (Re: Charset declaration in HTML)

Reply via email to