On 2012/07/14 1:33, Philippe Verdy wrote:
Fra: Jukka K. Korpela<[email protected]>
"When the BOM is used in web pages or editors for UTF-8 encoded content it
can sometimes introduce blank spaces or short sequences of strange-looking
characters (such as ). For this reason, it is usually best for
interoperability to omit the BOM, when given a choice, for UTF-8 content."

    http://www.w3.org/International/questions/qa-byte-order-mark

This statemant for maximum interoperability may have been true in the
past, where Unicode support was not so universal and still not adopted
formally for all newer developments in RFCs published by the IETF. But
now the situation is reversed : maximum interoperability if offered
when BOMs are present, not really to indicate the byte order itself,
but to confirm that the content is Unicode encoded and extremely
likely to be text content and not arbitrary binary contents (that
today almost always use a distinctive leading signature).

As you mention the IETF, what people in the IETF like most about UTF-8 is that it's upward-compatible with ASCII. Because the protocol/syntax-relevant part is usually ASCII only, that means that a lot of stuff can work just by making things 8-bit clean (which in this day and age may mean essentially no work in some cases).

A BOM anywhere in a protocol therefore just removes the biggest advantage of UTF-8. While it's usually okay to use a BOM at the start of a whole file (or the file equivalent in transmission, which is a MIME entity), anywhere else (e.g. in small protocol fields), a BOM is a big no-no.

Regards,   Martin.

Reply via email to