On 7/13/2012 2:42 PM, David Starner wrote:
On Fri, Jul 13, 2012 at 1:29 PM, Jukka K. Korpela <[email protected]> wrote:
2012-07-13 22:37, David Starner wrote:
Wikipedia says "The Unicode standard recommends against the BOM for
UTF-8." and refers to page 30 of the Unicode Standard, version 6.0,
that says "Use of a BOM is neither required nor recommended for
UTF-8..." Calling it a myth seems bizarre.
“Not recommended” is distinct from “recommends against”.
I disagree; the meaning of the two phrases overlaps in my idolect, and
while it would be somewhat laconic, I might use "not recommended" to
mean "if you insist on doing that, please give us a chance to get the
fire extinguisher first",
I can state confidently and unequivocally that it is not used in that
sense in the standard, and by reading the whole phrase it's clear that
it is intended as statement of neutrality on the part of the Unicode
Standard - respectfully being aware of the difference between a
character encoding and a data transmission (or file format) protocol.
A
more appropriate formulation would be “Use of a BOM is not required for BOM,
but may be used as a signature that indicates, with practical certainty,
that data is UTF-8 encoded.”
In the environment that UTF-8 was developed for, a BOM is a nuisance;
a BOM will stop the shell from properly interpreting a hashbang, and
other existing programs will lose the BOM, duplicate the BOM, and
scatter BOMs throughout files. Given the number of text-like file
formats (like old-school PNM) and number of scripts depending on
existing behavior, these aren't going to be changed.
I think it's the cost of doing business. Unix was successful in getting
the web to use UTF-8 rather than UTF-16 etc. files to be the basis for
the exchange of markup language data. In environments that are
predicated on mandatory conversion TO Unicode, not knowing whether
something is "text" or "utf-8" text isn't as benign as it might be in
the former environment. Hence, the implementation of the UTF-8 BOM there.
As I said before, Unicode simplified but did not solve the fact that
text from other operating systems requires some modification before
working just right. But I don't think that Unicode should recommend
unconditionally the UTF-8 BOM, because it is problematic in the field
of use UTF-8 was created for and is still used for.
And, as you can see, Unicode, as a standard, is neutral on the issue.
For precisely that reason!
A.