Re: BOM ambiguity?

Stephan Stiller Fri, 13 Jul 2012 19:40:42 -0700

So there is a BOM-ambiguity when a file starts with
    FF FE
and then a couple of U+0000 characters, yes? Because this could beeither UTF-16 or UTF-32 under little-endianness. Has this beenpointed out and discussed beforehand?
No, there is not a "BOM-ambiguity". Rather, there is an English ambiguity
in your question concerning the meaning of "a file" and its contents.
If "a file" is a byte stream interpreted as an LE Unicode 16-bitstring, then:
FF FE 00 00 82 04 01 00 ...  --> <U+FEFF, U+0000, U+0482, U+0001>
If "a file" is a byte stream interpreted as an LE Unicode 32-bitstring, then:
FF FE 00 00 82 04 01 00 ...  --> <U+FEFF, U+10482>

[...]

I appreciate the input, but I think it's not that simple. There are anumber of contexts where I know that a file is for sure a textfile and Ialso either know that it's Unicode or I'm assuming that it is because itstarts with one of the common bit-incarnations of the BOM.

With that in mind, there is value in documenting, however briefly, thatreading FF FE 00 00 is by itself technically ambiguous. Because a lot ofsoftware developers might not want to think so much about such thingsand rather be told. I wish I could comment more here about how it's donein reality, but I can't because I don't even know how various editors'and Unix tools' file format heuristics look like because they're usuallynot documented.


Stephan

Re: BOM ambiguity?

Reply via email to