So there is a BOM-ambiguity when a file starts with
FF FE
and then a couple of U+0000 characters, yes? Because this could be
either UTF-16 or UTF-32 under little-endianness. Has this been
pointed out and discussed beforehand?
No, there is not a "BOM-ambiguity". Rather, there is an English ambiguity
in your question concerning the meaning of "a file" and its contents.
If "a file" is a byte stream interpreted as an LE Unicode 16-bit
string, then:
FF FE 00 00 82 04 01 00 ... --> <U+FEFF, U+0000, U+0482, U+0001>
If "a file" is a byte stream interpreted as an LE Unicode 32-bit
string, then:
FF FE 00 00 82 04 01 00 ... --> <U+FEFF, U+10482>
[...]
I appreciate the input, but I think it's not that simple. There are a
number of contexts where I know that a file is for sure a textfile and I
also either know that it's Unicode or I'm assuming that it is because it
starts with one of the common bit-incarnations of the BOM.
With that in mind, there is value in documenting, however briefly, that
reading FF FE 00 00 is by itself technically ambiguous. Because a lot of
software developers might not want to think so much about such things
and rather be told. I wish I could comment more here about how it's done
in reality, but I can't because I don't even know how various editors'
and Unix tools' file format heuristics look like because they're usually
not documented.
Stephan