So there is a BOM-ambiguity when a file starts with
    FF FE
and then a couple of U+0000 characters, yes? Because this could be either UTF-16 or UTF-32 under little-endianness. Has this been pointed out and discussed beforehand?

No, there is not a "BOM-ambiguity". Rather, there is an English ambiguity
in your question concerning the meaning of "a file" and its contents.

If "a file" is a byte stream interpreted as an LE Unicode 16-bit string, then:
FF FE 00 00 82 04 01 00 ...  --> <U+FEFF, U+0000, U+0482, U+0001>

If "a file" is a byte stream interpreted as an LE Unicode 32-bit string, then:
FF FE 00 00 82 04 01 00 ...  --> <U+FEFF, U+10482>

[...]

I appreciate the input, but I think it's not that simple. There are a number of contexts where I know that a file is for sure a textfile and I also either know that it's Unicode or I'm assuming that it is because it starts with one of the common bit-incarnations of the BOM.

With that in mind, there is value in documenting, however briefly, that reading FF FE 00 00 is by itself technically ambiguous. Because a lot of software developers might not want to think so much about such things and rather be told. I wish I could comment more here about how it's done in reality, but I can't because I don't even know how various editors' and Unix tools' file format heuristics look like because they're usually not documented.

Stephan


Reply via email to