From: "Doug Ewell" <[EMAIL PROTECTED]> > In UTF-16 practically any sequence of bytes is valid, and since you > can't assume you know the language, you can't employ distribution > statistics. Twelve years ago, when most text was not Unicode and all > Unicode text was UTF-16, Microsoft documentation suggested a heuristic > of checking every other byte to see if it was zero, which of course > would only work for Latin-1 text encoded in UTF-16. If you need to > detect the encoding of non-Western-European text, you would have to be > more sophisticated than this.
Here I completely disagree: even though mostly any 16bit values in UTF-16 are valid, they are NOT uniformly distributed. You'll see immediately that even and odd bytes have very distinct distribution, with the bytes representing the least significant bits of code units having a flatter distribution in a wider range than the other bytes which are distributed in very few byte values (rarely more than 2 or 3 for European languages, or with a mostly flat distribution of some limited ranges for Korean or for Chinese). Even today, where Unicode has more than 1 plane, UTF-16 is still easy to determine, because you'll see sequences of bytes where any byte between 0xD8 and 0xDB is followed by 2 bytes where the second is between 0xDC and 0xDF. The low bit of the positions of these two bytes reveals if it's coded with UTF16-BE or UTF16-LE, and then you can look at the effective ranges of decoded UTF16 code units to detect unassigned or illegal codepoints which would invalidate the UTF-16 possibility.

