Re: Detecting encoding in Plain text

Philippe Verdy Mon, 12 Jan 2004 04:32:06 -0800

From: "Doug Ewell" <[EMAIL PROTECTED]>
> In UTF-16 practically any sequence of bytes is valid, and since you
> can't assume you know the language, you can't employ distribution
> statistics.  Twelve years ago, when most text was not Unicode and all
> Unicode text was UTF-16, Microsoft documentation suggested a heuristic
> of checking every other byte to see if it was zero, which of course
> would only work for Latin-1 text encoded in UTF-16.  If you need to
> detect the encoding of non-Western-European text, you would have to be
> more sophisticated than this.


Here I completely disagree: even though mostly any 16bit values in UTF-16
are valid, they are NOT uniformly distributed. You'll see immediately that
even
and odd bytes have very distinct distribution, with the bytes representing
the
least significant bits of code units having a flatter distribution in a
wider range
than the other bytes which are distributed in very few byte values
(rarely more than 2 or 3 for European languages, or with a mostly flat
distribution
of some limited ranges for Korean or for Chinese).

Even today, where Unicode has more than 1 plane, UTF-16 is still easy to
determine, because you'll see sequences of bytes where any byte between
0xD8 and 0xDB is followed by 2 bytes where the second is between 0xDC
and 0xDF. The low bit of the positions of these two bytes reveals if it's
coded
with UTF16-BE or UTF16-LE, and then you can look at the effective ranges of
decoded UTF16 code units to detect unassigned or illegal codepoints which
would invalidate the UTF-16 possibility.

Re: Detecting encoding in Plain text

Reply via email to