Re: encoding sniffing

Philippe Verdy Mon, 14 Jul 2003 15:05:13 -0700

On Monday, July 14, 2003 10:14 PM, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:


> Are there any libraries out there (open-source or otherwise) that can
> be used to detect the character encoding of a file or data stream?

Yes, but these libraries actually try to detect the actual encoded
language, based on strict validity rules to discriminate first the
possible encodings, then statistic rules to try matching the
languages with their various encoded byte sequences, then with
the help of common words. The result is probabilistic, and what you
get is an ordered list of language-encoding pairs. There are many
cases where the final decision is ambiguous, so this may be tuned
by the reader.

Simple algorithms are used in Internet Explorer for its "auto-
determined" mode, but it often fails and detects a Chinese
text encoded with EUC-CN or UTF-7, when in fact it is just plain
English coded with ASCII. This failure occurs with Chinese
simply because there is no actual dictionnary to try matching the
common ideographs often used in Chinese text (notably its
ideographic punctuation and square spaces).

However pure statistic rules often works to detect only the
encoding (but with no guarantee).

I don't use Mozilla, but it may have such a mode for the detection
of the actual encoding; if so it should be in its sources (I did not
check).

-- 
Philippe.
Spams non tol�r�s: tout message non sollicit� sera
rapport� � vos fournisseurs de services Internet.

Re: encoding sniffing

Reply via email to