On Monday, July 14, 2003 10:14 PM, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> Are there any libraries out there (open-source or otherwise) that can > be used to detect the character encoding of a file or data stream? Yes, but these libraries actually try to detect the actual encoded language, based on strict validity rules to discriminate first the possible encodings, then statistic rules to try matching the languages with their various encoded byte sequences, then with the help of common words. The result is probabilistic, and what you get is an ordered list of language-encoding pairs. There are many cases where the final decision is ambiguous, so this may be tuned by the reader. Simple algorithms are used in Internet Explorer for its "auto- determined" mode, but it often fails and detects a Chinese text encoded with EUC-CN or UTF-7, when in fact it is just plain English coded with ASCII. This failure occurs with Chinese simply because there is no actual dictionnary to try matching the common ideographs often used in Chinese text (notably its ideographic punctuation and square spaces). However pure statistic rules often works to detect only the encoding (but with no guarantee). I don't use Mozilla, but it may have such a mode for the detection of the actual encoding; if so it should be in its sources (I did not check). -- Philippe. Spams non tol�r�s: tout message non sollicit� sera rapport� � vos fournisseurs de services Internet.

