Asmus distinguishes between two kinds of cases: The first is guessing the charset incorrectly in a way that completely degrades the text, e.g. 8859-1 vs. 8859-2. Second is a more subtle kind of mistakes, and arguably much less objectionable, e.g., 8859-1 vs. 1252, or the "smart quotes" problem.

I like this distinction, and would point out that we can probably quantify this into a continuum, in the sense that most of the code points in 8859-1 and 1252 are equivalent, while fewer are so in 8859-1 and 8859-2. (If we wished, we could refine this further by assigning different penalties for showing the wrong glyph for an alphabetic character than for punctuation.)

If I were to design a charset-verifier, I would distinguish between
these two cases. If something came tagged with a region-specific
charset, I would honor that, unless I found strong evidence of the "this
can't be right" nature. In some cases, to collect such evidence would
require significant statistics. The rule here should be "do no harm",
that is, destroying a document by incorrectly changing a true charset
should receive a nuch higher penalty than failing to detect a broken
charset. (That way, you don't penalize people who live by the rules :).

I have always thought that the "right way" to deal with determining the correct charset of a document is to treat it as a statistical classification problem. Given a collection of documents as training data, we could extract features including the following:

- "suggested" charset, document type, and other information from metadata,
  such as HTTP Content-Type, HTML <META> tags, email headers, etc.
- various statistical signatures from the text itself, e.g. ngrams
- top-level domain of the originating web site
- anything else we can think of

We can then apply one of many possible multi-class algorithms developed by the machine learning community to this training set. Such an algorithm would learn how to weight the different features so as to tag the most documents correctly. (For some of these algorithms we would have to tag each document in the training set with the "real" charset, but there are also semi-supervised and unsupervised algorithms that would discover the most consistent assignment, if we were unable or unwilling to correctly tag everything in our dataset.)

I have always assumed that Google, or someone, must already be doing this sort of thing (although perhaps not on Google Groups!).

Asmus' comments made me realize that the machine learning approach I outline above can be taken even further: there are many classification algorithms that can be trained with different penalties for different kinds of mistakes. These penalties could be determined by hand, or could come from quantifying the potential degradation as I describe above. This provides a natural and principled way to require far more evidence for overriding 8859-1 with 8859-2 than with 1252, for example.

- John D. Burger
  MITRE


Reply via email to