Re: charset parameter in Google Groups

John Burger Fri, 02 Jul 2010 07:57:44 -0700

Asmus distinguishes between two kinds of cases: The first is guessingthe charset incorrectly in a way that completely degrades the text,e.g. 8859-1 vs. 8859-2. Second is a more subtle kind of mistakes, andarguably much less objectionable, e.g., 8859-1 vs. 1252, or the "smartquotes" problem.

I like this distinction, and would point out that we can probablyquantify this into a continuum, in the sense that most of the codepoints in 8859-1 and 1252 are equivalent, while fewer are so in 8859-1and 8859-2. (If we wished, we could refine this further by assigningdifferent penalties for showing the wrong glyph for an alphabeticcharacter than for punctuation.)

If I were to design a charset-verifier, I would distinguish between
these two cases. If something came tagged with a region-specific

charset, I would honor that, unless I found strong evidence of the"this

can't be right" nature. In some cases, to collect such evidence would
require significant statistics. The rule here should be "do no harm",
that is, destroying a document by incorrectly changing a true charset
should receive a nuch higher penalty than failing to detect a broken

charset. (That way, you don't penalize people who live by therules :).

I have always thought that the "right way" to deal with determiningthe correct charset of a document is to treat it as a statisticalclassification problem. Given a collection of documents as trainingdata, we could extract features including the following:

- "suggested" charset, document type, and other information frommetadata,

  such as HTTP Content-Type, HTML <META> tags, email headers, etc.
- various statistical signatures from the text itself, e.g. ngrams
- top-level domain of the originating web site
- anything else we can think of

We can then apply one of many possible multi-class algorithmsdeveloped by the machine learning community to this training set.Such an algorithm would learn how to weight the different features soas to tag the most documents correctly. (For some of these algorithmswe would have to tag each document in the training set with the "real"charset, but there are also semi-supervised and unsupervisedalgorithms that would discover the most consistent assignment, if wewere unable or unwilling to correctly tag everything in our dataset.)

I have always assumed that Google, or someone, must already be doingthis sort of thing (although perhaps not on Google Groups!).

Asmus' comments made me realize that the machine learning approach Ioutline above can be taken even further: there are many classificationalgorithms that can be trained with different penalties for differentkinds of mistakes. These penalties could be determined by hand, orcould come from quantifying the potential degradation as I describeabove. This provides a natural and principled way to require far moreevidence for overriding 8859-1 with 8859-2 than with 1252, for example.


- John D. Burger
  MITRE

Re: charset parameter in Google Groups

Reply via email to