Results for a couple of timcv.py tests that I've done recently are here: <http://entrian.com/sbwiki/SpfTokenizing> <http://entrian.com/sbwiki/DeAnagraming>
The former was in response to a request to tokenize the Received-SPF headers. I don't have a great deal of mail with those headers (and looking at the specs, it's not clear whether they are still meant to be used). Hardly anything changed, anyway, so it doesn't seem worth doing anything with them at the moment. The latter was prompted by a comment in JGC's latest newsletter (though I'm sure I've seen this somewhere before, too). To avoid deliberate misspellings and the so-called 'cambridge effect' you replace each (or generate a new) token that is made up of the letters in the original token sorted into a constant order (e.g. alphabetical). So "god" becomes "dgo", but so does "dog". I tried both replacing the original token and adding a new one, and tried making the change in the headers, in the body, and both. In the good cases FPs weren't really effected, but FNs always increased, as did unsures, so that with the effect of making the database harder to read, makes this a bad idea it seems. Anyway, just FYI :) =Tony.Meyer _______________________________________________ spambayes-dev mailing list spambayes-dev@python.org http://mail.python.org/mailman/listinfo/spambayes-dev