[spambayes-dev] More stupid beats smart timcv.py results

Tony Meyer Tue, 18 Jan 2005 15:22:25 -0800

Results for a couple of timcv.py tests that I've done recently are here:

<http://entrian.com/sbwiki/SpfTokenizing>
<http://entrian.com/sbwiki/DeAnagraming>


The former was in response to a request to tokenize the Received-SPF
headers.  I don't have a great deal of mail with those headers (and looking
at the specs, it's not clear whether they are still meant to be used).
Hardly anything changed, anyway, so it doesn't seem worth doing anything
with them at the moment.

The latter was prompted by a comment in JGC's latest newsletter (though I'm
sure I've seen this somewhere before, too).  To avoid deliberate
misspellings and the so-called 'cambridge effect' you replace each (or
generate a new) token that is made up of the letters in the original token
sorted into a constant order (e.g. alphabetical).  So "god" becomes "dgo",
but so does "dog".

I tried both replacing the original token and adding a new one, and tried
making the change in the headers, in the body, and both.  In the good cases
FPs weren't really effected, but FNs always increased, as did unsures, so
that with the effect of making the database harder to read, makes this a bad
idea it seems.

Anyway, just FYI :)

=Tony.Meyer

_______________________________________________
spambayes-dev mailing list
spambayes-dev@python.org
http://mail.python.org/mailman/listinfo/spambayes-dev

[spambayes-dev] More stupid beats smart timcv.py results

Reply via email to