[Tony Meyer] > Results for a couple of timcv.py tests that I've done recently are > here:
It's sure nice to see someone is still testing ideas! It would be even nicer if we could find a good one <wink>. > <http://entrian.com/sbwiki/SpfTokenizing> > <http://entrian.com/sbwiki/DeAnagraming> > > The former was in response to a request to tokenize the > Received-SPF headers. I don't have a great deal of mail with > those headers (and looking at the specs, it's not clear whether > they are still meant to be used). Hardly anything changed, > anyway, so it doesn't seem worth doing anything with them > at the moment. Indeed, I had to stare hard to find any difference at all. > The latter was prompted by a comment in JGC's latest > newsletter (though I'm sure I've seen this somewhere before, > too). To avoid deliberate misspellings and the so-called > 'cambridge effect' you replace each (or generate a new) token > that is made up of the letters in the original token sorted into a > constant order (e.g. alphabetical). So "god" becomes "dgo", > but so does "dog". > > I tried both replacing the original token and adding a new one, > and tried making the change in the headers, in the body, and > both. In the good cases FPs weren't really effected, but FNs > always increased, as did unsures, so that with the effect of > making the database harder to read, makes this a bad > idea it seems. Yup. I see very little Camridbge Unvierstiy obfuscation, so I wouldn't expect this to help. In effect, replacing tokens with a canonicalized form is a limited kind of hashing (mapping multiple tokens to one), and the only kind of deliberate token-confusion that ever won in tests was the "skip:" gimmick for very long tokens. In the cases where you added the canonicalized form (in addition to retaining the original form), it may have a bad interaction with the bigram option (which I believe you use), destroying the natural bigrams. It would be clearer to turn bigrams off in that case. But I wouldn't expect it to help anyway. _______________________________________________ spambayes-dev mailing list spambayes-dev@python.org http://mail.python.org/mailman/listinfo/spambayes-dev