[Tony Meyer]
> Results for a couple of timcv.py tests that I've done recently are
> here:

It's sure nice to see someone is still testing ideas!  It would be
even nicer if we could find a good one <wink>.

> <http://entrian.com/sbwiki/SpfTokenizing>
> <http://entrian.com/sbwiki/DeAnagraming>
>
> The former was in response to a request to tokenize the
> Received-SPF headers.  I don't have a great deal of mail with
> those headers (and looking at the specs, it's not clear whether
> they are still meant to be used).  Hardly anything changed,
> anyway, so it doesn't seem worth doing anything with them
> at the moment.

Indeed, I had to stare hard to find any difference at all.

> The latter was prompted by a comment in JGC's latest
> newsletter (though I'm sure I've seen this somewhere before,
> too).  To avoid deliberate misspellings and the so-called
> 'cambridge effect' you replace each (or generate a new) token
> that is made up of the letters in the original token sorted into a
> constant order (e.g. alphabetical).  So "god" becomes "dgo",
> but so does "dog".
>
> I tried both replacing the original token and adding a new one,
> and tried making the change in the headers, in the body, and
> both.  In the good cases FPs weren't really effected, but FNs
> always increased, as did unsures, so that with the effect of
> making the database harder to read, makes this a bad
> idea it seems.

Yup.  I see very little Camridbge Unvierstiy obfuscation, so I
wouldn't expect this to help.  In effect, replacing tokens with a
canonicalized form is a limited kind of hashing (mapping multiple
tokens to one), and the only kind of deliberate token-confusion that
ever won in tests was the "skip:" gimmick for very long tokens.

In the cases where you added the canonicalized form (in addition to
retaining the original form), it may have a bad interaction with the
bigram option (which I believe you use), destroying the natural
bigrams.  It would be clearer to turn bigrams off in that case.  But I
wouldn't expect it to help anyway.
_______________________________________________
spambayes-dev mailing list
spambayes-dev@python.org
http://mail.python.org/mailman/listinfo/spambayes-dev

Reply via email to