Daniel Quinlan wrote: >> So, before we make the pre2 release and start mass-checks, there's one >> thing I want to nail down in the corpus policy: should we just remove >> any spam list that has tons of false positives? Theo Van Dinter <[EMAIL PROTECTED]> writes:
> It would depend what the FPs are from I'd say. Well, I'd rather just have a hard and fast rule such as "remove anti-spam mailing lists if spam snippets or domain names are frequently posted". If we're going to remove any FPs (by rule or message), then there's really no point in including other messages because they won't affect the perceptron results. >> Removing the SpamAssassin ones is just common sense, but I looked at my >> false positives and 59 out of 102 of my false positives are from another >> anti-spam mailing list that frequently includes snippets of spam, URLs, > Ah. IMO, any spam-related mails have no place in a ham corpus. > They're not going to be considered "standard" for most people, and as > you've said, they have a large tendency to include spam snippets/etc > that cause filters to go all gonzo. Well, since you ask... out of the non-net FPs: 35 BIZ_TLD 19 DRUGS_ERECTILE 18 FORGED_RCVD_HELO 15 DRUGS_ANXIETY 12 DRUGS_ANXIETY_EREC 11 DRUGS_PAIN 10 INFO_TLD 9 DRUGS_ERECTILE_OBFU 9 DRUGS_ANXIETY_OBFU 8 MAILTO_TO_SPAM_ADDR 6 NORMAL_HTTP_TO_IP 6 DRUGS_PAIN_EREC 6 DOMAIN_4U2 5 HTTP_EXCESSIVE_ESCAPES 5 FROM_NO_LOWER 5 DRUGS_MANYKINDS 5 DRUGS_DIET_EREC 5 DRUGS_DIET ... and with network tests, there are amazingly only 5 false positives from that list instead of 59 because we rely on the body a lot less: 2 BIZ_TLD 2 DOMAIN_4U2 2 DOMAIN_RATIO 2 FROM_NO_LOWER 2 INFO_TLD 2 URI_OFFERS Clearly, domain names are the main issue here. Snippets of drug spam are popular too. I think removing them *all* would be better and would better match actual practice (while I do tag them with SA headers, I don't filter this list or the SpamAssassin lists). Daniel -- Daniel Quinlan http://www.pathname.com/~quinlan/
