-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
+0.9 on those proposed changes. Only 1 change: I would suggest that legit bounce messages, where you (the user) sends a ham (obviously ;) mail and it bounces, should be retained in the ham corpus where they occur. - --j. Daniel Quinlan writes: > I think we should consider some updates to the policy, especially > considering the copious amounts of spam we have, the recent explosion in > joe-job bounces, etc. Current policy below, but first, here are my > proposed changes: > > 1. firm age limits: > - no ham older than 12 months > - no spam older than 6 months > 2. no SpamAssassin mailing lists from sourceforge.net or apache.org > (corpus bias, false positives, etc.) > 3. no viruses (please check spam and ham with ClamAV or another > anti-virus program to remove these) > 4. no messages with envelope-sender of <> or <[EMAIL PROTECTED]> to > remove bounces > 5. no mailing list moderation administative messages since these also > contain spam > > current policy is: > > ------- start of cut text -------------- > SpamAssassin relies on corpus data to generate good scores. Here's the policy > we use to judge if a corpus is "good" or not. It should be: > > - hand-verified as "spam" and "ham" (non-spam) piles -- *not* just > classified > using existing spam-classification algorithms (such as SpamAssassin > itself) > > - containing a representative mix of ham mail -- that includes > commercial-sounding-but-not-spam messages, legitimate business discussion > (which may include talk of "sales", "marketing", "offers" etc), or > verified > opt-in mail newsletters. This is a *very* important point! > > - containing no old spam mail. Older spam uses different tricks and > terminology, which will impact SpamAssassin's accuracy when it's filtering > "live", new mail. Please try not to scan spam older than 6 months. > > - cleaned of viruses, and forwarded spam messages. These will skew the > results. > > - and finally, cleaned of discussion of spam or virus messages or signatures > (such as SpamAssassin-talk or bugtraq mailing list messages). Even though > they are ham, these often contain snippets of code that incorrectly > trigger tests, and again will skew the results. (Rewriting the tests to > avoid triggering on SpamAssassin-talk messages is not realistic!) > > Once you run "mass-check" on a corpus, see the instructions in "CORPUS_SUBMIT" > for details of how to verify that the top scorers are not accidental spam that > got through. > > lastmod: Jan 13 2003 jm > ------- end ---------------------------- > > Daniel -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFA18QOQTcbUG5Y7woRAhe+AKDACdmkixsGBGKXltEmzG9K49nyjACgl0tv QCy34Ned5SKuSl1zPqkueKE= =aUEz -----END PGP SIGNATURE-----
