proposed changes to CORPUS_POLICY

Daniel Quinlan 22 Jun 2004 02:53:07 -0000

I think we should consider some updates to the policy, especially
considering the copious amounts of spam we have, the recent explosion in
joe-job bounces, etc.  Current policy below, but first, here are my
proposed changes:


1. firm age limits:
   - no ham older than 12 months
   - no spam older than 6 months
2. no SpamAssassin mailing lists from sourceforge.net or apache.org
   (corpus bias, false positives, etc.)
3. no viruses (please check spam and ham with ClamAV or another
   anti-virus program to remove these)
4. no messages with envelope-sender of <> or <[EMAIL PROTECTED]> to
   remove bounces
5. no mailing list moderation administative messages since these also
   contain spam

current policy is:

------- start of cut text --------------
SpamAssassin relies on corpus data to generate good scores.  Here's the policy
we use to judge if a corpus is "good" or not.  It should be:

  - hand-verified as "spam" and "ham" (non-spam) piles -- *not* just classified
    using existing spam-classification algorithms (such as SpamAssassin itself)

  - containing a representative mix of ham mail -- that includes
    commercial-sounding-but-not-spam messages, legitimate business discussion
    (which may include talk of "sales", "marketing", "offers" etc), or verified
    opt-in mail newsletters. This is a *very* important point!

  - containing no old spam mail.  Older spam uses different tricks and
    terminology, which will impact SpamAssassin's accuracy when it's filtering
    "live", new mail.  Please try not to scan spam older than 6 months.

  - cleaned of viruses, and forwarded spam messages.  These will skew the
    results.

  - and finally, cleaned of discussion of spam or virus messages or signatures
    (such as SpamAssassin-talk or bugtraq mailing list messages).  Even though
    they are ham, these often contain snippets of code that incorrectly
    trigger tests, and again will skew the results.  (Rewriting the tests to
    avoid triggering on SpamAssassin-talk messages is not realistic!)

Once you run "mass-check" on a corpus, see the instructions in "CORPUS_SUBMIT"
for details of how to verify that the top scorers are not accidental spam that
got through.

lastmod: Jan 13 2003 jm
------- end ----------------------------

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

proposed changes to CORPUS_POLICY

Reply via email to