Re: proposed changes to CORPUS_POLICY

Justin Mason 22 Jun 2004 05:31:37 -0000

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


+0.9 on those proposed changes.

Only 1 change: I would suggest that legit bounce messages, where you (the
user) sends a ham (obviously ;) mail and it bounces, should be retained in
the ham corpus where they occur.

- --j.

Daniel Quinlan writes:
> I think we should consider some updates to the policy, especially
> considering the copious amounts of spam we have, the recent explosion in
> joe-job bounces, etc.  Current policy below, but first, here are my
> proposed changes:
> 
> 1. firm age limits:
>    - no ham older than 12 months
>    - no spam older than 6 months
> 2. no SpamAssassin mailing lists from sourceforge.net or apache.org
>    (corpus bias, false positives, etc.)
> 3. no viruses (please check spam and ham with ClamAV or another
>    anti-virus program to remove these)
> 4. no messages with envelope-sender of <> or <[EMAIL PROTECTED]> to
>    remove bounces
> 5. no mailing list moderation administative messages since these also
>    contain spam
> 
> current policy is:
> 
> ------- start of cut text --------------
> SpamAssassin relies on corpus data to generate good scores.  Here's the policy
> we use to judge if a corpus is "good" or not.  It should be:
> 
>   - hand-verified as "spam" and "ham" (non-spam) piles -- *not* just 
> classified
>     using existing spam-classification algorithms (such as SpamAssassin 
> itself)
> 
>   - containing a representative mix of ham mail -- that includes
>     commercial-sounding-but-not-spam messages, legitimate business discussion
>     (which may include talk of "sales", "marketing", "offers" etc), or 
> verified
>     opt-in mail newsletters. This is a *very* important point!
> 
>   - containing no old spam mail.  Older spam uses different tricks and
>     terminology, which will impact SpamAssassin's accuracy when it's filtering
>     "live", new mail.  Please try not to scan spam older than 6 months.
> 
>   - cleaned of viruses, and forwarded spam messages.  These will skew the
>     results.
> 
>   - and finally, cleaned of discussion of spam or virus messages or signatures
>     (such as SpamAssassin-talk or bugtraq mailing list messages).  Even though
>     they are ham, these often contain snippets of code that incorrectly
>     trigger tests, and again will skew the results.  (Rewriting the tests to
>     avoid triggering on SpamAssassin-talk messages is not realistic!)
> 
> Once you run "mass-check" on a corpus, see the instructions in "CORPUS_SUBMIT"
> for details of how to verify that the top scorers are not accidental spam that
> got through.
> 
> lastmod: Jan 13 2003 jm
> ------- end ----------------------------
> 
> Daniel
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFA18QOQTcbUG5Y7woRAhe+AKDACdmkixsGBGKXltEmzG9K49nyjACgl0tv
QCy34Ned5SKuSl1zPqkueKE=
=aUEz
-----END PGP SIGNATURE-----

Re: proposed changes to CORPUS_POLICY

Reply via email to