Re: recommended setup to catch spam for thousands of domains?

Ryan Thompson 28 Jul 2004 22:45:10 -0000

Miles Keaton wrote to Ryan Thompson:

AWESOME info, Ryan. I really appreciate the time & insight.


Great! Glad I could help.

Just curious - for site-wide Bayesian training, then:
- (OR) - What are you training it with, if not emails from your clients/users?


We've really worked on our rules. Our average ham score is well below
zero, and 99.5% of our spam scores > 10.0 (our threshold is 7.0). We're
autolearning close to 90% of all mail that comes through our systems,
and I've yet to find an autolearn mistake.

The nice thing about autolearning is, even though it's just making
scores for strongly-classified spam and ham more extreme, it *does*
still identify spam and ham terms that greatly influence the Bayes
scores for borderline emails that don't meet either autolearn threshold.

We still do as much manual training as possible to catch the edge cases.
Finding plenty of spam is easy. Finding ham is a bit harder. Still,
we're usually easily able to feed a few thousand messages per week
through sa-learn that *weren't* previously auto-learned. We get this
email from our own mailboxes, spamtraps, and any submissions which do
happen to come in, although those are relatively small. As it turns out,
we're apparently able to get quite a representative sample of our spam
and ham. I haven't run the numbers for a while, but there aren't many
contradictory Bayes scores in our the spam and ham corpora, here.

- Ryan

--
  Ryan Thompson <[EMAIL PROTECTED]>

  SaskNow Technologies - http://www.sasknow.com
  901-1st Avenue North - Saskatoon, SK - S7K 1Y4

        Tel: 306-664-3600   Fax: 306-244-7037   Saskatoon
  Toll-Free: 877-727-5669     (877-SASKNOW)     North America

Re: recommended setup to catch spam for thousands of domains?

Reply via email to