Miles Keaton wrote to Ryan Thompson:
AWESOME info, Ryan. I really appreciate the time & insight.
Great! Glad I could help.
Just curious - for site-wide Bayesian training, then:
- (OR) - What are you training it with, if not emails from your clients/users?
We've really worked on our rules. Our average ham score is well below zero, and 99.5% of our spam scores > 10.0 (our threshold is 7.0). We're autolearning close to 90% of all mail that comes through our systems, and I've yet to find an autolearn mistake.
The nice thing about autolearning is, even though it's just making scores for strongly-classified spam and ham more extreme, it *does* still identify spam and ham terms that greatly influence the Bayes scores for borderline emails that don't meet either autolearn threshold.
We still do as much manual training as possible to catch the edge cases. Finding plenty of spam is easy. Finding ham is a bit harder. Still, we're usually easily able to feed a few thousand messages per week through sa-learn that *weren't* previously auto-learned. We get this email from our own mailboxes, spamtraps, and any submissions which do happen to come in, although those are relatively small. As it turns out, we're apparently able to get quite a representative sample of our spam and ham. I haven't run the numbers for a while, but there aren't many contradictory Bayes scores in our the spam and ham corpora, here.
- Ryan
-- Ryan Thompson <[EMAIL PROTECTED]>
SaskNow Technologies - http://www.sasknow.com 901-1st Avenue North - Saskatoon, SK - S7K 1Y4
Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon Toll-Free: 877-727-5669 (877-SASKNOW) North America