On 05/30/07 10:56, [EMAIL PROTECTED] (Charles Mangin) wrote:
one of them that i'm working on now is bayesian filtering within
spamassassin. i've got it marking/learning spam and ham, but it's slow
going. what i'd love to find is a compilation of example spams that i
can dump into my database so it can start with a critical mass of spam
to check against. jumpstart the "training" process, so to speak.
do any mail admins on this list know where to get such an archive,
other than to open up one of my own domains to the floodgates and just
capture it myself?
It is highly recommended that you train your Bayes database only
with messages that have actually been received at your own
installation. Using someone else's spam and ham is likely skew
your database and result in inaccuracies. In other words,
SpamAssassin needs to know what _you_ see as spam and ham, not
what someone else sees.
Obtaining ham messages is fairly easy. Just use known good
messages that you and your users have received. For spam, if you
have any stocks of old spam that you've received, start with
that (as long as they're not too old; they should reflect the
character of the spam that you are currently receiving). If
you're having SA quarantine messages that it has classified as
spam, feed those messages to sa-learn or `spamassassin -r` after
reviewing them to weed out false positives. If you haven't done
so already, turn on SA's autolearn feature, with the caveat that
if you are autolearning, you need to keep an eye on false
positives and false negatives so that you can correct SA if it
autolearns a message incorrectly. Of course, you need to keep an
eye on false results anyway, but it's especially important if
you are autolearning.
As you've noted, training a Bayes database can take a while,
depending on how exactly you collect sample messages from which
to learn, but if you do it carefully you'll end up with much
better accuracy in the long term.
Some other ways to improve SA's accuracy, in no particular order:
- Run sa-update regularly to keep SA's default rulesets up to date.
- Consider using additional rulesets like the ones from
<http://www.rulesemporium.com/rules.htm>. Before using any of
these rulesets, though, make sure that you review them to make
sure that they are appropriate for your installation. Not all of
them are appropriate to all situations.
- If you decide to use rulesets from Rules Emporium, use the
RulesDuJour script to keep them up to date.
- If you can afford the computing and network overhead, consider
turning on SA's network tests so that you can take advantage of
things like the URIBL tests.
- If you can afford the computing and network overhead, consider
installing and using Razor|Pyzor|DCC. The great majority of spam
in my quarantine folder has hits on Razor and|or URIBL rules.
- Don't necessarily just accept the default spam threshold of
5.0. If you're getting too many false results, adjust the
threshold in small increments and wait a while between
adjustments to make sure that you don't get more false results
than you can tolerate.
--
Christopher Bort
Homes Magazine
email: <[EMAIL PROTECTED]>
website: <http://www.homesmagazine.com/>
FAX: 775-284-1298
Phone: 775-284-1294
Real Estate Advertising/ Web Products/ Digital Printing Services
Serving: Wine Country Napa & Sonoma County, Marin County, San
Francisco Bay
Area, Santa Cruz County, Monterey County , San Luis Obispo
County & Santa
Barbara County, Reno/Sparks & Carson Valley, North Lake Tahoe &
Truckee &
South Lake Tahoe
#############################################################
This message is sent to you because you are subscribed to
the mailing list <SIMS@mail.stalker.com>.
To unsubscribe, E-mail to: <[EMAIL PROTECTED]>
To switch to the DIGEST mode, E-mail to <[EMAIL PROTECTED]>
To switch to the INDEX mode, E-mail to <[EMAIL PROTECTED]>
Send administrative queries to <[EMAIL PROTECTED]>