On 05/31/07 11:31, [EMAIL PROTECTED] (Lewis Butler) wrote:
On 30-May-2007, at 13:16, Christopher Bort wrote:
It is highly recommended that you train your Bayes database
only with messages that have actually been received at your
own installation. Using someone else's spam and ham is likely
skew your database and result in inaccuracies. In other words,
SpamAssassin needs to know what _you_ see as spam and ham, not
what someone else sees.
Spam AND ham, yes, but there is nothing wrong with using
someone else's spamarchive to help train up your bayes. While
defenitions of ham vary widely, the same cannot be said for spam.
Sure it can. I've seen a lot of different definitions of 'spam,'
'UCE,' 'UBE,' etc. They don't all agree on everything, starting
with what to call it. Even within a given definition, what is or
is not spam can be subjective. As a case in point, I work for a
company that publishes real estate advertising magazines. Our
advertisers routinely send us ad copy by e-mail. In many
contexts, much of it would look quite spammy. In our context it
is not and I need to make certain that such messages are not
classified as spam. Also, our sales reps get a lot of e-mail
advertisements and 'newsletters' all about the latest greatest
things happening in real estate, mortgages and generally bilking
people out of as much money as possible for houses that are
several times bigger than what they need. To me, a great deal of
it is absolutely spam. To our sales people who are receiving it,
it's keeping up with trends in the market to which they sell
(i.e. realtors). As much as the stuff might turn my stomach, a
lot of it I cannot have SA classifying as spam.
The point is still that the spam you train SpamAssassin with
must reflect the nature of spam that your installation receives.
Can you get away with using someone else's 'generic' spam
archive for training? Sure, but you would be well advised to
first review it thoroughly to make sure that it fairly closely
matches the spam you need to catch on your own server(s).
Another thing that has been implied in this thread, but not
stated explicitly, is that Bayes training needs to continue on a
regular basis beyond the initial training period. The mix of
spam and ham that hits a given server now may not be exactly the
same as what that server will see six months from now, so
training needs to keep up with what is currently being received.
The best thing for accuracy is keeping SpamAssassin up to date
(I am right now updating to 3.2), keeping its rules updated,
and running some of the other rules sets. I use RulesDuJour
myself, but read-up on the wiki:
<http://wiki.apache.org/spamassassin/CustomRulesets>
<http://www.exit0.us/index.php?pagename=RulesDuJour >
Yes, I recommended using additional rulesets and RulesDuJour in
my previous post, as well as regular use of sa-update to keep
SA's default rules up to date. Good point about keeping SA
itself up to date. Subscribing to the SpamAssassin-Announce list
is a good way to know when updates are available.
- Don't necessarily just accept the default spam threshold of
5.0. If you're getting too many false results, adjust the
threshold in small increments and wait a while between
adjustments to make sure that you don't get more false results
than you can tolerate.
I have to disagree. I never adjust my threshold. I throw
things at bayes until they 'stick' and I may, on rare
occasions, adjust the score for a rule.
Note that I said 'don't necessarily.' By that I meant that 5.0
may work for a lot of installations, but it is something to keep
track of because adjusting it might be beneficial. This just
points out once again that everyone's installation is different
enough that there's no one-size-fits-all strategy. When I first
installed SpamAssassin, before I had a well-trained Bayes
database and before adding any custom rulesets, I got a lot of
false positives with the threshold at 5.0 points, at least
partly due to the circumstances described above. I raised the
threshold to, IIRC, 7.5 or so to eliminate the false positives.
Of course, that introduced a lot of false negatives. As I
trained my Bayes database so that it became more finely tuned to
my server's view of both spam and ham, and added a few rulesets,
I gradually ratcheted the threshold downward. It's currently at
4.9. I get a fair amount of spam that scores around 5.0 or so,
but only the occasional piece that scores at 4.9, so I've kept
the threshold steady there for a while now.
--
Christopher Bort
Homes Magazine
email: <[EMAIL PROTECTED]>
website: <http://www.homesmagazine.com/>
FAX: 775-284-1298
Phone: 775-284-1294
Real Estate Advertising/ Web Products/ Digital Printing Services
Serving: Wine Country Napa & Sonoma County, Marin County, San
Francisco Bay
Area, Santa Cruz County, Monterey County , San Luis Obispo
County & Santa
Barbara County, Reno/Sparks & Carson Valley, North Lake Tahoe &
Truckee &
South Lake Tahoe
#############################################################
This message is sent to you because you are subscribed to
the mailing list <SIMS@mail.stalker.com>.
To unsubscribe, E-mail to: <[EMAIL PROTECTED]>
To switch to the DIGEST mode, E-mail to <[EMAIL PROTECTED]>
To switch to the INDEX mode, E-mail to <[EMAIL PROTECTED]>
Send administrative queries to <[EMAIL PROTECTED]>