Re: Bayes less effective over time?

Scot L. Harris 4 Jul 2004 00:35:01 -0000

On Sat, 2004-07-03 at 19:35, Mike McMullen wrote: 
> For me bayes just keeps getting better and better. Over the last month, I've 
> seen
> it go from the 3rd most often triggered rule to the 1st. 
> 
> We use IMAP and have procmail store those messages marked as Spam go into
> a Spam folder for each user. The users also have a Ham folder. If they get a 
> FP,
> they move it to the Ham folder. A script runs every night and trains against 
> each
> person's Spam and Ham folders.
> 
> Our "magic" looks like this:
> 
> 0.000          0          2              0  non-token data: bayes db version
> 0.000          0       8672           0  non-token data: nspam
> 0.000          0      63552          0  non-token data: nham
> 0.000          0     175517         0  non-token data: ntokens
> 0.000          0 1087428649     0  non-token data: oldest atime
> 0.000          0 1088895973     0  non-token data: newest atime
> 0.000          0 1088882101     0  non-token data: last journal sync atime
> 0.000          0 1088825438     0  non-token data: last expiry atime
> 0.000          0    1382400        0  non-token data: last expire atime delta
> 0.000          0       8023           0  non-token data: last expire 
> reduction count
> 
> We are a small shop. We average about 1200 delivered messages a day. 
> We reject at the sendmail level those blocklisted sites.  Our rejection rate
> is a little over 34%. After that, about 6.5% of delivered mail is marked
> as Spam with very very few FPs.
> 
> Of the 6.5% spam rate, bayes_99 was triggered 82% of the time and that
> percentage keeps rising.
> 
> I get maybe one spam that slips through every 2-3 days.
> 
> Regarding, spammers going back to their old tricks, the appropriate
> up to date rules used take care of that when they do it initially. Bayes
> will restore any needed tokens that may have been deleted, with
> continued training.
> 
> Spamassassin's best success lies in not depending on just rules or just bayes
> to solve spamming problems. It's the combination of each that gives the
> best overall performance. Especially when used with MTA level rejection.


This past week I figured out that to get really good results with SA I
need to get a wider spread between ham and spam scores.  Since the first
part of the year SA had been doing pretty well.  In the last couple of
weeks more messages started getting by.  I realized that we had been
lucky since the spread between spam and the threshold score was very
thin all along.  

In the past week I have applied a couple of the SARE rule sets which
helped a lot.  I also wrote one rule specific to our site which appears
to be catching a large number of spam and increasing the score on other
spam significantly.  

I had become concerned that the bayes database had become skewed someway
even though it is tagging virtual all spam messages.  Also the amount of
spam/ham is something like 47000/4000 in my bayes database.  I note in
your dump that you have significantly more ham than spam, opposite of
mine.  I had asked before about care and feeding of the database with
only a couple of responses.  I did implement some of the SARE rulesets
as recommended and it has helped significantly.  I am also investigating
implementing the surbl list.  

This also led me to start trying to find some way to gather statistics
on what is happening.  I have implemented one suggestion which simply
greps the number of ham/spam from the log file and am currently feeding
that to an rrd database which can be displayed on a web page.  That is
the first item.  The next is to collect info on the number of messages
in each score bucket.  I have run some stuff by hand which shows a major
clumping of scores close the threshold value.  I am hoping that with the
implementation of the SARE rules and other items that I will be seeing
this clumping move up the scale giving me a wider gap between ham and
spam.  I figure this is what will show me that the changes are having a
positive affect (besides not getting any spam through and no false
positives).  

What has really surprised me is the dramatic increase in spam in the
last 6 months.  The company previously was getting maybe a couple of
hundred spam messages a day.  Now it appears that we are getting a
couple of thousand a day.  I have implemented blocks on certain IP
addresses which I have identified as sending spam.  The other day I had
over 4000 connection attempts on port 25 from those IP addresses.  And
that is stuff that did not even get to SA.  

I have been told that if we had not implemented SA that the company
would have probably just turned off email.  (it is a fairly small
company and could get by working with faxes and phones.)

Apologies for the long post.  But this past week I have spent a lot of
time on this stuff.  Hopefully I will get the stats I need setup next
week as well as a few new rule sets and can get on doing real work for
awhile.  :)
-- 
Scot L. Harris
[EMAIL PROTECTED]

BEWARE!  People acting under the influence of human nature.

Re: Bayes less effective over time?

Reply via email to