Re: Bayes less effective over time?

Mike McMullen 4 Jul 2004 01:46:53 -0000

> On Sat, 2004-07-03 at 19:35, Mike McMullen wrote: 
> > For me bayes just keeps getting better and better. Over the last month, 
> > I've seen
> > it go from the 3rd most often triggered rule to the 1st. 
> > 
> > We use IMAP and have procmail store those messages marked as Spam go into
> > a Spam folder for each user. The users also have a Ham folder. If they get 
> > a FP,
> > they move it to the Ham folder. A script runs every night and trains 
> > against each
> > person's Spam and Ham folders.
> > 
> > Our "magic" looks like this:
> > 
> > 0.000          0          2              0  non-token data: bayes db version
> > 0.000          0       8672           0  non-token data: nspam
> > 0.000          0      63552          0  non-token data: nham
> > 0.000          0     175517         0  non-token data: ntokens
> > 0.000          0 1087428649     0  non-token data: oldest atime
> > 0.000          0 1088895973     0  non-token data: newest atime
> > 0.000          0 1088882101     0  non-token data: last journal sync atime
> > 0.000          0 1088825438     0  non-token data: last expiry atime
> > 0.000          0    1382400        0  non-token data: last expire atime 
> > delta
> > 0.000          0       8023           0  non-token data: last expire 
> > reduction count
> > 
> > We are a small shop. We average about 1200 delivered messages a day. 
> > We reject at the sendmail level those blocklisted sites.  Our rejection rate
> > is a little over 34%. After that, about 6.5% of delivered mail is marked
> > as Spam with very very few FPs.
> > 
> > Of the 6.5% spam rate, bayes_99 was triggered 82% of the time and that
> > percentage keeps rising.
> > 
> > I get maybe one spam that slips through every 2-3 days.
> > 
> > Regarding, spammers going back to their old tricks, the appropriate
> > up to date rules used take care of that when they do it initially. Bayes
> > will restore any needed tokens that may have been deleted, with
> > continued training.
> > 
> > Spamassassin's best success lies in not depending on just rules or just 
> > bayes
> > to solve spamming problems. It's the combination of each that gives the
> > best overall performance. Especially when used with MTA level rejection.
> 
> This past week I figured out that to get really good results with SA I
> need to get a wider spread between ham and spam scores.  Since the first
> part of the year SA had been doing pretty well.  In the last couple of
> weeks more messages started getting by.  I realized that we had been
> lucky since the spread between spam and the threshold score was very
> thin all along.  
> 
> In the past week I have applied a couple of the SARE rule sets which
> helped a lot.  I also wrote one rule specific to our site which appears
> to be catching a large number of spam and increasing the score on other
> spam significantly.  
> 
> I had become concerned that the bayes database had become skewed someway
> even though it is tagging virtual all spam messages.  Also the amount of
> spam/ham is something like 47000/4000 in my bayes database.  I note in
> your dump that you have significantly more ham than spam, opposite of
> mine.  I had asked before about care and feeding of the database with
> only a couple of responses.  I did implement some of the SARE rulesets
> as recommended and it has helped significantly.  I am also investigating
> implementing the surbl list.  
> 

The reason we have more ham than spam is because of the use of DNS-based
blacklists like spamhaus, and spamcop to reject mail from reported spam sites.
This way we stop them before they even get in the system. 

For us that meant in the last 70 days, we stopped 40k+ emails from even 
being seen by MailScanner, SA or ClamAV. When your mail system is
an old box built out of pieces found in the  PC "bone pile" that makes
a big difference! 

So that 40k+ of known spam was stopped, leaving another just under 5k
that got nabbed by SA. Thats the 6.4% I mentioned in my original post.

The best thing you can do for bayes is to train it religiously each night.
Since we only have 6-7 email users, I have a absurdly simple script
that runs sa-learn on each user's Spam and Ham mailbox each night.
I just name each person's mailbox explicitly in the shell script. 

Crude but effective.

Hope this helps,

Mike
Re: Bayes less effective over time?

Reply via email to