Re: false positives and negatives

Loren Wilton Tue, 31 May 2005 00:13:15 -0700

> Sorry for my late reply - my evening is your morning.
> There is 1000 spam a week that leaks through and perhaps another 500-600
that
> get filtered by spamassassin.
> If my Bayes is poorly trained what options do I have.
> Here is a typical letter that gets through.
>
>
============================================================================
=======
> Return-Path: <[EMAIL PROTECTED]>
>  Received: from fw.doverie.bg (doh-gw.customer.0rbitel.net
[195.24.44.114])
> by mail1.mr-bricolage.bg (8.13.3/8.13.3/Debian-6) with SMTP id
> j4V11DGj014435
> for <[EMAIL PROTECTED]>; Tue, 31 May 2005 04:01:15 +0300
>  Received: (qmail 13680 invoked by uid 507); 31 May 2005 00:58:54 -0000
>  Delivered-To: [EMAIL PROTECTED]
>  Received: (qmail 13672 invoked by uid 503); 31 May 2005 00:58:48 -0000
>  Received: from [EMAIL PROTECTED] by fw.doverie.bg by uid 500 with
> qmail-scanner-1.15
> (f-prot: 3.12. Clear:.
> Processed in 12.821956 secs); 31 May 2005 00:58:48 -0000
>  Received: from cow100.orbitel.bg (HELO ns.orbitel.bg) (195.24.32.18)
> by 0 with SMTP; 31 May 2005 00:58:20 -0000
>  Received: (qmail 607 invoked from network); 31 May 2005 01:01:36 -0000
>  Received: from unknown (HELO street67.net) (219.134.152.97)
> by ns.orbitel.bg with SMTP; 31 May 2005 01:01:36 -0000
>  Message-ID: <[EMAIL PROTECTED]>
>  Date: Mon, 30 May 2005 16:15:11 +1100
>  From: "michael torrey" <[EMAIL PROTECTED]>
>  User-Agent: QUALCOMM Windows Eudora Version 6.0.0.22
>  X-Accept-Language: en-us
>  MIME-Version: 1.0
>  To: "Elden Irving" <[EMAIL PROTECTED]>
>  Cc: <[EMAIL PROTECTED]>,
> <[EMAIL PROTECTED]>
>  Subject: It is all about quality tableets sold at the finest prices.
>  Content-Type: text/plain;
> charset="us-ascii"
>  Content-Transfer-Encoding: 7bit
>  X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on
> mail1.mr-bricolage.bg
>  X-Spam-Level:
>  X-Spam-Status: No, score=0.1 required=2.0 tests=FORGED_RCVD_HELO
> autolearn=ham version=3.0.2
>  Status: R
>  X-Status: N
>  X-KMail-EncryptionState:
>  X-KMail-SignatureState:
>  X-KMail-MDN-Sent:
>
> At our rxdrug-site, you can choose top-selling rxmeds at a reduced prices.
> Legitimate way to e-shoppe for tableets. We provide customers flexible and
> reliable distribution services.
> ======================================================================


It is holiday in the US, so you probably won't receive more replies for some
hours.

The spam you show is difficult to handle.  One important thing is there is
no url or other link in the message body to a drug site where people could
get the spammed product.  I am assuking the original spam much have had
such, since a spam without a link is fairly useless.  If you are getting
spams without links similar to this, then other methods, such as writing
some custom rules, would be required to eliminate the problem.

Bayes did not trigger on this message, either for or against.  I'm somewhat
surprised that Bayes didn't even show a BAYES_50 score though.  So bayes is
neither helping nor hindering.  It should be helping.  But that gets us to
the next point:

> autolearn=ham

Bayes autolearn is enabled, as it is by default.  Since this got a low
score, it has been learned as ham rather than spam.  Sooner or later Bayes
will start helping messages like this get through by giving them scores of
BAYES_00.

You could back this particular message out of Bayes by learning it manually
as spam.  However, if you are having 1000 messages a week leak through with
low scores, your Bayes database probably believes that all spams are haps at
this point.  So there is no point in learning individual messages correctly
just yet; your bayes database is probably junk.

Start by setting bayes_auto_learn to 0 in local.cf to disable auto
learning - it is doing mych more harm than good at this point.  Later you
will probably be able to turn it back on, once you have a Bayes database
that knows spam from ham.  But not yet.

Also add a score line for BAYES_99 to fix the poor scoring in 3.0.2 for this
rule:
    score BAYES_99    4
should do the trick.

Next remove your existing bayes database and start over.  You will need to
manually train it on at least 200 each ham and spam.  If you make a couple
of mbox files, one with manually sorted spam, the other manually sorted ham,
and feed these to sa-learn correctly, you should be able to get bayes
working for you in no more than a day or two, probably only a few hours,
depending on your mail rate.

Keep training bayes manually every now and then.  You should get a good base
of at least a few thousand hams and spams each, representative of the sort
of mail you get.  If you start seeing new spams that are scoring below
BAYES_70 or so, learn a few of them.  Every so often learn a few new hams to
keep things balanaced.  You typically will only have to spend a few minutes
a week dealing with this.  If you get bayes trained well, you could turn on
auto-learning again.  But I'm personally nervous doing this, and it isn't
that hard to toss a few messages to bayes every now and then.

That should get bayes on your side pretty quickly.

The next thing that could help you is to enable net tests, specifically the
SURBL checks.  These will catch a lot of your spams.

You might need to be careful with any other net checks.  You have a really
screwy sequence of received headers, with all of those qmail headers between
all the real headers.  I don't know if SA will be able to deal with that and
figure out where your main mail gateway is so that it can determine the
trusted hosts correctly.

        Loren

Re: false positives and negatives

Reply via email to