Re: Identifying the real problem (was: Re: Blacklist for spam-words)

Karsten Bräckelmann Thu, 16 Sep 2010 12:31:24 -0700

On Thu, 2010-09-16 at 11:32 -0700, franc wrote:
> > ... Do you train *both*, spam *and* ham? Any chance these
> > have been trained incorrectly before? What Bayes score do they actually
> > get? The X-Spam-Status header would be sufficient to see.
> > 
> > The few lines of 'sa-learn --dump magic' would be good, too. Oh, and you
> > are training Bayes as the same user SA checks the mail for, right?
> 
> Yes, i trained both. By the way, i use spamassassin with amavis. 
> This is my bayes result:


So you trained (manually) as the amavis user, using the system-wide
Bayes DB, right?

> ~# sa-learn --dbpath /var/lib/amavis/.spamassassin/bayes --dump magic
> 0.000          0          3          0  non-token data: bayes db version
> 0.000          0       3270          0  non-token data: nspam
> 0.000          0       8809          0  non-token data: nham
> 0.000          0     120576          0  non-token data: ntokens

You need to train on more spam.

> I know, that just some blacklisted words are really not the solution. So i
> put the threshold of spam lower in amavis conf:
> 
> $sa_tag_level_deflt  = undef;
> $sa_tag2_level_deflt = 6.31;  
> $sa_kill_level_deflt = 15;            
> $sa_dsn_cutoff_level = 25;            
> 
> A typical score of a "Uhren"-mail is:
> 
> X-Virus-Scanned: Debian amavisd-new at ew6.org
> X-Amavis-Alert: BAD HEADER, Duplicate header field: "Cc"
> X-Spam-Flag: NO
> X-Spam-Score: 12.989

Err... a SA score of ~13 and status not spam. *sigh*  See, you just
needed to identify your real problem. *THIS* is it.

The SA default spam threshold is 5. Everything exceeding that threshold
is classified spam. Five. So this example would have been caught no
problem by vanilla SA.

The scores of the individual rules have been set with that default
threshold of 5 in mind. Raising it *slightly* is OK, if you want to stay
even more on the FP-safe side. Raising it like the above shows is just
plain wrong. And it is the reason for your problem of not catching this
spam.

> X-Spam-Level: ************
> X-Spam-Status: No, score=12.989 required=15 tests=[BAYES_99=3.5,
>       DNS_FROM_OPENWHOIS=1.13, HTML_MESSAGE=0.001, PYZOR_CHECK=3.7,
>       RCVD_IN_PBL=0.905, RCVD_IN_SORBS_HTTP=0.001, RCVD_IN_SORBS_WEB=0.619,
>       RCVD_IN_XBL=3.033, RDNS_NONE=0.1]

No URI DNSBL hits here, but that does not necessarily indicate an issue.
DNSBL hits, so DNS works for you.

BAYES_99 means, the Bayes sub-system considers it spam with a value of
0.99 or higher -- where 0.0 means ham, 0.5 neutral, and 1.0 being the
highest, pure evil spam. Bayes has sufficiently been trained with this
kind of spam.

This also means, that Bayes obviously considers the words you wanted to
blacklist as spam already -- and results in a partial score of 3.5 (of
5.0 by default, again) for Bayes alone. That's 70% there of being marked
as spam...

> So with "$sa_tag2_level_deflt = 6.31" it is ok. Before i had 15. Above 6.31
> the mails are directly put to the Spam-folder, so with IMAP, the user can
> still look at them.

Not an Amavis user -- isn't 6.31 the amavis default? Why did you raise
the threshold in the first place!? Again, that is (was) your problem.


> Anyway, do you think i need to update to 3.3.x or is 3.2 still OK?

3.2 is less effective than 3.3, but as long as you're still happy with
the results, there is no immediate need to upgrade. Using a sane spam
threshold, mind you. You would have seen pretty much the exact same
"problem" with SA 3.3 and the threshold raised to 15.


-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: Identifying the real problem (was: Re: Blacklist for spam-words)

Reply via email to