Re: Tuning the bayes-system?

Karsten Bräckelmann Tue, 21 Oct 2008 07:40:06 -0700

On Tue, 2008-10-21 at 14:32 +0200, Heinrich Christian Peters wrote:
> Hello Karsten/guenther, (?)


Real name, commonly known nick name (and email address). :)


> >> I am using a system-wide spamassassin setup (MailScanner). Nearly all my
> >> spam-mails are detected correctly (~0,1% is not), no FP. But, especially
> >> German spam-mails, are "wrongly" classified by the bayes-system. Should
> > 
> > According to your stats snippets:  BAYES_50 is not "wrongly" classified,
> > but not-classified-at-all. The difference is the very meaning of a
> > Bayesian score of 0.5 -- undecided, neither really spammy nor hammy
> > tokens.
> 
> I see, but what are about the 1.6% of spam (around 57 mails) classified
> by the bayes-system as ham (BAYES_00)? And, another thing, as you can

That's mis-classified alright.

> see, if the mail was classified as "BAYES_50" it is in nearly every case
> spam, so I think, the mails are wrongly classified, they should be
> BAYES_60 or higher...

Again, BAYES_50 is neither classified as ham nor spam. According to Byes
there's just no indication to classify it. Thus, IMHO it is not wrongly
classified. Think about it that way -- the absence of a given URL in
either black and white lists does not constitute a false hit for the
list.


> > Since you merely mentioned "German spam", the details might make a
> > difference, though. What are you talking about exactly?
> 
> German is my first language and nearly all (ham-)mails I get, are
> German.  The few English (ham-)mails I get are correctly classified as
> BAYES_10 or below.

> The (spam-)mails I am talking about are eg.:
>  - phishing-mails (today: DABbank AG)
>  - casino (Fiesta Club Casino, Euro Club Casino)

These are not exactly spam IMHO. They are phishing mail and trojan URL
carrying mail respectively. ClamAV and the SaneSecurity phish sigs weed
those out before SA even processes the mail in my setup.

With a notable exception of the very recent DAB Bank phishes, which
started today. Massively. Apparently there's no AV sig yet for those.
However, even though Bayes didn't catch them for me either, they
typically score around *20* here, with hits in XBL, PBL and URIBL_BLACK.
If you really have a problem with these, I guess Bayes isn't your main
issue. ;)


>  - pharmacy, mostly caught by ZMIde_Pharmacy

German pharmacy spam. Similar to the above for me. Hits blacklists
galore, Bayes of 80 or higher. The bulk of these I get features rather
static text anyway -- do you really have a problem training them in
Bayes?

Since you are using site-wide Bayes, are you sure that your manual
training uses the *same* Bayes DB? A common oops, and you'd effectively
end up with auto-learning only, no manual training on low scorers.


>  - "job offers", finance-sector

Not as easy to catch indeed.


> > Given your timing, my guess is you're talking about the recent flood of
> > German porn spam, advertising cam sites. Even though they are using
> > pretty explicit phrases, these appear to be hard to catch.
> 
> These mails are not the problem, I didn't get them...

Consider yourself lucky. :)

  guenther


-- 
char *t="[EMAIL PROTECTED]";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: Tuning the bayes-system?

Reply via email to