Re: Question training the Bayse filter

Karsten Bräckelmann Wed, 12 Nov 2008 09:27:57 -0800

On Wed, 2008-11-12 at 12:24 +0100, Thomas Zastrow wrote:
> Karsten Bräckelmann schrieb:
> > On Tue, 2008-11-11 at 21:55 +0100, Thomas Zastrow wrote:
> >   
> >> I'm still not happy with my Spamassassin ... it don't recognizes a lot
> >> of Spam mails, even my Thunderbird with default properties recognizes
> >> more than SA.


Coincidentally, I just migrated a few home users from the TB internal
Bayes filter to a full featured SA install. So far, they are more than
happy. (Even the one who didn't have sufficient spam on the first run,
so Bayes started working later only.)

Are you sure it's Bayes you're having trouble with? Any chance it's
actually something else, like disabled network tests? Do you see rules
hitting like URIBL_* or RCVD_IN_*?


> >> Every day, I train the Bayes filter with all the spam which were not
> >> already recognized as spam. My question is now: makes it sense to use
> >> also the already as spam marked mails as input for sa-learn?
> >
> > Yes, but... (you know, there just has to be a but. ;)

Getting back to this with more detail -- Yes, it does make sense. In
particular while there are either  (a) low scoring spams with a Bayes
score lower than 0.9 (non BAYES_9x hits) or  (b) you recently started
training and there might still be a lot different spam that hasn't been
learned yet.

Even if auto-learn is enabled, SA will not automatically learn messages
that score below a couple different thresholds for safety reasons. These
should be learned manually, if you suspect a problem.

SA will not learn a message twice, so it is safe to simply feed it the
entire (recent) spam folder.


> > If this might be the case, you will not have seen BAYES rules in any of
> > your messages SA headers. To know for sure about your training so far,
> > see nham and nspam in this command:
> >
> >   sa-learn --dump magic
> 
> There are Bayes rules, but often the value is very small so that it does 
> not change the status of the mail. Here is the output of the sa-learn 
> --dump magic command:
> 
> 0.000          0          3          0  non-token data: bayes db version
> 0.000          0        713          0  non-token data: nspam
> 0.000          0        788          0  non-token data: nham
> 0.000          0      88333          0  non-token data: ntokens

Can you elaborate, please -- what exactly do you mean by "small values"?
If you check the headers, what are common BAYES_XX rules triggered for
both, your recent spam and ham?

Also, maybe it would be good to see some common samples of mail that
isn't being detected as expected. Please upload it somewhere (like a
pastebin or your webspace) and provide the link, do not post it to the
list directly.


Another thing that comes to mind:  How *exactly* are you learning?

I guess you're running sa-learn on some mail folders. Which exactly are
they? Thunderbird local mail storage, or maybe IMAP? Are you running
sa-learn on the raw mbox files? Any chance there have been a bunch of
mis-classified mail in there, which you moved or deleted?

Like, say, spam in your Inbox, which you move to a train-this folder,
then run sa-learn --spam (any other switches?) on it, and do the same
with --ham for the Inbox. If you didn't expunge (compress or something
in TB lingo), these spams are *still* in your Inbox, marked as deleted.
They won't be physically removed unless compressing the folder.

In short: More details and evidence, please. :)

  guenther


-- 
char *t="[EMAIL PROTECTED]";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: Question training the Bayse filter

Reply via email to