On Wed, 2008-11-12 at 12:24 +0100, Thomas Zastrow wrote: > Karsten Bräckelmann schrieb: > > On Tue, 2008-11-11 at 21:55 +0100, Thomas Zastrow wrote: > > > >> I'm still not happy with my Spamassassin ... it don't recognizes a lot > >> of Spam mails, even my Thunderbird with default properties recognizes > >> more than SA.
Coincidentally, I just migrated a few home users from the TB internal Bayes filter to a full featured SA install. So far, they are more than happy. (Even the one who didn't have sufficient spam on the first run, so Bayes started working later only.) Are you sure it's Bayes you're having trouble with? Any chance it's actually something else, like disabled network tests? Do you see rules hitting like URIBL_* or RCVD_IN_*? > >> Every day, I train the Bayes filter with all the spam which were not > >> already recognized as spam. My question is now: makes it sense to use > >> also the already as spam marked mails as input for sa-learn? > > > > Yes, but... (you know, there just has to be a but. ;) Getting back to this with more detail -- Yes, it does make sense. In particular while there are either (a) low scoring spams with a Bayes score lower than 0.9 (non BAYES_9x hits) or (b) you recently started training and there might still be a lot different spam that hasn't been learned yet. Even if auto-learn is enabled, SA will not automatically learn messages that score below a couple different thresholds for safety reasons. These should be learned manually, if you suspect a problem. SA will not learn a message twice, so it is safe to simply feed it the entire (recent) spam folder. > > If this might be the case, you will not have seen BAYES rules in any of > > your messages SA headers. To know for sure about your training so far, > > see nham and nspam in this command: > > > > sa-learn --dump magic > > There are Bayes rules, but often the value is very small so that it does > not change the status of the mail. Here is the output of the sa-learn > --dump magic command: > > 0.000 0 3 0 non-token data: bayes db version > 0.000 0 713 0 non-token data: nspam > 0.000 0 788 0 non-token data: nham > 0.000 0 88333 0 non-token data: ntokens Can you elaborate, please -- what exactly do you mean by "small values"? If you check the headers, what are common BAYES_XX rules triggered for both, your recent spam and ham? Also, maybe it would be good to see some common samples of mail that isn't being detected as expected. Please upload it somewhere (like a pastebin or your webspace) and provide the link, do not post it to the list directly. Another thing that comes to mind: How *exactly* are you learning? I guess you're running sa-learn on some mail folders. Which exactly are they? Thunderbird local mail storage, or maybe IMAP? Are you running sa-learn on the raw mbox files? Any chance there have been a bunch of mis-classified mail in there, which you moved or deleted? Like, say, spam in your Inbox, which you move to a train-this folder, then run sa-learn --spam (any other switches?) on it, and do the same with --ham for the Inbox. If you didn't expunge (compress or something in TB lingo), these spams are *still* in your Inbox, marked as deleted. They won't be physically removed unless compressing the folder. In short: More details and evidence, please. :) guenther -- char *t="[EMAIL PROTECTED]"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}