* maillist <[EMAIL PROTECTED]> [2007-01-25 10:21:47 -0600]: > Kim Christensen wrote: > >Hey list, > > > >I've recently started training our bayesian filter with spam/ham from my > >personal mailbox, to prepare for live usage on our customer accounts. > > > >% sa-learn --dump magic > >... > >0.000 0 340 0 non-token data: nspam > >0.000 0 475 0 non-token data: nham > >0.000 0 53404 0 non-token data: ntokens > >... > > > >So far so good, and spamd is actually using the bayesian db when > >examining incoming mails. However, I find that a few of the legit ham > >(not a majority) mails get unusually high bayesian points, while some > >of the real spam (which gets scored as spam by sa) often get bayesian > >points < 1. > > > >Now, I'm sure I haven't trained the database with wrong messages. Is it > >a good idea to continue feeding sa-learn with example spam and ham until > >it reaches a few thousands messages, before relying on the results? > > > >I would think my current amount is sufficient, but I guess something's > >wrong with this picture :-) > > > > > >Best regards > > > Run spamassassin --test-mode on the messages that are scoring high and > low. See if they are actually running through any BAYES_* tests. I'm > not 100% sure but I think that by default, the bayes do not even begin > until you have 500 trained messages of each spam and ham. > > You can of course get around this by setting bayes_min_ham_num and > bayes_min_spam_num in your local.cf file.
Yeah, an example spam message marked with 17 points by SA gets the following result when running a test scan against it: ... 0.0 BAYES_50 BODY: Bayesian spam probability is 40 to 60% [score: 0.5106] ... Surely it runs through a Bayesian filter, and all the other scanning methods are going wild about it - but not the BAYES_* test! H Best regards -- Kim Christensen "I am Jack's smirking revenge."