* maillist <[EMAIL PROTECTED]> [2007-01-25 10:21:47 -0600]:

> Kim Christensen wrote:
> >Hey list,
> >
> >I've recently started training our bayesian filter with spam/ham from my
> >personal mailbox, to prepare for live usage on our customer accounts.
> >
> >% sa-learn --dump magic
> >...
> >0.000          0        340          0  non-token data: nspam
> >0.000          0        475          0  non-token data: nham
> >0.000          0      53404          0  non-token data: ntokens
> >...
> >
> >So far so good, and spamd is actually using the bayesian db when
> >examining incoming mails. However, I find that a few of the legit ham 
> >(not a majority) mails get unusually high bayesian points, while some
> >of the real spam (which gets scored as spam by sa) often get bayesian
> >points < 1. 
> >
> >Now, I'm sure I haven't trained the database with wrong messages. Is it
> >a good idea to continue feeding sa-learn with example spam and ham until
> >it reaches a few thousands messages, before relying on the results?
> >
> >I would think my current amount is sufficient, but I guess something's
> >wrong with this picture :-)
> >
> >
> >Best regards
> >  
> Run spamassassin --test-mode on the messages that are scoring high and 
> low.  See if they are actually running through any BAYES_* tests.  I'm 
> not 100% sure but I think that by default, the bayes do not even begin 
> until you have 500 trained messages of each spam and ham.
> 
> You can of course get around this by setting bayes_min_ham_num  and  
> bayes_min_spam_num in your local.cf file.

Yeah, an example spam message marked with 17 points by SA gets the
following result when running a test scan against it:

...
 0.0 BAYES_50               BODY: Bayesian spam probability is 40 to 60%
                            [score: 0.5106]
...

Surely it runs through a Bayesian filter, and all the other scanning
methods are going wild about it - but not the BAYES_* test! H


Best regards
-- 
Kim Christensen
"I am Jack's smirking revenge."

Reply via email to