RE: not seeing any advantage to sa-learn?

Karsten Bräckelmann Thu, 26 Feb 2009 16:36:21 -0800

On Thu, 2009-02-26 at 17:14 -0700, Savoy, Jim wrote:
>    I seem to have the same problem as Ricardo. I feed the same stuff every day
> into Bayes, using sa-learn, but the tagging never changes. Otherwise, SA seems
> to be working perfectly on all other messages, but not with the ones I 
> constantly
> feed it (they always seem to hit only BAYES_50).


Yes, I too do see messages, that appear to be resistent to Bayes
training. Actually, those are pretty short, little bit of "company"
text, and a picture. They just love to score Bayes 50 or 60...

Learning them sky-rockets the Bayes score. For *that* message. Other,
identical *looking* messages get me back to square one -- Bayes 50.

Point here being: Do you see Bayes improvements on a second run, after
learning?


> Spamassassin and the bayes databases are owned by user/group exim:exim.

> I run the imap-sa-learn.pl program as "root" (which could be the
> problem) but inside that program I see:
> 
> my $username = 'spamassassin';

Dunno that script. Can you verify that learning messages has an impact
on the sa-learn dump magic output?


> Which is the name of the user that owns the Public Folders on the Exchange 
> server.
> Perhaps that needs to be changed to exim? Or I need to run the 
> imap-sa-learn.pl
> program as user exim?

Again, dunno that script -- but it always is a good idea to train Bayes
as the user using that DB... ;)


> The primary MX server is a little wilder though:
> 
> 0.000          0          3          0  non-token data: bayes db version
> 0.000          0     206774          0  non-token data: nspam
> 0.000          0    1515235          0  non-token data: nham
> 0.000          0     917146          0  non-token data: ntokens
> 
> I hope that's OK (the wide disparity between nham and nspam).

No experience with that bias -- however, my guess is that is not ok.

According to the docs, "about equal" is best. Though it also has been
reported, that 10 times more spam than ham works just fine.

Personally, with my ham in-stream, I can claim that 30 times as much
spam than ham works just fine. (Keep in mind, that spam changes *much*
faster than ham.) However, about a 10th spam is most likely to result in
bad Bayes scores.

  guenther


-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

RE: not seeing any advantage to sa-learn?

Reply via email to