GRP Productions wrote on Mon, 14 Mar 2005 00:32:42 +0200:

> You are right, I am using MailWatch. I just posted this output to be easy 
> for one to see the actual dates without having to convert.

That's okay, the problem just is one cannot be sure how accurate it is. Knowing 
that you use MS would have been useful, anyway :-)
(BTW: my version of Mailwatch can't show this, do you use a CVS version?)

 Here is the 
> actual output: 
>  
> # /usr/bin/sa-learn -p /opt/MailScanner/etc/spam.assassin.prefs.conf --dump 
> magic 
> 0.000          0          3          0  non-token data: bayes db version 
> 0.000          0      49740          0  non-token data: nspam 
> 0.000          0      47167          0  non-token data: nham 
> 0.000          0     123325          0  non-token data: ntokens

I didn't look at this closely before, but I think this ratio indicates a 
problem, f.i. this is from our own mail server (just getting our own mail, not 
our clients'):

0.000          0      30089          0  non-token data: nspam
0.000          0      12515          0  non-token data: nham
0.000          0    1001630          0  non-token data: ntokens

See the number of tokens, we have ten times yours with less learned mail. That 
means that our db has much more tokens to qualify an email as ham or spam. Also 
your "hold time" is quite low, it's about a month. I think we haven tokens from 
even a year ago. That's maybe a bit too much, but I strongly suggest upping 
your bayes_expiry_max_db_size to something like 500.000 or so. Since you have a 
much higher flux of messages than we have on that machine you are literally 
"burning" your db to uselessness.

> No it isn't. This is exactly the point I mentioned.

But you didn't prove it ;-)

 But as I said earlier, 
> sa-learn claims it has learned, even from the web interface: 
> >SA Learn: Learned from 1 message(s) (1 message(s) examined). 

And you learned by specifying the config file? I suspect that you are at least 
occasionally using two SA configurations, the one coming with MS and the one 
coming with SA.

> This is getting more suspicious: there is no bayes_journal file! 

Oh. Still possible, though. You don't need to have one, but on high volume 
systems it's highly recommended. Check your SA config (whereever it is :-) for 
bayes_learn_to_journal 1. I don't know if it is 1 by default, though. What do 
you have starting with bayes in your config file?

> -rw-rw-rw-  1 root nobody     1236 Mar 14 00:22 bayes.mutex 
> -rw-rw-rw-  1 root nobody 10452992 Mar 14 00:22 bayes_seen 
> -rw-rw-rw-  1 root nobody  5509120 Mar 14 00:02 bayes_toks 

bayes_seen is quite high. I haven't ever seen that it is higher than bayes_toks 
on our systems. But maybe that's normal for high volume systems, I don't know. 
On the Mailscanner list many people complain about very big bayes_seen files. 
Someone else on this list should comment on the size.

> I can assure you noone has touched anything inside this directory. If this 
> is the reason for the problems I've been facing, is there a way to recreate 
> the file without having to lose my current data? (perhaps by copying the 
> above files somewhere, execute sa-learn --clear and some time later restore 
> the above files?)

Don't know if this would be of any help. As I said, I suspect you are using at 
least two different bayes dbs. At least when you do it from the command line. 
Run an "updatedb" and then "locate bayes" (this may not locate all files, f.i. 
not in /var !).
MS, of course, can only use one and doesn't have a chance of confusing that, so 
when it uses SA that learns and checks the same db. And so far that part seems 
to be okay (except for the bigger size of bayes_seen, but as I said, this may 
be normal for your setup, I really don't know). But you burn your tokens too 
fast. At least that's what I think.


Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org



Reply via email to