Hi,

I have a bayes db that's about 160MB with a 40MB token db on a system
with about 100k messages per day. I've just raised the max_db_size set
to 1.1M tokens (there are currently 1.06M tokens in there). I've also
changed bayes to write to the journal instead of directly to the
database and just checking it periodically to see if the journal needs
to be synced.

Can someone explain to me the relationship between the frequency of
"1-occurrence tokens" and the size of the database? Here is the output
from a recent manual sync:

token frequency: 1-occurrence tokens: 72.60%
token frequency: less than 8 occurrences: 18.11%

I was thinking that the because the tokens are seen only once, the
database was too big, so I lowered it back down, but I think that was
a mistake. Now some of the same emails are continually hitting only
BAYES_50 while others seemingly the same hit BAYES_99. I've now raised
the number of tokens available and continue to manually train the
database with spam and ham (there are about 1.1M spam and 500k ham
currently).

Are these the numbers I should expect to see? What does a typical
bayes db look like for a larger site?

Have I configured something wrong, or am I misunderstanding how this
works? Is there something else I should read?

Thanks,
Alex

Reply via email to