Hi, I have a bayes db that's about 160MB with a 40MB token db on a system with about 100k messages per day. I've just raised the max_db_size set to 1.1M tokens (there are currently 1.06M tokens in there). I've also changed bayes to write to the journal instead of directly to the database and just checking it periodically to see if the journal needs to be synced.
Can someone explain to me the relationship between the frequency of "1-occurrence tokens" and the size of the database? Here is the output from a recent manual sync: token frequency: 1-occurrence tokens: 72.60% token frequency: less than 8 occurrences: 18.11% I was thinking that the because the tokens are seen only once, the database was too big, so I lowered it back down, but I think that was a mistake. Now some of the same emails are continually hitting only BAYES_50 while others seemingly the same hit BAYES_99. I've now raised the number of tokens available and continue to manually train the database with spam and ham (there are about 1.1M spam and 500k ham currently). Are these the numbers I should expect to see? What does a typical bayes db look like for a larger site? Have I configured something wrong, or am I misunderstanding how this works? Is there something else I should read? Thanks, Alex