Our mail server processes about 75k-100k messages a day, and runs a
force-expire once a day at 2am or so. Currently the bayes db (in sql) had a
expiry_max_db_size set to 500k, and when the expiration runs normally the db
is cut in half pretty much:
[7435] dbg: bayes: expiry check keep size, 0.75 * max: 375000
[7435] dbg: bayes: token count: 1270581, final goal reduction size: 895581
[7435] dbg: bayes: first pass? current: 1161323405, Last: 1161237076, atime:
43200, count: 796764, newdelta: 38433, ratio: 1.12402292272241, period: 43200
[7435] dbg: bayes: can't use estimation method for expiry, unexpected result,
calculating optimal atime delta (first pass)
[7435] dbg: bayes: expiry max exponent: 9
[7435] dbg: bayes: atime token reduction
[7435] dbg: bayes: ======== ===============
[7435] dbg: bayes: 43200 700149
[7435] dbg: bayes: 86400 340572
[7435] dbg: bayes: 172800 0
...
[7435] dbg: bayes: first pass decided on 43200 for atime delta
[7435] dbg: bayes: expiry completed
expired old bayes database entries in 72 seconds
570476 entries kept, 700105 deleted
token frequency: 1-occurrence tokens: 60.16%
token frequency: less than 8 occurrences: 21.57%
I'm guessing the '1-occurrence' tokens aren't all that useful, so trimming
them isn't harming much, but I just wanted to get a little advice to make sure.
Though I haven't changed the max db size in some time (over a year), and the
token count before the expiration is much higher than it was a year ago, so
perhaps raising the db size by 50-100% might help make the classifier more
accurate?
--
Ryan Moore
----------
Perigee.net Corporation
704-849-8355 (sales)
704-849-8017 (tech)
www.perigee.net