-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Kai Schaetzl writes: > The problem seems to exists on all of our Bayes databases and I think the > cause is not "bad" data, but simply the way the SA expiry algorithm works. > There are no negative atimes or atimes in the future. If the database > contains tokens from a wide time range it's not able to calculate a > reasonable expiry atime and quits. This is typically to happen when you > set bayes_expiry_max_db_size to a high value and it takes some time to > fill up. When it finally hits the limit and wants to start the first > expire after maybe months of never expiring it fails. So you wind up with a very big, but unexpirable, db? I think that would be worth a bug, yes. in my opinion, expiry should always do *something* to get the db below a target size, even if that *something* isn't strictly token removal by atime. - --j. > Can something be done about the problem, shall I submit a bug on it? > (Already submitted bug #3872 where I mention this problem, but it's not > directly related to bug #3872.) SA could either do more iterations or try > a completely different approach. F.i. if it is told to expire 50.000 > tokens it should remove all old entries until the 50.000 tokens are > removed and then stop. I understand that this would take a bit longer > since the db needs to be sorted first but it should be feasible. > > If this problem isn't fixed using "bayes_auto_expire 1" is an open game. > > Here are examples (each one is from a different database since I don't > have examples from the same db "before and after", but they are very > similar in size and structure. Some are also version 2 and not 3.) > > n9:/home/spamd/bayes # sa-learn --dump magic > 0.000 0 3 0 non-token data: bayes db version > 0.000 0 19760 0 non-token data: nspam > 0.000 0 5706 0 non-token data: nham > 0.000 0 736251 0 non-token data: ntokens > 0.000 0 1052059392 0 non-token data: oldest atime > 0.000 0 1097242496 0 non-token data: newest atime > 0.000 0 1097248297 0 non-token data: last journal sync > atime > 0.000 0 1097248490 0 non-token data: last expiry atime > 0.000 0 29754654 0 non-token data: last expire atime > delta > 0.000 0 36 0 non-token data: last expire > reduction count > > This db contains tokens going back to March 2003 or so. It works quite > fine and marks almost every spamm message with BAYES_99. Size is about 20 > MB, max_db_size was set to 1.000.000 which made it skip any expire for > some time (don't know from when to when). > > Here's the failed result for a forced expire (with max_db_size set to > 500.000). > > debug: bayes: expiry check keep size, 0.75 * max: 375000 > debug: bayes: token count: 736251, final goal reduction size: 361251 > debug: bayes: First pass? Current: 1097248298, Last: 1096983812, atime: > 29754654, count: 36, newdelta: 2965, ratio: 10034.75, period: 43200 > debug: bayes: Can't use estimation method for expiry, something fishy, > calculating optimal atime delta (first pass) > debug: bayes: expiry max exponent: 9 > debug: bayes: atime token reduction > debug: bayes: ======== =============== > debug: bayes: 43200 735241 > debug: bayes: 86400 734058 > debug: bayes: 172800 733218 > debug: bayes: 345600 731427 > debug: bayes: 691200 728680 > debug: bayes: 1382400 721684 > debug: bayes: 2764800 712668 > debug: bayes: 5529600 679017 > debug: bayes: 11059200 668118 > debug: bayes: 22118400 553162 > debug: bayes: couldn't find a good delta atime, need more token > difference, skipping expire. > debug: Syncing complete. > > Finally, after setting to bayes_expiry_max_db_size 100.000 the expire > works because the reduction goal is high enough and expires down to > 162.000. Just that I didn't want to throw out more than 500.000 tokens :-( > > Here's the result after expiring so many tokens (remember, this is not the > same db, it was some days ago on another machine!) > > 0.000 0 2 0 non-token data: bayes db version > 0.000 0 19172 0 non-token data: nspam > 0.000 0 5379 0 non-token data: nham > 0.000 0 162010 0 non-token data: ntokens > 0.000 0 1074822619 0 non-token data: oldest atime > 0.000 0 1096936738 0 non-token data: newest atime > 0.000 0 0 0 non-token data: last journal sync > atime > 0.000 0 1096992499 0 non-token data: last expiry atime > 0.000 0 22118400 0 non-token data: last expire atime > delta > 0.000 0 553013 0 non-token data: last expire > reduction count > > but the problem already hits again with the next --force-expire: > > debug: bayes: found bayes db version 2 > debug: bayes: expiry check keep size, 75% of max: 75000 > debug: bayes: expiry keep size too small, resetting to 100,000 tokens > debug: bayes: token count: 162010, final goal reduction size: 62010 > debug: bayes: First pass? Current: 1096992487, Last: 1096988477, atime: > 22118400, count: 553013, newdelta: 197254680, ratio: 8.91812610869215 > debug: bayes: Can't use estimation method for expiry, something fishy, > calculating optimal atime delta (first pass) > debug: bayes: atime token reduction > debug: bayes: ======== =============== > debug: bayes: 43200 162006 > debug: bayes: 86400 162006 > debug: bayes: 172800 162006 > debug: bayes: 345600 162006 > debug: bayes: 691200 162006 > debug: bayes: 1382400 162006 > debug: bayes: 2764800 161954 > debug: bayes: 5529600 130225 > debug: bayes: 11059200 119126 > debug: bayes: 22118400 0 > debug: bayes: couldn't find a good delta atime, need more token > difference, skipping expire. > > This was a few days ago. Today, finally, the expiry worked again and > removed about a thousand tokens. And, again, next forced expiry doesn't > work. Maybe it will work in three days again. Here's the magic dump at the > moment: > > 0.000 0 2 0 non-token data: bayes db version > 0.000 0 19172 0 non-token data: nspam > 0.000 0 5379 0 non-token data: nham > 0.000 0 160600 0 non-token data: ntokens > 0.000 0 1075078200 0 non-token data: oldest atime > 0.000 0 1097195892 0 non-token data: newest atime > 0.000 0 0 0 non-token data: last journal sync > atime > 0.000 0 1097250405 0 non-token data: last expiry atime > 0.000 0 22118400 0 non-token data: last expire atime > delta > 0.000 0 1410 0 non-token data: last expire > reduction count > > Kai -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFBZsKzQTcbUG5Y7woRAkxcAKDf0gvThy4vVp2feI+hcaeFTDBKQQCgok19 /rIaDDSyyfQc0WlV3naz3ao= =B2lX -----END PGP SIGNATURE-----