-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Kai Schaetzl writes:
> The problem seems to exists on all of our Bayes databases and I think the 
> cause is not "bad" data, but simply the way the SA expiry algorithm works. 
> There are no negative atimes or atimes in the future. If the database 
> contains tokens from a wide time range it's not able to calculate a 
> reasonable expiry atime and quits. This is typically to happen when you 
> set bayes_expiry_max_db_size to a high value and it takes some time to 
> fill up. When it finally hits the limit and wants to start the first 
> expire after maybe months of never expiring it fails.

So you wind up with a very big, but unexpirable, db?   I think
that would be worth a bug, yes.

in my opinion, expiry should always do *something* to get the db
below a target size, even if that *something* isn't strictly token
removal by atime.

- --j.

> Can something be done about the problem, shall I submit a bug on it? 
> (Already submitted bug #3872 where I mention this problem, but it's not 
> directly related to bug #3872.) SA could either do more iterations or try 
> a completely different approach. F.i. if it is told to expire 50.000 
> tokens it should remove all old entries until the 50.000 tokens are 
> removed and then stop. I understand that this would take a bit longer 
> since the db needs to be sorted first but it should be feasible.
> 
> If this problem isn't fixed using "bayes_auto_expire 1" is an open game.
> 
> Here are examples (each one is from a different database since I don't 
> have examples from the same db "before and after", but they are very 
> similar in size and structure. Some are also version 2 and not 3.)
> 
> n9:/home/spamd/bayes # sa-learn --dump magic
> 0.000          0          3          0  non-token data: bayes db version
> 0.000          0      19760          0  non-token data: nspam
> 0.000          0       5706          0  non-token data: nham
> 0.000          0     736251          0  non-token data: ntokens
> 0.000          0 1052059392          0  non-token data: oldest atime
> 0.000          0 1097242496          0  non-token data: newest atime
> 0.000          0 1097248297          0  non-token data: last journal sync 
> atime
> 0.000          0 1097248490          0  non-token data: last expiry atime
> 0.000          0   29754654          0  non-token data: last expire atime 
> delta
> 0.000          0         36          0  non-token data: last expire 
> reduction count
> 
> This db contains tokens going back to March 2003 or so. It works quite 
> fine and marks almost every spamm message with BAYES_99. Size is about 20 
> MB, max_db_size was set to 1.000.000 which made it skip any expire for 
> some time (don't know from when to when).
> 
> Here's the failed result for a forced expire (with max_db_size set to 
> 500.000).
> 
> debug: bayes: expiry check keep size, 0.75 * max: 375000
> debug: bayes: token count: 736251, final goal reduction size: 361251
> debug: bayes: First pass?  Current: 1097248298, Last: 1096983812, atime: 
> 29754654, count: 36, newdelta: 2965, ratio: 10034.75, period: 43200
> debug: bayes: Can't use estimation method for expiry, something fishy, 
> calculating optimal atime delta (first pass)
> debug: bayes: expiry max exponent: 9
> debug: bayes: atime     token reduction
> debug: bayes: ========  ===============
> debug: bayes: 43200     735241
> debug: bayes: 86400     734058
> debug: bayes: 172800    733218
> debug: bayes: 345600    731427
> debug: bayes: 691200    728680
> debug: bayes: 1382400   721684
> debug: bayes: 2764800   712668
> debug: bayes: 5529600   679017
> debug: bayes: 11059200  668118
> debug: bayes: 22118400  553162
> debug: bayes: couldn't find a good delta atime, need more token 
> difference, skipping expire.
> debug: Syncing complete.
> 
> Finally, after setting to bayes_expiry_max_db_size 100.000 the expire 
> works because the reduction goal is high enough and expires down to 
> 162.000. Just that I didn't want to throw out more than 500.000 tokens :-(
> 
> Here's the result after expiring so many tokens (remember, this is not the 
> same db, it was some days ago on another machine!)
> 
> 0.000          0          2          0  non-token data: bayes db version
> 0.000          0      19172          0  non-token data: nspam
> 0.000          0       5379          0  non-token data: nham
> 0.000          0     162010          0  non-token data: ntokens
> 0.000          0 1074822619          0  non-token data: oldest atime
> 0.000          0 1096936738          0  non-token data: newest atime
> 0.000          0          0          0  non-token data: last journal sync 
> atime
> 0.000          0 1096992499          0  non-token data: last expiry atime
> 0.000          0   22118400          0  non-token data: last expire atime 
> delta
> 0.000          0     553013          0  non-token data: last expire 
> reduction count
> 
> but the problem already hits again with the next --force-expire:
> 
> debug: bayes: found bayes db version 2
> debug: bayes: expiry check keep size, 75% of max: 75000
> debug: bayes: expiry keep size too small, resetting to 100,000 tokens
> debug: bayes: token count: 162010, final goal reduction size: 62010
> debug: bayes: First pass?  Current: 1096992487, Last: 1096988477, atime: 
> 22118400, count: 553013, newdelta: 197254680, ratio: 8.91812610869215
> debug: bayes: Can't use estimation method for expiry, something fishy, 
> calculating optimal atime delta (first pass)
> debug: bayes: atime     token reduction
> debug: bayes: ========  ===============
> debug: bayes: 43200     162006
> debug: bayes: 86400     162006
> debug: bayes: 172800    162006
> debug: bayes: 345600    162006
> debug: bayes: 691200    162006
> debug: bayes: 1382400   162006
> debug: bayes: 2764800   161954
> debug: bayes: 5529600   130225
> debug: bayes: 11059200  119126
> debug: bayes: 22118400  0
> debug: bayes: couldn't find a good delta atime, need more token 
> difference, skipping expire.
> 
> This was a few days ago. Today, finally, the expiry worked again and 
> removed about a thousand tokens. And, again, next forced expiry doesn't 
> work. Maybe it will work in three days again. Here's the magic dump at the 
> moment:
> 
> 0.000          0          2          0  non-token data: bayes db version
> 0.000          0      19172          0  non-token data: nspam
> 0.000          0       5379          0  non-token data: nham
> 0.000          0     160600          0  non-token data: ntokens
> 0.000          0 1075078200          0  non-token data: oldest atime
> 0.000          0 1097195892          0  non-token data: newest atime
> 0.000          0          0          0  non-token data: last journal sync 
> atime
> 0.000          0 1097250405          0  non-token data: last expiry atime
> 0.000          0   22118400          0  non-token data: last expire atime 
> delta
> 0.000          0       1410          0  non-token data: last expire 
> reduction count
> 
> Kai
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBZsKzQTcbUG5Y7woRAkxcAKDf0gvThy4vVp2feI+hcaeFTDBKQQCgok19
/rIaDDSyyfQc0WlV3naz3ao=
=B2lX
-----END PGP SIGNATURE-----

Reply via email to