user-db size, content confusions (how many toks?)

Linda Walsh Sun, 29 Mar 2009 16:08:49 -0700


I see 3 DB's in my user directory (.spamassassin).


auto-whitelist  (~80MB)
bayes_seen      (~40MB)
bayes_toks      (~20MB)

Was trying to find relation of 'bayes_expiry_max_db_size' to the physical
size of the above files.  I'm finding some answers, I've run into some
seeming "contradictions".  Had db_size set to 500,000, reduced to 250,000
and to 'default' (150,000) during testing.

In trying to lower 'db_size' and see how that affected physical sizes,
I ran sa-learn --force expires and saw these debug messages of 'Note':

[30905] dbg: bayes: expiry check keep size, 0.75 * max: 112500
[30905] dbg: bayes: token count: 0, final goal reduction size: -112500
[30905] dbg: bayes: reduction goal of -112500 is under 1,000 tokens, skipping 
expire
[30905] dbg: bayes: expiry completed

---
First prob(contradiction).  dbg above says "token count: 0".  (This is with
a combined bayes db size of 60MB (_seen, _toks).

Seems to think I have no bayes data.  Saw another dbg msg that indicated the
bayes classifier was untrained (<~150? entries) & disabled.

Dunno how it got zeroed, but tried adding 'ham' by running sa-learn over
my a despam'ed mailbox.  First run showed:

Learned tokens from 55 message(s) (55 message(s) examined)

But subsequent runs of 'sa-learn with dbg+expire" still show token count: 0.

sa-learn --dump magic shows something different:
0.000          0          3          0  non-token data: bayes db version
0.000          0     556414          0  non-token data: nspam
0.000          0     574441          0  non-token data: nham
0.000          0     491743          0  non-token data: ntokens
0.000          0 1216456288          0  non-token data: oldest atime
0.000          0 1237796146          0  non-token data: newest atime
0.000          0 1220476831          0  non-token data: last journal sync atime
0.000          0 1217838535          0  non-token data: last expiry atime
0.000          0    1382400          0  non-token data: last expire atime delta
0.000          0      70612          0  non-token data: last expire reduction 
count
---------

Does the above indicate 0 tokens?  I.e. isn't 'ntokens' = 491743 mean

slightly under 500K tokens (my original limit before trying to run 'sa-learn-expires + dbg' manually).


It's like the sa-learn magic shows a 'db' corresponding to my old limit
(that I think is still being 'auto-expired', so might not have pruned
figure as it runs about once per 24 hours, if I understand normal spamd
workings).

So is the --magic output, maybe what is seen and being 'size-controlled' by
auto-expire (was ~500K before recent test changes).

Why isn't 'sa-learn --force expire' seeing the TOKENs indicated in
sa-learn --dump magic?  Debug messages are pointing at the same file
for both operations, so how can dump-magic indicated 500K, but the
debug of sa-learn --force-expire, is somehow seeing 0 TOKENs?

Am I misinterpreting the debug output?

Thanks,
Linda

user-db size, content confusions (how many toks?)

Reply via email to