Re: user-db size, content confusions (how many toks?)

Matt Kettler Tue, 31 Mar 2009 20:22:51 -0700

Linda Walsh wrote:
> Matt Kettler wrote:
>>> I see 3 DB's in my user directory (.spamassassin).
>>>    auto-whitelist (~80MB),      bayes_seen (~40MB),     bayes_toks
>>> (~20MB)
>>> Was trying to find relation of 'bayes_expiry_max_db_size' to the
>>> physical
>>> size of the above files.
> ---
>
>> expiry will only affect bayes_toks. Currently neither auto-whitelist nor
>> bayes_seen have any expiry mechanism at all.
> ---
> So they just grow without limit?
Yep. Not ideal, and there's bugs open on both.
>   How often are they loaded?
IIRC, at the creation of a Mail::SpamAssassin instance, but I'm not well
versed in that aspect of the code.
> Does only "spamd" access the auto-whitelist?
Well, any Mail::SpamAssassin instance. (spamd, the "spamassassin"
script, etc). spamc, on the other hand, is not a Mail::SpamAssassin
instance, and doesn't access *any* of the SA config files or databases.


>
> Optimally, I would assume spamd opens it upon start, but it needs to
> update
> the disk file periodically (sync the db) for reliability.  How often does
> it 'sync'?
In the case of the whitelist, it's per-message.

In the case of the bayes_seen, every time a message is learned.
>
>> bayes_seen can safely be deleted if you need to. It keeps track of what
>> messages have already been learned to prevent relearning them. However,
>> unless you're likely to re-feed messages to SA, bayes_seen isn't stictly
>> neccesary.
> ---
>     Only refeeding would usually be 'ham', because I might rerun over
> an "Inbox", that might have old messages in it.  I don't rerun "ham"
> training
> often -- except to "despam" a message (one that was marked spam and
> shouldn't
> have been).
>
>
>
>>> I'm finding some answers, I've run into some seeming
>>> "contradictions".  ...
>>> ---
>>> First prob(contradiction).  dbg above says "token count: 0".  (This is
>>> with
>>> a combined bayes db size of 60MB (_seen, _toks).
>> Are you sure your sa-learn was using the same DB path?
> ---
>     Sure??  It listed the same filename (default location
> /home/<user>/.spamassasssin/<bayes...>).  Other than that, I haven't
> tried to trace perl running spamassassin, to see if it is really
> accessing
> the same file.  Only going off the 'debug' messages (which correspond
> to the
> settings in "user_prefs" that's in the default location dir.
>
>
>> From the sounds of it, sa-learn is using a directory with an empty DB.
> ----
>     Yeah...Doesn't make sense to me -- how would "sa-learn --dump magic"
> use a different location?  I.e. it showed ~500K tokens...
>
>
>>> I.e. isn't 'ntokens' = 491743 mean slightly under 500K tokens 
>> Yep, looks like you have 491,743 tokens to me.
>
>>> It's like the sa-learn magic shows a 'db' corresponding to my old limit
>>> (that I think is still being 'auto-expired', so might not have pruned
>>> figure as it runs about once per 24 hours, if I understand normal spamd
>>> workings).
>> Approximately. Also, be aware that in order for spamd to use new
>> settings it needs to be restarted.
> ----
>     Having changed the user_prefs files back to the default
> setting (i.e. deleted my previous addition) -- 2 days ago, and system was
> rebooted 1day14hours ago, I'm certain spamd has been restarted.
Hmm, can you set bayes_expiry_max_db_size in a user_prefs file? That
seems like an option that might be privileged and only honored at the
site-wide level. An absurdly large value can bog the whole server down
when processing mail, so an end user could DoS your machine if allowed
to set this.



>
> YET: all db sizes are the same as before (no reduction in size
> corresponding to going 'back' to a default 150K limit), though sa-learn
> run with dbg and force-expire indicated 0 tokens -- but sa-learn
> w/dump magic
> indicates 500K tokens.  How can "expire" say 0 toks but dump-magic say
> 500K?
That's a big mystery to me. Doesn't make sense.
>
>     File timemstamps show all 3-db files have been updated today.
> (Presumably by spamd processing email as it comes in).  But file sizes
> still are @ sizes indicated at top of this message: 80/40/20-MB.
>
>
>>> So is the --magic output, maybe what is seen and being
>>> 'size-controlled' by auto-expire?
>> Yes, at least, it should be.
>
>
>>> Why isn't 'sa-learn --force expire' seeing the TOKENs indicated in
>>> sa-learn --dump magic?  
>> That is particularly strange to me, and it sounds like there's some
>> problems there.
> ---
> *sigh*
>
>>
>> Can you give a bit of detail, ie: what paths are you looking at for the
>> files, what version of SA,
> ---
>     SA = old version of 3.1.7.
>     Which at very least points to an upgrade possibly solving the
> problem,
> BUT, this was working at one point, and don't know why it 'stopped'.  I'm
> generally uncomfortable with fixing things that were working just
> because they
> have randomly stopped working without knowing *why*, (though that
> discomfort has
> become something I've just more had to deal with as the Microsoft SW
> maintenance method becomes the norm (update and see if bug is
> gone...yes?  ok,
> bug gone; (unclear if fixed or hidden, unclear about effects of other
> changes in
> a new version...)
Understood.

That said, 3.1.7 is vulnerable to CVE-2007-0451 and CVE-2007-2873.

You should seriously consider upgrading for the first one.

http://wiki.apache.org/spamassassin/Security
<http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2007-2873>   
>
>
>>> Am I misinterpreting the debug output?
>> No, you don't seem to be.
> ---
>     Thanks for the confirmation of my 'reality'.  Really, the most
> logical
> and time-efficient way to proceed is likely to upgrade to newer
> version at some
> point soon (and ignore my discontent regarding 'not knowing' why or
> what caused
> the break).
>
> *sigh*
> Linda
>
>>>
>>>
>>
>

Re: user-db size, content confusions (how many toks?)

Reply via email to