Dan Mahoney, System Admin wrote:
On Fri, 31 Mar 2006, Daryl C. W. O'Shea wrote:
I think it's actually load related... spamd is timing out the
copy_config sooner than it's really taking under high load. If you
were to change the alarm value from 10 to 100 or so, around spamd line
949 this may go away.
Oddly enough, one of our other machines on our network (which also runs
spamd) also seems to die around the same time. I'm concerned about IT
as well, but less than this one. Still, snagging logs there is probably
not a bad idea.
If you're mail load is distributed somewhat evenly, I'd say it isn't too
odd.
Any idea what sort of load averages you've got when this starts to
happen? It looks like it starts off with a couple children timing out,
then you become short on children, mail starts stacking up, and it
snowballs from there.
I know somewhere in those logs it started rejecting mail on load average
12. A simple one-liner in spamd to echo the load into the logs could be
useful (I don't need a patch, but telling me what to put and where to
put it could be useful). Alternatively I could just do something with
logger(1), echo(1), uptime(1) and cron.
cron'ing it is probably the best way to go.
Like Justin mentioned, a strace -f -ttt would say for sure, but I don't
know how realistic that is if it's not reproducible on demand given the
amount of mail you're processing. I guess you could try it and see what
happens.
Somewhere along the lines last night I also lost connection to AIM
(which runs from that netblock) so it's quite possibly network
congestion related. Even so, if I theoretically had 30 seconds of
latency 6 hours ago, spamd should theoretically NOT still be hanging now...
It could be prolonged due to losing children for around three minutes
every time one times out and exits. I didn't follow the logs all the
way though... I haven't slept yet, and they weren't helping.
BTW, we should probably find or open a bugzilla ticket for this. Bug
4699 is related. The pre-fork issue is probably another bug of its own.
If you say to do that, then I'll certainly do that -- last thing I
wanted to do was open a false bug, at least until the other unrelated
annoyances were cleared up (and thank you guys immensely for that).
Didn't you say you yourself (Daryl) were having a similar issue?
That's alright I've got INVALID as a hot key. :)
There's definitely a bug with:
- SpamdForkScaling not being notified about the child exiting
- maybe too short of an alarm on the copy_config (the alarm probably
isn't even necessary anymore... we could safely lengthen it)
- the craziness of the missing $@ value patched in bug 4699
Yes, I believe I did. I think I even have a strace of it if I can trim
it out of a 2GB strace log. *shudder* I keep forgetting about it... I'm
probably fearing the massive log. I've only seen the lone "__alarm__"s
though... spamd always recovers... I only process a few thousand
messages a day, so not much load.
Daryl