Dan Mahoney, System Admin wrote:
On Fri, 31 Mar 2006, Daryl C. W. O'Shea wrote:
Dan Mahoney, System Admin wrote:
Mar 30 21:52:14 quark spamd[45835]: __alarm__
Mar 30 21:52:14 quark spamd[45835]: __alarm__
Mar 30 21:52:14 quark spamd[45835]: spamd: copy_config timeout (with
empty $@), respawning child process after 25 messages at
/usr/local/bin/spamd line 982.
Mar 30 21:52:16 quark spamd[52479]: __alarm__
Mar 30 21:52:16 quark spamd[52479]: __alarm__
Mar 30 21:52:16 quark spamd[52479]: spamd: copy_config timeout (with
empty $@), respawning child process after 9 messages at
/usr/local/bin/spamd line 982.
This indicates that the patch from bug 4699 is working -- spamd now
recognizes that the alarm timed out on copy_config.
I'm still thinking that you should increase this alarm timeout value --
to something like 600 seconds, or remove it entirely -- if you're going
to be running under high load.
Mar 30 21:52:31 quark spamd[42292]: syswrite() on closed filehandle
GEN88 at /usr/local/lib/perl5/5.8.6/mach/IO/Handle.pm line 451.
Mar 30 21:52:31 quark spamd[42292]: Use of uninitialized value in
concatenation (.) or string at
/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm
line 330.
Mar 30 21:52:31 quark spamd[42292]: prefork: write of ping failed to
52479 fd=: at
/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm
line 330.
Mar 30 21:52:31 quark spamd[42292]: Use of uninitialized value in
concatenation (.) or string at
/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm
line 127.
Mar 30 21:52:31 quark spamd[42292]: prefork: killing failed child
52479 fd= at
/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm
line 127.
Mar 30 21:52:31 quark spamd[42292]: prefork: killed child 52479 at
/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm
line 141.
This indicates that the child is exiting, but SpamdForkScaling doesn't
know about it until a ping fails 150 seconds later, so a new child
isn't spawned for a long time after one of them commits suicide.
As noted in bug 4852, this appears to only be cosmetic. spamd is
continuing to spawn new children as necessary and is continuing to
process mail successfully.
It does so slowly, but with a load average of over 12 (and sometimes
over 15+), I'm not surprised.
Oddly enough, one of our other machines on our network (which also runs
spamd) also seems to die around the same time. I'm concerned about IT
as well, but less than this one. Still, snagging logs there is probably
not a bad idea.
Any idea what sort of load averages you've got when this starts to
happen? It looks like it starts off with a couple children timing out,
then you become short on children, mail starts stacking up, and it
snowballs from there.
I know somewhere in those logs it started rejecting mail on load average
12. A simple one-liner in spamd to echo the load into the logs could be
useful (I don't need a patch, but telling me what to put and where to
put it could be useful). Alternatively I could just do something with
logger(1), echo(1), uptime(1) and cron.
As far as I can tell from the log, it appears that the load average was
near or above 12 the entire time that the time outs were occurring.
After the restart of spamd at Mar 30 14:15:46, everything was fine until
another burst of mail around Mar 30 21:10:04 at which point the load
average is reported to be 12. At the same time, spamd starts spawning
more children, which isn't a great thing to be doing with a load average
of 12 or so.
As this happens, processing slows down, and more children are spawned.
This clears up and we're back to the min number of children around Mar
30 21:13:38.
Another burst happens around Mar 30 21:49:56, again with a load average
around 12. A lot more children are spawned this time, and things really
slow down. copy_config timeouts start happening again, but mail is
still being processed and children are exiting and spawning
appropriately. This continues to the end of the log.
Somewhere along the lines last night I also lost connection to AIM
(which runs from that netblock) so it's quite possibly network
congestion related. Even so, if I theoretically had 30 seconds of
latency 6 hours ago, spamd should theoretically NOT still be hanging now...
It doesn't look like spamd ever hangs. It continues to process mail,
albeit slowly due to load.
This load could be being caused, and sustained, due to the machine
hitting swap and thrashing. Do you really have enough memory for 40
spamd children? I don't think I'd try it with less than 2.0GB of RAM
dedicated to spamd use. With an MTA, SQL, and whatever else running,
I'd probably want 3GB+ of RAM in the machine.
This really looks similar to what happens to Sendmail when a machine
hits a load average of over 900! :)
Daryl