Re: Spamd keeps getting hung up!

Daryl C. W. O'Shea Sat, 01 Apr 2006 21:12:14 -0800

Dan Mahoney, System Admin wrote:

On Fri, 31 Mar 2006, Daryl C. W. O'Shea wrote:

Dan Mahoney, System Admin wrote:

Mar 30 21:52:14 quark spamd[45835]: __alarm__
Mar 30 21:52:14 quark spamd[45835]: __alarm__
Mar 30 21:52:14 quark spamd[45835]: spamd: copy_config timeout (withempty $@), respawning child process after 25 messages at/usr/local/bin/spamd line 982.
Mar 30 21:52:16 quark spamd[52479]: __alarm__
Mar 30 21:52:16 quark spamd[52479]: __alarm__
Mar 30 21:52:16 quark spamd[52479]: spamd: copy_config timeout (withempty $@), respawning child process after 9 messages at/usr/local/bin/spamd line 982.
This indicates that the patch from bug 4699 is working -- spamd nowrecognizes that the alarm timed out on copy_config.

I'm still thinking that you should increase this alarm timeout value --to something like 600 seconds, or remove it entirely -- if you're goingto be running under high load.

Mar 30 21:52:31 quark spamd[42292]: syswrite() on closed filehandleGEN88 at /usr/local/lib/perl5/5.8.6/mach/IO/Handle.pm line 451.Mar 30 21:52:31 quark spamd[42292]: Use of uninitialized value inconcatenation (.) or string at/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pmline 330.Mar 30 21:52:31 quark spamd[42292]: prefork: write of ping failed to52479 fd=: at/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pmline 330.Mar 30 21:52:31 quark spamd[42292]: Use of uninitialized value inconcatenation (.) or string at/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pmline 127.Mar 30 21:52:31 quark spamd[42292]: prefork: killing failed child52479 fd= at/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pmline 127.Mar 30 21:52:31 quark spamd[42292]: prefork: killed child 52479 at/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pmline 141.

This indicates that the child is exiting, but SpamdForkScaling doesn'tknow about it until a ping fails 150 seconds later, so a new childisn't spawned for a long time after one of them commits suicide.

As noted in bug 4852, this appears to only be cosmetic. spamd iscontinuing to spawn new children as necessary and is continuing toprocess mail successfully.

It does so slowly, but with a load average of over 12 (and sometimesover 15+), I'm not surprised.

Oddly enough, one of our other machines on our network (which also runsspamd) also seems to die around the same time. I'm concerned about ITas well, but less than this one. Still, snagging logs there is probablynot a bad idea.
Any idea what sort of load averages you've got when this starts tohappen? It looks like it starts off with a couple children timing out,then you become short on children, mail starts stacking up, and itsnowballs from there.
I know somewhere in those logs it started rejecting mail on load average12. A simple one-liner in spamd to echo the load into the logs could beuseful (I don't need a patch, but telling me what to put and where toput it could be useful). Alternatively I could just do something withlogger(1), echo(1), uptime(1) and cron.

As far as I can tell from the log, it appears that the load average wasnear or above 12 the entire time that the time outs were occurring.

After the restart of spamd at Mar 30 14:15:46, everything was fine untilanother burst of mail around Mar 30 21:10:04 at which point the loadaverage is reported to be 12. At the same time, spamd starts spawningmore children, which isn't a great thing to be doing with a load averageof 12 or so.

As this happens, processing slows down, and more children are spawned.This clears up and we're back to the min number of children around Mar30 21:13:38.

Another burst happens around Mar 30 21:49:56, again with a load averagearound 12. A lot more children are spawned this time, and things reallyslow down. copy_config timeouts start happening again, but mail isstill being processed and children are exiting and spawningappropriately. This continues to the end of the log.

Somewhere along the lines last night I also lost connection to AIM(which runs from that netblock) so it's quite possibly networkcongestion related. Even so, if I theoretically had 30 seconds oflatency 6 hours ago, spamd should theoretically NOT still be hanging now...

It doesn't look like spamd ever hangs. It continues to process mail,albeit slowly due to load.

This load could be being caused, and sustained, due to the machinehitting swap and thrashing. Do you really have enough memory for 40spamd children? I don't think I'd try it with less than 2.0GB of RAMdedicated to spamd use. With an MTA, SQL, and whatever else running,I'd probably want 3GB+ of RAM in the machine.

This really looks similar to what happens to Sendmail when a machinehits a load average of over 900! :)



Daryl

Re: Spamd keeps getting hung up!

Reply via email to