On Sun, 2 Apr 2006, Daryl C. W. O'Shea wrote:

Dan Mahoney, System Admin wrote:
On Fri, 31 Mar 2006, Daryl C. W. O'Shea wrote:

Dan Mahoney, System Admin wrote:

Mar 30 21:52:14 quark spamd[45835]: __alarm__
Mar 30 21:52:14 quark spamd[45835]: __alarm__
Mar 30 21:52:14 quark spamd[45835]: spamd: copy_config timeout (with empty $@), respawning child process after 25 messages at /usr/local/bin/spamd line 982.
Mar 30 21:52:16 quark spamd[52479]: __alarm__
Mar 30 21:52:16 quark spamd[52479]: __alarm__
Mar 30 21:52:16 quark spamd[52479]: spamd: copy_config timeout (with empty $@), respawning child process after 9 messages at /usr/local/bin/spamd line 982.

This indicates that the patch from bug 4699 is working -- spamd now recognizes that the alarm timed out on copy_config.

I'm still thinking that you should increase this alarm timeout value -- to something like 600 seconds, or remove it entirely -- if you're going to be running under high load.


Mar 30 21:52:31 quark spamd[42292]: syswrite() on closed filehandle GEN88 at /usr/local/lib/perl5/5.8.6/mach/IO/Handle.pm line 451. Mar 30 21:52:31 quark spamd[42292]: Use of uninitialized value in concatenation (.) or string at /usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm line 330. Mar 30 21:52:31 quark spamd[42292]: prefork: write of ping failed to 52479 fd=: at /usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm line 330. Mar 30 21:52:31 quark spamd[42292]: Use of uninitialized value in concatenation (.) or string at /usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm line 127. Mar 30 21:52:31 quark spamd[42292]: prefork: killing failed child 52479 fd= at /usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm line 127. Mar 30 21:52:31 quark spamd[42292]: prefork: killed child 52479 at /usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm line 141.

This indicates that the child is exiting, but SpamdForkScaling doesn't know about it until a ping fails 150 seconds later, so a new child isn't spawned for a long time after one of them commits suicide.

As noted in bug 4852, this appears to only be cosmetic. spamd is continuing to spawn new children as necessary and is continuing to process mail successfully.

It does so slowly, but with a load average of over 12 (and sometimes over 15+), I'm not surprised.


Oddly enough, one of our other machines on our network (which also runs spamd) also seems to die around the same time. I'm concerned about IT as well, but less than this one. Still, snagging logs there is probably not a bad idea.

Any idea what sort of load averages you've got when this starts to happen? It looks like it starts off with a couple children timing out, then you become short on children, mail starts stacking up, and it snowballs from there.

I know somewhere in those logs it started rejecting mail on load average 12. A simple one-liner in spamd to echo the load into the logs could be useful (I don't need a patch, but telling me what to put and where to put it could be useful). Alternatively I could just do something with logger(1), echo(1), uptime(1) and cron.

As far as I can tell from the log, it appears that the load average was near or above 12 the entire time that the time outs were occurring.

After the restart of spamd at Mar 30 14:15:46, everything was fine until another burst of mail around Mar 30 21:10:04 at which point the load average is reported to be 12. At the same time, spamd starts spawning more children, which isn't a great thing to be doing with a load average of 12 or so.

As this happens, processing slows down, and more children are spawned. This clears up and we're back to the min number of children around Mar 30 21:13:38.

Another burst happens around Mar 30 21:49:56, again with a load average around 12. A lot more children are spawned this time, and things really slow down. copy_config timeouts start happening again, but mail is still being processed and children are exiting and spawning appropriately. This continues to the end of the log.


Somewhere along the lines last night I also lost connection to AIM (which runs from that netblock) so it's quite possibly network congestion related. Even so, if I theoretically had 30 seconds of latency 6 hours ago, spamd should theoretically NOT still be hanging now...

It doesn't look like spamd ever hangs. It continues to process mail, albeit slowly due to load.

This load could be being caused, and sustained, due to the machine hitting swap and thrashing. Do you really have enough memory for 40 spamd children? I don't think I'd try it with less than 2.0GB of RAM dedicated to spamd use. With an MTA, SQL, and whatever else running, I'd probably want 3GB+ of RAM in the machine.

This really looks similar to what happens to Sendmail when a machine hits a load average of over 900! :)

The problem is, Daryl...there's a handful of users that get a certain type of spam that DOESN'T process in the usual timeframe. In fact, even on a spamassassin restart, those messages come right back in, and eat a process space and NEVER finish. Presumably they also consume CPU time and whatever else.

If this was simply a load issue, it would be far more constant...but five users on a box with a 500 user pool...and I get probably more spam than anyone, and my own messages never hang spamd up like this.

When I get done, I'd be more than happy to send you a stripped corpus of tinsc's mail, and let you run them on your own setup, and see if they cause similar delays. If they don't, fine, I'm crazy and need more ram, but I can't believe that I'd run at a load average of <2 90 percent of the time, and have this only when cww or tinsc get flooded with a suspicious amount of mail with very similar messageids with processes that NEVER exit.

In the mean, feel free to tell me how to strace this, just in case. Or poke me out of band and I'll give you root.

-Dan



Daryl


--

"this is too stupid even for irc"

-mtreal, EFnet #macintosh, 09/15/2K, 12:33 AM

--------Dan Mahoney--------
Techie,  Sysadmin,  WebGeek
Gushi on efnet/undernet IRC
ICQ: 13735144   AIM: LarpGM
Site:  http://www.gushi.org
---------------------------

Reply via email to