On Sun, 2 Apr 2006, Daryl C. W. O'Shea wrote:
Dan Mahoney, System Admin wrote:
On Fri, 31 Mar 2006, Daryl C. W. O'Shea wrote:
Dan Mahoney, System Admin wrote:
Mar 30 21:52:14 quark spamd[45835]: __alarm__
Mar 30 21:52:14 quark spamd[45835]: __alarm__
Mar 30 21:52:14 quark spamd[45835]: spamd: copy_config timeout (with
empty $@), respawning child process after 25 messages at
/usr/local/bin/spamd line 982.
Mar 30 21:52:16 quark spamd[52479]: __alarm__
Mar 30 21:52:16 quark spamd[52479]: __alarm__
Mar 30 21:52:16 quark spamd[52479]: spamd: copy_config timeout (with
empty $@), respawning child process after 9 messages at
/usr/local/bin/spamd line 982.
This indicates that the patch from bug 4699 is working -- spamd now
recognizes that the alarm timed out on copy_config.
I'm still thinking that you should increase this alarm timeout value -- to
something like 600 seconds, or remove it entirely -- if you're going to be
running under high load.
Mar 30 21:52:31 quark spamd[42292]: syswrite() on closed filehandle GEN88
at /usr/local/lib/perl5/5.8.6/mach/IO/Handle.pm line 451.
Mar 30 21:52:31 quark spamd[42292]: Use of uninitialized value in
concatenation (.) or string at
/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm
line 330.
Mar 30 21:52:31 quark spamd[42292]: prefork: write of ping failed to
52479 fd=: at
/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm
line 330.
Mar 30 21:52:31 quark spamd[42292]: Use of uninitialized value in
concatenation (.) or string at
/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm
line 127.
Mar 30 21:52:31 quark spamd[42292]: prefork: killing failed child 52479
fd= at
/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm
line 127.
Mar 30 21:52:31 quark spamd[42292]: prefork: killed child 52479 at
/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm
line 141.
This indicates that the child is exiting, but SpamdForkScaling doesn't
know about it until a ping fails 150 seconds later, so a new child isn't
spawned for a long time after one of them commits suicide.
As noted in bug 4852, this appears to only be cosmetic. spamd is continuing
to spawn new children as necessary and is continuing to process mail
successfully.
It does so slowly, but with a load average of over 12 (and sometimes over
15+), I'm not surprised.
Oddly enough, one of our other machines on our network (which also runs
spamd) also seems to die around the same time. I'm concerned about IT as
well, but less than this one. Still, snagging logs there is probably not a
bad idea.
Any idea what sort of load averages you've got when this starts to happen?
It looks like it starts off with a couple children timing out, then you
become short on children, mail starts stacking up, and it snowballs from
there.
I know somewhere in those logs it started rejecting mail on load average
12. A simple one-liner in spamd to echo the load into the logs could be
useful (I don't need a patch, but telling me what to put and where to put
it could be useful). Alternatively I could just do something with
logger(1), echo(1), uptime(1) and cron.
As far as I can tell from the log, it appears that the load average was near
or above 12 the entire time that the time outs were occurring.
After the restart of spamd at Mar 30 14:15:46, everything was fine until
another burst of mail around Mar 30 21:10:04 at which point the load average
is reported to be 12. At the same time, spamd starts spawning more children,
which isn't a great thing to be doing with a load average of 12 or so.
As this happens, processing slows down, and more children are spawned. This
clears up and we're back to the min number of children around Mar 30
21:13:38.
Another burst happens around Mar 30 21:49:56, again with a load average
around 12. A lot more children are spawned this time, and things really slow
down. copy_config timeouts start happening again, but mail is still being
processed and children are exiting and spawning appropriately. This
continues to the end of the log.
Somewhere along the lines last night I also lost connection to AIM (which
runs from that netblock) so it's quite possibly network congestion related.
Even so, if I theoretically had 30 seconds of latency 6 hours ago, spamd
should theoretically NOT still be hanging now...
It doesn't look like spamd ever hangs. It continues to process mail, albeit
slowly due to load.
This load could be being caused, and sustained, due to the machine hitting
swap and thrashing. Do you really have enough memory for 40 spamd children?
I don't think I'd try it with less than 2.0GB of RAM dedicated to spamd use.
With an MTA, SQL, and whatever else running, I'd probably want 3GB+ of RAM in
the machine.
This really looks similar to what happens to Sendmail when a machine hits a
load average of over 900! :)
The problem is, Daryl...there's a handful of users that get a certain type
of spam that DOESN'T process in the usual timeframe. In fact, even on a
spamassassin restart, those messages come right back in, and eat a process
space and NEVER finish. Presumably they also consume CPU time and
whatever else.
If this was simply a load issue, it would be far more constant...but five
users on a box with a 500 user pool...and I get probably more spam than
anyone, and my own messages never hang spamd up like this.
When I get done, I'd be more than happy to send you a stripped corpus of
tinsc's mail, and let you run them on your own setup, and see if they
cause similar delays. If they don't, fine, I'm crazy and need more ram,
but I can't believe that I'd run at a load average of <2 90 percent of the
time, and have this only when cww or tinsc get flooded with a suspicious
amount of mail with very similar messageids with processes that NEVER
exit.
In the mean, feel free to tell me how to strace this, just in case. Or
poke me out of band and I'll give you root.
-Dan
Daryl
--
"this is too stupid even for irc"
-mtreal, EFnet #macintosh, 09/15/2K, 12:33 AM
--------Dan Mahoney--------
Techie, Sysadmin, WebGeek
Gushi on efnet/undernet IRC
ICQ: 13735144 AIM: LarpGM
Site: http://www.gushi.org
---------------------------