On Fri, 31 Mar 2006, Daryl C. W. O'Shea wrote:
(copying Justin since this has to do with pre-forking)
Dan Mahoney, System Admin wrote:
On Fri, 10 Mar 2006, Daryl C. W. O'Shea wrote:
On 3/10/2006 11:22 AM, Dan Mahoney, System Admin wrote:
Okay,
I'm still getting these issues. I've corrected every other issue that's
plagued us, and the thing still locks up. USUALLY when a user gets some
form of dictionary spam. For the users I can identify I've been keeping
copies of their stuff.
NOTE: This is under a stock 3.1.1, if there are any other patches I should
be using from the previous conversations that are NOT in 3.1.1, please let
me know, and I'll make sure I have those too. I'm seeing lots of the
following:
Mar 30 21:52:14 quark spamd[45835]: __alarm__
Mar 30 21:52:14 quark spamd[45835]: __alarm__
Mar 30 21:52:14 quark spamd[45835]: spamd: copy_config timeout (with empty
$@), respawning child process after 25 messages at /usr/local/bin/spamd
line 982.
Mar 30 21:52:16 quark spamd[52479]: __alarm__
Mar 30 21:52:16 quark spamd[52479]: __alarm__
Mar 30 21:52:16 quark spamd[52479]: spamd: copy_config timeout (with empty
$@), respawning child process after 9 messages at /usr/local/bin/spamd line
982.
This indicates that the patch from bug 4699 is working -- spamd now
recognizes that the alarm timed out on copy_config.
And also some of this:
Mar 30 21:52:31 quark spamd[42292]: syswrite() on closed filehandle GEN88
at /usr/local/lib/perl5/5.8.6/mach/IO/Handle.pm line 451.
Mar 30 21:52:31 quark spamd[42292]: Use of uninitialized value in
concatenation (.) or string at
/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm
line 330.
Mar 30 21:52:31 quark spamd[42292]: prefork: write of ping failed to 52479
fd=: at
/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm
line 330.
Mar 30 21:52:31 quark spamd[42292]: Use of uninitialized value in
concatenation (.) or string at
/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm
line 127.
Mar 30 21:52:31 quark spamd[42292]: prefork: killing failed child 52479 fd=
at
/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm
line 127.
Mar 30 21:52:31 quark spamd[42292]: prefork: killed child 52479 at
/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm
line 141.
Mar 30 21:52:31 quark spamd[42292]: syswrite() on closed filehandle GEN70
at /usr/local/lib/perl5/5.8.6/mach/IO/Handle.pm line 451.
Mar 30 21:52:31 quark spamd[42292]: Use of uninitialized value in
concatenation (.) or string at
/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm
line 330.
Mar 30 21:52:31 quark spamd[42292]: prefork: write of ping failed to 45835
fd=: at
/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm
line 330.
Mar 30 21:52:31 quark spamd[42292]: Use of uninitialized value in
concatenation (.) or string at
/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm
line 127.
Mar 30 21:52:31 quark spamd[42292]: prefork: killing failed child 45835 fd=
at
/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm
line 127.
Mar 30 21:52:31 quark spamd[42292]: prefork: killed child 45835 at
/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/SpamdForkScaling.pm
line 141.
This indicates that the child is exiting, but SpamdForkScaling doesn't know
about it until a ping fails 150 seconds later, so a new child isn't spawned
for a long time after one of them commits suicide.
Example at or around Mar 30 01:48:16 in this file:
http://www.gushi.org/maillog33106-0.txt
And another similar lockup at Mar 30 21:49:50 -- SAME USER, go figure.
I don't have archived copies of this user's mail -- yet. I've set up
archiving for them, and we have everything from now forward, but I'm
convinced there's SOMETHING in the spam they're getting that causes a
lockup.
I think it's actually load related... spamd is timing out the copy_config
sooner than it's really taking under high load. If you were to change the
alarm value from 10 to 100 or so, around spamd line 949 this may go away.
Oddly enough, one of our other machines on our network (which also runs
spamd) also seems to die around the same time. I'm concerned about IT as
well, but less than this one. Still, snagging logs there is probably not
a bad idea.
Any idea what sort of load averages you've got when this starts to happen?
It looks like it starts off with a couple children timing out, then you
become short on children, mail starts stacking up, and it snowballs from
there.
I know somewhere in those logs it started rejecting mail on load average
12. A simple one-liner in spamd to echo the load into the logs could be
useful (I don't need a patch, but telling me what to put and where to put
it could be useful). Alternatively I could just do something with
logger(1), echo(1), uptime(1) and cron.
Somewhere along the lines last night I also lost connection to AIM (which
runs from that netblock) so it's quite possibly network congestion
related. Even so, if I theoretically had 30 seconds of latency 6 hours
ago, spamd should theoretically NOT still be hanging now...
BTW, we should probably find or open a bugzilla ticket for this. Bug 4699 is
related. The pre-fork issue is probably another bug of its own.
If you say to do that, then I'll certainly do that -- last thing I wanted
to do was open a false bug, at least until the other unrelated annoyances
were cleared up (and thank you guys immensely for that). Didn't you say
you yourself (Daryl) were having a similar issue?
-Dan
--
--------Dan Mahoney--------
Techie, Sysadmin, WebGeek
Gushi on efnet/undernet IRC
ICQ: 13735144 AIM: LarpGM
Site: http://www.gushi.org
---------------------------