FYI, this is a conversation Neal and I were having that should have
gone to the list instead of only me.
Forwarded here in case anyone wants to join in.
Bob Gobeille
Begin forwarded message:
From: Bob Gobeille <bob.gobei...@hp.com>
Date: July 14, 2009 1:12:42 PM MDT
To: "Dr. Neal Krawetz" <ne...@fossology.org>
Subject: Re: Scheduler hangs
On Jul 14, 2009, at 1:13 PM, Dr. Neal Krawetz wrote:
Hi Bob,
I just had a thought about detecting and debugging scheduler hangs,
as
well as a possible workaround.
The last I knew... the bug was a signal handling conflict between
the DB
library and the scheduler.
Right now, the scheduler intercepts signals.
When sigaction() is called in scheduler.c, there is an optional
parameter
for storing the old signal handler. I currently ignore the old
handler
value since I am hijacking the signal handler and have no intention
of
putting it back.
In some new code I added (when testing agents), I save the old sig,
set SIG_IGN, and then restore it after giving the agents enough time
to die and send their SIG_CHLD. Without this we were reporting
"unexpected" child deaths, when in fact, we wanted to ignore those
deaths.
However... Try this:
- Save the old signal handler.
In the scheduler's signal handler, call the original handler before
exiting. (if old is not null then call old with same parameters
that
my signal handler received)
- I also set a few signals to SIG_IGN.
Instead of ignoring them, create a new signal handler that receives
them, does nothing, and calls the old handler.
- Occasionally (in the main loop, after the sleep) check to see if
the handler is still set the way I set it.
If it isn't, then it means that Postgres hijacked my handler.
I suspect that Postgres is hijacking interrupts. Between my
hijacking and
their hijacking, things are getting messed up. If this turns out
to be
the case, then there is a solution:
- Before hijacking any handlers: call the DB and give it a few simple
exercises. This will force the SQL library to hijack any
interrupts
it wants.
- Then configure the scheduler's interrupts, holding onto the old
handlers.
- Have the scheduler pass though all signals to the old handler.
Good ideas. I wrote the scheduler watchdog as a hack because the
hangs/dies are rare and I thought that diagnosing the problem would
be a bitch. Actually, diagnosing the problem would be a good thing
to do. ;-)
Bob
_______________________________________________
fossology mailing list
fossology@fossology.org
http://fossology.org/mailman/listinfo/fossology