FYI, this is a conversation Neal and I were having that should have gone to the list instead of only me.
Forwarded here in case anyone wants to join in.

Bob Gobeille

Begin forwarded message:

From: Bob Gobeille <bob.gobei...@hp.com>
Date: July 14, 2009 1:12:42 PM MDT
To: "Dr. Neal Krawetz" <ne...@fossology.org>
Subject: Re: Scheduler hangs


On Jul 14, 2009, at 1:13 PM, Dr. Neal Krawetz wrote:

Hi Bob,

I just had a thought about detecting and debugging scheduler hangs, as
well as a possible workaround.

The last I knew... the bug was a signal handling conflict between the DB
library and the scheduler.

Right now, the scheduler intercepts signals.
When sigaction() is called in scheduler.c, there is an optional parameter for storing the old signal handler. I currently ignore the old handler value since I am hijacking the signal handler and have no intention of
putting it back.

In some new code I added (when testing agents), I save the old sig, set SIG_IGN, and then restore it after giving the agents enough time to die and send their SIG_CHLD. Without this we were reporting "unexpected" child deaths, when in fact, we wanted to ignore those deaths.


However... Try this:
- Save the old signal handler.
  In the scheduler's signal handler, call the original handler before
exiting. (if old is not null then call old with same parameters that
  my signal handler received)

- I also set a few signals to SIG_IGN.
  Instead of ignoring them, create a new signal handler that receives
  them, does nothing, and calls the old handler.

- Occasionally (in the main loop, after the sleep) check to see if
  the handler is still set the way I set it.
  If it isn't, then it means that Postgres hijacked my handler.

I suspect that Postgres is hijacking interrupts. Between my hijacking and their hijacking, things are getting messed up. If this turns out to be
the case, then there is a solution:
- Before hijacking any handlers: call the DB and give it a few simple
exercises. This will force the SQL library to hijack any interrupts
  it wants.
- Then configure the scheduler's interrupts, holding onto the old handlers.
- Have the scheduler pass though all signals to the old handler.

Good ideas. I wrote the scheduler watchdog as a hack because the hangs/dies are rare and I thought that diagnosing the problem would be a bitch. Actually, diagnosing the problem would be a good thing to do. ;-)

Bob

_______________________________________________
fossology mailing list
fossology@fossology.org
http://fossology.org/mailman/listinfo/fossology

Reply via email to