[FOSSology] Fwd: Scheduler hangs

Bob Gobeille Tue, 14 Jul 2009 14:03:59 -0700

FYI, this is a conversation Neal and I were having that should havegone to the list instead of only me.

Forwarded here in case anyone wants to join in.

Bob Gobeille


Begin forwarded message:

From: Bob Gobeille <bob.gobei...@hp.com>
Date: July 14, 2009 1:12:42 PM MDT
To: "Dr. Neal Krawetz" <ne...@fossology.org>
Subject: Re: Scheduler hangs


On Jul 14, 2009, at 1:13 PM, Dr. Neal Krawetz wrote:
Hi Bob,
I just had a thought about detecting and debugging scheduler hangs,as
well as a possible workaround.
The last I knew... the bug was a signal handling conflict betweenthe DB
library and the scheduler.

Right now, the scheduler intercepts signals.
When sigaction() is called in scheduler.c, there is an optionalparameterfor storing the old signal handler. I currently ignore the oldhandlervalue since I am hijacking the signal handler and have no intentionof
putting it back.
In some new code I added (when testing agents), I save the old sig,set SIG_IGN, and then restore it after giving the agents enough timeto die and send their SIG_CHLD. Without this we were reporting"unexpected" child deaths, when in fact, we wanted to ignore thosedeaths.
However... Try this:
- Save the old signal handler.
  In the scheduler's signal handler, call the original handler before
exiting. (if old is not null then call old with same parametersthat
  my signal handler received)

- I also set a few signals to SIG_IGN.
  Instead of ignoring them, create a new signal handler that receives
  them, does nothing, and calls the old handler.

- Occasionally (in the main loop, after the sleep) check to see if
  the handler is still set the way I set it.
  If it isn't, then it means that Postgres hijacked my handler.
I suspect that Postgres is hijacking interrupts. Between myhijacking andtheir hijacking, things are getting messed up. If this turns outto be
the case, then there is a solution:
- Before hijacking any handlers: call the DB and give it a few simple
exercises. This will force the SQL library to hijack anyinterrupts
  it wants.
- Then configure the scheduler's interrupts, holding onto the oldhandlers.
- Have the scheduler pass though all signals to the old handler.
Good ideas. I wrote the scheduler watchdog as a hack because thehangs/dies are rare and I thought that diagnosing the problem wouldbe a bitch. Actually, diagnosing the problem would be a good thingto do. ;-)
Bob


_______________________________________________
fossology mailing list
fossology@fossology.org
http://fossology.org/mailman/listinfo/fossology

[FOSSology] Fwd: Scheduler hangs

Reply via email to