On Sep 6, 2013, at 6:46 PM, Dave Love <[email protected]> wrote:

> François-Michel L'Heureux <[email protected]> writes:
> 
>> While investigating jobs that have been running for way too long, I've
>> found out that qhost shows nodes that are dead with "alive stats" such
>> as load, memuse and swapus. qstat also shows them processing jobs with
>> state "r", as if the node was there and working.
> 
> Yes, it's a known bug somewhere in the issue tracker that I've never got
> round to tracking down.

My first question is are the nodes actually dead (as powered down) or stuck (as 
in overloaded)?

The shepherd is a fork and wait process with the built-in starter method (or 
your own starter method) and does not report job status, once the job is 
scheduled, until the job completes.   If the job never completes because it's 
wedged (along with everything else) then the wait doesn't reap the child and 
that leaves the job status as 'r' even though, technically, the job is doing 
squat (again, along with everything else).  

The wait is not called with WNOHANG (at least not that I saw when I stepped 
through the code this afternoon-- I was tired, however), and there is no 
SIGALARM to time out the wait so, as far as the qmaster knows, the jobs running 
because the blocked shepherd hasn't told it any different.

My second question is, how did the nodes "die"?  Did you kill them or 
gracefully shut them down?  Or, again, are they just wedged, so that no user 
space programs can really run, but the kernel can respond to a ping test?

If the nodes can respond to a ping test then the qmaster still thinks they're 
up-- I think.  The backend communication between a node and qmaster is via an 
event queue and I'm still working through that code-- it's not as straight 
forward as the shepherd.

I just started going over the code last week and have yet to fully understand 
it (that will take a long time, I think), so take everything I say with a grain 
of salt.

Thanks.

        John.

> 
> -- 
> Community Grid Engine:  http://arc.liv.ac.uk/SGE/
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to