On Sep 6, 2013, at 6:46 PM, Dave Love <[email protected]> wrote:
> François-Michel L'Heureux <[email protected]> writes: > >> While investigating jobs that have been running for way too long, I've >> found out that qhost shows nodes that are dead with "alive stats" such >> as load, memuse and swapus. qstat also shows them processing jobs with >> state "r", as if the node was there and working. > > Yes, it's a known bug somewhere in the issue tracker that I've never got > round to tracking down. My first question is are the nodes actually dead (as powered down) or stuck (as in overloaded)? The shepherd is a fork and wait process with the built-in starter method (or your own starter method) and does not report job status, once the job is scheduled, until the job completes. If the job never completes because it's wedged (along with everything else) then the wait doesn't reap the child and that leaves the job status as 'r' even though, technically, the job is doing squat (again, along with everything else). The wait is not called with WNOHANG (at least not that I saw when I stepped through the code this afternoon-- I was tired, however), and there is no SIGALARM to time out the wait so, as far as the qmaster knows, the jobs running because the blocked shepherd hasn't told it any different. My second question is, how did the nodes "die"? Did you kill them or gracefully shut them down? Or, again, are they just wedged, so that no user space programs can really run, but the kernel can respond to a ping test? If the nodes can respond to a ping test then the qmaster still thinks they're up-- I think. The backend communication between a node and qmaster is via an event queue and I'm still working through that code-- it's not as straight forward as the shepherd. I just started going over the code last week and have yet to fully understand it (that will take a long time, I think), so take everything I say with a grain of salt. Thanks. John. > > -- > Community Grid Engine: http://arc.liv.ac.uk/SGE/ > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
