Am 11.08.2011 um 15:29 schrieb Dave Love: > Stuart Barkley <[email protected]> writes: > >> If a node dies or is rebooted SGE does not do anything about hung jobs >> when the node comes back online. The jobs continue to appear in >> the queue as if they where running. >> >> This may be related to my using diskless nodes where the local spool >> directory is cleared on reboot. I will be looking into putting the >> execd spool files on a shared directory in the future which may >> address this problem. > > I don't think it will. I see the same with a shared spool, at least for > nodes running tightly-integrated parallel jobs, and I think others have > in the archives. I thought there was an issue filed already, but > apparently not. I'll file it, at least.
I think the message in the subject happens when there is something in the spool directory of the node like "$SGE_ROOT/default/spool/node01/jobs/00/0000/515" while there is nothing in "active_jobs" any longer. So it can't kill anything. Clearing the node's "jobs" directory may resolve it. -- Reuti > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
