On Thu, 11 Aug 2011 at 10:31 -0000, Reuti wrote:
> I think the message in the subject happens when there is something
> in the spool directory of the node like
> "$SGE_ROOT/default/spool/node01/jobs/00/0000/515" while there is
> nothing in "active_jobs" any longer. So it can't kill anything.
>
> Clearing the node's "jobs" directory may resolve it.
Just to be clear (for the archives and future users), the message in
the subject of this thread occurs when the following are set in the
SGE configuration.
reschedule_unknown 00:15:00
qmaster_params ENABLE_RESCHEDULE_KILL=true \
ENABLE_RESCHEDULE_SLAVE=true
Unrelated jobs get incorrectly killed on many/most/all other nodes
when a single node hits the 15 minute reschedule_unknown time limit.
The node may have been powered off or may have locked up for other
reasons.
My "solution" was to just turn these things back off and this is
probably the simplest solution for anyone else seeing this problem.
Stuart
--
I've never been lost; I was once bewildered for three days, but never lost!
-- Daniel Boone
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users