I've been working on adding BLCR checkpointing for OpenMPI jobs on our cluster. Although the checkpoint and restart themselves seem to work in the process I encountered a few issues if I reschedule a multi-node job via qmod -rq or qmod -rj. 1)I get errors in the messages file of nodes running slave tasks but not the master. 12/07/2012 15:43:52| main|node-f10|E|slave shepherd of job 1843574.1 exited with exit status = 11 12/07/2012 15:43:52| main|node-f10|E|can't find directory active_jobs/1843574.1/1.node-f10 for reaping job 1843574.1 task 1.node-f10 -- 12/10/2012 13:27:41| main|node-f10|E|slave shepherd of job 2061719.1 exited with exit status = 11 12/10/2012 13:27:41| main|node-f10|E|can't find directory active_jobs/2061719.1/1.node-f10 for reaping job 2061719.1 task 1.node-f10 -- 12/10/2012 14:42:37| main|node-f10|E|slave shepherd of job 2062825.1 exited with exit status = 11 12/10/2012 14:42:37| main|node-f10|E|can't find directory active_jobs/2062825.1/1.node-f10 for reaping job 2062825.1 task 1.node-f10 -- 12/10/2012 14:57:57| main|node-f10|E|slave shepherd of job 2062825.1 exited with exit status = 11 12/10/2012 14:57:57| main|node-f10|E|can't find directory active_jobs/2062825.1/1.node-f10 for reaping job 2062825.1 task 1.node-f10 -- 12/11/2012 09:27:45| main|node-f10|E|slave shepherd of job 2066267.1 exited with exit status = 11 12/11/2012 09:27:45| main|node-f10|E|can't find directory active_jobs/2066267.1/1.node-f10 for reaping job 2066267.1 task 1.node-f10 -- 12/12/2012 09:38:02| main|node-f10|E|slave shepherd of job 2067358.1 exited with exit status = 11 12/12/2012 09:38:02| main|node-f10|E|can't find directory active_jobs/2067358.1/1.node-f10 for reaping job 2067358.1 task 1.node-f10 12/12/2012 11:51:53| main|node-f10|E|slave shepherd of job 2067359.1 exited with exit status = 11 12/12/2012 11:51:53| main|node-f10|E|can't find directory active_jobs/2067359.1/1.node-f10 for reaping job 2067359.1 task 1.node-f10
2)On the non-master nodes the actual job processes do not die but instead get re-parented to init. 3)The job in question will not run on the non-master nodes of its previous incarnation. If it tries to start on them It gets stuck in an Rt state until I restart (softstop then start) the sge_execd. I can probably find a way to kill the rogue processes and kick the sge_execd when the errors appear but I wonder if anyone has encountered this before and has a way to prevent the issues in the first place Thanks William
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
