William Hay <[email protected]> writes:

> I've been working on  adding BLCR checkpointing for OpenMPI jobs on our
> cluster.

Is that the one with infinipath?  If so, where did you get the
checkpointing support?

> Although the checkpoint and restart themselves seem to work  in
> the process I encountered a few issues if I reschedule a multi-node job via
> qmod -rq or qmod -rj.
> 1)I get errors in the messages file of nodes running slave tasks but not
> the master.
> 12/07/2012 15:43:52|  main|node-f10|E|slave shepherd of job 1843574.1
> exited with exit status = 11

Presumably the first thing to do is figure out is why the job went to 11.

> 2)On the non-master nodes the actual job processes do not die but instead
> get re-parented to init.
> 3)The job in question will not run on the non-master nodes of its previous
> incarnation.  If it tries to start on them It gets stuck in an Rt state
> until I restart (softstop then start) the sge_execd.

No log messages, I assume.  Do other jobs start?  Is it a problem with
the original spool directories not being deleted?

> I can probably find a way to kill the rogue processes and kick the
> sge_execd when the errors appear but I wonder if anyone has encountered
> this before and has a way to prevent the issues in the first place

The latest SGE should prevent that, but otherwise (if you can avoid
loosely integrated jobs) proc_police should do the job
<http://arc.liv.ac.uk/SGE/howto/remove_orphaned_processes.html>.

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to