Adam Tygart <[email protected]> writes:

> Hello everyone,
>
> I've seen DMTCP mentioned on occasion here, and was hoping someone had
> some notes or, even better, scripts to handle checkpointing and
> restarting applications via dmtcp and a -ckpt environment in SGE.
> Either my google-fu is weak, or there aren't any publicly available
> scripts/notes for doing this already.
>
> Thoughts anyone?

I was considering it for single-node jobs.  I mailed the person who
mentioned SGE in the DMTCP mail archives, but didn't hear back.  I
couldn't find anything for Torque et al either.

The practical problem with multiple tasks/node, is managing the DMTCP
socket.  You either need to specify one to dmtcp_checkpoint or let it
pick a random one (batch mode).  The former requires keeping some sort
of list of ports in use on the node (with locking).  With the latter,
there's no convenient means of finding out the port in use -- it's just
printed to dmtcp_checkpoint's stdout and in the environment of the
sub-process.

I looked at modifying dmtcp_checkpoint to log the random port somewhere,
but wasn't convinced how and where best to do it.  (C++ doesn't
encourage me to hack on things.)  I haven't got back to it, which
probably should involve discussion on the dmtcp list.  I think you can
set it checkpointing itself and resume from the last checkpoint in case
of disaster OK, but you need the control to do a checkpoint prior to
migration, at least.

I reckon it should be able to use a Unix domain socket in a sensible
place -- like the temporary directory DMTCP references (but doesn't
actually use as far as I can tell) -- assuming the SGE processes can
access it.  That would be more secure, and you'd know where to look for
the control, assuming it's named for the SGE task.

(Open-mpi has added DMTCP support.  I haven't looked at how it works
there, but I guess it won't be relevant for dealing with simple jobs,
and it's not terribly useful for parallel ones without Infiniband
support.)

Sorting this out would be a useful, straightforward contribution if
someone would like to tackle it and make the result available.

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to