Adam Tygart <[email protected]> writes: > Hello everyone, > > I've seen DMTCP mentioned on occasion here, and was hoping someone had > some notes or, even better, scripts to handle checkpointing and > restarting applications via dmtcp and a -ckpt environment in SGE. > Either my google-fu is weak, or there aren't any publicly available > scripts/notes for doing this already. > > Thoughts anyone?
I was considering it for single-node jobs. I mailed the person who mentioned SGE in the DMTCP mail archives, but didn't hear back. I couldn't find anything for Torque et al either. The practical problem with multiple tasks/node, is managing the DMTCP socket. You either need to specify one to dmtcp_checkpoint or let it pick a random one (batch mode). The former requires keeping some sort of list of ports in use on the node (with locking). With the latter, there's no convenient means of finding out the port in use -- it's just printed to dmtcp_checkpoint's stdout and in the environment of the sub-process. I looked at modifying dmtcp_checkpoint to log the random port somewhere, but wasn't convinced how and where best to do it. (C++ doesn't encourage me to hack on things.) I haven't got back to it, which probably should involve discussion on the dmtcp list. I think you can set it checkpointing itself and resume from the last checkpoint in case of disaster OK, but you need the control to do a checkpoint prior to migration, at least. I reckon it should be able to use a Unix domain socket in a sensible place -- like the temporary directory DMTCP references (but doesn't actually use as far as I can tell) -- assuming the SGE processes can access it. That would be more secure, and you'd know where to look for the control, assuming it's named for the SGE task. (Open-mpi has added DMTCP support. I haven't looked at how it works there, but I guess it won't be relevant for dealing with simple jobs, and it's not terribly useful for parallel ones without Infiniband support.) Sorting this out would be a useful, straightforward contribution if someone would like to tackle it and make the result available. -- Community Grid Engine: http://arc.liv.ac.uk/SGE/ _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
