On Wed, Mar 16, 2011 at 10:51 AM, Dave Love <[email protected]> wrote:
> Lane Schwartz <[email protected]> writes: > > > On Fri, Mar 11, 2011 at 6:09 PM, Dave Love <[email protected]> > wrote: > >> What's the attraction of the Condor stuff over BLCR or DMTCP? > > > > I don't have any prior experience with checkpointing. I simply > > searched around, and Condor seemed to be the easiest and most > > straightforward to setup. > > Don't you need to rebuild the application with it, as the doc suggests, > or can you do it with LD_PRELOAD? > So far I've used condor_compile, which has worked fine so far on the small programs I've tested. Since most all the code I'm running is open source, re-linking hasn't been an issue yet. > On the other hand, DMTCP works in user space with unaltered applications > (with certain restrictions, like only socket-based communication as far > as I remember). I haven't yet tried to set it up under SGE, though, so > have no experience to offer. BLCR works reasonably, especially with > some MPIs, but you have to maintain the kernel module. Over the past few days, I've tried using condor and DMTCP for checkpointing. I've been able to get both to work locally. I can launch checkpointable jobs on my machine, checkpoint and kill them, then successfully restart the jobs from the checkpoint files. For condor and DMTCP, the process I've tried for launching jobs is very similar: Launch job using condor: my_relinked_app -_condor_ckpt /path/to/save/checkpoint.file Restart job using condor: my_relinked_app -_condor_restart /path/to/save/checkpoint.file Launch job using DMTCP: dmtcp_checkpoint my_app Restart job using DMTCP: dmtcp_restart /path/to/saved/checkpoint.file Unfortunately, things break down when I've tried submitting checkpointable jobs via qsub. If I submit either of the above "Launch job" commands using qsub, my application never starts running: qsub ./my.condor.sh (or qsub ./my.dmtcp.sh) The job gets queued up and assigned to run, and the stderr and stdout files are created. When a checkpointable job starts, condor and DMTCP each print a small log message. That log message shows up in the logs. But no output from my program appears. SGE lists my job's status as "r" but when I ssh in to the machine where the job is running and run ps aux, ps lists my job's status as suspended. When I launch my checkpointable jobs locally (not using qsub) they run and produce immediate output. When I run those same jobs using qsub, they go into "r" status, but never produce output and appear to not be actually running. On a related topic, using 6.2u5p1 I've had mixed results following the checkpointing interface tutorial at http://gridscheduler.sourceforge.net/howto/checkpointing.html. The initial examples describe setting up a transparent interface and running it with some simply shell scripts; I've been able to get these to work as described. I've also followed the examples for setting up application-level interface with shell scripts; that works, but only the migr_command and clean_command appear to run. When I run example 6, which uses condor in conjunction with transparent checkpointing, no condor checkpoint files are created. I'd love to use checkpointing, and it feels like I'm tantalizingly close to having things working. Does anyone actually have checkpointing working with Condor, DMTCP, or any other library using 6.2u5p1? Thanks, Lane
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
