On Wed, Mar 16, 2011 at 10:51 AM, Dave Love <[email protected]> wrote:

> Lane Schwartz <[email protected]> writes:
>
> > On Fri, Mar 11, 2011 at 6:09 PM, Dave Love <[email protected]>
> wrote:
> >> What's the attraction of the Condor stuff over BLCR or DMTCP?
> >
> > I don't have any prior experience with checkpointing. I simply
> > searched around, and Condor seemed to be the easiest and most
> > straightforward to setup.
>
> Don't you need to rebuild the application with it, as the doc suggests,
> or can you do it with LD_PRELOAD?
>

So far I've used condor_compile, which has worked fine so far on the small
programs I've tested. Since most all the code I'm running is open source,
re-linking hasn't been an issue yet.


> On the other hand, DMTCP works in user space with unaltered applications
> (with certain restrictions, like only socket-based communication as far
> as I remember).  I haven't yet tried to set it up under SGE, though, so
> have no experience to offer.  BLCR works reasonably, especially with
> some MPIs, but you have to maintain the kernel module.


Over the past few days, I've tried using condor and DMTCP for checkpointing.
I've been able to get both to work locally. I can launch checkpointable jobs
on my machine, checkpoint and kill them, then successfully restart the jobs
from the checkpoint files. For condor and DMTCP, the process I've tried for
launching jobs is very similar:

Launch job using condor:  my_relinked_app -_condor_ckpt
/path/to/save/checkpoint.file
Restart job using condor:  my_relinked_app -_condor_restart
/path/to/save/checkpoint.file

Launch job using DMTCP: dmtcp_checkpoint my_app
Restart job using DMTCP: dmtcp_restart /path/to/saved/checkpoint.file

Unfortunately, things break down when I've tried submitting checkpointable
jobs via qsub. If I submit either of the above "Launch job" commands using
qsub, my application never starts running:

qsub ./my.condor.sh   (or qsub ./my.dmtcp.sh)

The job gets queued up and assigned to run, and the stderr and stdout files
are created. When a checkpointable job starts, condor and DMTCP each print a
small log message. That log message shows up in the logs. But no output from
my program appears. SGE lists my job's status as "r" but when I ssh in to
the machine where the job is running and run ps aux, ps lists my job's
status as suspended.

When I launch my checkpointable jobs locally (not using qsub) they run and
produce immediate output. When I run those same jobs using qsub, they go
into "r" status, but never produce output and appear to not be actually
running.

On a related topic, using 6.2u5p1 I've had mixed results following the
checkpointing interface tutorial at
http://gridscheduler.sourceforge.net/howto/checkpointing.html. The initial
examples describe setting up a transparent interface and running it with
some simply shell scripts; I've been able to get these to work as described.
I've also followed the examples for setting up application-level interface
with shell scripts; that works, but only the migr_command and clean_command
appear to run. When I run example 6, which uses condor in conjunction with
transparent checkpointing, no condor checkpoint files are created.

I'd love to use checkpointing, and it feels like I'm tantalizingly close to
having things working. Does anyone actually have checkpointing working with
Condor, DMTCP, or any other library using 6.2u5p1?

Thanks,
Lane
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to