Hi,

I would like to use condor's standalone checkpointing to enable
checkpointing jobs that are run via Sun Grid Engine (SGE). I've
successfully compiled a toy C program using condor_compile, and I can
successfully run, stop, and resume the job with its checkpoint file.

When I attempt to run my toy using qsub as an SGE job with
checkpointing enabled, the job gets queued up but never runs. The job
runs fine if submitted without checkpointing. Has anyone here
successfully run SGE jobs using condor checkpointing?

For reference, here's my configuration. Within SGE's qmon utility, I
defined a checkpoint object called "condor" the following
configuration:

Name: condor
Interface: TRANSPARENT
Checkpoint command: NONE
Migrate command: NONE
Clean command: NONE
Checkpoint directory: /tmp
Checkpoint When: xsr
Checkpoint Signal: NONE

To submit the job with checkpointing, I ran this:
qsub -ckpt condor /home/lane/toy.sh -_condor

Where toy.sh is:
#!/bin/bash

/usr/bin/setarch x86_64 -R -L /home/lane/toy -_condor_D_ALL


The job as submitted above gets a "qw" status, but never runs. If I
submitting the job without "-ckpt condor" then it runs.

Any pointers to tips would be appreciated. I've done quite a bit of
research online; it appears that this should be possible, but I just
haven't had any success figuring out how.

Cheers,
Lane
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to