http://www.sc.ehu.es/acwmialj/ptech/checkpointing.html
On 03/ 9/11 11:38 AM, "Hung-Sheng Tsao (laotsao 老曹 ) Ph.D" wrote:
it seems that you need to add all.q to queue_list in your checkpoint object On 03/ 9/11 11:25 AM, Lane Schwartz wrote:Hi, I would like to use condor's standalone checkpointing to enable checkpointing jobs that are run via Sun Grid Engine (SGE). I've successfully compiled a toy C program using condor_compile, and I can successfully run, stop, and resume the job with its checkpoint file. When I attempt to run my toy using qsub as an SGE job with checkpointing enabled, the job gets queued up but never runs. The job runs fine if submitted without checkpointing. Has anyone here successfully run SGE jobs using condor checkpointing? For reference, here's my configuration. Within SGE's qmon utility, I defined a checkpoint object called "condor" the following configuration: Name: condor Interface: TRANSPARENT Checkpoint command: NONE Migrate command: NONE Clean command: NONE Checkpoint directory: /tmp Checkpoint When: xsr Checkpoint Signal: NONE To submit the job with checkpointing, I ran this: qsub -ckpt condor /home/lane/toy.sh -_condor Where toy.sh is: #!/bin/bash /usr/bin/setarch x86_64 -R -L /home/lane/toy -_condor_D_ALL The job as submitted above gets a "qw" status, but never runs. If I submitting the job without "-ckpt condor" then it runs. Any pointers to tips would be appreciated. I've done quite a bit of research online; it appears that this should be possible, but I just haven't had any success figuring out how. Cheers, Lane _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
<<attachment: laotsao.vcf>>
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
