I'm trying to get blcr checkpointing running on our cluster.   I've
created a checkpointing environment that looks
like this:

ckpt_name          blcr
interface          application-level
ckpt_command       /cm/shared/apps/sge/assist/ckpt/blcr/checkpoint.sh $job_pid
migr_command       /cm/shared/apps/sge/assist/ckpt/blcr/checkpoint.sh $job_pid
restart_command    none
clean_command      /bin/true
ckpt_dir           /tmp
signal             none
when               xsmr

I submit a serial job to the checkpointing environment with
#$ -c mxs
#$ -ckpt blcr
and after it starts running I suspend it.

The messages file for the node it runs on contains the following:

09/17/2012 15:42:44|  main|node-o03|I|initiate migration at job
suspend for job 898195 task 1
09/17/2012 15:42:44|  main|node-o03|I|SIGNAL jid: 898195 jatask: 1
signal: MIGRATE

However as far as I can tell neither the ckpt_command nor the
migr_command are run.  The first line of the
checkpoint.sh script touches a file in /tmp which does not appear (nor
do any checkpoints).

The ckpt_command is duplicated to migr_command because I was trying to
get checkpointing to run without migration
at first but since the logs mentioned migration  I copied the
checkpoint script to migr_command to see if it was being run
instead of ckpt_command when a suitable job is suspended rather than
as an optional addition to it as the man page implies.

We're using 6.2u3 (still).
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to