I'm trying to get blcr checkpointing running on our cluster. I've created a checkpointing environment that looks like this:
ckpt_name blcr interface application-level ckpt_command /cm/shared/apps/sge/assist/ckpt/blcr/checkpoint.sh $job_pid migr_command /cm/shared/apps/sge/assist/ckpt/blcr/checkpoint.sh $job_pid restart_command none clean_command /bin/true ckpt_dir /tmp signal none when xsmr I submit a serial job to the checkpointing environment with #$ -c mxs #$ -ckpt blcr and after it starts running I suspend it. The messages file for the node it runs on contains the following: 09/17/2012 15:42:44| main|node-o03|I|initiate migration at job suspend for job 898195 task 1 09/17/2012 15:42:44| main|node-o03|I|SIGNAL jid: 898195 jatask: 1 signal: MIGRATE However as far as I can tell neither the ckpt_command nor the migr_command are run. The first line of the checkpoint.sh script touches a file in /tmp which does not appear (nor do any checkpoints). The ckpt_command is duplicated to migr_command because I was trying to get checkpointing to run without migration at first but since the logs mentioned migration I copied the checkpoint script to migr_command to see if it was being run instead of ckpt_command when a suitable job is suspended rather than as an optional addition to it as the man page implies. We're using 6.2u3 (still). _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
