Am 17.09.2012 um 17:11 schrieb William Hay: > I'm trying to get blcr checkpointing running on our cluster. I've > created a checkpointing environment that looks > like this: > > ckpt_name blcr > interface application-level > ckpt_command /cm/shared/apps/sge/assist/ckpt/blcr/checkpoint.sh $job_pid > migr_command /cm/shared/apps/sge/assist/ckpt/blcr/checkpoint.sh $job_pid > restart_command none > clean_command /bin/true > ckpt_dir /tmp > signal none > when xsmr > > I submit a serial job to the checkpointing environment with > #$ -c mxs > #$ -ckpt blcr > and after it starts running I suspend it. > > The messages file for the node it runs on contains the following: > > 09/17/2012 15:42:44| main|node-o03|I|initiate migration at job > suspend for job 898195 task 1 > 09/17/2012 15:42:44| main|node-o03|I|SIGNAL jid: 898195 jatask: 1 > signal: MIGRATE > > However as far as I can tell neither the ckpt_command nor the > migr_command are run. The first line of the > checkpoint.sh script touches a file in /tmp which does not appear (nor > do any checkpoints).
You checked /tmp on the node? The ckpt_command is only run in "min_cpu_interval" which you define in the queue. > The ckpt_command is duplicated to migr_command because I was trying to > get checkpointing to run without migration > at first but since the logs mentioned migration I copied the > checkpoint script to migr_command to see if it was being run > instead of ckpt_command when a suitable job is suspended rather than > as an optional addition to it as the man page implies. Yes, it should. But the man page is wrong in the aspect, that a checkpoint is created just be fore the migration. This you have to do on your own in the defined migr_command. There are Howto's: http://arc.liv.ac.uk/SGE/howto/checkpointing.html http://arc.liv.ac.uk/SGE/howto/APSTC-TB-2004-005.pdf -- Reuti > We're using 6.2u3 (still). > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
