Accidentally replied straight to Reuti rather than the list where it might be more generally useful.
---------- Forwarded message ---------- From: William Hay <[email protected]> Date: 18 September 2012 03:02 Subject: Re: [gridengine users] ckpt_command and migr_command not running To: Reuti <[email protected]> On 17 September 2012 16:32, Reuti <[email protected]> wrote: > Am 17.09.2012 um 17:11 schrieb William Hay: > >> I'm trying to get blcr checkpointing running on our cluster. I've >> created a checkpointing environment that looks >> like this: >> >> ckpt_name blcr >> interface application-level >> ckpt_command /cm/shared/apps/sge/assist/ckpt/blcr/checkpoint.sh >> $job_pid >> migr_command /cm/shared/apps/sge/assist/ckpt/blcr/checkpoint.sh >> $job_pid >> restart_command none >> clean_command /bin/true >> ckpt_dir /tmp >> signal none >> when xsmr >> >> I submit a serial job to the checkpointing environment with >> #$ -c mxs >> #$ -ckpt blcr >> and after it starts running I suspend it. >> >> The messages file for the node it runs on contains the following: >> >> 09/17/2012 15:42:44| main|node-o03|I|initiate migration at job >> suspend for job 898195 task 1 >> 09/17/2012 15:42:44| main|node-o03|I|SIGNAL jid: 898195 jatask: 1 >> signal: MIGRATE >> >> However as far as I can tell neither the ckpt_command nor the >> migr_command are run. The first line of the >> checkpoint.sh script touches a file in /tmp which does not appear (nor >> do any checkpoints). > > You checked /tmp on the node? Yes nothing shows up there. > > The ckpt_command is only run in "min_cpu_interval" which you define in the > queue. I tried waiting for min_cpu_interval plus a fudge factor as well and no sign of the checkpoint.sh script being run. > > >> The ckpt_command is duplicated to migr_command because I was trying to >> get checkpointing to run without migration >> at first but since the logs mentioned migration I copied the >> checkpoint script to migr_command to see if it was being run >> instead of ckpt_command when a suitable job is suspended rather than >> as an optional addition to it as the man page implies. > > Yes, it should. But the man page is wrong in the aspect, that a checkpoint is > created just be fore the migration. This you have to do on your own in the > defined migr_command. First I have to convince it to run some code. Then I can worry about running the right code. > > There are Howto's: > > http://arc.liv.ac.uk/SGE/howto/checkpointing.html > http://arc.liv.ac.uk/SGE/howto/APSTC-TB-2004-005.pdf Indeed but they appear to refer to rather old versions of both SGE and blcr. > > -- Reuti > > >> We're using 6.2u3 (still). >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users > > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
