Am 18.09.2012 um 04:04 schrieb William Hay: > Accidentally replied straight to Reuti rather than the list where it > might be more generally useful. > > > ---------- Forwarded message ---------- > From: William Hay <[email protected]> > Date: 18 September 2012 03:02 > Subject: Re: [gridengine users] ckpt_command and migr_command not running > To: Reuti <[email protected]> > > > On 17 September 2012 16:32, Reuti <[email protected]> wrote: >> Am 17.09.2012 um 17:11 schrieb William Hay: >> >>> I'm trying to get blcr checkpointing running on our cluster. I've >>> created a checkpointing environment that looks >>> like this: >>> >>> ckpt_name blcr >>> interface application-level >>> ckpt_command /cm/shared/apps/sge/assist/ckpt/blcr/checkpoint.sh >>> $job_pid >>> migr_command /cm/shared/apps/sge/assist/ckpt/blcr/checkpoint.sh >>> $job_pid >>> restart_command none >>> clean_command /bin/true >>> ckpt_dir /tmp >>> signal none >>> when xsmr >>> >>> I submit a serial job to the checkpointing environment with >>> #$ -c mxs >>> #$ -ckpt blcr >>> and after it starts running I suspend it. >>> >>> The messages file for the node it runs on contains the following: >>> >>> 09/17/2012 15:42:44| main|node-o03|I|initiate migration at job >>> suspend for job 898195 task 1 >>> 09/17/2012 15:42:44| main|node-o03|I|SIGNAL jid: 898195 jatask: 1 >>> signal: MIGRATE >>> >>> However as far as I can tell neither the ckpt_command nor the >>> migr_command are run. The first line of the >>> checkpoint.sh script touches a file in /tmp which does not appear (nor >>> do any checkpoints). >> >> You checked /tmp on the node? > Yes nothing shows up there. >> >> The ckpt_command is only run in "min_cpu_interval" which you define in the >> queue. > > I tried waiting for min_cpu_interval plus a fudge factor as well and > no sign of the checkpoint.sh script being run. > >> >> >>> The ckpt_command is duplicated to migr_command because I was trying to >>> get checkpointing to run without migration >>> at first but since the logs mentioned migration I copied the >>> checkpoint script to migr_command to see if it was being run >>> instead of ckpt_command when a suitable job is suspended rather than >>> as an optional addition to it as the man page implies. >> >> Yes, it should. But the man page is wrong in the aspect, that a checkpoint >> is created just be fore the migration. This you have to do on your own in >> the defined migr_command. > > First I have to convince it to run some code. Then I can worry about > running the right code. > > >> >> There are Howto's: >> >> http://arc.liv.ac.uk/SGE/howto/checkpointing.html >> http://arc.liv.ac.uk/SGE/howto/APSTC-TB-2004-005.pdf > > Indeed but they appear to refer to rather old versions of both SGE and blcr.
It still applies, otherwise I would have updated it (at least the Howto and the state diagrams in the BLCR document). If you find them missing actual changes in SGE, please let me know. -- Reuti _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
