Am 18.09.2012 um 04:04 schrieb William Hay:

> Accidentally replied straight to Reuti rather than the list where it
> might be more generally useful.
> 
> 
> ---------- Forwarded message ----------
> From: William Hay <[email protected]>
> Date: 18 September 2012 03:02
> Subject: Re: [gridengine users] ckpt_command and migr_command not running
> To: Reuti <[email protected]>
> 
> 
> On 17 September 2012 16:32, Reuti <[email protected]> wrote:
>> Am 17.09.2012 um 17:11 schrieb William Hay:
>> 
>>> I'm trying to get blcr checkpointing running on our cluster.   I've
>>> created a checkpointing environment that looks
>>> like this:
>>> 
>>> ckpt_name          blcr
>>> interface          application-level
>>> ckpt_command       /cm/shared/apps/sge/assist/ckpt/blcr/checkpoint.sh 
>>> $job_pid
>>> migr_command       /cm/shared/apps/sge/assist/ckpt/blcr/checkpoint.sh 
>>> $job_pid
>>> restart_command    none
>>> clean_command      /bin/true
>>> ckpt_dir           /tmp
>>> signal             none
>>> when               xsmr
>>> 
>>> I submit a serial job to the checkpointing environment with
>>> #$ -c mxs
>>> #$ -ckpt blcr
>>> and after it starts running I suspend it.
>>> 
>>> The messages file for the node it runs on contains the following:
>>> 
>>> 09/17/2012 15:42:44|  main|node-o03|I|initiate migration at job
>>> suspend for job 898195 task 1
>>> 09/17/2012 15:42:44|  main|node-o03|I|SIGNAL jid: 898195 jatask: 1
>>> signal: MIGRATE
>>> 
>>> However as far as I can tell neither the ckpt_command nor the
>>> migr_command are run.  The first line of the
>>> checkpoint.sh script touches a file in /tmp which does not appear (nor
>>> do any checkpoints).
>> 
>> You checked /tmp on the node?
> Yes nothing shows up there.
>> 
>> The ckpt_command is only run in "min_cpu_interval" which you define in the 
>> queue.
> 
> I tried waiting for min_cpu_interval plus a fudge factor as well and
> no sign of the checkpoint.sh script being run.
> 
>> 
>> 
>>> The ckpt_command is duplicated to migr_command because I was trying to
>>> get checkpointing to run without migration
>>> at first but since the logs mentioned migration  I copied the
>>> checkpoint script to migr_command to see if it was being run
>>> instead of ckpt_command when a suitable job is suspended rather than
>>> as an optional addition to it as the man page implies.
>> 
>> Yes, it should. But the man page is wrong in the aspect, that a checkpoint 
>> is created just be fore the migration. This you have to do on your own in 
>> the defined migr_command.
> 
> First I have to convince it to run some code.  Then I can worry about
> running the right code.
> 
> 
>> 
>> There are Howto's:
>> 
>> http://arc.liv.ac.uk/SGE/howto/checkpointing.html
>> http://arc.liv.ac.uk/SGE/howto/APSTC-TB-2004-005.pdf
> 
> Indeed but they appear to refer to rather old versions of both SGE and blcr.

It still applies, otherwise I would have updated it (at least the Howto and the 
state diagrams in the BLCR document). If you find them missing actual changes 
in SGE, please let me know.

-- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to