Accidentally replied straight to Reuti rather than the list where it
might be more generally useful.


---------- Forwarded message ----------
From: William Hay <[email protected]>
Date: 18 September 2012 03:02
Subject: Re: [gridengine users] ckpt_command and migr_command not running
To: Reuti <[email protected]>


On 17 September 2012 16:32, Reuti <[email protected]> wrote:
> Am 17.09.2012 um 17:11 schrieb William Hay:
>
>> I'm trying to get blcr checkpointing running on our cluster.   I've
>> created a checkpointing environment that looks
>> like this:
>>
>> ckpt_name          blcr
>> interface          application-level
>> ckpt_command       /cm/shared/apps/sge/assist/ckpt/blcr/checkpoint.sh 
>> $job_pid
>> migr_command       /cm/shared/apps/sge/assist/ckpt/blcr/checkpoint.sh 
>> $job_pid
>> restart_command    none
>> clean_command      /bin/true
>> ckpt_dir           /tmp
>> signal             none
>> when               xsmr
>>
>> I submit a serial job to the checkpointing environment with
>> #$ -c mxs
>> #$ -ckpt blcr
>> and after it starts running I suspend it.
>>
>> The messages file for the node it runs on contains the following:
>>
>> 09/17/2012 15:42:44|  main|node-o03|I|initiate migration at job
>> suspend for job 898195 task 1
>> 09/17/2012 15:42:44|  main|node-o03|I|SIGNAL jid: 898195 jatask: 1
>> signal: MIGRATE
>>
>> However as far as I can tell neither the ckpt_command nor the
>> migr_command are run.  The first line of the
>> checkpoint.sh script touches a file in /tmp which does not appear (nor
>> do any checkpoints).
>
> You checked /tmp on the node?
Yes nothing shows up there.
>
> The ckpt_command is only run in "min_cpu_interval" which you define in the 
> queue.

I tried waiting for min_cpu_interval plus a fudge factor as well and
no sign of the checkpoint.sh script being run.

>
>
>> The ckpt_command is duplicated to migr_command because I was trying to
>> get checkpointing to run without migration
>> at first but since the logs mentioned migration  I copied the
>> checkpoint script to migr_command to see if it was being run
>> instead of ckpt_command when a suitable job is suspended rather than
>> as an optional addition to it as the man page implies.
>
> Yes, it should. But the man page is wrong in the aspect, that a checkpoint is 
> created just be fore the migration. This you have to do on your own in the 
> defined migr_command.

First I have to convince it to run some code.  Then I can worry about
running the right code.


>
> There are Howto's:
>
> http://arc.liv.ac.uk/SGE/howto/checkpointing.html
> http://arc.liv.ac.uk/SGE/howto/APSTC-TB-2004-005.pdf

Indeed but they appear to refer to rather old versions of both SGE and blcr.

>
> -- Reuti
>
>
>> We're using 6.2u3 (still).
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>
>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to