Am 23.03.2012 um 11:55 schrieb Lars van der bijl:

> On 23 March 2012 11:46, Reuti <[email protected]> wrote:
>> Hi,
>> 
>> Am 23.03.2012 um 10:46 schrieb Lars van der bijl:
>> 
>>> Hey everyone,
>>> 
>>> I have a small script.
>>> 
>>> #!/bin/bash
>>> 
>>> echo "start"
>>> 
>>> NUMBER=$[ ( $RANDOM % 100 )  + 1 ]
>>> 
>>> path="/production/people/lars/sge-test/output.$NUMBER.txt"
>>> 
>>> for I in {1..20}; do
>>>  echo "----$I----" >> $path;
>>>  sleep 1;
>>> done
>>> echo "end"
>>> 
>>> I submit it to sge (6.2u5)
>>> 
>>> qsub -r y -ckpt realise-checkpoint -q test.q@@atoms -o
>>> /production/people/lars/sge-test -e /production/people/lars/sge-test
>>> /tmp/test.sh
>>> 
>>> the checkpoint is pretty default.
>>> 
>>> On shutdown of execd
>>> On Job Suspend
>>> On Reschedule Job
>> 
>> There is no setting "On Reschedule Job", what is you ckpt definition in 
>> detail qconf -sckpt ...
> 
> ckpt_name          realise-checkpoint
> interface          USERDEFINED
> ckpt_command       NONE
> migr_command       NONE
> restart_command    NONE
> clean_command      NONE
> ckpt_dir           /tmp
> signal             NONE
> when               xsr
> 
> It's called Reschedule Job (- the On)

This is the action, but the condition is the unknown state of the exechost.
 

>> 
>>> half way through it running on a host i hit reschedule in qmon
>>> 
>>> It removes it from the running list. puts it on a different host. all fine.
>> 
>> Where does the checkpointing environment come into play here when you 
>> rescheduled it by hand already? I.e. you are doing like "qmod -rj ..." if I 
>> get you right, just in the GUI.
> 
> yes just in the GUI.
> 
> regardless of a checkpoint. shouldn't sge kill the task immediately on
> the old host?

After a delay: yes.

There was issue 1521 which was fixed in 6.2u3, so it shouln't be there any 
longer.

http://permalink.gmane.org/gmane.comp.clustering.gridengine.users/20388

IMO the use of a checkpointing environment doesn't change anything here.

-- euti


>>> 
>>> I then look at the output of my random output paths.
>>> 
>>> $ cat output.22.txt
>>> ----1----
>>> ----2----
>>> ...
>>> ----19----
>>> ----20----
>>> fluffy  production ~/sge-test
>>> $ cat output.81.txt
>>> ----1----
>>> ----2----
>>> ...
>>> ----19----
>>> ----20----
>>> 
>>> looks like sge didn't kill the task on the first host.
>>> 
>>> i do the same submission with 100 seconds. reschedule it a few seconds
>>> into the task running and the first task will stop around 70.
>> 
>> Do you specify anything like -notify?
> 
> no I do not.
> 
>> 
>> -- Reuti
>> 
>> 
>>> is this expected behaviour?
>>> 
>>> Lars
>>> 
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to