On 23 March 2012 13:03, Reuti <[email protected]> wrote:
> Am 23.03.2012 um 11:55 schrieb Lars van der bijl:
>
>> On 23 March 2012 11:46, Reuti <[email protected]> wrote:
>>> Hi,
>>>
>>> Am 23.03.2012 um 10:46 schrieb Lars van der bijl:
>>>
>>>> Hey everyone,
>>>>
>>>> I have a small script.
>>>>
>>>> #!/bin/bash
>>>>
>>>> echo "start"
>>>>
>>>> NUMBER=$[ ( $RANDOM % 100 )  + 1 ]
>>>>
>>>> path="/production/people/lars/sge-test/output.$NUMBER.txt"
>>>>
>>>> for I in {1..20}; do
>>>>  echo "----$I----" >> $path;
>>>>  sleep 1;
>>>> done
>>>> echo "end"
>>>>
>>>> I submit it to sge (6.2u5)
>>>>
>>>> qsub -r y -ckpt realise-checkpoint -q test.q@@atoms -o
>>>> /production/people/lars/sge-test -e /production/people/lars/sge-test
>>>> /tmp/test.sh
>>>>
>>>> the checkpoint is pretty default.
>>>>
>>>> On shutdown of execd
>>>> On Job Suspend
>>>> On Reschedule Job
>>>
>>> There is no setting "On Reschedule Job", what is you ckpt definition in 
>>> detail qconf -sckpt ...
>>
>> ckpt_name          realise-checkpoint
>> interface          USERDEFINED
>> ckpt_command       NONE
>> migr_command       NONE
>> restart_command    NONE
>> clean_command      NONE
>> ckpt_dir           /tmp
>> signal             NONE
>> when               xsr
>>
>> It's called Reschedule Job (- the On)
>
> This is the action, but the condition is the unknown state of the exechost.
>
>
>>>
>>>> half way through it running on a host i hit reschedule in qmon
>>>>
>>>> It removes it from the running list. puts it on a different host. all fine.
>>>
>>> Where does the checkpointing environment come into play here when you 
>>> rescheduled it by hand already? I.e. you are doing like "qmod -rj ..." if I 
>>> get you right, just in the GUI.
>>
>> yes just in the GUI.
>>
>> regardless of a checkpoint. shouldn't sge kill the task immediately on
>> the old host?
>
> After a delay: yes.

is it possible to changed the time of the delay?

>
> There was issue 1521 which was fixed in 6.2u3, so it shouln't be there any 
> longer.
>
> http://permalink.gmane.org/gmane.comp.clustering.gridengine.users/20388
>
> IMO the use of a checkpointing environment doesn't change anything here.
>
> -- euti
>
>
>>>>
>>>> I then look at the output of my random output paths.
>>>>
>>>> $ cat output.22.txt
>>>> ----1----
>>>> ----2----
>>>> ...
>>>> ----19----
>>>> ----20----
>>>> fluffy  production ~/sge-test
>>>> $ cat output.81.txt
>>>> ----1----
>>>> ----2----
>>>> ...
>>>> ----19----
>>>> ----20----
>>>>
>>>> looks like sge didn't kill the task on the first host.
>>>>
>>>> i do the same submission with 100 seconds. reschedule it a few seconds
>>>> into the task running and the first task will stop around 70.
>>>
>>> Do you specify anything like -notify?
>>
>> no I do not.
>>
>>>
>>> -- Reuti
>>>
>>>
>>>> is this expected behaviour?
>>>>
>>>> Lars
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> [email protected]
>>>> https://gridengine.org/mailman/listinfo/users
>>>
>>
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to