Am 23.03.2012 um 13:18 schrieb Lars van der bijl:

> On 23 March 2012 13:03, Reuti <[email protected]> wrote:
>> Am 23.03.2012 um 11:55 schrieb Lars van der bijl:
>> 
>>> On 23 March 2012 11:46, Reuti <[email protected]> wrote:
>>>> Hi,
>>>> 
>>>> Am 23.03.2012 um 10:46 schrieb Lars van der bijl:
>>>> 
>>>>> Hey everyone,
>>>>> 
>>>>> I have a small script.
>>>>> 
>>>>> #!/bin/bash
>>>>> 
>>>>> echo "start"
>>>>> 
>>>>> NUMBER=$[ ( $RANDOM % 100 )  + 1 ]
>>>>> 
>>>>> path="/production/people/lars/sge-test/output.$NUMBER.txt"
>>>>> 
>>>>> for I in {1..20}; do
>>>>>  echo "----$I----" >> $path;
>>>>>  sleep 1;
>>>>> done
>>>>> echo "end"
>>>>> 
>>>>> I submit it to sge (6.2u5)
>>>>> 
>>>>> qsub -r y -ckpt realise-checkpoint -q test.q@@atoms -o
>>>>> /production/people/lars/sge-test -e /production/people/lars/sge-test
>>>>> /tmp/test.sh
>>>>> 
>>>>> the checkpoint is pretty default.
>>>>> 
>>>>> On shutdown of execd
>>>>> On Job Suspend
>>>>> On Reschedule Job
>>>> 
>>>> There is no setting "On Reschedule Job", what is you ckpt definition in 
>>>> detail qconf -sckpt ...
>>> 
>>> ckpt_name          realise-checkpoint
>>> interface          USERDEFINED
>>> ckpt_command       NONE
>>> migr_command       NONE
>>> restart_command    NONE
>>> clean_command      NONE
>>> ckpt_dir           /tmp
>>> signal             NONE
>>> when               xsr
>>> 
>>> It's called Reschedule Job (- the On)
>> 
>> This is the action, but the condition is the unknown state of the exechost.
>> 
>> 
>>>> 
>>>>> half way through it running on a host i hit reschedule in qmon
>>>>> 
>>>>> It removes it from the running list. puts it on a different host. all 
>>>>> fine.
>>>> 
>>>> Where does the checkpointing environment come into play here when you 
>>>> rescheduled it by hand already? I.e. you are doing like "qmod -rj ..." if 
>>>> I get you right, just in the GUI.
>>> 
>>> yes just in the GUI.
>>> 
>>> regardless of a checkpoint. shouldn't sge kill the task immediately on
>>> the old host?
>> 
>> After a delay: yes.
> 
> is it possible to changed the time of the delay?

Not that I'm aware of. You can check in the jobscript whether any input/output 
file you use is still open with the commands `lsof` or `fuser` and wait until 
it's closed again from the other process. This could also detect if the old 
process is for any reason not vanishing at all.

-- Reuti


>> There was issue 1521 which was fixed in 6.2u3, so it shouln't be there any 
>> longer.
>> 
>> http://permalink.gmane.org/gmane.comp.clustering.gridengine.users/20388
>> 
>> IMO the use of a checkpointing environment doesn't change anything here.
>> 
>> -- euti
>> 
>> 
>>>>> 
>>>>> I then look at the output of my random output paths.
>>>>> 
>>>>> $ cat output.22.txt
>>>>> ----1----
>>>>> ----2----
>>>>> ...
>>>>> ----19----
>>>>> ----20----
>>>>> fluffy  production ~/sge-test
>>>>> $ cat output.81.txt
>>>>> ----1----
>>>>> ----2----
>>>>> ...
>>>>> ----19----
>>>>> ----20----
>>>>> 
>>>>> looks like sge didn't kill the task on the first host.
>>>>> 
>>>>> i do the same submission with 100 seconds. reschedule it a few seconds
>>>>> into the task running and the first task will stop around 70.
>>>> 
>>>> Do you specify anything like -notify?
>>> 
>>> no I do not.
>>> 
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> is this expected behaviour?
>>>>> 
>>>>> Lars
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> [email protected]
>>>>> https://gridengine.org/mailman/listinfo/users
>>>> 
>>> 
>> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to