Am 23.03.2012 um 13:18 schrieb Lars van der bijl: > On 23 March 2012 13:03, Reuti <[email protected]> wrote: >> Am 23.03.2012 um 11:55 schrieb Lars van der bijl: >> >>> On 23 March 2012 11:46, Reuti <[email protected]> wrote: >>>> Hi, >>>> >>>> Am 23.03.2012 um 10:46 schrieb Lars van der bijl: >>>> >>>>> Hey everyone, >>>>> >>>>> I have a small script. >>>>> >>>>> #!/bin/bash >>>>> >>>>> echo "start" >>>>> >>>>> NUMBER=$[ ( $RANDOM % 100 ) + 1 ] >>>>> >>>>> path="/production/people/lars/sge-test/output.$NUMBER.txt" >>>>> >>>>> for I in {1..20}; do >>>>> echo "----$I----" >> $path; >>>>> sleep 1; >>>>> done >>>>> echo "end" >>>>> >>>>> I submit it to sge (6.2u5) >>>>> >>>>> qsub -r y -ckpt realise-checkpoint -q test.q@@atoms -o >>>>> /production/people/lars/sge-test -e /production/people/lars/sge-test >>>>> /tmp/test.sh >>>>> >>>>> the checkpoint is pretty default. >>>>> >>>>> On shutdown of execd >>>>> On Job Suspend >>>>> On Reschedule Job >>>> >>>> There is no setting "On Reschedule Job", what is you ckpt definition in >>>> detail qconf -sckpt ... >>> >>> ckpt_name realise-checkpoint >>> interface USERDEFINED >>> ckpt_command NONE >>> migr_command NONE >>> restart_command NONE >>> clean_command NONE >>> ckpt_dir /tmp >>> signal NONE >>> when xsr >>> >>> It's called Reschedule Job (- the On) >> >> This is the action, but the condition is the unknown state of the exechost. >> >> >>>> >>>>> half way through it running on a host i hit reschedule in qmon >>>>> >>>>> It removes it from the running list. puts it on a different host. all >>>>> fine. >>>> >>>> Where does the checkpointing environment come into play here when you >>>> rescheduled it by hand already? I.e. you are doing like "qmod -rj ..." if >>>> I get you right, just in the GUI. >>> >>> yes just in the GUI. >>> >>> regardless of a checkpoint. shouldn't sge kill the task immediately on >>> the old host? >> >> After a delay: yes. > > is it possible to changed the time of the delay?
Not that I'm aware of. You can check in the jobscript whether any input/output file you use is still open with the commands `lsof` or `fuser` and wait until it's closed again from the other process. This could also detect if the old process is for any reason not vanishing at all. -- Reuti >> There was issue 1521 which was fixed in 6.2u3, so it shouln't be there any >> longer. >> >> http://permalink.gmane.org/gmane.comp.clustering.gridengine.users/20388 >> >> IMO the use of a checkpointing environment doesn't change anything here. >> >> -- euti >> >> >>>>> >>>>> I then look at the output of my random output paths. >>>>> >>>>> $ cat output.22.txt >>>>> ----1---- >>>>> ----2---- >>>>> ... >>>>> ----19---- >>>>> ----20---- >>>>> fluffy production ~/sge-test >>>>> $ cat output.81.txt >>>>> ----1---- >>>>> ----2---- >>>>> ... >>>>> ----19---- >>>>> ----20---- >>>>> >>>>> looks like sge didn't kill the task on the first host. >>>>> >>>>> i do the same submission with 100 seconds. reschedule it a few seconds >>>>> into the task running and the first task will stop around 70. >>>> >>>> Do you specify anything like -notify? >>> >>> no I do not. >>> >>>> >>>> -- Reuti >>>> >>>> >>>>> is this expected behaviour? >>>>> >>>>> Lars >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> [email protected] >>>>> https://gridengine.org/mailman/listinfo/users >>>> >>> >> > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
