Am 23.03.2012 um 11:55 schrieb Lars van der bijl: > On 23 March 2012 11:46, Reuti <[email protected]> wrote: >> Hi, >> >> Am 23.03.2012 um 10:46 schrieb Lars van der bijl: >> >>> Hey everyone, >>> >>> I have a small script. >>> >>> #!/bin/bash >>> >>> echo "start" >>> >>> NUMBER=$[ ( $RANDOM % 100 ) + 1 ] >>> >>> path="/production/people/lars/sge-test/output.$NUMBER.txt" >>> >>> for I in {1..20}; do >>> echo "----$I----" >> $path; >>> sleep 1; >>> done >>> echo "end" >>> >>> I submit it to sge (6.2u5) >>> >>> qsub -r y -ckpt realise-checkpoint -q test.q@@atoms -o >>> /production/people/lars/sge-test -e /production/people/lars/sge-test >>> /tmp/test.sh >>> >>> the checkpoint is pretty default. >>> >>> On shutdown of execd >>> On Job Suspend >>> On Reschedule Job >> >> There is no setting "On Reschedule Job", what is you ckpt definition in >> detail qconf -sckpt ... > > ckpt_name realise-checkpoint > interface USERDEFINED > ckpt_command NONE > migr_command NONE > restart_command NONE > clean_command NONE > ckpt_dir /tmp > signal NONE > when xsr > > It's called Reschedule Job (- the On)
This is the action, but the condition is the unknown state of the exechost. >> >>> half way through it running on a host i hit reschedule in qmon >>> >>> It removes it from the running list. puts it on a different host. all fine. >> >> Where does the checkpointing environment come into play here when you >> rescheduled it by hand already? I.e. you are doing like "qmod -rj ..." if I >> get you right, just in the GUI. > > yes just in the GUI. > > regardless of a checkpoint. shouldn't sge kill the task immediately on > the old host? After a delay: yes. There was issue 1521 which was fixed in 6.2u3, so it shouln't be there any longer. http://permalink.gmane.org/gmane.comp.clustering.gridengine.users/20388 IMO the use of a checkpointing environment doesn't change anything here. -- euti >>> >>> I then look at the output of my random output paths. >>> >>> $ cat output.22.txt >>> ----1---- >>> ----2---- >>> ... >>> ----19---- >>> ----20---- >>> fluffy production ~/sge-test >>> $ cat output.81.txt >>> ----1---- >>> ----2---- >>> ... >>> ----19---- >>> ----20---- >>> >>> looks like sge didn't kill the task on the first host. >>> >>> i do the same submission with 100 seconds. reschedule it a few seconds >>> into the task running and the first task will stop around 70. >> >> Do you specify anything like -notify? > > no I do not. > >> >> -- Reuti >> >> >>> is this expected behaviour? >>> >>> Lars >>> >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users >> > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
