Hi,

Am 23.03.2012 um 10:46 schrieb Lars van der bijl:

> Hey everyone,
> 
> I have a small script.
> 
> #!/bin/bash
> 
> echo "start"
> 
> NUMBER=$[ ( $RANDOM % 100 )  + 1 ]
> 
> path="/production/people/lars/sge-test/output.$NUMBER.txt"
> 
> for I in {1..20}; do
>  echo "----$I----" >> $path;
>  sleep 1;
> done
> echo "end"
> 
> I submit it to sge (6.2u5)
> 
> qsub -r y -ckpt realise-checkpoint -q test.q@@atoms -o
> /production/people/lars/sge-test -e /production/people/lars/sge-test
> /tmp/test.sh
> 
> the checkpoint is pretty default.
> 
> On shutdown of execd
> On Job Suspend
> On Reschedule Job

There is no setting "On Reschedule Job", what is you ckpt definition in detail 
qconf -sckpt ...

> half way through it running on a host i hit reschedule in qmon
> 
> It removes it from the running list. puts it on a different host. all fine.

Where does the checkpointing environment come into play here when you 
rescheduled it by hand already? I.e. you are doing like "qmod -rj ..." if I get 
you right, just in the GUI.


> I then look at the output of my random output paths.
> 
> $ cat output.22.txt
> ----1----
> ----2----
> ...
> ----19----
> ----20----
> fluffy production ~/sge-test
> $ cat output.81.txt
> ----1----
> ----2----
> ...
> ----19----
> ----20----
> 
> looks like sge didn't kill the task on the first host.
> 
> i do the same submission with 100 seconds. reschedule it a few seconds
> into the task running and the first task will stop around 70.

Do you specify anything like -notify?

-- Reuti


> is this expected behaviour?
> 
> Lars
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to