Hi,
Am 23.03.2012 um 10:46 schrieb Lars van der bijl:
> Hey everyone,
>
> I have a small script.
>
> #!/bin/bash
>
> echo "start"
>
> NUMBER=$[ ( $RANDOM % 100 ) + 1 ]
>
> path="/production/people/lars/sge-test/output.$NUMBER.txt"
>
> for I in {1..20}; do
> echo "----$I----" >> $path;
> sleep 1;
> done
> echo "end"
>
> I submit it to sge (6.2u5)
>
> qsub -r y -ckpt realise-checkpoint -q test.q@@atoms -o
> /production/people/lars/sge-test -e /production/people/lars/sge-test
> /tmp/test.sh
>
> the checkpoint is pretty default.
>
> On shutdown of execd
> On Job Suspend
> On Reschedule Job
There is no setting "On Reschedule Job", what is you ckpt definition in detail
qconf -sckpt ...
> half way through it running on a host i hit reschedule in qmon
>
> It removes it from the running list. puts it on a different host. all fine.
Where does the checkpointing environment come into play here when you
rescheduled it by hand already? I.e. you are doing like "qmod -rj ..." if I get
you right, just in the GUI.
> I then look at the output of my random output paths.
>
> $ cat output.22.txt
> ----1----
> ----2----
> ...
> ----19----
> ----20----
> fluffy production ~/sge-test
> $ cat output.81.txt
> ----1----
> ----2----
> ...
> ----19----
> ----20----
>
> looks like sge didn't kill the task on the first host.
>
> i do the same submission with 100 seconds. reschedule it a few seconds
> into the task running and the first task will stop around 70.
Do you specify anything like -notify?
-- Reuti
> is this expected behaviour?
>
> Lars
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users