According to the man-page of queue_conf the kill -9 command should have been sent by default (we tried this first). This killscript below was an attempt to fix the problem. Both don't work.
Bert > -----Original Message----- > From: Reuti [mailto:[email protected]] > Sent: Wednesday, December 04, 2013 6:28 PM > To: Wiegers, Bert > Cc: [email protected] > Subject: Re: [gridengine users] qlogin with ssh > > Am 04.12.2013 um 17:47 schrieb Wiegers, Bert: > > > our setup is > > > > sge_conf: > > qlogin_command > > /export/opt/SGE-8.1.6/utilbin/lx-amd64/qlogin_wrapper.sh > > > > cat /export/opt/SGE-8.1.6/utilbin/lx-amd64/qlogin_wrapper.sh > > #!/bin/sh > > HOST=$1 > > PORT=$2 > > /usr/bin/ssh -Y -p $PORT $HOST > > > > > > queue_conf: > > terminate_method > > /export/opt/SGE-8.1.6/scripts/case_terminate_method.sh \ > > $job_pid $job_owner > > What was the motivation to have a custom method? > > The default is to send a kill to the complete process group, i.e. something > like > > kill -9 -- -$1 > > in your setup. > > > > cat /export/opt/SGE-8.1.6/scripts/case_terminate_method.sh > > #!/bin/bash > > > > if [ $# -ne 2 ] ; then > > echo "Usage:" $0 job_pid job_owner > > exit 1 > > fi > > > > job_pid=$1 > > job_owner=$2 > > > > # try and kill the session group - the group leader is the shell > > # executing the job script > > pkill -s $job_pid if [ $? -ne 0 ] ; then > > kill $job_pid > > AFAICS the sid can be different from the pid or pgrp. And the even when they > are the same: it's the > sid of the sshd, not the shell. > > -- Reuti > > > > fi > > > > # cleanup grace period > > sleep 10 > > pkill -9 -s $job_pid > > if [ $? -ne 0 ] ; then > > kill -9 $job_pid > > fi > > > > > > > > Bert > > > > > >> -----Original Message----- > >> From: Reuti [mailto:[email protected]] > >> Sent: Wednesday, December 04, 2013 5:33 PM > >> To: Wiegers, Bert > >> Cc: [email protected] > >> Subject: Re: [gridengine users] qlogin with ssh > >> > >> Am 04.12.2013 um 17:19 schrieb Wiegers, Bert: > >> > >>> Hi *, > >>> > >>> we are using a qlogin wrapper script, as mentioned below. > >>> It looks like that this setup prevents the sge to reach the > >>> terminate_method. > >> > >> You defined a custom "terminate_method"? Can you please post it? > >> > >> -- Reuti > >> > >> > >>> Bert > >>> > >>>> -----Original Message----- > >>>> From: [email protected] [mailto:[email protected]] > >>>> On Behalf Of > >> Wiegers, > >>>> Bert > >>>> Sent: Tuesday, December 03, 2013 9:01 AM > >>>> To: [email protected] > >>>> Subject: Re: [gridengine users] qlogin with ssh > >>>> > >>>> Hi Reuti, > >>>> > >>>> The processtree looks like this > >>>> root 20939 0.0 0.0 1242552 5892 ? Sl Nov14 18:57 > >>>> /export/opt/SGE-8.1.6/bin/lx- > >>>> amd64/sge_execd > >>>> root 33874 99.7 0.0 34164 2828 ? R 08:47 0:22 \_ > >>>> sge_shepherd-18003 -bg > >>>> root 33882 0.0 0.0 98156 3836 pts/1 Ss+ 08:47 0:00 \_ > >>>> sshd: xxxxxx [priv] > >>>> xxxxxx 33884 0.0 0.0 98156 2044 pts/1 S+ 08:47 0:00 > >>>> \_ sshd: xxxxxx@pts/2 > >>>> xxxxxx 33885 1.1 0.0 14556 3260 pts/2 SNs 08:47 0:00 > >>>> \_ -tcsh > >>>> it stays the same as long as I am logged on to the node. > >>>> > >>>> The Job is still listed in qstat. > >>>> > >>>> In the messages of the scheduler I find these hints: > >>>> 12/03/2013 08:52:31|schedu|service0|W|job 18003.1 should have finished > >>>> since 90s > >>>> > >>>> When I logout afterwards I see in the messages > >>>> 12/03/2013 08:58:42|worker|service0|I|removing trigger to terminate job > >>>> 18003.1 > >>>> 12/03/2013 08:58:42|worker|service0|W|job 18003.1 failed on host XY > >>>> qmaster enforced h_rt, > >> h_cpu, > >>>> or h_vmem limit because: <unknown reason> > >>>> > >>>> Bert > >>>> > >>>> > >>>> > >>>>> -----Original Message----- > >>>>> From: Reuti [mailto:[email protected]] > >>>>> Sent: Monday, December 02, 2013 6:43 PM > >>>>> To: Wiegers, Bert > >>>>> Cc: [email protected] > >>>>> Subject: Re: [gridengine users] qlogin with ssh > >>>>> > >>>>> Hi, > >>>>> > >>>>> Am 02.12.2013 um 18:28 schrieb Wiegers, Bert: > >>>>> > >>>>>> we are running the SGE 8.1.6. > >>>>>> We have configured some interactive queues and use qlogin with the > >>>>>> wrapper-script (... /usr/bin/ssh -Y -p $PORT $HOST). > >>>>>> In our setup the user is forced to use the h_rt variable. > >>>>>> Unfortunatly qlogin does not care if the walltime is overdue. > >>>>>> The shepherd seems to be unable to kill the qlogin sessions, when the > >>>>>> user is still connected to the node. > >>>>>> Has anyone a solution or a workaround for this? > >>>>> > >>>>> Is the `sshd` a child of the `shephered`, i.e. something like: > >>>>> > >>>>> $ ps -e f > >>>>> ... > >>>>> 6656 ? Sl 56:23 /usr/sge/bin/lx24-x86/sge_execd > >>>>> 9391 ? S 0:00 \_ sge_shepherd-10502 -bg > >>>>> 9392 ? Ss 0:00 \_ sshd: reuti [priv] > >>>>> 9398 ? S 0:00 \_ sshd: reuti@pts/2 > >>>>> 9405 pts/2 Ss 0:00 \_ -bash > >>>>> > >>>>> How does the process tree look like after "h_rt" expired - did the job > >>>>> vanish from the `qstat` > >>> too? > >>>>> > >>>>> -- Reuti > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> [email protected] > >>>> https://gridengine.org/mailman/listinfo/users > >>> _______________________________________________ > >>> users mailing list > >>> [email protected] > >>> https://gridengine.org/mailman/listinfo/users > > > > _______________________________________________ > > users mailing list > > [email protected] > > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
