According to the man-page of queue_conf
the kill -9 command should have been sent by default (we tried this first).
This killscript below was an attempt to fix the problem.
Both don't work.

Bert



> -----Original Message-----
> From: Reuti [mailto:[email protected]]
> Sent: Wednesday, December 04, 2013 6:28 PM
> To: Wiegers, Bert
> Cc: [email protected]
> Subject: Re: [gridengine users] qlogin with ssh
> 
> Am 04.12.2013 um 17:47 schrieb Wiegers, Bert:
> 
> > our setup is
> >
> > sge_conf:
> > qlogin_command               
> > /export/opt/SGE-8.1.6/utilbin/lx-amd64/qlogin_wrapper.sh
> >
> > cat /export/opt/SGE-8.1.6/utilbin/lx-amd64/qlogin_wrapper.sh
> > #!/bin/sh
> > HOST=$1
> > PORT=$2
> > /usr/bin/ssh -Y -p $PORT $HOST
> >
> >
> > queue_conf:
> > terminate_method      
> > /export/opt/SGE-8.1.6/scripts/case_terminate_method.sh \
> >                      $job_pid $job_owner
> 
> What was the motivation to have a custom method?
> 
> The default is to send a kill to the complete process group, i.e. something 
> like
> 
> kill -9 -- -$1
> 
> in your setup.
> 
> 
> > cat /export/opt/SGE-8.1.6/scripts/case_terminate_method.sh
> > #!/bin/bash
> >
> > if [ $# -ne 2 ] ; then
> >  echo "Usage:" $0 job_pid job_owner
> >  exit 1
> > fi
> >
> > job_pid=$1
> > job_owner=$2
> >
> > # try and kill the session group - the group leader is the shell
> > # executing the job script
> > pkill -s $job_pid if [ $? -ne 0 ] ; then
> >        kill $job_pid
> 
> AFAICS the sid can be different from the pid or pgrp. And the even when they 
> are the same: it's the
> sid of the sshd, not the shell.
> 
> -- Reuti
> 
> 
> > fi
> >
> > # cleanup grace period
> > sleep 10
> > pkill -9 -s $job_pid
> > if [ $? -ne 0 ] ; then
> >        kill -9 $job_pid
> > fi
> >
> >
> >
> > Bert
> >
> >
> >> -----Original Message-----
> >> From: Reuti [mailto:[email protected]]
> >> Sent: Wednesday, December 04, 2013 5:33 PM
> >> To: Wiegers, Bert
> >> Cc: [email protected]
> >> Subject: Re: [gridengine users] qlogin with ssh
> >>
> >> Am 04.12.2013 um 17:19 schrieb Wiegers, Bert:
> >>
> >>> Hi *,
> >>>
> >>> we are using a qlogin wrapper script, as mentioned below.
> >>> It looks like that this setup prevents the sge to reach the 
> >>> terminate_method.
> >>
> >> You defined a custom "terminate_method"? Can you please post it?
> >>
> >> -- Reuti
> >>
> >>
> >>> Bert
> >>>
> >>>> -----Original Message-----
> >>>> From: [email protected] [mailto:[email protected]] 
> >>>> On Behalf Of
> >> Wiegers,
> >>>> Bert
> >>>> Sent: Tuesday, December 03, 2013 9:01 AM
> >>>> To: [email protected]
> >>>> Subject: Re: [gridengine users] qlogin with ssh
> >>>>
> >>>> Hi Reuti,
> >>>>
> >>>> The processtree looks like this
> >>>> root     20939  0.0  0.0 1242552 5892 ?        Sl   Nov14  18:57 
> >>>> /export/opt/SGE-8.1.6/bin/lx-
> >>>> amd64/sge_execd
> >>>> root     33874 99.7  0.0  34164  2828 ?        R    08:47   0:22  \_ 
> >>>> sge_shepherd-18003 -bg
> >>>> root     33882  0.0  0.0  98156  3836 pts/1    Ss+  08:47   0:00      \_ 
> >>>> sshd: xxxxxx [priv]
> >>>> xxxxxx 33884  0.0  0.0  98156  2044 pts/1    S+   08:47   0:00          
> >>>> \_ sshd: xxxxxx@pts/2
> >>>> xxxxxx 33885  1.1  0.0  14556  3260 pts/2    SNs  08:47   0:00           
> >>>>    \_ -tcsh
> >>>> it stays the same as long as I am logged on to the node.
> >>>>
> >>>> The Job is still listed in qstat.
> >>>>
> >>>> In the messages of the scheduler I find these hints:
> >>>> 12/03/2013 08:52:31|schedu|service0|W|job 18003.1 should have finished 
> >>>> since 90s
> >>>>
> >>>> When I logout afterwards I see  in the messages
> >>>> 12/03/2013 08:58:42|worker|service0|I|removing trigger to terminate job 
> >>>> 18003.1
> >>>> 12/03/2013 08:58:42|worker|service0|W|job 18003.1 failed on host XY 
> >>>> qmaster enforced h_rt,
> >> h_cpu,
> >>>> or h_vmem limit because: <unknown reason>
> >>>>
> >>>> Bert
> >>>>
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Reuti [mailto:[email protected]]
> >>>>> Sent: Monday, December 02, 2013 6:43 PM
> >>>>> To: Wiegers, Bert
> >>>>> Cc: [email protected]
> >>>>> Subject: Re: [gridengine users] qlogin with ssh
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> Am 02.12.2013 um 18:28 schrieb Wiegers, Bert:
> >>>>>
> >>>>>> we are running the SGE 8.1.6.
> >>>>>> We have configured some interactive queues and use qlogin with the
> >>>>>> wrapper-script  (... /usr/bin/ssh -Y -p $PORT $HOST).
> >>>>>> In our setup the user is forced to use the  h_rt variable.
> >>>>>> Unfortunatly qlogin does not care if the walltime is overdue.
> >>>>>> The shepherd seems to be unable to kill the qlogin sessions, when the
> >>>>>> user is still connected to the node.
> >>>>>> Has anyone a solution or a workaround for this?
> >>>>>
> >>>>> Is the `sshd` a child of the `shephered`, i.e. something like:
> >>>>>
> >>>>> $ ps -e f
> >>>>> ...
> >>>>> 6656 ?        Sl    56:23 /usr/sge/bin/lx24-x86/sge_execd
> >>>>> 9391 ?        S      0:00  \_ sge_shepherd-10502 -bg
> >>>>> 9392 ?        Ss     0:00      \_ sshd: reuti [priv]
> >>>>> 9398 ?        S      0:00          \_ sshd: reuti@pts/2
> >>>>> 9405 pts/2    Ss     0:00              \_ -bash
> >>>>>
> >>>>> How does the process tree look like after "h_rt" expired - did the job 
> >>>>> vanish from the `qstat`
> >>> too?
> >>>>>
> >>>>> -- Reuti
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> [email protected]
> >>>> https://gridengine.org/mailman/listinfo/users
> >>> _______________________________________________
> >>> users mailing list
> >>> [email protected]
> >>> https://gridengine.org/mailman/listinfo/users
> >
> > _______________________________________________
> > users mailing list
> > [email protected]
> > https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to