Hi,

after implementing the tight integrated ssh the shepherd still can't interrupt 
an active ssh session.

Maybe the interesting part:
As soon as the qlogin is started on the node the shepherd produces the 100% 
load.
stracing the pid shows heavy output with this content

wait4(4294967295, 0x7fff7ed4001c, WNOHANG, 0x7fff7ed400c0) = 0
alarm(0)                                = 0
wait4(4294967295, 0x7fff7ed4001c, WNOHANG, 0x7fff7ed400c0) = 0
alarm(0)                                = 0
wait4(4294967295, 0x7fff7ed4001c, WNOHANG, 0x7fff7ed400c0) = 0
alarm(0)                                = 0


This topic has been discussed some time ago:
https://arc.liv.ac.uk/pipermail/gridengine-users/2010-November/032871.html

no solution so far...

Bert





> -----Original Message-----
> From: Reuti [mailto:[email protected]]
> Sent: Thursday, December 05, 2013 1:51 PM
> To: Wiegers, Bert
> Cc: [email protected] Group
> Subject: Re: [gridengine users] qlogin with ssh
> 
> Hi,
> 
> Am 04.12.2013 um 22:47 schrieb Wiegers, Bert:
> 
> > I haven't tried this yet, because I can't find the right location for the 
> > needed patch in the openssh
> sources:
> >
> > patch:
> >              in main():
> >                     init_rng();
> >                     #ifdef SGESSH_INTEGRATION
> >                     sgessh_readconfig();
> >                     #endif
> >
> > Changelog from openssh
> > 20110909
> > - (dtucker) [entropy.h] Bug #1932: remove old definition of init_rng.  From
> >   Colin Watson.
> >
> > Has anyone done it?
> 
> Comparing older and actual source it has to be put right after:
> 
> __progname = ssh_get_progname(av[0]);
> 
> (untested)
> 
> -- Reuti
> 
> 
> >
> > execd_params ENABLE_ADDGRP_KILL=TRUE
> > is already there.
> >
> > Bert
> >
> >> -----Original Message-----
> >> From: Reuti [mailto:[email protected]]
> >> Sent: Wednesday, December 04, 2013 10:30 PM
> >> To: Wiegers, Bert
> >> Cc: [email protected]
> >> Subject: Re: [gridengine users] qlogin with ssh
> >>
> >> Am 04.12.2013 um 21:59 schrieb Wiegers, Bert:
> >>
> >>> According to the man-page of queue_conf
> >>> the kill -9 command should have been sent by default (we tried this 
> >>> first).
> >>> This killscript below was an attempt to fix the problem.
> >>> Both don't work.
> >>
> >> Then it might be promising to get a tight SSH integration:
> >>
> >> http://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html
> >>
> >> section "SSH TIGHT INTEGRATION". I wonder why I forgot to mention there 
> >> that it needs
> >> "execd_params ENABLE_ADDGRP_KILL=TRUE" in SGE's configuration.
> >>
> >> -- Reuti
> >>
> >>
> >>> Bert
> >>>
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Reuti [mailto:[email protected]]
> >>>> Sent: Wednesday, December 04, 2013 6:28 PM
> >>>> To: Wiegers, Bert
> >>>> Cc: [email protected]
> >>>> Subject: Re: [gridengine users] qlogin with ssh
> >>>>
> >>>> Am 04.12.2013 um 17:47 schrieb Wiegers, Bert:
> >>>>
> >>>>> our setup is
> >>>>>
> >>>>> sge_conf:
> >>>>> qlogin_command               
> >>>>> /export/opt/SGE-8.1.6/utilbin/lx-amd64/qlogin_wrapper.sh
> >>>>>
> >>>>> cat /export/opt/SGE-8.1.6/utilbin/lx-amd64/qlogin_wrapper.sh
> >>>>> #!/bin/sh
> >>>>> HOST=$1
> >>>>> PORT=$2
> >>>>> /usr/bin/ssh -Y -p $PORT $HOST
> >>>>>
> >>>>>
> >>>>> queue_conf:
> >>>>> terminate_method      
> >>>>> /export/opt/SGE-8.1.6/scripts/case_terminate_method.sh \
> >>>>>                    $job_pid $job_owner
> >>>>
> >>>> What was the motivation to have a custom method?
> >>>>
> >>>> The default is to send a kill to the complete process group, i.e. 
> >>>> something like
> >>>>
> >>>> kill -9 -- -$1
> >>>>
> >>>> in your setup.
> >>>>
> >>>>
> >>>>> cat /export/opt/SGE-8.1.6/scripts/case_terminate_method.sh
> >>>>> #!/bin/bash
> >>>>>
> >>>>> if [ $# -ne 2 ] ; then
> >>>>> echo "Usage:" $0 job_pid job_owner
> >>>>> exit 1
> >>>>> fi
> >>>>>
> >>>>> job_pid=$1
> >>>>> job_owner=$2
> >>>>>
> >>>>> # try and kill the session group - the group leader is the shell
> >>>>> # executing the job script
> >>>>> pkill -s $job_pid if [ $? -ne 0 ] ; then
> >>>>>      kill $job_pid
> >>>>
> >>>> AFAICS the sid can be different from the pid or pgrp. And the even when 
> >>>> they are the same: it's
> >> the
> >>>> sid of the sshd, not the shell.
> >>>>
> >>>> -- Reuti
> >>>>
> >>>>
> >>>>> fi
> >>>>>
> >>>>> # cleanup grace period
> >>>>> sleep 10
> >>>>> pkill -9 -s $job_pid
> >>>>> if [ $? -ne 0 ] ; then
> >>>>>      kill -9 $job_pid
> >>>>> fi
> >>>>>
> >>>>>
> >>>>>
> >>>>> Bert
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Reuti [mailto:[email protected]]
> >>>>>> Sent: Wednesday, December 04, 2013 5:33 PM
> >>>>>> To: Wiegers, Bert
> >>>>>> Cc: [email protected]
> >>>>>> Subject: Re: [gridengine users] qlogin with ssh
> >>>>>>
> >>>>>> Am 04.12.2013 um 17:19 schrieb Wiegers, Bert:
> >>>>>>
> >>>>>>> Hi *,
> >>>>>>>
> >>>>>>> we are using a qlogin wrapper script, as mentioned below.
> >>>>>>> It looks like that this setup prevents the sge to reach the 
> >>>>>>> terminate_method.
> >>>>>>
> >>>>>> You defined a custom "terminate_method"? Can you please post it?
> >>>>>>
> >>>>>> -- Reuti
> >>>>>>
> >>>>>>
> >>>>>>> Bert
> >>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: [email protected] 
> >>>>>>>> [mailto:[email protected]] On Behalf
> Of
> >>>>>> Wiegers,
> >>>>>>>> Bert
> >>>>>>>> Sent: Tuesday, December 03, 2013 9:01 AM
> >>>>>>>> To: [email protected]
> >>>>>>>> Subject: Re: [gridengine users] qlogin with ssh
> >>>>>>>>
> >>>>>>>> Hi Reuti,
> >>>>>>>>
> >>>>>>>> The processtree looks like this
> >>>>>>>> root     20939  0.0  0.0 1242552 5892 ?        Sl   Nov14  18:57 
> >>>>>>>> /export/opt/SGE-8.1.6/bin/lx-
> >>>>>>>> amd64/sge_execd
> >>>>>>>> root     33874 99.7  0.0  34164  2828 ?        R    08:47   0:22  \_ 
> >>>>>>>> sge_shepherd-18003 -bg
> >>>>>>>> root     33882  0.0  0.0  98156  3836 pts/1    Ss+  08:47   0:00     
> >>>>>>>>  \_ sshd: xxxxxx [priv]
> >>>>>>>> xxxxxx 33884  0.0  0.0  98156  2044 pts/1    S+   08:47   0:00       
> >>>>>>>>    \_ sshd: xxxxxx@pts/2
> >>>>>>>> xxxxxx 33885  1.1  0.0  14556  3260 pts/2    SNs  08:47   0:00       
> >>>>>>>>        \_ -tcsh
> >>>>>>>> it stays the same as long as I am logged on to the node.
> >>>>>>>>
> >>>>>>>> The Job is still listed in qstat.
> >>>>>>>>
> >>>>>>>> In the messages of the scheduler I find these hints:
> >>>>>>>> 12/03/2013 08:52:31|schedu|service0|W|job 18003.1 should have 
> >>>>>>>> finished since 90s
> >>>>>>>>
> >>>>>>>> When I logout afterwards I see  in the messages
> >>>>>>>> 12/03/2013 08:58:42|worker|service0|I|removing trigger to terminate 
> >>>>>>>> job 18003.1
> >>>>>>>> 12/03/2013 08:58:42|worker|service0|W|job 18003.1 failed on host XY 
> >>>>>>>> qmaster enforced
> >> h_rt,
> >>>>>> h_cpu,
> >>>>>>>> or h_vmem limit because: <unknown reason>
> >>>>>>>>
> >>>>>>>> Bert
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Reuti [mailto:[email protected]]
> >>>>>>>>> Sent: Monday, December 02, 2013 6:43 PM
> >>>>>>>>> To: Wiegers, Bert
> >>>>>>>>> Cc: [email protected]
> >>>>>>>>> Subject: Re: [gridengine users] qlogin with ssh
> >>>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> Am 02.12.2013 um 18:28 schrieb Wiegers, Bert:
> >>>>>>>>>
> >>>>>>>>>> we are running the SGE 8.1.6.
> >>>>>>>>>> We have configured some interactive queues and use qlogin with the
> >>>>>>>>>> wrapper-script  (... /usr/bin/ssh -Y -p $PORT $HOST).
> >>>>>>>>>> In our setup the user is forced to use the  h_rt variable.
> >>>>>>>>>> Unfortunatly qlogin does not care if the walltime is overdue.
> >>>>>>>>>> The shepherd seems to be unable to kill the qlogin sessions, when 
> >>>>>>>>>> the
> >>>>>>>>>> user is still connected to the node.
> >>>>>>>>>> Has anyone a solution or a workaround for this?
> >>>>>>>>>
> >>>>>>>>> Is the `sshd` a child of the `shephered`, i.e. something like:
> >>>>>>>>>
> >>>>>>>>> $ ps -e f
> >>>>>>>>> ...
> >>>>>>>>> 6656 ?        Sl    56:23 /usr/sge/bin/lx24-x86/sge_execd
> >>>>>>>>> 9391 ?        S      0:00  \_ sge_shepherd-10502 -bg
> >>>>>>>>> 9392 ?        Ss     0:00      \_ sshd: reuti [priv]
> >>>>>>>>> 9398 ?        S      0:00          \_ sshd: reuti@pts/2
> >>>>>>>>> 9405 pts/2    Ss     0:00              \_ -bash
> >>>>>>>>>
> >>>>>>>>> How does the process tree look like after "h_rt" expired - did the 
> >>>>>>>>> job vanish from the
> `qstat`
> >>>>>>> too?
> >>>>>>>>>
> >>>>>>>>> -- Reuti
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> users mailing list
> >>>>>>>> [email protected]
> >>>>>>>> https://gridengine.org/mailman/listinfo/users
> >>>>>>> _______________________________________________
> >>>>>>> users mailing list
> >>>>>>> [email protected]
> >>>>>>> https://gridengine.org/mailman/listinfo/users
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> [email protected]
> >>>>> https://gridengine.org/mailman/listinfo/users
> >>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> [email protected]
> >>> https://gridengine.org/mailman/listinfo/users
> >
> >


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to