Hi, after implementing the tight integrated ssh the shepherd still can't interrupt an active ssh session.
Maybe the interesting part: As soon as the qlogin is started on the node the shepherd produces the 100% load. stracing the pid shows heavy output with this content wait4(4294967295, 0x7fff7ed4001c, WNOHANG, 0x7fff7ed400c0) = 0 alarm(0) = 0 wait4(4294967295, 0x7fff7ed4001c, WNOHANG, 0x7fff7ed400c0) = 0 alarm(0) = 0 wait4(4294967295, 0x7fff7ed4001c, WNOHANG, 0x7fff7ed400c0) = 0 alarm(0) = 0 This topic has been discussed some time ago: https://arc.liv.ac.uk/pipermail/gridengine-users/2010-November/032871.html no solution so far... Bert > -----Original Message----- > From: Reuti [mailto:[email protected]] > Sent: Thursday, December 05, 2013 1:51 PM > To: Wiegers, Bert > Cc: [email protected] Group > Subject: Re: [gridengine users] qlogin with ssh > > Hi, > > Am 04.12.2013 um 22:47 schrieb Wiegers, Bert: > > > I haven't tried this yet, because I can't find the right location for the > > needed patch in the openssh > sources: > > > > patch: > > in main(): > > init_rng(); > > #ifdef SGESSH_INTEGRATION > > sgessh_readconfig(); > > #endif > > > > Changelog from openssh > > 20110909 > > - (dtucker) [entropy.h] Bug #1932: remove old definition of init_rng. From > > Colin Watson. > > > > Has anyone done it? > > Comparing older and actual source it has to be put right after: > > __progname = ssh_get_progname(av[0]); > > (untested) > > -- Reuti > > > > > > execd_params ENABLE_ADDGRP_KILL=TRUE > > is already there. > > > > Bert > > > >> -----Original Message----- > >> From: Reuti [mailto:[email protected]] > >> Sent: Wednesday, December 04, 2013 10:30 PM > >> To: Wiegers, Bert > >> Cc: [email protected] > >> Subject: Re: [gridengine users] qlogin with ssh > >> > >> Am 04.12.2013 um 21:59 schrieb Wiegers, Bert: > >> > >>> According to the man-page of queue_conf > >>> the kill -9 command should have been sent by default (we tried this > >>> first). > >>> This killscript below was an attempt to fix the problem. > >>> Both don't work. > >> > >> Then it might be promising to get a tight SSH integration: > >> > >> http://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html > >> > >> section "SSH TIGHT INTEGRATION". I wonder why I forgot to mention there > >> that it needs > >> "execd_params ENABLE_ADDGRP_KILL=TRUE" in SGE's configuration. > >> > >> -- Reuti > >> > >> > >>> Bert > >>> > >>> > >>> > >>>> -----Original Message----- > >>>> From: Reuti [mailto:[email protected]] > >>>> Sent: Wednesday, December 04, 2013 6:28 PM > >>>> To: Wiegers, Bert > >>>> Cc: [email protected] > >>>> Subject: Re: [gridengine users] qlogin with ssh > >>>> > >>>> Am 04.12.2013 um 17:47 schrieb Wiegers, Bert: > >>>> > >>>>> our setup is > >>>>> > >>>>> sge_conf: > >>>>> qlogin_command > >>>>> /export/opt/SGE-8.1.6/utilbin/lx-amd64/qlogin_wrapper.sh > >>>>> > >>>>> cat /export/opt/SGE-8.1.6/utilbin/lx-amd64/qlogin_wrapper.sh > >>>>> #!/bin/sh > >>>>> HOST=$1 > >>>>> PORT=$2 > >>>>> /usr/bin/ssh -Y -p $PORT $HOST > >>>>> > >>>>> > >>>>> queue_conf: > >>>>> terminate_method > >>>>> /export/opt/SGE-8.1.6/scripts/case_terminate_method.sh \ > >>>>> $job_pid $job_owner > >>>> > >>>> What was the motivation to have a custom method? > >>>> > >>>> The default is to send a kill to the complete process group, i.e. > >>>> something like > >>>> > >>>> kill -9 -- -$1 > >>>> > >>>> in your setup. > >>>> > >>>> > >>>>> cat /export/opt/SGE-8.1.6/scripts/case_terminate_method.sh > >>>>> #!/bin/bash > >>>>> > >>>>> if [ $# -ne 2 ] ; then > >>>>> echo "Usage:" $0 job_pid job_owner > >>>>> exit 1 > >>>>> fi > >>>>> > >>>>> job_pid=$1 > >>>>> job_owner=$2 > >>>>> > >>>>> # try and kill the session group - the group leader is the shell > >>>>> # executing the job script > >>>>> pkill -s $job_pid if [ $? -ne 0 ] ; then > >>>>> kill $job_pid > >>>> > >>>> AFAICS the sid can be different from the pid or pgrp. And the even when > >>>> they are the same: it's > >> the > >>>> sid of the sshd, not the shell. > >>>> > >>>> -- Reuti > >>>> > >>>> > >>>>> fi > >>>>> > >>>>> # cleanup grace period > >>>>> sleep 10 > >>>>> pkill -9 -s $job_pid > >>>>> if [ $? -ne 0 ] ; then > >>>>> kill -9 $job_pid > >>>>> fi > >>>>> > >>>>> > >>>>> > >>>>> Bert > >>>>> > >>>>> > >>>>>> -----Original Message----- > >>>>>> From: Reuti [mailto:[email protected]] > >>>>>> Sent: Wednesday, December 04, 2013 5:33 PM > >>>>>> To: Wiegers, Bert > >>>>>> Cc: [email protected] > >>>>>> Subject: Re: [gridengine users] qlogin with ssh > >>>>>> > >>>>>> Am 04.12.2013 um 17:19 schrieb Wiegers, Bert: > >>>>>> > >>>>>>> Hi *, > >>>>>>> > >>>>>>> we are using a qlogin wrapper script, as mentioned below. > >>>>>>> It looks like that this setup prevents the sge to reach the > >>>>>>> terminate_method. > >>>>>> > >>>>>> You defined a custom "terminate_method"? Can you please post it? > >>>>>> > >>>>>> -- Reuti > >>>>>> > >>>>>> > >>>>>>> Bert > >>>>>>> > >>>>>>>> -----Original Message----- > >>>>>>>> From: [email protected] > >>>>>>>> [mailto:[email protected]] On Behalf > Of > >>>>>> Wiegers, > >>>>>>>> Bert > >>>>>>>> Sent: Tuesday, December 03, 2013 9:01 AM > >>>>>>>> To: [email protected] > >>>>>>>> Subject: Re: [gridengine users] qlogin with ssh > >>>>>>>> > >>>>>>>> Hi Reuti, > >>>>>>>> > >>>>>>>> The processtree looks like this > >>>>>>>> root 20939 0.0 0.0 1242552 5892 ? Sl Nov14 18:57 > >>>>>>>> /export/opt/SGE-8.1.6/bin/lx- > >>>>>>>> amd64/sge_execd > >>>>>>>> root 33874 99.7 0.0 34164 2828 ? R 08:47 0:22 \_ > >>>>>>>> sge_shepherd-18003 -bg > >>>>>>>> root 33882 0.0 0.0 98156 3836 pts/1 Ss+ 08:47 0:00 > >>>>>>>> \_ sshd: xxxxxx [priv] > >>>>>>>> xxxxxx 33884 0.0 0.0 98156 2044 pts/1 S+ 08:47 0:00 > >>>>>>>> \_ sshd: xxxxxx@pts/2 > >>>>>>>> xxxxxx 33885 1.1 0.0 14556 3260 pts/2 SNs 08:47 0:00 > >>>>>>>> \_ -tcsh > >>>>>>>> it stays the same as long as I am logged on to the node. > >>>>>>>> > >>>>>>>> The Job is still listed in qstat. > >>>>>>>> > >>>>>>>> In the messages of the scheduler I find these hints: > >>>>>>>> 12/03/2013 08:52:31|schedu|service0|W|job 18003.1 should have > >>>>>>>> finished since 90s > >>>>>>>> > >>>>>>>> When I logout afterwards I see in the messages > >>>>>>>> 12/03/2013 08:58:42|worker|service0|I|removing trigger to terminate > >>>>>>>> job 18003.1 > >>>>>>>> 12/03/2013 08:58:42|worker|service0|W|job 18003.1 failed on host XY > >>>>>>>> qmaster enforced > >> h_rt, > >>>>>> h_cpu, > >>>>>>>> or h_vmem limit because: <unknown reason> > >>>>>>>> > >>>>>>>> Bert > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> -----Original Message----- > >>>>>>>>> From: Reuti [mailto:[email protected]] > >>>>>>>>> Sent: Monday, December 02, 2013 6:43 PM > >>>>>>>>> To: Wiegers, Bert > >>>>>>>>> Cc: [email protected] > >>>>>>>>> Subject: Re: [gridengine users] qlogin with ssh > >>>>>>>>> > >>>>>>>>> Hi, > >>>>>>>>> > >>>>>>>>> Am 02.12.2013 um 18:28 schrieb Wiegers, Bert: > >>>>>>>>> > >>>>>>>>>> we are running the SGE 8.1.6. > >>>>>>>>>> We have configured some interactive queues and use qlogin with the > >>>>>>>>>> wrapper-script (... /usr/bin/ssh -Y -p $PORT $HOST). > >>>>>>>>>> In our setup the user is forced to use the h_rt variable. > >>>>>>>>>> Unfortunatly qlogin does not care if the walltime is overdue. > >>>>>>>>>> The shepherd seems to be unable to kill the qlogin sessions, when > >>>>>>>>>> the > >>>>>>>>>> user is still connected to the node. > >>>>>>>>>> Has anyone a solution or a workaround for this? > >>>>>>>>> > >>>>>>>>> Is the `sshd` a child of the `shephered`, i.e. something like: > >>>>>>>>> > >>>>>>>>> $ ps -e f > >>>>>>>>> ... > >>>>>>>>> 6656 ? Sl 56:23 /usr/sge/bin/lx24-x86/sge_execd > >>>>>>>>> 9391 ? S 0:00 \_ sge_shepherd-10502 -bg > >>>>>>>>> 9392 ? Ss 0:00 \_ sshd: reuti [priv] > >>>>>>>>> 9398 ? S 0:00 \_ sshd: reuti@pts/2 > >>>>>>>>> 9405 pts/2 Ss 0:00 \_ -bash > >>>>>>>>> > >>>>>>>>> How does the process tree look like after "h_rt" expired - did the > >>>>>>>>> job vanish from the > `qstat` > >>>>>>> too? > >>>>>>>>> > >>>>>>>>> -- Reuti > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> users mailing list > >>>>>>>> [email protected] > >>>>>>>> https://gridengine.org/mailman/listinfo/users > >>>>>>> _______________________________________________ > >>>>>>> users mailing list > >>>>>>> [email protected] > >>>>>>> https://gridengine.org/mailman/listinfo/users > >>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> [email protected] > >>>>> https://gridengine.org/mailman/listinfo/users > >>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> [email protected] > >>> https://gridengine.org/mailman/listinfo/users > > > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
