I checked the failure with:

Son of Grid Engine 8.1.6
        qlogin keeps running for ever
        sge_shepherd runs at 100%

Son of Grid Engine 8.0.0d
        qlogin keeps running for ever
        sge_shepherd runs at 100%

Open Grid Scheduler 6.2u5p1
        qlogin gets killed correctly after h_rt
        sge_shepherd runs normal

What has happened to the shepherd going from v6.2 to v8.0?

Bert

> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On 
> Behalf Of Wiegers,
> Bert
> Sent: Thursday, January 23, 2014 3:44 PM
> To: [email protected] Group
> Subject: Re: [gridengine users] qlogin with ssh
> 
> Hi,
> 
> after implementing the tight integrated ssh the shepherd still can't interrupt
> an active ssh session.
> 
> Maybe the interesting part:
> As soon as the qlogin is started on the node the shepherd produces the 100% 
> load.
> stracing the pid shows heavy output with this content
> 
> wait4(4294967295, 0x7fff7ed4001c, WNOHANG, 0x7fff7ed400c0) = 0
> alarm(0)                                = 0
> wait4(4294967295, 0x7fff7ed4001c, WNOHANG, 0x7fff7ed400c0) = 0
> alarm(0)                                = 0
> wait4(4294967295, 0x7fff7ed4001c, WNOHANG, 0x7fff7ed400c0) = 0
> alarm(0)                                = 0
> 
> 
> This topic has been discussed some time ago:
> https://arc.liv.ac.uk/pipermail/gridengine-users/2010-November/032871.html
> 
> no solution so far...
> 
> Bert
> 
> 
> 
> 
> 
> > -----Original Message-----
> > From: Reuti [mailto:[email protected]]
> > Sent: Thursday, December 05, 2013 1:51 PM
> > To: Wiegers, Bert
> > Cc: [email protected] Group
> > Subject: Re: [gridengine users] qlogin with ssh
> >
> > Hi,
> >
> > Am 04.12.2013 um 22:47 schrieb Wiegers, Bert:
> >
> > > I haven't tried this yet, because I can't find the right location for the 
> > > needed patch in the openssh
> > sources:
> > >
> > > patch:
> > >              in main():
> > >                     init_rng();
> > >                     #ifdef SGESSH_INTEGRATION
> > >                     sgessh_readconfig();
> > >                     #endif
> > >
> > > Changelog from openssh
> > > 20110909
> > > - (dtucker) [entropy.h] Bug #1932: remove old definition of init_rng.  
> > > From
> > >   Colin Watson.
> > >
> > > Has anyone done it?
> >
> > Comparing older and actual source it has to be put right after:
> >
> > __progname = ssh_get_progname(av[0]);
> >
> > (untested)
> >
> > -- Reuti
> >
> >
> > >
> > > execd_params ENABLE_ADDGRP_KILL=TRUE
> > > is already there.
> > >
> > > Bert
> > >
> > >> -----Original Message-----
> > >> From: Reuti [mailto:[email protected]]
> > >> Sent: Wednesday, December 04, 2013 10:30 PM
> > >> To: Wiegers, Bert
> > >> Cc: [email protected]
> > >> Subject: Re: [gridengine users] qlogin with ssh
> > >>
> > >> Am 04.12.2013 um 21:59 schrieb Wiegers, Bert:
> > >>
> > >>> According to the man-page of queue_conf
> > >>> the kill -9 command should have been sent by default (we tried this 
> > >>> first).
> > >>> This killscript below was an attempt to fix the problem.
> > >>> Both don't work.
> > >>
> > >> Then it might be promising to get a tight SSH integration:
> > >>
> > >> http://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html
> > >>
> > >> section "SSH TIGHT INTEGRATION". I wonder why I forgot to mention there 
> > >> that it needs
> > >> "execd_params ENABLE_ADDGRP_KILL=TRUE" in SGE's configuration.
> > >>
> > >> -- Reuti
> > >>
> > >>
> > >>> Bert
> > >>>
> > >>>
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: Reuti [mailto:[email protected]]
> > >>>> Sent: Wednesday, December 04, 2013 6:28 PM
> > >>>> To: Wiegers, Bert
> > >>>> Cc: [email protected]
> > >>>> Subject: Re: [gridengine users] qlogin with ssh
> > >>>>
> > >>>> Am 04.12.2013 um 17:47 schrieb Wiegers, Bert:
> > >>>>
> > >>>>> our setup is
> > >>>>>
> > >>>>> sge_conf:
> > >>>>> qlogin_command               
> > >>>>> /export/opt/SGE-8.1.6/utilbin/lx-amd64/qlogin_wrapper.sh
> > >>>>>
> > >>>>> cat /export/opt/SGE-8.1.6/utilbin/lx-amd64/qlogin_wrapper.sh
> > >>>>> #!/bin/sh
> > >>>>> HOST=$1
> > >>>>> PORT=$2
> > >>>>> /usr/bin/ssh -Y -p $PORT $HOST
> > >>>>>
> > >>>>>
> > >>>>> queue_conf:
> > >>>>> terminate_method      
> > >>>>> /export/opt/SGE-8.1.6/scripts/case_terminate_method.sh \
> > >>>>>                    $job_pid $job_owner
> > >>>>
> > >>>> What was the motivation to have a custom method?
> > >>>>
> > >>>> The default is to send a kill to the complete process group, i.e. 
> > >>>> something like
> > >>>>
> > >>>> kill -9 -- -$1
> > >>>>
> > >>>> in your setup.
> > >>>>
> > >>>>
> > >>>>> cat /export/opt/SGE-8.1.6/scripts/case_terminate_method.sh
> > >>>>> #!/bin/bash
> > >>>>>
> > >>>>> if [ $# -ne 2 ] ; then
> > >>>>> echo "Usage:" $0 job_pid job_owner
> > >>>>> exit 1
> > >>>>> fi
> > >>>>>
> > >>>>> job_pid=$1
> > >>>>> job_owner=$2
> > >>>>>
> > >>>>> # try and kill the session group - the group leader is the shell
> > >>>>> # executing the job script
> > >>>>> pkill -s $job_pid if [ $? -ne 0 ] ; then
> > >>>>>      kill $job_pid
> > >>>>
> > >>>> AFAICS the sid can be different from the pid or pgrp. And the even 
> > >>>> when they are the same:
> it's
> > >> the
> > >>>> sid of the sshd, not the shell.
> > >>>>
> > >>>> -- Reuti
> > >>>>
> > >>>>
> > >>>>> fi
> > >>>>>
> > >>>>> # cleanup grace period
> > >>>>> sleep 10
> > >>>>> pkill -9 -s $job_pid
> > >>>>> if [ $? -ne 0 ] ; then
> > >>>>>      kill -9 $job_pid
> > >>>>> fi
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Bert
> > >>>>>
> > >>>>>
> > >>>>>> -----Original Message-----
> > >>>>>> From: Reuti [mailto:[email protected]]
> > >>>>>> Sent: Wednesday, December 04, 2013 5:33 PM
> > >>>>>> To: Wiegers, Bert
> > >>>>>> Cc: [email protected]
> > >>>>>> Subject: Re: [gridengine users] qlogin with ssh
> > >>>>>>
> > >>>>>> Am 04.12.2013 um 17:19 schrieb Wiegers, Bert:
> > >>>>>>
> > >>>>>>> Hi *,
> > >>>>>>>
> > >>>>>>> we are using a qlogin wrapper script, as mentioned below.
> > >>>>>>> It looks like that this setup prevents the sge to reach the 
> > >>>>>>> terminate_method.
> > >>>>>>
> > >>>>>> You defined a custom "terminate_method"? Can you please post it?
> > >>>>>>
> > >>>>>> -- Reuti
> > >>>>>>
> > >>>>>>
> > >>>>>>> Bert
> > >>>>>>>
> > >>>>>>>> -----Original Message-----
> > >>>>>>>> From: [email protected] 
> > >>>>>>>> [mailto:[email protected]] On Behalf
> > Of
> > >>>>>> Wiegers,
> > >>>>>>>> Bert
> > >>>>>>>> Sent: Tuesday, December 03, 2013 9:01 AM
> > >>>>>>>> To: [email protected]
> > >>>>>>>> Subject: Re: [gridengine users] qlogin with ssh
> > >>>>>>>>
> > >>>>>>>> Hi Reuti,
> > >>>>>>>>
> > >>>>>>>> The processtree looks like this
> > >>>>>>>> root     20939  0.0  0.0 1242552 5892 ?        Sl   Nov14  18:57 
> > >>>>>>>> /export/opt/SGE-8.1.6/bin/lx-
> > >>>>>>>> amd64/sge_execd
> > >>>>>>>> root     33874 99.7  0.0  34164  2828 ?        R    08:47   0:22  
> > >>>>>>>> \_ sge_shepherd-18003 -bg
> > >>>>>>>> root     33882  0.0  0.0  98156  3836 pts/1    Ss+  08:47   0:00   
> > >>>>>>>>    \_ sshd: xxxxxx [priv]
> > >>>>>>>> xxxxxx 33884  0.0  0.0  98156  2044 pts/1    S+   08:47   0:00     
> > >>>>>>>>      \_ sshd: xxxxxx@pts/2
> > >>>>>>>> xxxxxx 33885  1.1  0.0  14556  3260 pts/2    SNs  08:47   0:00     
> > >>>>>>>>          \_ -tcsh
> > >>>>>>>> it stays the same as long as I am logged on to the node.
> > >>>>>>>>
> > >>>>>>>> The Job is still listed in qstat.
> > >>>>>>>>
> > >>>>>>>> In the messages of the scheduler I find these hints:
> > >>>>>>>> 12/03/2013 08:52:31|schedu|service0|W|job 18003.1 should have 
> > >>>>>>>> finished since 90s
> > >>>>>>>>
> > >>>>>>>> When I logout afterwards I see  in the messages
> > >>>>>>>> 12/03/2013 08:58:42|worker|service0|I|removing trigger to 
> > >>>>>>>> terminate job 18003.1
> > >>>>>>>> 12/03/2013 08:58:42|worker|service0|W|job 18003.1 failed on host 
> > >>>>>>>> XY qmaster enforced
> > >> h_rt,
> > >>>>>> h_cpu,
> > >>>>>>>> or h_vmem limit because: <unknown reason>
> > >>>>>>>>
> > >>>>>>>> Bert
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>> -----Original Message-----
> > >>>>>>>>> From: Reuti [mailto:[email protected]]
> > >>>>>>>>> Sent: Monday, December 02, 2013 6:43 PM
> > >>>>>>>>> To: Wiegers, Bert
> > >>>>>>>>> Cc: [email protected]
> > >>>>>>>>> Subject: Re: [gridengine users] qlogin with ssh
> > >>>>>>>>>
> > >>>>>>>>> Hi,
> > >>>>>>>>>
> > >>>>>>>>> Am 02.12.2013 um 18:28 schrieb Wiegers, Bert:
> > >>>>>>>>>
> > >>>>>>>>>> we are running the SGE 8.1.6.
> > >>>>>>>>>> We have configured some interactive queues and use qlogin with 
> > >>>>>>>>>> the
> > >>>>>>>>>> wrapper-script  (... /usr/bin/ssh -Y -p $PORT $HOST).
> > >>>>>>>>>> In our setup the user is forced to use the  h_rt variable.
> > >>>>>>>>>> Unfortunatly qlogin does not care if the walltime is overdue.
> > >>>>>>>>>> The shepherd seems to be unable to kill the qlogin sessions, 
> > >>>>>>>>>> when the
> > >>>>>>>>>> user is still connected to the node.
> > >>>>>>>>>> Has anyone a solution or a workaround for this?
> > >>>>>>>>>
> > >>>>>>>>> Is the `sshd` a child of the `shephered`, i.e. something like:
> > >>>>>>>>>
> > >>>>>>>>> $ ps -e f
> > >>>>>>>>> ...
> > >>>>>>>>> 6656 ?        Sl    56:23 /usr/sge/bin/lx24-x86/sge_execd
> > >>>>>>>>> 9391 ?        S      0:00  \_ sge_shepherd-10502 -bg
> > >>>>>>>>> 9392 ?        Ss     0:00      \_ sshd: reuti [priv]
> > >>>>>>>>> 9398 ?        S      0:00          \_ sshd: reuti@pts/2
> > >>>>>>>>> 9405 pts/2    Ss     0:00              \_ -bash
> > >>>>>>>>>
> > >>>>>>>>> How does the process tree look like after "h_rt" expired - did 
> > >>>>>>>>> the job vanish from the
> > `qstat`
> > >>>>>>> too?
> > >>>>>>>>>
> > >>>>>>>>> -- Reuti
> > >>>>>>>>
> > >>>>>>>> _______________________________________________
> > >>>>>>>> users mailing list
> > >>>>>>>> [email protected]
> > >>>>>>>> https://gridengine.org/mailman/listinfo/users
> > >>>>>>> _______________________________________________
> > >>>>>>> users mailing list
> > >>>>>>> [email protected]
> > >>>>>>> https://gridengine.org/mailman/listinfo/users
> > >>>>>
> > >>>>> _______________________________________________
> > >>>>> users mailing list
> > >>>>> [email protected]
> > >>>>> https://gridengine.org/mailman/listinfo/users
> > >>>
> > >>>
> > >>> _______________________________________________
> > >>> users mailing list
> > >>> [email protected]
> > >>> https://gridengine.org/mailman/listinfo/users
> > >
> > >
> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to