I checked the failure with:
Son of Grid Engine 8.1.6
qlogin keeps running for ever
sge_shepherd runs at 100%
Son of Grid Engine 8.0.0d
qlogin keeps running for ever
sge_shepherd runs at 100%
Open Grid Scheduler 6.2u5p1
qlogin gets killed correctly after h_rt
sge_shepherd runs normal
What has happened to the shepherd going from v6.2 to v8.0?
Bert
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of Wiegers,
> Bert
> Sent: Thursday, January 23, 2014 3:44 PM
> To: [email protected] Group
> Subject: Re: [gridengine users] qlogin with ssh
>
> Hi,
>
> after implementing the tight integrated ssh the shepherd still can't interrupt
> an active ssh session.
>
> Maybe the interesting part:
> As soon as the qlogin is started on the node the shepherd produces the 100%
> load.
> stracing the pid shows heavy output with this content
>
> wait4(4294967295, 0x7fff7ed4001c, WNOHANG, 0x7fff7ed400c0) = 0
> alarm(0) = 0
> wait4(4294967295, 0x7fff7ed4001c, WNOHANG, 0x7fff7ed400c0) = 0
> alarm(0) = 0
> wait4(4294967295, 0x7fff7ed4001c, WNOHANG, 0x7fff7ed400c0) = 0
> alarm(0) = 0
>
>
> This topic has been discussed some time ago:
> https://arc.liv.ac.uk/pipermail/gridengine-users/2010-November/032871.html
>
> no solution so far...
>
> Bert
>
>
>
>
>
> > -----Original Message-----
> > From: Reuti [mailto:[email protected]]
> > Sent: Thursday, December 05, 2013 1:51 PM
> > To: Wiegers, Bert
> > Cc: [email protected] Group
> > Subject: Re: [gridengine users] qlogin with ssh
> >
> > Hi,
> >
> > Am 04.12.2013 um 22:47 schrieb Wiegers, Bert:
> >
> > > I haven't tried this yet, because I can't find the right location for the
> > > needed patch in the openssh
> > sources:
> > >
> > > patch:
> > > in main():
> > > init_rng();
> > > #ifdef SGESSH_INTEGRATION
> > > sgessh_readconfig();
> > > #endif
> > >
> > > Changelog from openssh
> > > 20110909
> > > - (dtucker) [entropy.h] Bug #1932: remove old definition of init_rng.
> > > From
> > > Colin Watson.
> > >
> > > Has anyone done it?
> >
> > Comparing older and actual source it has to be put right after:
> >
> > __progname = ssh_get_progname(av[0]);
> >
> > (untested)
> >
> > -- Reuti
> >
> >
> > >
> > > execd_params ENABLE_ADDGRP_KILL=TRUE
> > > is already there.
> > >
> > > Bert
> > >
> > >> -----Original Message-----
> > >> From: Reuti [mailto:[email protected]]
> > >> Sent: Wednesday, December 04, 2013 10:30 PM
> > >> To: Wiegers, Bert
> > >> Cc: [email protected]
> > >> Subject: Re: [gridengine users] qlogin with ssh
> > >>
> > >> Am 04.12.2013 um 21:59 schrieb Wiegers, Bert:
> > >>
> > >>> According to the man-page of queue_conf
> > >>> the kill -9 command should have been sent by default (we tried this
> > >>> first).
> > >>> This killscript below was an attempt to fix the problem.
> > >>> Both don't work.
> > >>
> > >> Then it might be promising to get a tight SSH integration:
> > >>
> > >> http://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html
> > >>
> > >> section "SSH TIGHT INTEGRATION". I wonder why I forgot to mention there
> > >> that it needs
> > >> "execd_params ENABLE_ADDGRP_KILL=TRUE" in SGE's configuration.
> > >>
> > >> -- Reuti
> > >>
> > >>
> > >>> Bert
> > >>>
> > >>>
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: Reuti [mailto:[email protected]]
> > >>>> Sent: Wednesday, December 04, 2013 6:28 PM
> > >>>> To: Wiegers, Bert
> > >>>> Cc: [email protected]
> > >>>> Subject: Re: [gridengine users] qlogin with ssh
> > >>>>
> > >>>> Am 04.12.2013 um 17:47 schrieb Wiegers, Bert:
> > >>>>
> > >>>>> our setup is
> > >>>>>
> > >>>>> sge_conf:
> > >>>>> qlogin_command
> > >>>>> /export/opt/SGE-8.1.6/utilbin/lx-amd64/qlogin_wrapper.sh
> > >>>>>
> > >>>>> cat /export/opt/SGE-8.1.6/utilbin/lx-amd64/qlogin_wrapper.sh
> > >>>>> #!/bin/sh
> > >>>>> HOST=$1
> > >>>>> PORT=$2
> > >>>>> /usr/bin/ssh -Y -p $PORT $HOST
> > >>>>>
> > >>>>>
> > >>>>> queue_conf:
> > >>>>> terminate_method
> > >>>>> /export/opt/SGE-8.1.6/scripts/case_terminate_method.sh \
> > >>>>> $job_pid $job_owner
> > >>>>
> > >>>> What was the motivation to have a custom method?
> > >>>>
> > >>>> The default is to send a kill to the complete process group, i.e.
> > >>>> something like
> > >>>>
> > >>>> kill -9 -- -$1
> > >>>>
> > >>>> in your setup.
> > >>>>
> > >>>>
> > >>>>> cat /export/opt/SGE-8.1.6/scripts/case_terminate_method.sh
> > >>>>> #!/bin/bash
> > >>>>>
> > >>>>> if [ $# -ne 2 ] ; then
> > >>>>> echo "Usage:" $0 job_pid job_owner
> > >>>>> exit 1
> > >>>>> fi
> > >>>>>
> > >>>>> job_pid=$1
> > >>>>> job_owner=$2
> > >>>>>
> > >>>>> # try and kill the session group - the group leader is the shell
> > >>>>> # executing the job script
> > >>>>> pkill -s $job_pid if [ $? -ne 0 ] ; then
> > >>>>> kill $job_pid
> > >>>>
> > >>>> AFAICS the sid can be different from the pid or pgrp. And the even
> > >>>> when they are the same:
> it's
> > >> the
> > >>>> sid of the sshd, not the shell.
> > >>>>
> > >>>> -- Reuti
> > >>>>
> > >>>>
> > >>>>> fi
> > >>>>>
> > >>>>> # cleanup grace period
> > >>>>> sleep 10
> > >>>>> pkill -9 -s $job_pid
> > >>>>> if [ $? -ne 0 ] ; then
> > >>>>> kill -9 $job_pid
> > >>>>> fi
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Bert
> > >>>>>
> > >>>>>
> > >>>>>> -----Original Message-----
> > >>>>>> From: Reuti [mailto:[email protected]]
> > >>>>>> Sent: Wednesday, December 04, 2013 5:33 PM
> > >>>>>> To: Wiegers, Bert
> > >>>>>> Cc: [email protected]
> > >>>>>> Subject: Re: [gridengine users] qlogin with ssh
> > >>>>>>
> > >>>>>> Am 04.12.2013 um 17:19 schrieb Wiegers, Bert:
> > >>>>>>
> > >>>>>>> Hi *,
> > >>>>>>>
> > >>>>>>> we are using a qlogin wrapper script, as mentioned below.
> > >>>>>>> It looks like that this setup prevents the sge to reach the
> > >>>>>>> terminate_method.
> > >>>>>>
> > >>>>>> You defined a custom "terminate_method"? Can you please post it?
> > >>>>>>
> > >>>>>> -- Reuti
> > >>>>>>
> > >>>>>>
> > >>>>>>> Bert
> > >>>>>>>
> > >>>>>>>> -----Original Message-----
> > >>>>>>>> From: [email protected]
> > >>>>>>>> [mailto:[email protected]] On Behalf
> > Of
> > >>>>>> Wiegers,
> > >>>>>>>> Bert
> > >>>>>>>> Sent: Tuesday, December 03, 2013 9:01 AM
> > >>>>>>>> To: [email protected]
> > >>>>>>>> Subject: Re: [gridengine users] qlogin with ssh
> > >>>>>>>>
> > >>>>>>>> Hi Reuti,
> > >>>>>>>>
> > >>>>>>>> The processtree looks like this
> > >>>>>>>> root 20939 0.0 0.0 1242552 5892 ? Sl Nov14 18:57
> > >>>>>>>> /export/opt/SGE-8.1.6/bin/lx-
> > >>>>>>>> amd64/sge_execd
> > >>>>>>>> root 33874 99.7 0.0 34164 2828 ? R 08:47 0:22
> > >>>>>>>> \_ sge_shepherd-18003 -bg
> > >>>>>>>> root 33882 0.0 0.0 98156 3836 pts/1 Ss+ 08:47 0:00
> > >>>>>>>> \_ sshd: xxxxxx [priv]
> > >>>>>>>> xxxxxx 33884 0.0 0.0 98156 2044 pts/1 S+ 08:47 0:00
> > >>>>>>>> \_ sshd: xxxxxx@pts/2
> > >>>>>>>> xxxxxx 33885 1.1 0.0 14556 3260 pts/2 SNs 08:47 0:00
> > >>>>>>>> \_ -tcsh
> > >>>>>>>> it stays the same as long as I am logged on to the node.
> > >>>>>>>>
> > >>>>>>>> The Job is still listed in qstat.
> > >>>>>>>>
> > >>>>>>>> In the messages of the scheduler I find these hints:
> > >>>>>>>> 12/03/2013 08:52:31|schedu|service0|W|job 18003.1 should have
> > >>>>>>>> finished since 90s
> > >>>>>>>>
> > >>>>>>>> When I logout afterwards I see in the messages
> > >>>>>>>> 12/03/2013 08:58:42|worker|service0|I|removing trigger to
> > >>>>>>>> terminate job 18003.1
> > >>>>>>>> 12/03/2013 08:58:42|worker|service0|W|job 18003.1 failed on host
> > >>>>>>>> XY qmaster enforced
> > >> h_rt,
> > >>>>>> h_cpu,
> > >>>>>>>> or h_vmem limit because: <unknown reason>
> > >>>>>>>>
> > >>>>>>>> Bert
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>> -----Original Message-----
> > >>>>>>>>> From: Reuti [mailto:[email protected]]
> > >>>>>>>>> Sent: Monday, December 02, 2013 6:43 PM
> > >>>>>>>>> To: Wiegers, Bert
> > >>>>>>>>> Cc: [email protected]
> > >>>>>>>>> Subject: Re: [gridengine users] qlogin with ssh
> > >>>>>>>>>
> > >>>>>>>>> Hi,
> > >>>>>>>>>
> > >>>>>>>>> Am 02.12.2013 um 18:28 schrieb Wiegers, Bert:
> > >>>>>>>>>
> > >>>>>>>>>> we are running the SGE 8.1.6.
> > >>>>>>>>>> We have configured some interactive queues and use qlogin with
> > >>>>>>>>>> the
> > >>>>>>>>>> wrapper-script (... /usr/bin/ssh -Y -p $PORT $HOST).
> > >>>>>>>>>> In our setup the user is forced to use the h_rt variable.
> > >>>>>>>>>> Unfortunatly qlogin does not care if the walltime is overdue.
> > >>>>>>>>>> The shepherd seems to be unable to kill the qlogin sessions,
> > >>>>>>>>>> when the
> > >>>>>>>>>> user is still connected to the node.
> > >>>>>>>>>> Has anyone a solution or a workaround for this?
> > >>>>>>>>>
> > >>>>>>>>> Is the `sshd` a child of the `shephered`, i.e. something like:
> > >>>>>>>>>
> > >>>>>>>>> $ ps -e f
> > >>>>>>>>> ...
> > >>>>>>>>> 6656 ? Sl 56:23 /usr/sge/bin/lx24-x86/sge_execd
> > >>>>>>>>> 9391 ? S 0:00 \_ sge_shepherd-10502 -bg
> > >>>>>>>>> 9392 ? Ss 0:00 \_ sshd: reuti [priv]
> > >>>>>>>>> 9398 ? S 0:00 \_ sshd: reuti@pts/2
> > >>>>>>>>> 9405 pts/2 Ss 0:00 \_ -bash
> > >>>>>>>>>
> > >>>>>>>>> How does the process tree look like after "h_rt" expired - did
> > >>>>>>>>> the job vanish from the
> > `qstat`
> > >>>>>>> too?
> > >>>>>>>>>
> > >>>>>>>>> -- Reuti
> > >>>>>>>>
> > >>>>>>>> _______________________________________________
> > >>>>>>>> users mailing list
> > >>>>>>>> [email protected]
> > >>>>>>>> https://gridengine.org/mailman/listinfo/users
> > >>>>>>> _______________________________________________
> > >>>>>>> users mailing list
> > >>>>>>> [email protected]
> > >>>>>>> https://gridengine.org/mailman/listinfo/users
> > >>>>>
> > >>>>> _______________________________________________
> > >>>>> users mailing list
> > >>>>> [email protected]
> > >>>>> https://gridengine.org/mailman/listinfo/users
> > >>>
> > >>>
> > >>> _______________________________________________
> > >>> users mailing list
> > >>> [email protected]
> > >>> https://gridengine.org/mailman/listinfo/users
> > >
> > >
>
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users