Am 04.12.2013 um 21:59 schrieb Wiegers, Bert: > According to the man-page of queue_conf > the kill -9 command should have been sent by default (we tried this first). > This killscript below was an attempt to fix the problem. > Both don't work.
Then it might be promising to get a tight SSH integration: http://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html section "SSH TIGHT INTEGRATION". I wonder why I forgot to mention there that it needs "execd_params ENABLE_ADDGRP_KILL=TRUE" in SGE's configuration. -- Reuti > Bert > > > >> -----Original Message----- >> From: Reuti [mailto:[email protected]] >> Sent: Wednesday, December 04, 2013 6:28 PM >> To: Wiegers, Bert >> Cc: [email protected] >> Subject: Re: [gridengine users] qlogin with ssh >> >> Am 04.12.2013 um 17:47 schrieb Wiegers, Bert: >> >>> our setup is >>> >>> sge_conf: >>> qlogin_command >>> /export/opt/SGE-8.1.6/utilbin/lx-amd64/qlogin_wrapper.sh >>> >>> cat /export/opt/SGE-8.1.6/utilbin/lx-amd64/qlogin_wrapper.sh >>> #!/bin/sh >>> HOST=$1 >>> PORT=$2 >>> /usr/bin/ssh -Y -p $PORT $HOST >>> >>> >>> queue_conf: >>> terminate_method >>> /export/opt/SGE-8.1.6/scripts/case_terminate_method.sh \ >>> $job_pid $job_owner >> >> What was the motivation to have a custom method? >> >> The default is to send a kill to the complete process group, i.e. something >> like >> >> kill -9 -- -$1 >> >> in your setup. >> >> >>> cat /export/opt/SGE-8.1.6/scripts/case_terminate_method.sh >>> #!/bin/bash >>> >>> if [ $# -ne 2 ] ; then >>> echo "Usage:" $0 job_pid job_owner >>> exit 1 >>> fi >>> >>> job_pid=$1 >>> job_owner=$2 >>> >>> # try and kill the session group - the group leader is the shell >>> # executing the job script >>> pkill -s $job_pid if [ $? -ne 0 ] ; then >>> kill $job_pid >> >> AFAICS the sid can be different from the pid or pgrp. And the even when they >> are the same: it's the >> sid of the sshd, not the shell. >> >> -- Reuti >> >> >>> fi >>> >>> # cleanup grace period >>> sleep 10 >>> pkill -9 -s $job_pid >>> if [ $? -ne 0 ] ; then >>> kill -9 $job_pid >>> fi >>> >>> >>> >>> Bert >>> >>> >>>> -----Original Message----- >>>> From: Reuti [mailto:[email protected]] >>>> Sent: Wednesday, December 04, 2013 5:33 PM >>>> To: Wiegers, Bert >>>> Cc: [email protected] >>>> Subject: Re: [gridengine users] qlogin with ssh >>>> >>>> Am 04.12.2013 um 17:19 schrieb Wiegers, Bert: >>>> >>>>> Hi *, >>>>> >>>>> we are using a qlogin wrapper script, as mentioned below. >>>>> It looks like that this setup prevents the sge to reach the >>>>> terminate_method. >>>> >>>> You defined a custom "terminate_method"? Can you please post it? >>>> >>>> -- Reuti >>>> >>>> >>>>> Bert >>>>> >>>>>> -----Original Message----- >>>>>> From: [email protected] [mailto:[email protected]] >>>>>> On Behalf Of >>>> Wiegers, >>>>>> Bert >>>>>> Sent: Tuesday, December 03, 2013 9:01 AM >>>>>> To: [email protected] >>>>>> Subject: Re: [gridengine users] qlogin with ssh >>>>>> >>>>>> Hi Reuti, >>>>>> >>>>>> The processtree looks like this >>>>>> root 20939 0.0 0.0 1242552 5892 ? Sl Nov14 18:57 >>>>>> /export/opt/SGE-8.1.6/bin/lx- >>>>>> amd64/sge_execd >>>>>> root 33874 99.7 0.0 34164 2828 ? R 08:47 0:22 \_ >>>>>> sge_shepherd-18003 -bg >>>>>> root 33882 0.0 0.0 98156 3836 pts/1 Ss+ 08:47 0:00 \_ >>>>>> sshd: xxxxxx [priv] >>>>>> xxxxxx 33884 0.0 0.0 98156 2044 pts/1 S+ 08:47 0:00 >>>>>> \_ sshd: xxxxxx@pts/2 >>>>>> xxxxxx 33885 1.1 0.0 14556 3260 pts/2 SNs 08:47 0:00 >>>>>> \_ -tcsh >>>>>> it stays the same as long as I am logged on to the node. >>>>>> >>>>>> The Job is still listed in qstat. >>>>>> >>>>>> In the messages of the scheduler I find these hints: >>>>>> 12/03/2013 08:52:31|schedu|service0|W|job 18003.1 should have finished >>>>>> since 90s >>>>>> >>>>>> When I logout afterwards I see in the messages >>>>>> 12/03/2013 08:58:42|worker|service0|I|removing trigger to terminate job >>>>>> 18003.1 >>>>>> 12/03/2013 08:58:42|worker|service0|W|job 18003.1 failed on host XY >>>>>> qmaster enforced h_rt, >>>> h_cpu, >>>>>> or h_vmem limit because: <unknown reason> >>>>>> >>>>>> Bert >>>>>> >>>>>> >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Reuti [mailto:[email protected]] >>>>>>> Sent: Monday, December 02, 2013 6:43 PM >>>>>>> To: Wiegers, Bert >>>>>>> Cc: [email protected] >>>>>>> Subject: Re: [gridengine users] qlogin with ssh >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Am 02.12.2013 um 18:28 schrieb Wiegers, Bert: >>>>>>> >>>>>>>> we are running the SGE 8.1.6. >>>>>>>> We have configured some interactive queues and use qlogin with the >>>>>>>> wrapper-script (... /usr/bin/ssh -Y -p $PORT $HOST). >>>>>>>> In our setup the user is forced to use the h_rt variable. >>>>>>>> Unfortunatly qlogin does not care if the walltime is overdue. >>>>>>>> The shepherd seems to be unable to kill the qlogin sessions, when the >>>>>>>> user is still connected to the node. >>>>>>>> Has anyone a solution or a workaround for this? >>>>>>> >>>>>>> Is the `sshd` a child of the `shephered`, i.e. something like: >>>>>>> >>>>>>> $ ps -e f >>>>>>> ... >>>>>>> 6656 ? Sl 56:23 /usr/sge/bin/lx24-x86/sge_execd >>>>>>> 9391 ? S 0:00 \_ sge_shepherd-10502 -bg >>>>>>> 9392 ? Ss 0:00 \_ sshd: reuti [priv] >>>>>>> 9398 ? S 0:00 \_ sshd: reuti@pts/2 >>>>>>> 9405 pts/2 Ss 0:00 \_ -bash >>>>>>> >>>>>>> How does the process tree look like after "h_rt" expired - did the job >>>>>>> vanish from the `qstat` >>>>> too? >>>>>>> >>>>>>> -- Reuti >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> [email protected] >>>>>> https://gridengine.org/mailman/listinfo/users >>>>> _______________________________________________ >>>>> users mailing list >>>>> [email protected] >>>>> https://gridengine.org/mailman/listinfo/users >>> >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
