Am 04.12.2013 um 17:47 schrieb Wiegers, Bert: > our setup is > > sge_conf: > qlogin_command > /export/opt/SGE-8.1.6/utilbin/lx-amd64/qlogin_wrapper.sh > > cat /export/opt/SGE-8.1.6/utilbin/lx-amd64/qlogin_wrapper.sh > #!/bin/sh > HOST=$1 > PORT=$2 > /usr/bin/ssh -Y -p $PORT $HOST > > > queue_conf: > terminate_method /export/opt/SGE-8.1.6/scripts/case_terminate_method.sh \ > $job_pid $job_owner
What was the motivation to have a custom method? The default is to send a kill to the complete process group, i.e. something like kill -9 -- -$1 in your setup. > cat /export/opt/SGE-8.1.6/scripts/case_terminate_method.sh > #!/bin/bash > > if [ $# -ne 2 ] ; then > echo "Usage:" $0 job_pid job_owner > exit 1 > fi > > job_pid=$1 > job_owner=$2 > > # try and kill the session group - the group leader is the shell > # executing the job script > pkill -s $job_pid if [ $? -ne 0 ] ; then > kill $job_pid AFAICS the sid can be different from the pid or pgrp. And the even when they are the same: it's the sid of the sshd, not the shell. -- Reuti > fi > > # cleanup grace period > sleep 10 > pkill -9 -s $job_pid > if [ $? -ne 0 ] ; then > kill -9 $job_pid > fi > > > > Bert > > >> -----Original Message----- >> From: Reuti [mailto:[email protected]] >> Sent: Wednesday, December 04, 2013 5:33 PM >> To: Wiegers, Bert >> Cc: [email protected] >> Subject: Re: [gridengine users] qlogin with ssh >> >> Am 04.12.2013 um 17:19 schrieb Wiegers, Bert: >> >>> Hi *, >>> >>> we are using a qlogin wrapper script, as mentioned below. >>> It looks like that this setup prevents the sge to reach the >>> terminate_method. >> >> You defined a custom "terminate_method"? Can you please post it? >> >> -- Reuti >> >> >>> Bert >>> >>>> -----Original Message----- >>>> From: [email protected] [mailto:[email protected]] >>>> On Behalf Of >> Wiegers, >>>> Bert >>>> Sent: Tuesday, December 03, 2013 9:01 AM >>>> To: [email protected] >>>> Subject: Re: [gridengine users] qlogin with ssh >>>> >>>> Hi Reuti, >>>> >>>> The processtree looks like this >>>> root 20939 0.0 0.0 1242552 5892 ? Sl Nov14 18:57 >>>> /export/opt/SGE-8.1.6/bin/lx- >>>> amd64/sge_execd >>>> root 33874 99.7 0.0 34164 2828 ? R 08:47 0:22 \_ >>>> sge_shepherd-18003 -bg >>>> root 33882 0.0 0.0 98156 3836 pts/1 Ss+ 08:47 0:00 \_ >>>> sshd: xxxxxx [priv] >>>> xxxxxx 33884 0.0 0.0 98156 2044 pts/1 S+ 08:47 0:00 \_ >>>> sshd: xxxxxx@pts/2 >>>> xxxxxx 33885 1.1 0.0 14556 3260 pts/2 SNs 08:47 0:00 >>>> \_ -tcsh >>>> it stays the same as long as I am logged on to the node. >>>> >>>> The Job is still listed in qstat. >>>> >>>> In the messages of the scheduler I find these hints: >>>> 12/03/2013 08:52:31|schedu|service0|W|job 18003.1 should have finished >>>> since 90s >>>> >>>> When I logout afterwards I see in the messages >>>> 12/03/2013 08:58:42|worker|service0|I|removing trigger to terminate job >>>> 18003.1 >>>> 12/03/2013 08:58:42|worker|service0|W|job 18003.1 failed on host XY >>>> qmaster enforced h_rt, >> h_cpu, >>>> or h_vmem limit because: <unknown reason> >>>> >>>> Bert >>>> >>>> >>>> >>>>> -----Original Message----- >>>>> From: Reuti [mailto:[email protected]] >>>>> Sent: Monday, December 02, 2013 6:43 PM >>>>> To: Wiegers, Bert >>>>> Cc: [email protected] >>>>> Subject: Re: [gridengine users] qlogin with ssh >>>>> >>>>> Hi, >>>>> >>>>> Am 02.12.2013 um 18:28 schrieb Wiegers, Bert: >>>>> >>>>>> we are running the SGE 8.1.6. >>>>>> We have configured some interactive queues and use qlogin with the >>>>>> wrapper-script (... /usr/bin/ssh -Y -p $PORT $HOST). >>>>>> In our setup the user is forced to use the h_rt variable. >>>>>> Unfortunatly qlogin does not care if the walltime is overdue. >>>>>> The shepherd seems to be unable to kill the qlogin sessions, when the >>>>>> user is still connected to the node. >>>>>> Has anyone a solution or a workaround for this? >>>>> >>>>> Is the `sshd` a child of the `shephered`, i.e. something like: >>>>> >>>>> $ ps -e f >>>>> ... >>>>> 6656 ? Sl 56:23 /usr/sge/bin/lx24-x86/sge_execd >>>>> 9391 ? S 0:00 \_ sge_shepherd-10502 -bg >>>>> 9392 ? Ss 0:00 \_ sshd: reuti [priv] >>>>> 9398 ? S 0:00 \_ sshd: reuti@pts/2 >>>>> 9405 pts/2 Ss 0:00 \_ -bash >>>>> >>>>> How does the process tree look like after "h_rt" expired - did the job >>>>> vanish from the `qstat` >>> too? >>>>> >>>>> -- Reuti >>>> >>>> _______________________________________________ >>>> users mailing list >>>> [email protected] >>>> https://gridengine.org/mailman/listinfo/users >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
