That is something we noticed, as well. The 100% CPU usage, I mean. Is that a known thing?

Tina

On 04/12/13 16:50, Wiegers, Bert wrote:
btw.
on the nodes where I am logged on with qlogin the  CPU-usage from the shepherd 
is always on 100%


-----Original Message-----
From: Reuti [mailto:[email protected]]
Sent: Wednesday, December 04, 2013 5:33 PM
To: Wiegers, Bert
Cc: [email protected]
Subject: Re: [gridengine users] qlogin with ssh

Am 04.12.2013 um 17:19 schrieb Wiegers, Bert:

Hi *,

we are using a qlogin wrapper script, as mentioned below.
It looks like that this setup prevents the sge to reach the terminate_method.

You defined a custom "terminate_method"? Can you please post it?

-- Reuti


Bert

-----Original Message-----
From: [email protected] [mailto:[email protected]] On 
Behalf Of
Wiegers,
Bert
Sent: Tuesday, December 03, 2013 9:01 AM
To: [email protected]
Subject: Re: [gridengine users] qlogin with ssh

Hi Reuti,

The processtree looks like this
root     20939  0.0  0.0 1242552 5892 ?        Sl   Nov14  18:57 
/export/opt/SGE-8.1.6/bin/lx-
amd64/sge_execd
root     33874 99.7  0.0  34164  2828 ?        R    08:47   0:22  \_ 
sge_shepherd-18003 -bg
root     33882  0.0  0.0  98156  3836 pts/1    Ss+  08:47   0:00      \_ sshd: 
xxxxxx [priv]
xxxxxx 33884  0.0  0.0  98156  2044 pts/1    S+   08:47   0:00          \_ 
sshd: xxxxxx@pts/2
xxxxxx 33885  1.1  0.0  14556  3260 pts/2    SNs  08:47   0:00              \_ 
-tcsh
it stays the same as long as I am logged on to the node.

The Job is still listed in qstat.

In the messages of the scheduler I find these hints:
12/03/2013 08:52:31|schedu|service0|W|job 18003.1 should have finished since 90s

When I logout afterwards I see  in the messages
12/03/2013 08:58:42|worker|service0|I|removing trigger to terminate job 18003.1
12/03/2013 08:58:42|worker|service0|W|job 18003.1 failed on host XY qmaster 
enforced h_rt,
h_cpu,
or h_vmem limit because: <unknown reason>

Bert



-----Original Message-----
From: Reuti [mailto:[email protected]]
Sent: Monday, December 02, 2013 6:43 PM
To: Wiegers, Bert
Cc: [email protected]
Subject: Re: [gridengine users] qlogin with ssh

Hi,

Am 02.12.2013 um 18:28 schrieb Wiegers, Bert:

we are running the SGE 8.1.6.
We have configured some interactive queues and use qlogin with the
wrapper-script  (... /usr/bin/ssh -Y -p $PORT $HOST).
In our setup the user is forced to use the  h_rt variable.
Unfortunatly qlogin does not care if the walltime is overdue.
The shepherd seems to be unable to kill the qlogin sessions, when the
user is still connected to the node.
Has anyone a solution or a workaround for this?

Is the `sshd` a child of the `shephered`, i.e. something like:

$ ps -e f
...
6656 ?        Sl    56:23 /usr/sge/bin/lx24-x86/sge_execd
9391 ?        S      0:00  \_ sge_shepherd-10502 -bg
9392 ?        Ss     0:00      \_ sshd: reuti [priv]
9398 ?        S      0:00          \_ sshd: reuti@pts/2
9405 pts/2    Ss     0:00              \_ -bash

How does the process tree look like after "h_rt" expired - did the job vanish 
from the `qstat`
too?

-- Reuti

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users



_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users



--
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442

--
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom




_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to