That is something we noticed, as well. The 100% CPU usage, I mean. Is
that a known thing?
Tina
On 04/12/13 16:50, Wiegers, Bert wrote:
btw.
on the nodes where I am logged on with qlogin the CPU-usage from the shepherd
is always on 100%
-----Original Message-----
From: Reuti [mailto:[email protected]]
Sent: Wednesday, December 04, 2013 5:33 PM
To: Wiegers, Bert
Cc: [email protected]
Subject: Re: [gridengine users] qlogin with ssh
Am 04.12.2013 um 17:19 schrieb Wiegers, Bert:
Hi *,
we are using a qlogin wrapper script, as mentioned below.
It looks like that this setup prevents the sge to reach the terminate_method.
You defined a custom "terminate_method"? Can you please post it?
-- Reuti
Bert
-----Original Message-----
From: [email protected] [mailto:[email protected]] On
Behalf Of
Wiegers,
Bert
Sent: Tuesday, December 03, 2013 9:01 AM
To: [email protected]
Subject: Re: [gridengine users] qlogin with ssh
Hi Reuti,
The processtree looks like this
root 20939 0.0 0.0 1242552 5892 ? Sl Nov14 18:57
/export/opt/SGE-8.1.6/bin/lx-
amd64/sge_execd
root 33874 99.7 0.0 34164 2828 ? R 08:47 0:22 \_
sge_shepherd-18003 -bg
root 33882 0.0 0.0 98156 3836 pts/1 Ss+ 08:47 0:00 \_ sshd:
xxxxxx [priv]
xxxxxx 33884 0.0 0.0 98156 2044 pts/1 S+ 08:47 0:00 \_
sshd: xxxxxx@pts/2
xxxxxx 33885 1.1 0.0 14556 3260 pts/2 SNs 08:47 0:00 \_
-tcsh
it stays the same as long as I am logged on to the node.
The Job is still listed in qstat.
In the messages of the scheduler I find these hints:
12/03/2013 08:52:31|schedu|service0|W|job 18003.1 should have finished since 90s
When I logout afterwards I see in the messages
12/03/2013 08:58:42|worker|service0|I|removing trigger to terminate job 18003.1
12/03/2013 08:58:42|worker|service0|W|job 18003.1 failed on host XY qmaster
enforced h_rt,
h_cpu,
or h_vmem limit because: <unknown reason>
Bert
-----Original Message-----
From: Reuti [mailto:[email protected]]
Sent: Monday, December 02, 2013 6:43 PM
To: Wiegers, Bert
Cc: [email protected]
Subject: Re: [gridengine users] qlogin with ssh
Hi,
Am 02.12.2013 um 18:28 schrieb Wiegers, Bert:
we are running the SGE 8.1.6.
We have configured some interactive queues and use qlogin with the
wrapper-script (... /usr/bin/ssh -Y -p $PORT $HOST).
In our setup the user is forced to use the h_rt variable.
Unfortunatly qlogin does not care if the walltime is overdue.
The shepherd seems to be unable to kill the qlogin sessions, when the
user is still connected to the node.
Has anyone a solution or a workaround for this?
Is the `sshd` a child of the `shephered`, i.e. something like:
$ ps -e f
...
6656 ? Sl 56:23 /usr/sge/bin/lx24-x86/sge_execd
9391 ? S 0:00 \_ sge_shepherd-10502 -bg
9392 ? Ss 0:00 \_ sshd: reuti [priv]
9398 ? S 0:00 \_ sshd: reuti@pts/2
9405 pts/2 Ss 0:00 \_ -bash
How does the process tree look like after "h_rt" expired - did the job vanish
from the `qstat`
too?
-- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
--
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
--
This e-mail and any attachments may contain confidential, copyright and or
privileged material, and are for the use of the intended addressee only. If you
are not the intended addressee or an authorised recipient of the addressee
please notify us of receipt by returning the e-mail and do not use, copy,
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and
Wales with its registered office at Diamond House, Harwell Science and
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users