Hi Folks,
New to the list! I am the sysadmin of an HPC cluster using SGE 8.1.8. The
cluster has 100+ nodes running Centos 7 with a shared DDN storage cluster
configured as a GPFS device and a number of NFS mounts to a Centos 7 server.
Some of my users are reporting problems with qsub that have
-discuss-boun...@liverpool.ac.uk] on behalf of
juanesteban.jime...@mdc-berlin.de [juanesteban.jime...@mdc-berlin.de]
Sent: Wednesday, February 01, 2017 23:56
To: sge-discuss@liv.ac.uk
Subject: [SGE-discuss] qsub permission denied
Hi Folks,
New to the list! I am the sysadmin of an HPC cluster using SGE
k] on behalf of
juanesteban.jime...@mdc-berlin.de [juanesteban.jime...@mdc-berlin.de]
Sent: Thursday, February 02, 2017 00:19
To: sge-discuss@liv.ac.uk
Subject: Re: [SGE-discuss] qsub permission denied
In addition, if I then qmod -cj xxx,xxx, etc., the jobs run fine. If the users
throttles th
Today I did some more testing and the problem appears to be specific to GPFS.
I changed the script to put the logs in a folder on an NFS share and *without*
the throttling, there are no errors.
Juan
On 02/02/2017, 00:23, "SGE-discuss on behalf of
juanesteban.jime...@mdc-berlin.de&qu
49 30 9406 2800
From: SGE-discuss [sge-discuss-boun...@liverpool.ac.uk] on behalf of
juanesteban.jime...@mdc-berlin.de [juanesteban.jime...@mdc-berlin.de]
Sent: Thursday, February 02, 2017 00:23
To: sge-discuss@liv.ac.uk
Subject: Re: [SGE-discuss] qsub permission deni
uling step
(probably requires a change to GE)
- Start all tasks with a small sequential delay
(probably requires a change to GE)
Thanks,
Stuart Barkley
On Thu, 9 Feb 2017 at 02:47 -, juanesteban.jime...@mdc-berlin.de wrote:
> Date: Thu, 9 Feb 2017 02:47:44
> From: "juanesteban.
I can get it to fail with just 478 or so attempts to stat() a directory in
GPFS, so yes, I think you're running into the same wall I did. FWIW, simply
moving that activity to NFS or a local mount takes care of it.
Mfg,
Juan Jimenez
System Administrator, HPC
MDC Berlin / IT-Dept.
Tel.: +49 30 940
Hi folks,
Has anyone developed some way of rewarding/penalizing users who request
resources that generally match what their jobs use, as opposed to those who
generally request many more resources than their jobs need? I would think that
a standard epilog script ought to be able to record this i
Hi folks,
I just ran into my first episode of the scheduler crashing because of too many
submitted jobs. It pegged memory usage to as much as I could give it (12gb at
one point) and still crashed while it tries to work its way through the stack.
I need to figure out how to size a box properly f
+, juanesteban.jime...@mdc-berlin.de
wrote:
>Hi folks,
>
>I just ran into my first episode of the scheduler crashing because of too many
>submitted jobs. It pegged memory usage to as much as I could give it (12gb at
>one point) and still crashed while it tries to work its
From: SGE-discuss [sge-discuss-boun...@liverpool.ac.uk] on behalf of
juanesteban.jime...@mdc-berlin.de [juanesteban.jime...@mdc-berlin.de]
Sent: Tuesday, March 21, 2017 09:41
To: Jesse Becker
Cc: SGE-discuss@liv.ac.uk
Subject: Re: [SGE-discuss] Sizing the qmaster
>The "size" of job metadata (scripts, ENV, etc) doesn't really affect
>the RAM usage appreciably that I've seen. We routinely have jobs
>ENVs of almost 4k or more, and it's never been a problem. The
>"data" processed by jobs isn't a factor in qmaster RAM usage, so far as
>I kn
I have a lot of problems with AD, Kerberos, SSSD, LDAP and GridEngine, but I
think it is related to the fact that I connect to AD servers that do not
synchronize with the master quicktly enough. Once in a while I have to clear
the SSSD cache and restart the SSSD services on all the nodes, and un
rch 21, 2017 17:19
To: Jimenez, Juan Esteban
Cc: Jesse Becker; SGE-discuss@liv.ac.uk
Subject: Re: [SGE-discuss] Sizing the qmaster
> Am 21.03.2017 um 16:15 schrieb juanesteban.jime...@mdc-berlin.de:
>
>> The "size" of job metadata (scripts, ENV, etc) doesn't really
uti [re...@staff.uni-marburg.de]
Sent: Sunday, April 09, 2017 17:09
To: Jimenez, Juan Esteban
Cc: Jesse Becker; SGE-discuss@liv.ac.uk
Subject: Re: [SGE-discuss] Sizing the qmaster
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi,
Am 09.04.2017 um 12:38 schrieb juanesteban.jime...@mdc-berlin.de:
> Upda
cuss-boun...@liverpool.ac.uk] on behalf of
juanesteban.jime...@mdc-berlin.de [juanesteban.jime...@mdc-berlin.de]
Sent: Sunday, April 09, 2017 12:28
To: Orion Poplawski; sge-disc...@liverpool.ac.uk
Subject: Re: [SGE-discuss] Kerberos authentication
I have a lot of problems with AD, Kerberos, SSSD,
12.04.17, 10:21, "William Hay" wrote:
On Tue, Apr 11, 2017 at 05:11:58PM +, juanesteban.jime...@mdc-berlin.de
wrote:
> I've got a serious problem here with authenetication with AD and
Kerberos. I have already done away with all the possibilities I can think of
outside
m Hay" wrote:
On Wed, Apr 12, 2017 at 12:33:07PM +, juanesteban.jime...@mdc-berlin.de
wrote:
> We???re still in the same boat.
>
> What I am trying to figure out is why QRSH is looking for any password in
the first place when the system is configured t
Behalf
> Of juanesteban.jime...@mdc-berlin.de
> Sent: Wednesday, April 12, 2017 5:15 PM
> To: William Hay
> Cc: SGE-discuss@liv.ac.uk
> Subject: Re: [SGE-discuss] Kerberos authentication
>
> The problem is that GridEngine doesn’t tell me the context of th
I am running SGE on nodes with both 7.1 and 7.3. Works fine on both.
Just make sure that if you are using Active Directory/Kerberos for
authentication and authorization, your DC’s are capable of handling a lot of
traffic/requests. If not, things like DRMAA will uncover any shortcomings.
Mfg,
Ju
ectory.
For Cluster centralize user management I'm using NIS.
Will NIS work fine with SGE ?
Thanks & Regards
Yasir Israr
-Original Message-
From: juanesteban.jime...@mdc-berlin.de
[mailto:juanesteban.jime...@mdc-berlin.de]
Sent: 27 April
Thanks & Regards
Yasir Israr
-Original Message-----
From: juanesteban.jime...@mdc-berlin.de
[mailto:juanesteban.jime...@mdc-berlin.de]
Sent: 27 April 2017 04:00 PM
To: ya...@orionsolutions.co.in; 'Maximilian Friedersdorff';
sge-disc...@liverpool.ac.
Has anyone ever managed to tie permission to use a resource like GPU’s on a
node to membership in an Active Directory and/or Linux Group?
Mfg,
Juan Jimenez
System Administrator, BIH HPC Cluster
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800
___
SGE-discu
In our cluster we have one node with two Nvidia GPUs. I have been trying to
figure out how to set them up as consumable resources tied to an ACL, but I
can't get SGE to handle them correctly. It always says the resource is not
available.
Can someone walk me through the steps required to set thi
...@liverpool.ac.uk
Subject: Re: [SGE-discuss] Tying resource use to AD/Linux groups
Hi Juan,
On 16 May 2017 at 12:32,
juanesteban.jime...@mdc-berlin.de<mailto:juanesteban.jime...@mdc-berlin.de>
mailto:juanesteban.jime...@mdc-berlin.de>>
wrote:
Has anyone ever managed to tie permiss
i don't know what that means.
Get Outlook for Android<https://aka.ms/ghei36>
On Tue, May 16, 2017 at 10:10 PM +0200, "Reuti"
mailto:re...@staff.uni-marburg.de>> wrote:
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Am 16.05.2017 um 22:07 schrieb juanesteban.j
told that this is not recommended. ??
Mfg,
Juan Jimenez
System Administrator, BIH HPC Cluster
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800
On 17.05.17, 09:44, "William Hay" wrote:
On Tue, May 16, 2017 at 08:07:15PM +, juanesteban.jime...@mdc-berlin.de
wrote:
> In our cl
E-discuss] GPUs as a resource
Hi,
For GPU integration we are using a solution like this one:
https://github.com/kyamagu/sge-gpuprolog
Regards
On Thu, May 18, 2017 at 12:37 PM,
juanesteban.jime...@mdc-berlin.de<mailto:juanesteban.jime...@mdc-berlin.de>
mailto:juanesteban.jime...@mdc-berlin.de
So, I now have a working gpu.q. However, users in the acl eat up slots even if
they have not requested a gpu resource. How do i keep out jobs that do not
specifically request a gpu. I only want jobs to run on that queue/node if they
want to use one of the two gpu's.
thanks!
Juan
Get Outlook f
...@staff.uni-marburg.de]
Sent: Friday, May 19, 2017 14:20
To: Jimenez, Juan Esteban
Cc: Kamel Mazouzi; SGE-discuss@liv.ac.uk
Subject: Re: [SGE-discuss] GPUs as a resource
Hi,
> Am 18.05.2017 um 14:15 schrieb juanesteban.jime...@mdc-berlin.de:
>
> I tried it according to the instructions, but it w
> You are being told by who or what? If it is a what then the exact message is
> helpful?
By my colleagues who are running a 2nd cluster using Univa GridEngine. This was
a warning from Univa not to do it that way because it increases qmaster workload
Juan
__
> As William mentions below: are these nodes exclusively reserved for dedicated
> users, or should other users be able to use them, but not the GPU?
One node, reserved for a user list. The problem is that those users run non-gpu
jobs and some of the jobs get put into the gpu node even though the
...@staff.uni-marburg.de]
Sent: Friday, May 19, 2017 16:37
To: Jimenez, Juan Esteban
Cc: Kamel Mazouzi; SGE-discuss@liv.ac.uk
Subject: Re: [SGE-discuss] GPUs as a resource
> Am 19.05.2017 um 16:33 schrieb juanesteban.jime...@mdc-berlin.de:
>
> I put them in /opt/sge/default/common/sge-gpuprolog.
> It does indeed but not by a whole lot for a queue on a couple of nodes.
> Since you want to reserve these nodes for GPU users then the extra queue is
> needless.
> I suggest:
> 1.Make the GPU complex FORCED (so users who don't request a gpu can't end up
> on a node with gpus).
> 2.Define the
Esteban
Cc: William Hay; SGE-discuss@liv.ac.uk
Subject: Re: [SGE-discuss] GPUs as a resource
> Am 19.05.2017 um 16:35 schrieb juanesteban.jime...@mdc-berlin.de:
>
>> You are being told by who or what? If it is a what then the exact message
>> is helpful?
>
> By my colleagu
:12:42 schrieb "juanesteban.jime...@mdc-berlin.de"
:
> I am just telling you what my colleagues say they were told by Univa.
>
> Mfg,
> Juan Jimenez
> System Administrator, HPC
> MDC Berlin / IT-Dept.
> Tel.: +49 30 9406 2800
>
>
> __
Out of the blue, qrsh refuses to work. It will not allow me to connect to any
node, but ssh works fine from all nodes to all nodes. ???
[jjimene@med-login2 bin]$ qrsh -verbose
Your job 4909901 ("QRLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 4909
BTW, I did this to try to troubleshoot this, in qconf -mconf
rsh_command /usr/bin/ssh -Y -A -
But where does qrsh put the result of the - option?
Mfg,
Juan Jimenez
System Administrator, HPC
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800
__
,
Juan Jimenez
System Administrator, BIH HPC Cluster
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800
On 29.05.17, 16:39, "Reuti" wrote:
> Am 29.05.2017 um 16:08 schrieb juanesteban.jime...@mdc-berlin.de:
>
> Out of the blue, qrsh refuses to work. It will not all
On 29.05.17, 17:56, "Reuti" wrote:
> Am 29.05.2017 um 17:26 schrieb juanesteban.jime...@mdc-berlin.de:
>
> I am getting this very specific error:
>
> debug1: ssh_exchange_identification: /usr/sbin/sshd: error while loading
shared libraries: l
: Jimenez, Juan Esteban
Cc: SGE-discuss@liv.ac.uk
Subject: Re: [SGE-discuss] Another QRSH problem
> Am 29.05.2017 um 18:00 schrieb juanesteban.jime...@mdc-berlin.de:
>
> On 29.05.17, 17:56, "Reuti" wrote:
>
>
>> Am 29.05.2017 um 17:26 schrieb juanesteban.jime...@md
Administrator, BIH HPC Cluster
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800
On 29.05.17, 19:45, "SGE-discuss on behalf of
juanesteban.jime...@mdc-berlin.de" wrote:
How is the sheperd bring up this separate sshd daemon? What arguments are
being used?
Mfg,
Juan Jimenez
From: Reuti [re...@staff.uni-marburg.de]
Sent: Tuesday, May 30, 2017 11:36
To: Jimenez, Juan Esteban
Cc: SGE-discuss@liv.ac.uk
Subject: Re: [SGE-discuss] Another QRSH problem
> Am 30.05.2017 um 11:32 schrieb juanesteban.jime...@mdc-berlin.de:
>
>
means nothing can actually start as every malloc() will return E_NOMEM.
Simple.
>-Original Message-
>From: SGE-discuss [mailto:sge-discuss-boun...@liverpool.ac.uk] On Behalf Of
>juanesteban.jime...@mdc-berlin.de
>Sent: Thursday, June 01, 2017 9:49 AM
>To: Reuti
>Cc: SGE
Where should I start looking to resolve this? I've got a user complaining about
this, even though I told him the util is more for the installation of the
daemon, and that he should be using hostname instead
$ /opt/sge/utilbin/lx-amd64/gethostname
error resolving local host: can't resolve h
A daemon -is- a process...
Mfg,
Juan Jimenez
System Administrator, HPC
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800
From: SGE-discuss [sge-discuss-boun...@liverpool.ac.uk] on behalf of Mukesh
Chawla [mukesh.mnni...@gmail.com]
Sent: Sunday, June 04, 2017
4, 2017 8:08 PM,
"juanesteban.jime...@mdc-berlin.de<mailto:juanesteban.jime...@mdc-berlin.de>"
mailto:juanesteban.jime...@mdc-berlin.de>>
wrote:
A daemon -is- a process...
Mfg,
Juan Jimenez
System Administrator, HPC
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800
_
Hi folks,
I modified the SGE config in this manner:
execd_params S_DESCRIPTORS=16384 H_DESCRIPTORS=16384 \
S_MAXPROC=32768 H_MAXPROC=32768
However, this doesn’t seem to apply to qrsh sessions, only to job submissions.
How do I make it apply to qrsh a
Esteban
Cc: sge-disc...@liverpool.ac.uk
Subject: Re: [SGE-discuss] Ulimit -u in qrsh
Are the system's limits in effect for these login sessions, which could be
lower. Do the system's limits match these settings?
-- Reuti
> Am 09.06.2017 um 14:02 schrieb "juanesteban.jim
Esteban
Cc: Reuti; sge-disc...@liverpool.ac.uk
Subject: Re: [SGE-discuss] Ulimit -u in qrsh
On Fri, Jun 09, 2017 at 07:11:30PM +, juanesteban.jime...@mdc-berlin.de
wrote:
> qrsh is changing ulimit -u to 4096 no matter what value I set it to.
What limit do you get if you log directly into
I’ve got a problem with my qmaster. It is running but is unresponsive to
commands like qstat. The process status is mostly D for disk sleep, and when I
run it in non-daemon debug mode it spends a LOT of time reading the
Master_Job_List.
Any clues?
Mfg,
Juan Jimenez
System Administrator, BIH HP
existing data in /opt/sge?
Mfg,
Juan Jimenez
System Administrator, BIH HPC Cluster
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800
On 27.06.17, 10:04, "SGE-discuss on behalf of
juanesteban.jime...@mdc-berlin.de" wrote:
I’ve got a problem with my qmaster. It is running but is unres
2800
On 27.06.17, 10:41, "William Hay" wrote:
On Tue, Jun 27, 2017 at 08:30:55AM +, juanesteban.jime...@mdc-berlin.de
wrote:
> Never mind. One of my users submitted a job with 139k subjobs.
>
> A few other questions:
>
> 1) Is it possible
, this will reset the master job list and give me back
control?
Mfg,
Juan Jimenez
System Administrator, BIH HPC Cluster
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800
On 27.06.17, 11:12, "William Hay" wrote:
On Tue, Jun 27, 2017 at 08:44:30AM +, juanesteban.jime...@mdc-berlin
. Holding –everything- in memory is simply
not a good idea.
Mfg,
Juan Jimenez
System Administrator, BIH HPC Cluster
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800
On 28.06.17, 10:30, "William Hay" wrote:
On Tue, Jun 27, 2017 at 02:40:18PM +, juanesteban.jime...@mdc-berlin.de
wrot
28.06.17, 12:12, "William Hay" wrote:
On Wed, Jun 28, 2017 at 08:35:52AM +, juanesteban.jime...@mdc-berlin.de
wrote:
> I figured it would complain if I did that live so I did shut it down
first. Good advice anyway.
>
> It wasn???t one particular job. One use
ed, Jun 28, 2017 at 10:22:12AM +, juanesteban.jime...@mdc-berlin.de
wrote:
> Correct again. I have opened a ticket to move the qmaster from a VM to
its own full blade and I turned off schedd_job_info. Thanks again. :)
>
> I tried education. Doesn???t always work.
jobs. Schedd_job_info is already false.
Mfg,
Juan Jimenez
System Administrator, BIH HPC Cluster
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800
On 29.06.17, 15:47, "Mark Dixon" wrote:
On Tue, 27 Jun 2017, juanesteban.jime...@mdc-berlin.de wrote:
> Never mind. On
58 matches
Mail list logo