Re: [SGE-discuss] Qmaster unresponsive, process status "disk sleep"

2017-06-29 Thread juanesteban.jime...@mdc-berlin.de
jobs. Schedd_job_info is already false. Mfg, Juan Jimenez System Administrator, BIH HPC Cluster MDC Berlin / IT-Dept. Tel.: +49 30 9406 2800 On 29.06.17, 15:47, "Mark Dixon" <m.c.di...@leeds.ac.uk> wrote: On Tue, 27 Jun 2017, juanesteban.jime...@mdc-berlin.de wrote:

Re: [SGE-discuss] Qmaster unresponsive, process status "disk sleep"

2017-06-28 Thread juanesteban.jime...@mdc-berlin.de
On 28.06.17, 12:12, "William Hay" <w@ucl.ac.uk> wrote: On Wed, Jun 28, 2017 at 08:35:52AM +, juanesteban.jime...@mdc-berlin.de wrote: > I figured it would complain if I did that live so I did shut it down first. Good advice anyway. > > It wasn???t one

Re: [SGE-discuss] Qmaster unresponsive, process status "disk sleep"

2017-06-27 Thread juanesteban.jime...@mdc-berlin.de
, this will reset the master job list and give me back control? Mfg, Juan Jimenez System Administrator, BIH HPC Cluster MDC Berlin / IT-Dept. Tel.: +49 30 9406 2800 On 27.06.17, 11:12, "William Hay" <w@ucl.ac.uk> wrote: On Tue, Jun 27, 2017 at 08:44:30AM +, juanest

Re: [SGE-discuss] Qmaster unresponsive, process status "disk sleep"

2017-06-27 Thread juanesteban.jime...@mdc-berlin.de
2800 On 27.06.17, 10:41, "William Hay" <w@ucl.ac.uk> wrote: On Tue, Jun 27, 2017 at 08:30:55AM +, juanesteban.jime...@mdc-berlin.de wrote: > Never mind. One of my users submitted a job with 139k subjobs. > > A few other questions: >

Re: [SGE-discuss] Qmaster unresponsive, process status "disk sleep"

2017-06-27 Thread juanesteban.jime...@mdc-berlin.de
the existing data in /opt/sge? Mfg, Juan Jimenez System Administrator, BIH HPC Cluster MDC Berlin / IT-Dept. Tel.: +49 30 9406 2800 On 27.06.17, 10:04, "SGE-discuss on behalf of juanesteban.jime...@mdc-berlin.de" <sge-discuss-boun...@liverpool.ac.uk on behalf of juanesteban.jime...@

[SGE-discuss] Qmaster unresponsive, process status "disk sleep"

2017-06-27 Thread juanesteban.jime...@mdc-berlin.de
I’ve got a problem with my qmaster. It is running but is unresponsive to commands like qstat. The process status is mostly D for disk sleep, and when I run it in non-daemon debug mode it spends a LOT of time reading the Master_Job_List. Any clues? Mfg, Juan Jimenez System Administrator, BIH

Re: [SGE-discuss] Ulimit -u in qrsh

2017-06-09 Thread juanesteban.jime...@mdc-berlin.de
Esteban Cc: sge-disc...@liverpool.ac.uk Subject: Re: [SGE-discuss] Ulimit -u in qrsh Are the system's limits in effect for these login sessions, which could be lower. Do the system's limits match these settings? -- Reuti > Am 09.06.2017 um 14:02 schrieb "juanesteban.jime...@mdc-b

Re: [SGE-discuss] Can SGE exec be run as a process and not daemon

2017-06-04 Thread juanesteban.jime...@mdc-berlin.de
17 8:08 PM, "juanesteban.jime...@mdc-berlin.de<mailto:juanesteban.jime...@mdc-berlin.de>" <juanesteban.jime...@mdc-berlin.de<mailto:juanesteban.jime...@mdc-berlin.de>> wrote: A daemon -is- a process... Mfg, Juan Jimenez System Administrator, HPC MDC Be

[SGE-discuss] Why would gethostname fail?

2017-06-03 Thread juanesteban.jime...@mdc-berlin.de
Where should I start looking to resolve this? I've got a user complaining about this, even though I told him the util is more for the installation of the daemon, and that he should be using hostname instead $ /opt/sge/utilbin/lx-amd64/gethostname error resolving local host: can't resolve

Re: [SGE-discuss] Another QRSH problem

2017-06-01 Thread juanesteban.jime...@mdc-berlin.de
means nothing can actually start as every malloc() will return E_NOMEM. Simple. >-Original Message- >From: SGE-discuss [mailto:sge-discuss-boun...@liverpool.ac.uk] On Behalf Of >juanesteban.jime...@mdc-berlin.de >Sent: Thursday, June 01, 2017 9:49 AM >To: Reuti <re...@st

Re: [SGE-discuss] Another QRSH problem

2017-06-01 Thread juanesteban.jime...@mdc-berlin.de
From: Reuti [re...@staff.uni-marburg.de] Sent: Tuesday, May 30, 2017 11:36 To: Jimenez, Juan Esteban Cc: SGE-discuss@liv.ac.uk Subject: Re: [SGE-discuss] Another QRSH problem > Am 30.05.2017 um 11:32 schrieb juanesteban.jime...@mdc-berlin.de: > >

Re: [SGE-discuss] Another QRSH problem

2017-05-30 Thread juanesteban.jime...@mdc-berlin.de
Administrator, BIH HPC Cluster MDC Berlin / IT-Dept. Tel.: +49 30 9406 2800 On 29.05.17, 19:45, "SGE-discuss on behalf of juanesteban.jime...@mdc-berlin.de" <sge-discuss-boun...@liverpool.ac.uk on behalf of juanesteban.jime...@mdc-berlin.de> wrote: How is the sheperd bring up th

Re: [SGE-discuss] Another QRSH problem

2017-05-29 Thread juanesteban.jime...@mdc-berlin.de
To: Jimenez, Juan Esteban Cc: SGE-discuss@liv.ac.uk Subject: Re: [SGE-discuss] Another QRSH problem > Am 29.05.2017 um 18:00 schrieb juanesteban.jime...@mdc-berlin.de: > > On 29.05.17, 17:56, "Reuti" <re...@staff.uni-marburg.de> wrote: > > >> Am 29.05.2017 um

Re: [SGE-discuss] Another QRSH problem

2017-05-29 Thread juanesteban.jime...@mdc-berlin.de
On 29.05.17, 17:56, "Reuti" <re...@staff.uni-marburg.de> wrote: > Am 29.05.2017 um 17:26 schrieb juanesteban.jime...@mdc-berlin.de: > > I am getting this very specific error: > > debug1: ssh_exchange_identification: /usr/sbin/sshd: error

Re: [SGE-discuss] Another QRSH problem

2017-05-29 Thread juanesteban.jime...@mdc-berlin.de
, Juan Jimenez System Administrator, BIH HPC Cluster MDC Berlin / IT-Dept. Tel.: +49 30 9406 2800 On 29.05.17, 16:39, "Reuti" <re...@staff.uni-marburg.de> wrote: > Am 29.05.2017 um 16:08 schrieb juanesteban.jime...@mdc-berlin.de: > > Out of the b

Re: [SGE-discuss] Another QRSH problem

2017-05-29 Thread juanesteban.jime...@mdc-berlin.de
BTW, I did this to try to troubleshoot this, in qconf -mconf rsh_command /usr/bin/ssh -Y -A - But where does qrsh put the result of the - option? Mfg, Juan Jimenez System Administrator, HPC MDC Berlin / IT-Dept. Tel.: +49 30 9406 2800

Re: [SGE-discuss] GPUs as a resource

2017-05-19 Thread juanesteban.jime...@mdc-berlin.de
:12:42 schrieb "juanesteban.jime...@mdc-berlin.de" <juanesteban.jime...@mdc-berlin.de>: > I am just telling you what my colleagues say they were told by Univa. > > Mfg, > Juan Jimenez > System Administrator, HPC > MDC Berlin /

Re: [SGE-discuss] GPUs as a resource

2017-05-19 Thread juanesteban.jime...@mdc-berlin.de
Esteban Cc: William Hay; SGE-discuss@liv.ac.uk Subject: Re: [SGE-discuss] GPUs as a resource > Am 19.05.2017 um 16:35 schrieb juanesteban.jime...@mdc-berlin.de: > >> You are being told by who or what? If it is a what then the exact message >> is helpful? > > By my colleagu

Re: [SGE-discuss] GPUs as a resource

2017-05-19 Thread juanesteban.jime...@mdc-berlin.de
> It does indeed but not by a whole lot for a queue on a couple of nodes. > Since you want to reserve these nodes for GPU users then the extra queue is > needless. > I suggest: > 1.Make the GPU complex FORCED (so users who don't request a gpu can't end up > on a node with gpus). > 2.Define the

Re: [SGE-discuss] GPUs as a resource

2017-05-19 Thread juanesteban.jime...@mdc-berlin.de
...@staff.uni-marburg.de] Sent: Friday, May 19, 2017 16:37 To: Jimenez, Juan Esteban Cc: Kamel Mazouzi; SGE-discuss@liv.ac.uk Subject: Re: [SGE-discuss] GPUs as a resource > Am 19.05.2017 um 16:33 schrieb juanesteban.jime...@mdc-berlin.de: > > I put them in /opt/sge/default/common/sge-

Re: [SGE-discuss] GPUs as a resource

2017-05-19 Thread juanesteban.jime...@mdc-berlin.de
> You are being told by who or what? If it is a what then the exact message is > helpful? By my colleagues who are running a 2nd cluster using Univa GridEngine. This was a warning from Univa not to do it that way because it increases qmaster workload Juan

Re: [SGE-discuss] GPUs as a resource

2017-05-19 Thread juanesteban.jime...@mdc-berlin.de
...@staff.uni-marburg.de] Sent: Friday, May 19, 2017 14:20 To: Jimenez, Juan Esteban Cc: Kamel Mazouzi; SGE-discuss@liv.ac.uk Subject: Re: [SGE-discuss] GPUs as a resource Hi, > Am 18.05.2017 um 14:15 schrieb juanesteban.jime...@mdc-berlin.de: > > I tried it according to the instructions, but it w

[SGE-discuss] Exclusion

2017-05-19 Thread juanesteban.jime...@mdc-berlin.de
So, I now have a working gpu.q. However, users in the acl eat up slots even if they have not requested a gpu resource. How do i keep out jobs that do not specifically request a gpu. I only want jobs to run on that queue/node if they want to use one of the two gpu's. thanks! Juan Get Outlook

Re: [SGE-discuss] GPUs as a resource

2017-05-18 Thread juanesteban.jime...@mdc-berlin.de
for prolog and epilog or ?? Mfg, Juan Jimenez System Administrator, BIH HPC Cluster MDC Berlin / IT-Dept. Tel.: +49 30 9406 2800 From: Kamel Mazouzi <mazo...@gmail.com> Date: Thursday, 18. May 2017 at 13:07 To: "Jimenez, Juan Esteban" <juanesteban.jime...@mdc-berlin.de&

Re: [SGE-discuss] GPUs as a resource

2017-05-16 Thread juanesteban.jime...@mdc-berlin.de
16.05.2017 um 22:07 schrieb juanesteban.jime...@mdc-berlin.de: > In our cluster we have one node with two Nvidia GPUs. I have been trying to > figure out how to set them up as consumable resources tied to an ACL, but I > can't get SGE to handle them correctly. It always says the resource is not

Re: [SGE-discuss] Tying resource use to AD/Linux groups

2017-05-16 Thread juanesteban.jime...@mdc-berlin.de
...@liverpool.ac.uk Subject: Re: [SGE-discuss] Tying resource use to AD/Linux groups Hi Juan, On 16 May 2017 at 12:32, juanesteban.jime...@mdc-berlin.de<mailto:juanesteban.jime...@mdc-berlin.de> <juanesteban.jime...@mdc-berlin.de<mailto:juanesteban.jime...@mdc-berlin.de>> wrot

[SGE-discuss] GPUs as a resource

2017-05-16 Thread juanesteban.jime...@mdc-berlin.de
In our cluster we have one node with two Nvidia GPUs. I have been trying to figure out how to set them up as consumable resources tied to an ACL, but I can't get SGE to handle them correctly. It always says the resource is not available. Can someone walk me through the steps required to set

[SGE-discuss] Tying resource use to AD/Linux groups

2017-05-16 Thread juanesteban.jime...@mdc-berlin.de
Has anyone ever managed to tie permission to use a resource like GPU’s on a node to membership in an Active Directory and/or Linux Group? Mfg, Juan Jimenez System Administrator, BIH HPC Cluster MDC Berlin / IT-Dept. Tel.: +49 30 9406 2800 ___

Re: [SGE-discuss] SGE Installation on Centos 7

2017-04-27 Thread juanesteban.jime...@mdc-berlin.de
rs/loveshack/SGE/ Thanks & Regards Yasir Israr -Original Message- From: juanesteban.jime...@mdc-berlin.de [mailto:juanesteban.jime...@mdc-berlin.de] Sent: 27 April 2017 04:00 PM To: ya...@orionsolutions.co.in; 'Maximilian Friedersdorff'; sge-dis

Re: [SGE-discuss] Kerberos authentication

2017-04-21 Thread juanesteban.jime...@mdc-berlin.de
oun...@liverpool.ac.uk] On Behalf > Of juanesteban.jime...@mdc-berlin.de > Sent: Wednesday, April 12, 2017 5:15 PM > To: William Hay <w@ucl.ac.uk> > Cc: SGE-discuss@liv.ac.uk <sge-disc...@liverpool.ac.uk> > Subject: Re: [SGE-discuss]

Re: [SGE-discuss] Kerberos authentication

2017-04-12 Thread juanesteban.jime...@mdc-berlin.de
On 12.04.17, 10:21, "William Hay" <w@ucl.ac.uk> wrote: On Tue, Apr 11, 2017 at 05:11:58PM +, juanesteban.jime...@mdc-berlin.de wrote: > I've got a serious problem here with authenetication with AD and Kerberos. I have already done away with all the possibilities I ca

Re: [SGE-discuss] Sizing the qmaster

2017-04-09 Thread juanesteban.jime...@mdc-berlin.de
-marburg.de] Sent: Sunday, April 09, 2017 17:09 To: Jimenez, Juan Esteban Cc: Jesse Becker; SGE-discuss@liv.ac.uk Subject: Re: [SGE-discuss] Sizing the qmaster -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, Am 09.04.2017 um 12:38 schrieb juanesteban.jime...@mdc-berlin.de: > Update. > > We

Re: [SGE-discuss] Sizing the qmaster

2017-04-09 Thread juanesteban.jime...@mdc-berlin.de
o: Jimenez, Juan Esteban Cc: Jesse Becker; SGE-discuss@liv.ac.uk Subject: Re: [SGE-discuss] Sizing the qmaster > Am 21.03.2017 um 16:15 schrieb juanesteban.jime...@mdc-berlin.de: > >> The "size" of job metadata (scripts, ENV, etc) doesn't really affect >> the RAM

Re: [SGE-discuss] Sizing the qmaster

2017-03-21 Thread juanesteban.jime...@mdc-berlin.de
>The "size" of job metadata (scripts, ENV, etc) doesn't really affect >the RAM usage appreciably that I've seen. We routinely have jobs >ENVs of almost 4k or more, and it's never been a problem. The >"data" processed by jobs isn't a factor in qmaster RAM usage, so far as >I

Re: [SGE-discuss] Sizing the qmaster

2017-03-21 Thread juanesteban.jime...@mdc-berlin.de
From: SGE-discuss [sge-discuss-boun...@liverpool.ac.uk] on behalf of juanesteban.jime...@mdc-berlin.de [juanesteban.jime...@mdc-berlin.de] Sent: Tuesday, March 21, 2017 09:41 To: Jesse Becker Cc: SGE-discuss@liv.ac.uk Subject: Re: [SGE-discuss] Sizing the qmaster

Re: [SGE-discuss] Sizing the qmaster

2017-03-21 Thread juanesteban.jime...@mdc-berlin.de
+, juanesteban.jime...@mdc-berlin.de wrote: >Hi folks, > >I just ran into my first episode of the scheduler crashing because of too many >submitted jobs. It pegged memory usage to as much as I could give it (12gb at >one point) and still crashed while it tries to work

[SGE-discuss] Sizing the qmaster

2017-03-20 Thread juanesteban.jime...@mdc-berlin.de
Hi folks, I just ran into my first episode of the scheduler crashing because of too many submitted jobs. It pegged memory usage to as much as I could give it (12gb at one point) and still crashed while it tries to work its way through the stack. I need to figure out how to size a box properly

Re: [SGE-discuss] qsub permission denied

2017-02-02 Thread juanesteban.jime...@mdc-berlin.de
Today I did some more testing and the problem appears to be specific to GPFS. I changed the script to put the logs in a folder on an NFS share and *without* the throttling, there are no errors. Juan On 02/02/2017, 00:23, "SGE-discuss on behalf of juanesteban.jime...@mdc-berlin.de&

[SGE-discuss] qsub permission denied

2017-02-01 Thread juanesteban.jime...@mdc-berlin.de
Hi Folks, New to the list! I am the sysadmin of an HPC cluster using SGE 8.1.8. The cluster has 100+ nodes running Centos 7 with a shared DDN storage cluster configured as a GPFS device and a number of NFS mounts to a Centos 7 server. Some of my users are reporting problems with qsub that have