Re: [gridengine users] Jobs sitting in queue despite suitable slots and resources available

2018-04-24 Thread Aarseth, Are
Hi, I saw this issue in the archive and I just wanted to say that we see the same thing: 04/24/2018 14:16:41|worker|itsrv9|E|debiting 34359738368.00 of job_memory on host simsrv12.nordicsemi.no for 1 slots would exceed remaining capacity of 0.00 04/24/2018 14:16:41|worker|itsrv9|E|reso

Re: [gridengine users] Jobs sitting in queue despite suitable slots and resources available

2018-04-19 Thread Mark Dixon
On Tue, 17 Apr 2018, Joshua Baker-LePain wrote: As an alternative to fixing our current setup, I'd be most interested to hear if/how other folks are handling GPUs in their SoGE setups. I was considering changing the slot count in gpu.q to match the number of GPUs in a host (rather than CPU core

Re: [gridengine users] Jobs sitting in queue despite suitable slots and resources available

2018-04-17 Thread Joshua Baker-LePain
As an alternative to fixing our current setup, I'd be most interested to hear if/how other folks are handling GPUs in their SoGE setups. I was considering changing the slot count in gpu.q to match the number of GPUs in a host (rather than CPU cores) and have users request slots rather than the gp

Re: [gridengine users] Jobs sitting in queue despite suitable slots and resources available

2018-04-13 Thread Joshua Baker-LePain
n Fri, 13 Apr 2018 at 1:48am, William Hay wrote This looks more like the scheduler and qmaster threads of the qmaster disagreeing about the number of gpu left. This shouldn't persist but bouncing the qmaster might get them to agree. That is indeed exactly what it seems like is going on. How

Re: [gridengine users] Jobs sitting in queue despite suitable slots and resources available

2018-04-13 Thread Joshua Baker-LePain
On Fri, 13 Apr 2018 at 1:47am, Reuti wrote `qstat -f` doesn't shoe any queue instances being disbaled/in alarm state? No, the queues in question are definitely available to accept jobs. We do have *some* queues in the cluster that are either 'a' or 'au', but when this happens there are empt

Re: [gridengine users] Jobs sitting in queue despite suitable slots and resources available

2018-04-13 Thread William Hay
On Thu, Apr 12, 2018 at 10:15:34AM -0700, Joshua Baker-LePain wrote: > We're running SoGE 8.1.9 on a smallish (but growing) cluster. We've > recently added GPU nodes to the cluster. On each GPU node, a consumable > complex named 'gpu' is defined with the number of GPUs in the node. The > complex

Re: [gridengine users] Jobs sitting in queue despite suitable slots and resources available

2018-04-13 Thread Reuti
`qstat -f` doesn't shoe any queue instances being disbaled/in alarm state? -- Reuti > Am 12.04.2018 um 21:31 schrieb Joshua Baker-LePain : > > On Thu, 12 Apr 2018 at 10:15am, Joshua Baker-LePain wrote > >> We're running SoGE 8.1.9 on a smallish (but growing) cluster. We've >> recently added

Re: [gridengine users] Jobs sitting in queue despite suitable slots and resources available

2018-04-12 Thread Joshua Baker-LePain
On Thu, 12 Apr 2018 at 10:15am, Joshua Baker-LePain wrote We're running SoGE 8.1.9 on a smallish (but growing) cluster. We've recently added GPU nodes to the cluster. On each GPU node, a consumable complex named 'gpu' is defined with the number of GPUs in the node. The complex definition lo