[gridengine users] Jobs sitting in queue despite suitable slots and resources available

Joshua Baker-LePain Thu, 12 Apr 2018 10:19:00 -0700

We're running SoGE 8.1.9 on a smallish (but growing) cluster. We'verecently added GPU nodes to the cluster. On each GPU node, a consumablecomplex named 'gpu' is defined with the number of GPUs in the node. Thecomplex definition looks like this:


#name               shortcut   type      relop requestable consumable default  
urgency
#--------------------------------------------------------------------------------------
gpu                 gpu        INT       <=    YES         JOB        0        0

We're frequently seeing GPU jobs stuck in 'qw' even when slots andresources on GPU nodes are available. What appears to be happening isthat SGE is choosing a node that's full and then waiting for that node tobecome available rather than switching to another node. For example:


$ qstat -u "*" -q gpu.q
 370002 0.05778 C3D1000b2_ user1        r     04/11/2018 00:18:17 
gpu.q@msg-iogpu10                  5
 369728 0.05778 C3D4000b2_ user1        r     04/10/2018 18:00:24 
gpu.q@msg-iogpu11                  5
 371490 0.06613 class3d    user2        r     04/11/2018 20:50:02 
gpu.q@msg-iogpu12                  3
 367554 0.05778 C3D3000b2_ user1        r     04/08/2018 16:07:24 
gpu.q@msg-iogpu3                   3
 367553 0.05778 C3D2000b2_ user1        r     04/08/2018 17:56:54 
gpu.q@msg-iogpu4                   3
 367909 0.05778 C3D11k_b2Y user1        r     04/09/2018 00:04:24 
gpu.q@msg-iogpu8                   3
 371511 0.06613 class3d    user2        r     04/11/2018 21:45:02 
gpu.q@msg-iogpu9                   3
 371593 0.95000 refine_joi user3        qw    04/11/2018 23:05:57               
                     5

Job 371593 has requested '-l gpu=2'. Nodes msg-iogpu2, 5, 6, and 7 haveno jobs in gpu.q on them and avaialable gpu resources, e.g.:


$ qhost -F -h msg-iogpu2
.
.
   hc:gpu=2.000000

However, SGE seems to want to insist on running this job on msg-iogpu9, asseen by these lines in the messages file for each scheduling run:


04/12/2018 09:59:47|worker|wynq1|E|debiting 2.000000 of gpu on host msg-iogpu9 
for 1 slots would exceed remaining capacity of 0.000000
04/12/2018 09:59:47|worker|wynq1|E|resources no longer available for start of 
job 371593.1

From past experience, job 371593 will indeed wait until msg-iogpu9 becomes

available and run there. We do advise our users to set "-R y" for thesejobs -- is this a reservation issue? Where else should I look for clues?Any ideas? I'm a bit flummoxed on this one...


Thanks.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] Jobs sitting in queue despite suitable slots and resources available

Reply via email to