`qstat -f` doesn't shoe any queue instances being disbaled/in alarm state?

-- Reuti


> Am 12.04.2018 um 21:31 schrieb Joshua Baker-LePain <j...@salilab.org>:
> 
> On Thu, 12 Apr 2018 at 10:15am, Joshua Baker-LePain wrote
> 
>> We're running SoGE 8.1.9 on a smallish (but growing) cluster.  We've 
>> recently added GPU nodes to the cluster.  On each GPU node, a consumable 
>> complex named 'gpu' is defined with the number of GPUs in the node.  The 
>> complex definition looks like this:
>> 
>> # name               shortcut   type      relop requestable consumable # 
>> default  urgency
>> # 
>> --------------------------------------------------------------------------------------
>> gpu                 gpu        INT       <=    YES         JOB        0 0
>> 
>> We're frequently seeing GPU jobs stuck in 'qw' even when slots and resources 
>> on GPU nodes are available.  What appears to be happening is that SGE is 
>> choosing a node that's full and then waiting for that node to become 
>> available rather than switching to another node.  For example:
>> 
>> $ qstat -u "*" -q gpu.q
>> 370002 0.05778 C3D1000b2_ user1        r     04/11/2018 00:18:17
>> gpu.q@msg-iogpu10                  5
>> 369728 0.05778 C3D4000b2_ user1        r     04/10/2018 18:00:24
>> gpu.q@msg-iogpu11                  5
>> 371490 0.06613 class3d    user2        r     04/11/2018 20:50:02
>> gpu.q@msg-iogpu12                  3
>> 367554 0.05778 C3D3000b2_ user1        r     04/08/2018 16:07:24
>> gpu.q@msg-iogpu3                   3
>> 367553 0.05778 C3D2000b2_ user1        r     04/08/2018 17:56:54
>> gpu.q@msg-iogpu4                   3
>> 367909 0.05778 C3D11k_b2Y user1        r     04/09/2018 00:04:24
>> gpu.q@msg-iogpu8                   3
>> 371511 0.06613 class3d    user2        r     04/11/2018 21:45:02
>> gpu.q@msg-iogpu9                   3
>> 371593 0.95000 refine_joi user3        qw    04/11/2018 23:05:57
>> 5
>> 
>> Job 371593 has requested '-l gpu=2'.  Nodes msg-iogpu2, 5, 6, and 7 have no 
>> jobs in gpu.q on them and avaialable gpu resources, e.g.:
>> 
>> $ qhost -F -h msg-iogpu2
>> .
>> .
>>   hc:gpu=2.000000
>> 
>> However, SGE seems to want to insist on running this job on msg-iogpu9, as 
>> seen by these lines in the messages file for each scheduling run:
>> 
>> 04/12/2018 09:59:47|worker|wynq1|E|debiting 2.000000 of gpu on host 
>> msg-iogpu9 for 1 slots would exceed remaining capacity of 0.000000
>> 04/12/2018 09:59:47|worker|wynq1|E|resources no longer available for start 
>> of job 371593.1
>> 
>>> From past experience, job 371593 will indeed wait until msg-iogpu9 becomes 
>> available and run there.  We do advise our users to set "-R y" for these 
>> jobs -- is this a reservation issue?  Where else should I look for clues? 
>> Any ideas?  I'm a bit flummoxed on this one...
> 
> One last bit of info.  Running 'qalter -w p' on the stuck job proves that it 
> *should* be able to run:
> 
> $ qalter -w p 371593
> verification: found possible assignment with 5 slots
> 
> -- 
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to