n Fri, 13 Apr 2018 at 1:48am, William Hay wrote
This looks more like the scheduler and qmaster threads of the qmaster
disagreeing about the number of gpu left. This shouldn't persist but
bouncing the qmaster might get them to agree.
That is indeed exactly what it seems like is going on. However I've tried
bouncing the qmaster, and the problem persists after the restart.
It looks like you are defining the gpu as a host consumable. Is there
anything else that defines it: Queue consumable, global consumable,
resource quota or load sensor?
AFAIK, no. They only place the "gpu" variable occurs is in the host
definition "complex_values", e.g.:
$ qconf -se msg-iogpu9
load_values arch=lx-amd64,num_proc=32,mem_total=128739.226562M, \
gpu.cuda.1.util=90,gpu.names=GeForce GTX 1080 Ti;GeForce \
GTX 1080 Ti;,np_load_avg=0.174063, \
and the complex definition I sent previously. I am running the bundled
load sensor, as can be seen above.
What do you get if you use
qstat -F gpu -q 'gpu-q@msg-iogpu'
$ qstat -F gpu -q gpu.q@msg-iogpu9
queuename qtype resv/used/tot. load_avg arch
gpu.q@msg-iogpu9 BP 0/3/16 5.51 lx-amd64
Any ideas? I'm a bit flummoxed on this one...
Set MONITOR=1 in the scheduler's params and have a look at the schedule
file should tell you what the scheduler is doing.
I've had that set for a while now. JobID 373163 is currently stuck in the
queue with appropriate slots avaiable:
$ qstat -u "*" -q gpu.q
373163 0.50000 refine_joi user1 qw 04/13/2018 09:25:08
$ qalter -w p 373163
verification: found possible assignment with 3 slots
The qmaster "messages" file has this to say about that job ID (repeatedly
-- this is the first mention and coincides with the submit time above):
04/13/2018 09:25:32|worker|wynq1|E|debiting 2.000000 of gpu on host msg-iogpu9
for 1 slots would exceed remaining capacity of 0.000000
04/13/2018 09:25:32|worker|wynq1|E|resources no longer available for start of
And this is that job's first mention in the schedule file:
That block is repeated over and over in that file.
Also enabling sched_job_info for the job in question and then running
qstat -j on it after the next scheduling cycle might provide some clues.
Unfortunately that seems a bust as well. It just details all the queue
instances the job can't run in (all legitimate). It doesn't mention the
queue instances it *can* run in at all.
So the short version again is some part of the scheduler seems to think
there are available "gpu" complex slots on hosts where there aren't, and
another part of the scheduler realizes this and keeps the jobs requesting
those slots from starting. But it also won't try different hosts.
QB3 Shared Cluster Sysadmin
users mailing list