n Fri, 13 Apr 2018 at 1:48am, William Hay wrote
This looks more like the scheduler and qmaster threads of the qmaster
disagreeing about the number of gpu left. This shouldn't persist but
bouncing the qmaster might get them to agree.
That is indeed exactly what it seems like is going on. However I've tried
bouncing the qmaster, and the problem persists after the restart.
It looks like you are defining the gpu as a host consumable. Is there
anything else that defines it: Queue consumable, global consumable,
resource quota or load sensor?
AFAIK, no. They only place the "gpu" variable occurs is in the host
definition "complex_values", e.g.:
$ qconf -se msg-iogpu9
hostname msg-iogpu9
load_scaling NONE
complex_values mem_free=128000M,gpu=2
load_values arch=lx-amd64,num_proc=32,mem_total=128739.226562M, \
swap_total=4095.996094M,virtual_total=132835.222656M, \
m_topology=SCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTT, \
m_socket=2,m_core=16,m_thread=32,load_avg=5.570000, \
load_short=5.560000,load_medium=5.570000, \
load_long=5.530000,mem_free=124723.488281M, \
swap_free=4095.996094M,virtual_free=128819.484375M, \
mem_used=4015.738281M,swap_used=0.000000M, \
virtual_used=4015.738281M,cpu=21.100000, \
m_topology_inuse=SCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTT, \
gpu.ncuda=2,gpu.ndev=2,gpu.cuda.0.mem_free=752222208, \
gpu.cuda.0.procs=1,gpu.cuda.0.clock=1911, \
gpu.cuda.0.util=91,gpu.cuda.1.mem_free=752222208, \
gpu.cuda.1.procs=1,gpu.cuda.1.clock=1911, \
gpu.cuda.1.util=90,gpu.names=GeForce GTX 1080 Ti;GeForce \
GTX 1080 Ti;,np_load_avg=0.174063, \
np_load_short=0.173750,np_load_medium=0.174063, \
np_load_long=0.172813
processors 32
user_lists NONE
xuser_lists NONE
projects NONE
xprojects NONE
usage_scaling NONE
report_variables NONE
and the complex definition I sent previously. I am running the bundled
load sensor, as can be seen above.
What do you get if you use
qstat -F gpu -q 'gpu-q@msg-iogpu[29]'
$ qstat -F gpu -q gpu.q@msg-iogpu9
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
gpu.q@msg-iogpu9 BP 0/3/16 5.51 lx-amd64
hc:gpu=0
Any ideas? I'm a bit flummoxed on this one...
Set MONITOR=1 in the scheduler's params and have a look at the schedule
file should tell you what the scheduler is doing.
I've had that set for a while now. JobID 373163 is currently stuck in the
queue with appropriate slots avaiable:
$ qstat -u "*" -q gpu.q
.
.
373163 0.50000 refine_joi user1 qw 04/13/2018 09:25:08
3
$ qalter -w p 373163
verification: found possible assignment with 3 slots
The qmaster "messages" file has this to say about that job ID (repeatedly
-- this is the first mention and coincides with the submit time above):
04/13/2018 09:25:32|worker|wynq1|E|debiting 2.000000 of gpu on host msg-iogpu9
for 1 slots would exceed remaining capacity of 0.000000
04/13/2018 09:25:32|worker|wynq1|E|resources no longer available for start of
job 373163.1
And this is that job's first mention in the schedule file:
373163:1:STARTING:1523638696:82860:P:mpi_onehost:slots:3.000000
373163:1:STARTING:1523638696:82860:H:msg-iogpu9:mem_free:51539607552.000000
373163:1:STARTING:1523638696:82860:H:msg-iogpu9:gpu:2.000000
373163:1:STARTING:1523638696:82860:Q:gpu.q@msg-iogpu9:slots:3.000000
That block is repeated over and over in that file.
Also enabling sched_job_info for the job in question and then running
qstat -j on it after the next scheduling cycle might provide some clues.
Unfortunately that seems a bust as well. It just details all the queue
instances the job can't run in (all legitimate). It doesn't mention the
queue instances it *can* run in at all.
So the short version again is some part of the scheduler seems to think
there are available "gpu" complex slots on hosts where there aren't, and
another part of the scheduler realizes this and keeps the jobs requesting
those slots from starting. But it also won't try different hosts.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users