Hi all,

Over the weekend I tried to setup Gres device type allocation but stumbled
onto a bit of a problem.

The setup I'm working with is a cluster with Bright Cluster Manager setup and
managing in total 10 nodes. Eight of these nodes contain GPUs - gpu[01-07]
contains one GPU each, whereas gpu08 contains two GPUs. The GPUs in
gpu08 are not the same, one is a Tesla device and the other a Quadro. The other
two nodes have MICs on them, but these have not been configured yet.

Some software version information:

  *   Bright Cluster Manager: 7.3 running on SL 7.2
  *   SLURM: 16.05.2

As per https://slurm.schedmd.com/gres.html, I setup my slurm.conf file to have 
the
Gres line with the type specification - excerpt (full slurm.conf attached):

# Nodes
NodeName=mic[01-02]
NodeName=gpu08  Feature=multiple-gpus Gres=gpu:tesla:1,gpu:quadro:1
NodeName=gpu[01-07]  Gres=gpu:1
# Generic resources types
GresTypes=gpu,mic

and on node08 I added to the gres.conf file the following configuration:

Name=gpu File=/dev/nvidia0 Type=tesla
Name=gpu File=/dev/nvidia1 Type=quadro

I added nothing into the controller gres.conf file.

I believe that these settings/type information has propagated to slurmctld
as calling sinfo gives the following output:

$ sinfo -o "%10P %.5a %.15l %.6D %.6t %25G %N"
PARTITION  AVAIL       TIMELIMIT  NODES  STATE GRES                      
NODELIST
longq         up      7-00:00:00      1  drain (null)                    mic02
longq         up      7-00:00:00      1  alloc gpu:1                     gpu06
longq         up      7-00:00:00      6   idle gpu:1                     
gpu[01-05,07]
longq         up      7-00:00:00      1   idle gpu:tesla:1,gpu:quadro:1  gpu08
longq         up      7-00:00:00      1   idle (null)                    mic01
testq*        up           15:00      1  drain (null)                    mic02
testq*        up           15:00      1  alloc gpu:1                     gpu06
testq*        up           15:00      6   idle gpu:1                     
gpu[01-05,07]
testq*        up           15:00      1   idle gpu:tesla:1,gpu:quadro:1  gpu08
testq*        up           15:00      1   idle (null)                    mic01

When I try to allocate the resource using salloc, I get the following error 
message
though:

$ salloc --gres=gpu:tesla:1
salloc: error: Job submit/allocate failed: Requested node configuration is not 
available
salloc: Job allocation 544 has been revoked.

Doing a normal allocation without the`--gres=<>' flag works, but when I try the 
following
it fails as well:

$ salloc -w gpu08 --gres=gpu:1
salloc: error: Job submit/allocate failed: Requested node configuration is not 
available
salloc: Job allocation 545 has been revoked.

Activating the DebugFlag Gres provides the following output in the 
slurmctld.log file:

[2017-02-06T14:22:46.990] gres/gpu: state for gpu08
[2017-02-06T14:22:46.990]   gres_cnt found:2 configured:2 avail:2 alloc:0
[2017-02-06T14:22:46.990]   gres_bit_alloc:
[2017-02-06T14:22:46.990]   gres_used:(null)
[2017-02-06T14:22:46.990]   topo_cpus_bitmap[0]:NULL
[2017-02-06T14:22:46.990]   topo_gres_bitmap[0]:0
[2017-02-06T14:22:46.990]   topo_gres_cnt_alloc[0]:0
[2017-02-06T14:22:46.990]   topo_gres_cnt_avail[0]:1
[2017-02-06T14:22:46.990]   type[0]:tesla
[2017-02-06T14:22:46.990]   topo_cpus_bitmap[1]:NULL
[2017-02-06T14:22:46.990]   topo_gres_bitmap[1]:1
[2017-02-06T14:22:46.990]   topo_gres_cnt_alloc[1]:0
[2017-02-06T14:22:46.990]   topo_gres_cnt_avail[1]:1
[2017-02-06T14:22:46.990]   type[1]:quadro
[2017-02-06T14:22:46.990]   type_cnt_alloc[0]:0
[2017-02-06T14:22:46.990]   type_cnt_avail[0]:1
[2017-02-06T14:22:46.990]   type[0]:tesla
[2017-02-06T14:22:46.990]   type_cnt_alloc[1]:0
[2017-02-06T14:22:46.990]   type_cnt_avail[1]:1
[2017-02-06T14:22:46.990]   type[1]:quadro
[2017-02-06T14:22:46.990] gres/mic: state for gpu08
[2017-02-06T14:22:46.990]   gres_cnt found:0 configured:0 avail:0 alloc:0
[2017-02-06T14:22:46.990]   gres_bit_alloc:NULL
[2017-02-06T14:22:46.990]   gres_used:(null)

So here I am, a bit stumbled.

Having said that, is there something wrong with my configuration? Am I missing
anything or overlooked something?

Any help with this would be greatly appreciated!

Sincerely,
Hans Viessmann

P.s. please find attached the slurm.conf file, the slurmctld.log file, and 
gres.conf
       file.

________________________________

Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With campuses 
and students across the entire globe we span the world, delivering innovation 
and educational excellence in business, engineering, design and the physical, 
social and life sciences.

The contents of this e-mail (including any attachments) are confidential. If 
you are not the intended recipient of this e-mail, any disclosure, copying, 
distribution or use of its contents is strictly prohibited, and you should 
please notify the sender immediately and then delete it (including any 
attachments) from your system.

Attachment: gres.conf
Description: Binary data

Attachment: slurm.conf
Description: Binary data

[2017-02-06T14:22:43.396] slurmctld version 16.05.2 started on cluster slurm_cluster
[2017-02-06T14:22:43.805] layouts: no layout to initialize
[2017-02-06T14:22:43.810] layouts: loading entities/relations information
[2017-02-06T14:22:43.810] Recovered state of 10 nodes
[2017-02-06T14:22:43.811] Recovered JobID=317 State=0x1 NodeCnt=0 Assoc=0
[2017-02-06T14:22:43.811] gres: gpu state for job 544
[2017-02-06T14:22:43.811]   gres_cnt:1 node_cnt:0 type:tesla
[2017-02-06T14:22:43.811] Recovered JobID=544 State=0x5 NodeCnt=0 Assoc=2
[2017-02-06T14:22:43.811] gres: gpu state for job 545
[2017-02-06T14:22:43.811]   gres_cnt:1 node_cnt:0 type:(null)
[2017-02-06T14:22:43.811] Recovered JobID=545 State=0x5 NodeCnt=0 Assoc=2
[2017-02-06T14:22:43.811] Recovered information about 3 jobs
[2017-02-06T14:22:43.811] gres/gpu: state for gpu01
[2017-02-06T14:22:43.811]   gres_cnt found:TBD configured:1 avail:1 alloc:0
[2017-02-06T14:22:43.811]   gres_bit_alloc:
[2017-02-06T14:22:43.811]   gres_used:(null)
[2017-02-06T14:22:43.811] gres/mic: state for gpu01
[2017-02-06T14:22:43.811]   gres_cnt found:TBD configured:0 avail:0 alloc:0
[2017-02-06T14:22:43.811]   gres_bit_alloc:NULL
[2017-02-06T14:22:43.811]   gres_used:(null)
[2017-02-06T14:22:43.811] gres/gpu: state for gpu02
[2017-02-06T14:22:43.811]   gres_cnt found:TBD configured:1 avail:1 alloc:0
[2017-02-06T14:22:43.811]   gres_bit_alloc:
[2017-02-06T14:22:43.811]   gres_used:(null)
[2017-02-06T14:22:43.811] gres/mic: state for gpu02
[2017-02-06T14:22:43.811]   gres_cnt found:TBD configured:0 avail:0 alloc:0
[2017-02-06T14:22:43.811]   gres_bit_alloc:NULL
[2017-02-06T14:22:43.811]   gres_used:(null)
[2017-02-06T14:22:43.811] gres/gpu: state for gpu03
[2017-02-06T14:22:43.811]   gres_cnt found:TBD configured:1 avail:1 alloc:0
[2017-02-06T14:22:43.811]   gres_bit_alloc:
[2017-02-06T14:22:43.811]   gres_used:(null)
[2017-02-06T14:22:43.811] gres/mic: state for gpu03
[2017-02-06T14:22:43.811]   gres_cnt found:TBD configured:0 avail:0 alloc:0
[2017-02-06T14:22:43.811]   gres_bit_alloc:NULL
[2017-02-06T14:22:43.811]   gres_used:(null)
[2017-02-06T14:22:43.811] gres/gpu: state for gpu04
[2017-02-06T14:22:43.811]   gres_cnt found:TBD configured:1 avail:1 alloc:0
[2017-02-06T14:22:43.811]   gres_bit_alloc:
[2017-02-06T14:22:43.811]   gres_used:(null)
[2017-02-06T14:22:43.811] gres/mic: state for gpu04
[2017-02-06T14:22:43.811]   gres_cnt found:TBD configured:0 avail:0 alloc:0
[2017-02-06T14:22:43.812]   gres_bit_alloc:NULL
[2017-02-06T14:22:43.812]   gres_used:(null)
[2017-02-06T14:22:43.812] gres/gpu: state for gpu05
[2017-02-06T14:22:43.812]   gres_cnt found:TBD configured:1 avail:1 alloc:0
[2017-02-06T14:22:43.812]   gres_bit_alloc:
[2017-02-06T14:22:43.812]   gres_used:(null)
[2017-02-06T14:22:43.812] gres/mic: state for gpu05
[2017-02-06T14:22:43.812]   gres_cnt found:TBD configured:0 avail:0 alloc:0
[2017-02-06T14:22:43.812]   gres_bit_alloc:NULL
[2017-02-06T14:22:43.812]   gres_used:(null)
[2017-02-06T14:22:43.812] gres/gpu: state for gpu06
[2017-02-06T14:22:43.812]   gres_cnt found:TBD configured:1 avail:1 alloc:0
[2017-02-06T14:22:43.812]   gres_bit_alloc:
[2017-02-06T14:22:43.812]   gres_used:(null)
[2017-02-06T14:22:43.812] gres/mic: state for gpu06
[2017-02-06T14:22:43.812]   gres_cnt found:TBD configured:0 avail:0 alloc:0
[2017-02-06T14:22:43.812]   gres_bit_alloc:NULL
[2017-02-06T14:22:43.812]   gres_used:(null)
[2017-02-06T14:22:43.812] gres/gpu: state for gpu07
[2017-02-06T14:22:43.812]   gres_cnt found:TBD configured:1 avail:1 alloc:0
[2017-02-06T14:22:43.812]   gres_bit_alloc:
[2017-02-06T14:22:43.812]   gres_used:(null)
[2017-02-06T14:22:43.812] gres/mic: state for gpu07
[2017-02-06T14:22:43.812]   gres_cnt found:TBD configured:0 avail:0 alloc:0
[2017-02-06T14:22:43.812]   gres_bit_alloc:NULL
[2017-02-06T14:22:43.812]   gres_used:(null)
[2017-02-06T14:22:43.812] gres/gpu: state for gpu08
[2017-02-06T14:22:43.812]   gres_cnt found:TBD configured:2 avail:2 alloc:0
[2017-02-06T14:22:43.812]   gres_bit_alloc:
[2017-02-06T14:22:43.812]   gres_used:(null)
[2017-02-06T14:22:43.812]   type_cnt_alloc[0]:0
[2017-02-06T14:22:43.812]   type_cnt_avail[0]:1
[2017-02-06T14:22:43.812]   type[0]:tesla
[2017-02-06T14:22:43.812]   type_cnt_alloc[1]:0
[2017-02-06T14:22:43.812]   type_cnt_avail[1]:1
[2017-02-06T14:22:43.812]   type[1]:quadro
[2017-02-06T14:22:43.812] gres/mic: state for gpu08
[2017-02-06T14:22:43.812]   gres_cnt found:TBD configured:0 avail:0 alloc:0
[2017-02-06T14:22:43.812]   gres_bit_alloc:NULL
[2017-02-06T14:22:43.812]   gres_used:(null)
[2017-02-06T14:22:43.812] gres/gpu: state for mic01
[2017-02-06T14:22:43.812]   gres_cnt found:TBD configured:0 avail:0 alloc:0
[2017-02-06T14:22:43.812]   gres_bit_alloc:
[2017-02-06T14:22:43.812]   gres_used:(null)
[2017-02-06T14:22:43.812] gres/mic: state for mic01
[2017-02-06T14:22:43.812]   gres_cnt found:TBD configured:0 avail:0 alloc:0
[2017-02-06T14:22:43.812]   gres_bit_alloc:NULL
[2017-02-06T14:22:43.812]   gres_used:(null)
[2017-02-06T14:22:43.812] gres/gpu: state for mic02
[2017-02-06T14:22:43.812]   gres_cnt found:TBD configured:0 avail:0 alloc:0
[2017-02-06T14:22:43.812]   gres_bit_alloc:
[2017-02-06T14:22:43.812]   gres_used:(null)
[2017-02-06T14:22:43.812] gres/mic: state for mic02
[2017-02-06T14:22:43.812]   gres_cnt found:TBD configured:0 avail:0 alloc:0
[2017-02-06T14:22:43.812]   gres_bit_alloc:NULL
[2017-02-06T14:22:43.812]   gres_used:(null)
[2017-02-06T14:22:43.812] Recovered state of 0 reservations
[2017-02-06T14:22:43.812] read_slurm_conf: backup_controller not specified.
[2017-02-06T14:22:43.812] Running as primary controller
[2017-02-06T14:22:43.812] Registering slurmctld at port 6817 with slurmdbd.
[2017-02-06T14:22:43.976] No parameter for mcs plugin, default values set
[2017-02-06T14:22:43.976] mcs: MCSParameters = (null). ondemand set.
[2017-02-06T14:22:46.988] error: Node gpu07 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2017-02-06T14:22:46.988] gres/gpu: state for gpu07
[2017-02-06T14:22:46.988]   gres_cnt found:1 configured:1 avail:1 alloc:0
[2017-02-06T14:22:46.988]   gres_bit_alloc:
[2017-02-06T14:22:46.988]   gres_used:(null)
[2017-02-06T14:22:46.988] gres/mic: state for gpu07
[2017-02-06T14:22:46.988]   gres_cnt found:0 configured:0 avail:0 alloc:0
[2017-02-06T14:22:46.988]   gres_bit_alloc:NULL
[2017-02-06T14:22:46.988]   gres_used:(null)
[2017-02-06T14:22:46.988] error: Node mic01 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2017-02-06T14:22:46.988] gres/gpu: state for mic01
[2017-02-06T14:22:46.988]   gres_cnt found:0 configured:0 avail:0 alloc:0
[2017-02-06T14:22:46.988]   gres_bit_alloc:
[2017-02-06T14:22:46.988]   gres_used:(null)
[2017-02-06T14:22:46.988] gres/mic: state for mic01
[2017-02-06T14:22:46.988]   gres_cnt found:0 configured:0 avail:0 alloc:0
[2017-02-06T14:22:46.989]   gres_bit_alloc:NULL
[2017-02-06T14:22:46.989]   gres_used:(null)
[2017-02-06T14:22:46.989] error: Node mic02 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2017-02-06T14:22:46.989] error: Node gpu06 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2017-02-06T14:22:46.989] gres/gpu: state for mic02
[2017-02-06T14:22:46.989]   gres_cnt found:0 configured:0 avail:0 alloc:0
[2017-02-06T14:22:46.989]   gres_bit_alloc:
[2017-02-06T14:22:46.989]   gres_used:(null)
[2017-02-06T14:22:46.989] gres/mic: state for mic02
[2017-02-06T14:22:46.989]   gres_cnt found:0 configured:0 avail:0 alloc:0
[2017-02-06T14:22:46.989]   gres_bit_alloc:NULL
[2017-02-06T14:22:46.989]   gres_used:(null)
[2017-02-06T14:22:46.989] gres/gpu: state for gpu06
[2017-02-06T14:22:46.989]   gres_cnt found:1 configured:1 avail:1 alloc:0
[2017-02-06T14:22:46.989]   gres_bit_alloc:
[2017-02-06T14:22:46.989]   gres_used:(null)
[2017-02-06T14:22:46.989] gres/mic: state for gpu06
[2017-02-06T14:22:46.989]   gres_cnt found:0 configured:0 avail:0 alloc:0
[2017-02-06T14:22:46.989]   gres_bit_alloc:NULL
[2017-02-06T14:22:46.989]   gres_used:(null)
[2017-02-06T14:22:46.989] error: Node gpu03 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2017-02-06T14:22:46.989] error: Node gpu01 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2017-02-06T14:22:46.989] gres/gpu: state for gpu03
[2017-02-06T14:22:46.989]   gres_cnt found:1 configured:1 avail:1 alloc:0
[2017-02-06T14:22:46.989]   gres_bit_alloc:
[2017-02-06T14:22:46.989]   gres_used:(null)
[2017-02-06T14:22:46.989] gres/mic: state for gpu03
[2017-02-06T14:22:46.989]   gres_cnt found:0 configured:0 avail:0 alloc:0
[2017-02-06T14:22:46.989]   gres_bit_alloc:NULL
[2017-02-06T14:22:46.989]   gres_used:(null)
[2017-02-06T14:22:46.989] error: Node gpu02 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2017-02-06T14:22:46.989] gres/gpu: state for gpu01
[2017-02-06T14:22:46.989]   gres_cnt found:1 configured:1 avail:1 alloc:0
[2017-02-06T14:22:46.989]   gres_bit_alloc:
[2017-02-06T14:22:46.989]   gres_used:(null)
[2017-02-06T14:22:46.989] gres/mic: state for gpu01
[2017-02-06T14:22:46.989]   gres_cnt found:0 configured:0 avail:0 alloc:0
[2017-02-06T14:22:46.989]   gres_bit_alloc:NULL
[2017-02-06T14:22:46.990]   gres_used:(null)
[2017-02-06T14:22:46.990] error: Node gpu05 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2017-02-06T14:22:46.990] gres/gpu: state for gpu02
[2017-02-06T14:22:46.990]   gres_cnt found:1 configured:1 avail:1 alloc:0
[2017-02-06T14:22:46.990]   gres_bit_alloc:
[2017-02-06T14:22:46.990]   gres_used:(null)
[2017-02-06T14:22:46.990] gres/mic: state for gpu02
[2017-02-06T14:22:46.990]   gres_cnt found:0 configured:0 avail:0 alloc:0
[2017-02-06T14:22:46.990]   gres_bit_alloc:NULL
[2017-02-06T14:22:46.990]   gres_used:(null)
[2017-02-06T14:22:46.990] gres/gpu: state for gpu05
[2017-02-06T14:22:46.990]   gres_cnt found:1 configured:1 avail:1 alloc:0
[2017-02-06T14:22:46.990]   gres_bit_alloc:
[2017-02-06T14:22:46.990]   gres_used:(null)
[2017-02-06T14:22:46.990] gres/mic: state for gpu05
[2017-02-06T14:22:46.990]   gres_cnt found:0 configured:0 avail:0 alloc:0
[2017-02-06T14:22:46.990]   gres_bit_alloc:NULL
[2017-02-06T14:22:46.990]   gres_used:(null)
[2017-02-06T14:22:46.990] error: Node gpu08 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2017-02-06T14:22:46.990] gres/gpu: state for gpu08
[2017-02-06T14:22:46.990]   gres_cnt found:2 configured:2 avail:2 alloc:0
[2017-02-06T14:22:46.990]   gres_bit_alloc:
[2017-02-06T14:22:46.990]   gres_used:(null)
[2017-02-06T14:22:46.990]   topo_cpus_bitmap[0]:NULL
[2017-02-06T14:22:46.990]   topo_gres_bitmap[0]:0
[2017-02-06T14:22:46.990]   topo_gres_cnt_alloc[0]:0
[2017-02-06T14:22:46.990]   topo_gres_cnt_avail[0]:1
[2017-02-06T14:22:46.990]   type[0]:tesla
[2017-02-06T14:22:46.990]   topo_cpus_bitmap[1]:NULL
[2017-02-06T14:22:46.990]   topo_gres_bitmap[1]:1
[2017-02-06T14:22:46.990]   topo_gres_cnt_alloc[1]:0
[2017-02-06T14:22:46.990]   topo_gres_cnt_avail[1]:1
[2017-02-06T14:22:46.990]   type[1]:quadro
[2017-02-06T14:22:46.990]   type_cnt_alloc[0]:0
[2017-02-06T14:22:46.990]   type_cnt_avail[0]:1
[2017-02-06T14:22:46.990]   type[0]:tesla
[2017-02-06T14:22:46.990]   type_cnt_alloc[1]:0
[2017-02-06T14:22:46.990]   type_cnt_avail[1]:1
[2017-02-06T14:22:46.990]   type[1]:quadro
[2017-02-06T14:22:46.990] gres/mic: state for gpu08
[2017-02-06T14:22:46.990]   gres_cnt found:0 configured:0 avail:0 alloc:0
[2017-02-06T14:22:46.990]   gres_bit_alloc:NULL
[2017-02-06T14:22:46.990]   gres_used:(null)
[2017-02-06T14:22:46.993] error: Node gpu04 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2017-02-06T14:22:46.994] error: gres_plugin_node_config_unpack: gres/gpu lacks File parameter for node gpu04
[2017-02-06T14:22:46.994] gres/gpu: state for gpu04
[2017-02-06T14:22:46.994]   gres_cnt found:1 configured:1 avail:1 alloc:0
[2017-02-06T14:22:46.994]   gres_bit_alloc:
[2017-02-06T14:22:46.994]   gres_used:(null)
[2017-02-06T14:22:46.994] gres/mic: state for gpu04
[2017-02-06T14:22:46.994]   gres_cnt found:0 configured:0 avail:0 alloc:0
[2017-02-06T14:22:46.994]   gres_bit_alloc:NULL
[2017-02-06T14:22:46.994]   gres_used:(null)
[2017-02-06T14:22:47.982] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=0

Reply via email to