Hi all, Over the weekend I tried to setup Gres device type allocation but stumbled onto a bit of a problem.
The setup I'm working with is a cluster with Bright Cluster Manager setup and managing in total 10 nodes. Eight of these nodes contain GPUs - gpu[01-07] contains one GPU each, whereas gpu08 contains two GPUs. The GPUs in gpu08 are not the same, one is a Tesla device and the other a Quadro. The other two nodes have MICs on them, but these have not been configured yet. Some software version information: * Bright Cluster Manager: 7.3 running on SL 7.2 * SLURM: 16.05.2 As per https://slurm.schedmd.com/gres.html, I setup my slurm.conf file to have the Gres line with the type specification - excerpt (full slurm.conf attached): # Nodes NodeName=mic[01-02] NodeName=gpu08 Feature=multiple-gpus Gres=gpu:tesla:1,gpu:quadro:1 NodeName=gpu[01-07] Gres=gpu:1 # Generic resources types GresTypes=gpu,mic and on node08 I added to the gres.conf file the following configuration: Name=gpu File=/dev/nvidia0 Type=tesla Name=gpu File=/dev/nvidia1 Type=quadro I added nothing into the controller gres.conf file. I believe that these settings/type information has propagated to slurmctld as calling sinfo gives the following output: $ sinfo -o "%10P %.5a %.15l %.6D %.6t %25G %N" PARTITION AVAIL TIMELIMIT NODES STATE GRES NODELIST longq up 7-00:00:00 1 drain (null) mic02 longq up 7-00:00:00 1 alloc gpu:1 gpu06 longq up 7-00:00:00 6 idle gpu:1 gpu[01-05,07] longq up 7-00:00:00 1 idle gpu:tesla:1,gpu:quadro:1 gpu08 longq up 7-00:00:00 1 idle (null) mic01 testq* up 15:00 1 drain (null) mic02 testq* up 15:00 1 alloc gpu:1 gpu06 testq* up 15:00 6 idle gpu:1 gpu[01-05,07] testq* up 15:00 1 idle gpu:tesla:1,gpu:quadro:1 gpu08 testq* up 15:00 1 idle (null) mic01 When I try to allocate the resource using salloc, I get the following error message though: $ salloc --gres=gpu:tesla:1 salloc: error: Job submit/allocate failed: Requested node configuration is not available salloc: Job allocation 544 has been revoked. Doing a normal allocation without the`--gres=<>' flag works, but when I try the following it fails as well: $ salloc -w gpu08 --gres=gpu:1 salloc: error: Job submit/allocate failed: Requested node configuration is not available salloc: Job allocation 545 has been revoked. Activating the DebugFlag Gres provides the following output in the slurmctld.log file: [2017-02-06T14:22:46.990] gres/gpu: state for gpu08 [2017-02-06T14:22:46.990] gres_cnt found:2 configured:2 avail:2 alloc:0 [2017-02-06T14:22:46.990] gres_bit_alloc: [2017-02-06T14:22:46.990] gres_used:(null) [2017-02-06T14:22:46.990] topo_cpus_bitmap[0]:NULL [2017-02-06T14:22:46.990] topo_gres_bitmap[0]:0 [2017-02-06T14:22:46.990] topo_gres_cnt_alloc[0]:0 [2017-02-06T14:22:46.990] topo_gres_cnt_avail[0]:1 [2017-02-06T14:22:46.990] type[0]:tesla [2017-02-06T14:22:46.990] topo_cpus_bitmap[1]:NULL [2017-02-06T14:22:46.990] topo_gres_bitmap[1]:1 [2017-02-06T14:22:46.990] topo_gres_cnt_alloc[1]:0 [2017-02-06T14:22:46.990] topo_gres_cnt_avail[1]:1 [2017-02-06T14:22:46.990] type[1]:quadro [2017-02-06T14:22:46.990] type_cnt_alloc[0]:0 [2017-02-06T14:22:46.990] type_cnt_avail[0]:1 [2017-02-06T14:22:46.990] type[0]:tesla [2017-02-06T14:22:46.990] type_cnt_alloc[1]:0 [2017-02-06T14:22:46.990] type_cnt_avail[1]:1 [2017-02-06T14:22:46.990] type[1]:quadro [2017-02-06T14:22:46.990] gres/mic: state for gpu08 [2017-02-06T14:22:46.990] gres_cnt found:0 configured:0 avail:0 alloc:0 [2017-02-06T14:22:46.990] gres_bit_alloc:NULL [2017-02-06T14:22:46.990] gres_used:(null) So here I am, a bit stumbled. Having said that, is there something wrong with my configuration? Am I missing anything or overlooked something? Any help with this would be greatly appreciated! Sincerely, Hans Viessmann P.s. please find attached the slurm.conf file, the slurmctld.log file, and gres.conf file. ________________________________ Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With campuses and students across the entire globe we span the world, delivering innovation and educational excellence in business, engineering, design and the physical, social and life sciences. The contents of this e-mail (including any attachments) are confidential. If you are not the intended recipient of this e-mail, any disclosure, copying, distribution or use of its contents is strictly prohibited, and you should please notify the sender immediately and then delete it (including any attachments) from your system.
gres.conf
Description: Binary data
slurm.conf
Description: Binary data
[2017-02-06T14:22:43.396] slurmctld version 16.05.2 started on cluster slurm_cluster [2017-02-06T14:22:43.805] layouts: no layout to initialize [2017-02-06T14:22:43.810] layouts: loading entities/relations information [2017-02-06T14:22:43.810] Recovered state of 10 nodes [2017-02-06T14:22:43.811] Recovered JobID=317 State=0x1 NodeCnt=0 Assoc=0 [2017-02-06T14:22:43.811] gres: gpu state for job 544 [2017-02-06T14:22:43.811] gres_cnt:1 node_cnt:0 type:tesla [2017-02-06T14:22:43.811] Recovered JobID=544 State=0x5 NodeCnt=0 Assoc=2 [2017-02-06T14:22:43.811] gres: gpu state for job 545 [2017-02-06T14:22:43.811] gres_cnt:1 node_cnt:0 type:(null) [2017-02-06T14:22:43.811] Recovered JobID=545 State=0x5 NodeCnt=0 Assoc=2 [2017-02-06T14:22:43.811] Recovered information about 3 jobs [2017-02-06T14:22:43.811] gres/gpu: state for gpu01 [2017-02-06T14:22:43.811] gres_cnt found:TBD configured:1 avail:1 alloc:0 [2017-02-06T14:22:43.811] gres_bit_alloc: [2017-02-06T14:22:43.811] gres_used:(null) [2017-02-06T14:22:43.811] gres/mic: state for gpu01 [2017-02-06T14:22:43.811] gres_cnt found:TBD configured:0 avail:0 alloc:0 [2017-02-06T14:22:43.811] gres_bit_alloc:NULL [2017-02-06T14:22:43.811] gres_used:(null) [2017-02-06T14:22:43.811] gres/gpu: state for gpu02 [2017-02-06T14:22:43.811] gres_cnt found:TBD configured:1 avail:1 alloc:0 [2017-02-06T14:22:43.811] gres_bit_alloc: [2017-02-06T14:22:43.811] gres_used:(null) [2017-02-06T14:22:43.811] gres/mic: state for gpu02 [2017-02-06T14:22:43.811] gres_cnt found:TBD configured:0 avail:0 alloc:0 [2017-02-06T14:22:43.811] gres_bit_alloc:NULL [2017-02-06T14:22:43.811] gres_used:(null) [2017-02-06T14:22:43.811] gres/gpu: state for gpu03 [2017-02-06T14:22:43.811] gres_cnt found:TBD configured:1 avail:1 alloc:0 [2017-02-06T14:22:43.811] gres_bit_alloc: [2017-02-06T14:22:43.811] gres_used:(null) [2017-02-06T14:22:43.811] gres/mic: state for gpu03 [2017-02-06T14:22:43.811] gres_cnt found:TBD configured:0 avail:0 alloc:0 [2017-02-06T14:22:43.811] gres_bit_alloc:NULL [2017-02-06T14:22:43.811] gres_used:(null) [2017-02-06T14:22:43.811] gres/gpu: state for gpu04 [2017-02-06T14:22:43.811] gres_cnt found:TBD configured:1 avail:1 alloc:0 [2017-02-06T14:22:43.811] gres_bit_alloc: [2017-02-06T14:22:43.811] gres_used:(null) [2017-02-06T14:22:43.811] gres/mic: state for gpu04 [2017-02-06T14:22:43.811] gres_cnt found:TBD configured:0 avail:0 alloc:0 [2017-02-06T14:22:43.812] gres_bit_alloc:NULL [2017-02-06T14:22:43.812] gres_used:(null) [2017-02-06T14:22:43.812] gres/gpu: state for gpu05 [2017-02-06T14:22:43.812] gres_cnt found:TBD configured:1 avail:1 alloc:0 [2017-02-06T14:22:43.812] gres_bit_alloc: [2017-02-06T14:22:43.812] gres_used:(null) [2017-02-06T14:22:43.812] gres/mic: state for gpu05 [2017-02-06T14:22:43.812] gres_cnt found:TBD configured:0 avail:0 alloc:0 [2017-02-06T14:22:43.812] gres_bit_alloc:NULL [2017-02-06T14:22:43.812] gres_used:(null) [2017-02-06T14:22:43.812] gres/gpu: state for gpu06 [2017-02-06T14:22:43.812] gres_cnt found:TBD configured:1 avail:1 alloc:0 [2017-02-06T14:22:43.812] gres_bit_alloc: [2017-02-06T14:22:43.812] gres_used:(null) [2017-02-06T14:22:43.812] gres/mic: state for gpu06 [2017-02-06T14:22:43.812] gres_cnt found:TBD configured:0 avail:0 alloc:0 [2017-02-06T14:22:43.812] gres_bit_alloc:NULL [2017-02-06T14:22:43.812] gres_used:(null) [2017-02-06T14:22:43.812] gres/gpu: state for gpu07 [2017-02-06T14:22:43.812] gres_cnt found:TBD configured:1 avail:1 alloc:0 [2017-02-06T14:22:43.812] gres_bit_alloc: [2017-02-06T14:22:43.812] gres_used:(null) [2017-02-06T14:22:43.812] gres/mic: state for gpu07 [2017-02-06T14:22:43.812] gres_cnt found:TBD configured:0 avail:0 alloc:0 [2017-02-06T14:22:43.812] gres_bit_alloc:NULL [2017-02-06T14:22:43.812] gres_used:(null) [2017-02-06T14:22:43.812] gres/gpu: state for gpu08 [2017-02-06T14:22:43.812] gres_cnt found:TBD configured:2 avail:2 alloc:0 [2017-02-06T14:22:43.812] gres_bit_alloc: [2017-02-06T14:22:43.812] gres_used:(null) [2017-02-06T14:22:43.812] type_cnt_alloc[0]:0 [2017-02-06T14:22:43.812] type_cnt_avail[0]:1 [2017-02-06T14:22:43.812] type[0]:tesla [2017-02-06T14:22:43.812] type_cnt_alloc[1]:0 [2017-02-06T14:22:43.812] type_cnt_avail[1]:1 [2017-02-06T14:22:43.812] type[1]:quadro [2017-02-06T14:22:43.812] gres/mic: state for gpu08 [2017-02-06T14:22:43.812] gres_cnt found:TBD configured:0 avail:0 alloc:0 [2017-02-06T14:22:43.812] gres_bit_alloc:NULL [2017-02-06T14:22:43.812] gres_used:(null) [2017-02-06T14:22:43.812] gres/gpu: state for mic01 [2017-02-06T14:22:43.812] gres_cnt found:TBD configured:0 avail:0 alloc:0 [2017-02-06T14:22:43.812] gres_bit_alloc: [2017-02-06T14:22:43.812] gres_used:(null) [2017-02-06T14:22:43.812] gres/mic: state for mic01 [2017-02-06T14:22:43.812] gres_cnt found:TBD configured:0 avail:0 alloc:0 [2017-02-06T14:22:43.812] gres_bit_alloc:NULL [2017-02-06T14:22:43.812] gres_used:(null) [2017-02-06T14:22:43.812] gres/gpu: state for mic02 [2017-02-06T14:22:43.812] gres_cnt found:TBD configured:0 avail:0 alloc:0 [2017-02-06T14:22:43.812] gres_bit_alloc: [2017-02-06T14:22:43.812] gres_used:(null) [2017-02-06T14:22:43.812] gres/mic: state for mic02 [2017-02-06T14:22:43.812] gres_cnt found:TBD configured:0 avail:0 alloc:0 [2017-02-06T14:22:43.812] gres_bit_alloc:NULL [2017-02-06T14:22:43.812] gres_used:(null) [2017-02-06T14:22:43.812] Recovered state of 0 reservations [2017-02-06T14:22:43.812] read_slurm_conf: backup_controller not specified. [2017-02-06T14:22:43.812] Running as primary controller [2017-02-06T14:22:43.812] Registering slurmctld at port 6817 with slurmdbd. [2017-02-06T14:22:43.976] No parameter for mcs plugin, default values set [2017-02-06T14:22:43.976] mcs: MCSParameters = (null). ondemand set. [2017-02-06T14:22:46.988] error: Node gpu07 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2017-02-06T14:22:46.988] gres/gpu: state for gpu07 [2017-02-06T14:22:46.988] gres_cnt found:1 configured:1 avail:1 alloc:0 [2017-02-06T14:22:46.988] gres_bit_alloc: [2017-02-06T14:22:46.988] gres_used:(null) [2017-02-06T14:22:46.988] gres/mic: state for gpu07 [2017-02-06T14:22:46.988] gres_cnt found:0 configured:0 avail:0 alloc:0 [2017-02-06T14:22:46.988] gres_bit_alloc:NULL [2017-02-06T14:22:46.988] gres_used:(null) [2017-02-06T14:22:46.988] error: Node mic01 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2017-02-06T14:22:46.988] gres/gpu: state for mic01 [2017-02-06T14:22:46.988] gres_cnt found:0 configured:0 avail:0 alloc:0 [2017-02-06T14:22:46.988] gres_bit_alloc: [2017-02-06T14:22:46.988] gres_used:(null) [2017-02-06T14:22:46.988] gres/mic: state for mic01 [2017-02-06T14:22:46.988] gres_cnt found:0 configured:0 avail:0 alloc:0 [2017-02-06T14:22:46.989] gres_bit_alloc:NULL [2017-02-06T14:22:46.989] gres_used:(null) [2017-02-06T14:22:46.989] error: Node mic02 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2017-02-06T14:22:46.989] error: Node gpu06 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2017-02-06T14:22:46.989] gres/gpu: state for mic02 [2017-02-06T14:22:46.989] gres_cnt found:0 configured:0 avail:0 alloc:0 [2017-02-06T14:22:46.989] gres_bit_alloc: [2017-02-06T14:22:46.989] gres_used:(null) [2017-02-06T14:22:46.989] gres/mic: state for mic02 [2017-02-06T14:22:46.989] gres_cnt found:0 configured:0 avail:0 alloc:0 [2017-02-06T14:22:46.989] gres_bit_alloc:NULL [2017-02-06T14:22:46.989] gres_used:(null) [2017-02-06T14:22:46.989] gres/gpu: state for gpu06 [2017-02-06T14:22:46.989] gres_cnt found:1 configured:1 avail:1 alloc:0 [2017-02-06T14:22:46.989] gres_bit_alloc: [2017-02-06T14:22:46.989] gres_used:(null) [2017-02-06T14:22:46.989] gres/mic: state for gpu06 [2017-02-06T14:22:46.989] gres_cnt found:0 configured:0 avail:0 alloc:0 [2017-02-06T14:22:46.989] gres_bit_alloc:NULL [2017-02-06T14:22:46.989] gres_used:(null) [2017-02-06T14:22:46.989] error: Node gpu03 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2017-02-06T14:22:46.989] error: Node gpu01 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2017-02-06T14:22:46.989] gres/gpu: state for gpu03 [2017-02-06T14:22:46.989] gres_cnt found:1 configured:1 avail:1 alloc:0 [2017-02-06T14:22:46.989] gres_bit_alloc: [2017-02-06T14:22:46.989] gres_used:(null) [2017-02-06T14:22:46.989] gres/mic: state for gpu03 [2017-02-06T14:22:46.989] gres_cnt found:0 configured:0 avail:0 alloc:0 [2017-02-06T14:22:46.989] gres_bit_alloc:NULL [2017-02-06T14:22:46.989] gres_used:(null) [2017-02-06T14:22:46.989] error: Node gpu02 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2017-02-06T14:22:46.989] gres/gpu: state for gpu01 [2017-02-06T14:22:46.989] gres_cnt found:1 configured:1 avail:1 alloc:0 [2017-02-06T14:22:46.989] gres_bit_alloc: [2017-02-06T14:22:46.989] gres_used:(null) [2017-02-06T14:22:46.989] gres/mic: state for gpu01 [2017-02-06T14:22:46.989] gres_cnt found:0 configured:0 avail:0 alloc:0 [2017-02-06T14:22:46.989] gres_bit_alloc:NULL [2017-02-06T14:22:46.990] gres_used:(null) [2017-02-06T14:22:46.990] error: Node gpu05 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2017-02-06T14:22:46.990] gres/gpu: state for gpu02 [2017-02-06T14:22:46.990] gres_cnt found:1 configured:1 avail:1 alloc:0 [2017-02-06T14:22:46.990] gres_bit_alloc: [2017-02-06T14:22:46.990] gres_used:(null) [2017-02-06T14:22:46.990] gres/mic: state for gpu02 [2017-02-06T14:22:46.990] gres_cnt found:0 configured:0 avail:0 alloc:0 [2017-02-06T14:22:46.990] gres_bit_alloc:NULL [2017-02-06T14:22:46.990] gres_used:(null) [2017-02-06T14:22:46.990] gres/gpu: state for gpu05 [2017-02-06T14:22:46.990] gres_cnt found:1 configured:1 avail:1 alloc:0 [2017-02-06T14:22:46.990] gres_bit_alloc: [2017-02-06T14:22:46.990] gres_used:(null) [2017-02-06T14:22:46.990] gres/mic: state for gpu05 [2017-02-06T14:22:46.990] gres_cnt found:0 configured:0 avail:0 alloc:0 [2017-02-06T14:22:46.990] gres_bit_alloc:NULL [2017-02-06T14:22:46.990] gres_used:(null) [2017-02-06T14:22:46.990] error: Node gpu08 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2017-02-06T14:22:46.990] gres/gpu: state for gpu08 [2017-02-06T14:22:46.990] gres_cnt found:2 configured:2 avail:2 alloc:0 [2017-02-06T14:22:46.990] gres_bit_alloc: [2017-02-06T14:22:46.990] gres_used:(null) [2017-02-06T14:22:46.990] topo_cpus_bitmap[0]:NULL [2017-02-06T14:22:46.990] topo_gres_bitmap[0]:0 [2017-02-06T14:22:46.990] topo_gres_cnt_alloc[0]:0 [2017-02-06T14:22:46.990] topo_gres_cnt_avail[0]:1 [2017-02-06T14:22:46.990] type[0]:tesla [2017-02-06T14:22:46.990] topo_cpus_bitmap[1]:NULL [2017-02-06T14:22:46.990] topo_gres_bitmap[1]:1 [2017-02-06T14:22:46.990] topo_gres_cnt_alloc[1]:0 [2017-02-06T14:22:46.990] topo_gres_cnt_avail[1]:1 [2017-02-06T14:22:46.990] type[1]:quadro [2017-02-06T14:22:46.990] type_cnt_alloc[0]:0 [2017-02-06T14:22:46.990] type_cnt_avail[0]:1 [2017-02-06T14:22:46.990] type[0]:tesla [2017-02-06T14:22:46.990] type_cnt_alloc[1]:0 [2017-02-06T14:22:46.990] type_cnt_avail[1]:1 [2017-02-06T14:22:46.990] type[1]:quadro [2017-02-06T14:22:46.990] gres/mic: state for gpu08 [2017-02-06T14:22:46.990] gres_cnt found:0 configured:0 avail:0 alloc:0 [2017-02-06T14:22:46.990] gres_bit_alloc:NULL [2017-02-06T14:22:46.990] gres_used:(null) [2017-02-06T14:22:46.993] error: Node gpu04 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2017-02-06T14:22:46.994] error: gres_plugin_node_config_unpack: gres/gpu lacks File parameter for node gpu04 [2017-02-06T14:22:46.994] gres/gpu: state for gpu04 [2017-02-06T14:22:46.994] gres_cnt found:1 configured:1 avail:1 alloc:0 [2017-02-06T14:22:46.994] gres_bit_alloc: [2017-02-06T14:22:46.994] gres_used:(null) [2017-02-06T14:22:46.994] gres/mic: state for gpu04 [2017-02-06T14:22:46.994] gres_cnt found:0 configured:0 avail:0 alloc:0 [2017-02-06T14:22:46.994] gres_bit_alloc:NULL [2017-02-06T14:22:46.994] gres_used:(null) [2017-02-06T14:22:47.982] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=0