Hi Robert, I had just added the DebugFlags setting to slurm.conf on the head node and did not sychronise it with the nodes. I doubt that this could cause the problem I described as it was occuring before I made the change to slurm.conf.
One thing I did notice is this error occuring every once and a while: [2016-12-30T17:36:50.963] error: gres_plugin_node_config_unpack: gres/gpu lacks File parameter for node gpu07 [2016-12-30T17:36:50.963] error: gres_plugin_node_config_unpack: gres/gpu lacks File parameter for node gpu04 [2016-12-30T17:36:50.963] error: gres_plugin_node_config_unpack: gres/gpu lacks File parameter for node gpu01 [2016-12-30T17:36:50.963] error: gres_plugin_node_config_unpack: gres/gpu lacks File parameter for node gpu05 [2016-12-30T17:36:50.963] error: gres_plugin_node_config_unpack: gres/gpu lacks File parameter for node gpu02 [2016-12-30T17:36:50.964] error: gres_plugin_node_config_unpack: gres/gpu lacks File parameter for node gpu06 [2016-12-30T17:36:50.966] error: gres_plugin_node_config_unpack: gres/gpu lacks File parameter for node gpu03 Is it possible that I need to specify the Gres Type for the other nodes as well, even though that have only one GPU each? Sincerely, Hans On 6 February 2017 at 15:03, Robbert Eggermont <r.eggerm...@tudelft.nl> wrote: > > Hi Hans, > > You log shows that you slurm.conf is out-of-sync on some nodes, is that on > purpose? > > What happens when you synchronise the slurm.conf on all nodes, restart > slurmctld and restart slurmd on all nodes? > > Best, > > Robbert > > On 06-02-17 15:45, Hans-Nikolai Viessmann wrote: > >> Hi all, >> >> Over the weekend I tried to setup Gres device type allocation but stumbled >> onto a bit of a problem. >> >> The setup I'm working with is a cluster with Bright Cluster Manager >> setup and >> managing in total 10 nodes. Eight of these nodes contain GPUs - gpu[01-07] >> contains one GPU each, whereas gpu08 contains two GPUs. The GPUs in >> gpu08 are not the same, one is a Tesla device and the other a Quadro. >> The other >> two nodes have MICs on them, but these have not been configured yet. >> >> Some software version information: >> >> * Bright Cluster Manager: 7.3 running on SL 7.2 >> * SLURM: 16.05.2 >> >> As per https://slurm.schedmd.com/gres.html, I setup my slurm.conf file >> to have the >> Gres line with the type specification - excerpt (full slurm.conf >> attached): >> >> # Nodes >> NodeName=mic[01-02] >> NodeName=gpu08 Feature=multiple-gpus Gres=gpu:tesla:1,gpu:quadro:1 >> NodeName=gpu[01-07] Gres=gpu:1 >> # Generic resources types >> GresTypes=gpu,mic >> >> and on node08 I added to the gres.conf file the following configuration: >> >> Name=gpu File=/dev/nvidia0 Type=tesla >> Name=gpu File=/dev/nvidia1 Type=quadro >> >> /I added nothing into the controller gres.conf file./ >> >> >> I believe that these settings/type information has propagated to slurmctld >> as calling sinfo gives the following output: >> >> $ sinfo -o "%10P %.5a %.15l %.6D %.6t %25G %N" >> PARTITION AVAIL TIMELIMIT NODES STATE GRES >> NODELIST >> longq up 7-00:00:00 1 drain (null) >> mic02 >> longq up 7-00:00:00 1 alloc gpu:1 >> gpu06 >> longq up 7-00:00:00 6 idle gpu:1 >> gpu[01-05,07] >> longq up 7-00:00:00 1 idle gpu:tesla:1,gpu:quadro:1 >> gpu08 >> longq up 7-00:00:00 1 idle (null) >> mic01 >> testq* up 15:00 1 drain (null) >> mic02 >> testq* up 15:00 1 alloc gpu:1 >> gpu06 >> testq* up 15:00 6 idle gpu:1 >> gpu[01-05,07] >> testq* up 15:00 1 idle gpu:tesla:1,gpu:quadro:1 >> gpu08 >> testq* up 15:00 1 idle (null) >> mic01 >> >> When I try to allocate the resource using salloc, I get the following >> error message >> though: >> >> $ salloc --gres=gpu:tesla:1 >> salloc: error: Job submit/allocate failed: Requested node configuration >> is not available >> salloc: Job allocation 544 has been revoked. >> >> Doing a normal allocation without the`--gres=<>' flag works, but when I >> try the following >> it fails as well: >> >> $ salloc -w gpu08 --gres=gpu:1 >> salloc: error: Job submit/allocate failed: Requested node configuration >> is not available >> salloc: Job allocation 545 has been revoked. >> >> Activating the DebugFlag Gres provides the following output in the >> slurmctld.log file: >> >> [2017-02-06T14:22:46.990] gres/gpu: state for gpu08 >> [2017-02-06T14:22:46.990] gres_cnt found:2 configured:2 avail:2 alloc:0 >> [2017-02-06T14:22:46.990] gres_bit_alloc: >> [2017-02-06T14:22:46.990] gres_used:(null) >> [2017-02-06T14:22:46.990] topo_cpus_bitmap[0]:NULL >> [2017-02-06T14:22:46.990] topo_gres_bitmap[0]:0 >> [2017-02-06T14:22:46.990] topo_gres_cnt_alloc[0]:0 >> [2017-02-06T14:22:46.990] topo_gres_cnt_avail[0]:1 >> [2017-02-06T14:22:46.990] type[0]:tesla >> [2017-02-06T14:22:46.990] topo_cpus_bitmap[1]:NULL >> [2017-02-06T14:22:46.990] topo_gres_bitmap[1]:1 >> [2017-02-06T14:22:46.990] topo_gres_cnt_alloc[1]:0 >> [2017-02-06T14:22:46.990] topo_gres_cnt_avail[1]:1 >> [2017-02-06T14:22:46.990] type[1]:quadro >> [2017-02-06T14:22:46.990] type_cnt_alloc[0]:0 >> [2017-02-06T14:22:46.990] type_cnt_avail[0]:1 >> [2017-02-06T14:22:46.990] type[0]:tesla >> [2017-02-06T14:22:46.990] type_cnt_alloc[1]:0 >> [2017-02-06T14:22:46.990] type_cnt_avail[1]:1 >> [2017-02-06T14:22:46.990] type[1]:quadro >> [2017-02-06T14:22:46.990] gres/mic: state for gpu08 >> [2017-02-06T14:22:46.990] gres_cnt found:0 configured:0 avail:0 alloc:0 >> [2017-02-06T14:22:46.990] gres_bit_alloc:NULL >> [2017-02-06T14:22:46.990] gres_used:(null) >> >> So here I am, a bit stumbled. >> >> Having said that, is there something wrong with my configuration? Am I >> missing >> anything or overlooked something? >> >> Any help with this would be greatly appreciated! >> >> Sincerely, >> Hans Viessmann >> >> P.s. please find attached the slurm.conf file, the slurmctld.log file, >> and gres.conf >> file. >> >> Untitled Document >> ------------------------------------------------------------------------ >> >> Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With >> campuses and students across the entire globe we span the world, >> delivering innovation and educational excellence in business, >> engineering, design and the physical, social and life sciences. >> >> The contents of this e-mail (including any attachments) are >> confidential. If you are not the intended recipient of this e-mail, any >> disclosure, copying, distribution or use of its contents is strictly >> prohibited, and you should please notify the sender immediately and then >> delete it (including any attachments) from your system. >> >> > > -- > Robbert Eggermont Intelligent Systems > r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Science > +31 15 27 83234 Delft University of Technology >