Hi Robert,

I had just added the DebugFlags setting to slurm.conf on the head node
and did not sychronise it with the nodes. I doubt that this could cause the
problem I described as it was occuring before I made the change to
slurm.conf.

One thing I did notice is this error occuring every once and a while:

[2016-12-30T17:36:50.963] error: gres_plugin_node_config_unpack: gres/gpu
lacks File parameter for node gpu07
[2016-12-30T17:36:50.963] error: gres_plugin_node_config_unpack: gres/gpu
lacks File parameter for node gpu04
[2016-12-30T17:36:50.963] error: gres_plugin_node_config_unpack: gres/gpu
lacks File parameter for node gpu01
[2016-12-30T17:36:50.963] error: gres_plugin_node_config_unpack: gres/gpu
lacks File parameter for node gpu05
[2016-12-30T17:36:50.963] error: gres_plugin_node_config_unpack: gres/gpu
lacks File parameter for node gpu02
[2016-12-30T17:36:50.964] error: gres_plugin_node_config_unpack: gres/gpu
lacks File parameter for node gpu06
[2016-12-30T17:36:50.966] error: gres_plugin_node_config_unpack: gres/gpu
lacks File parameter for node gpu03

Is it possible that I need to specify the Gres Type for the other nodes as
well, even though that
have only one GPU each?

Sincerely,
Hans

On 6 February 2017 at 15:03, Robbert Eggermont <r.eggerm...@tudelft.nl>
wrote:

>
> Hi Hans,
>
> You log shows that you slurm.conf is out-of-sync on some nodes, is that on
> purpose?
>
> What happens when you synchronise the slurm.conf on all nodes, restart
> slurmctld and restart slurmd on all nodes?
>
> Best,
>
> Robbert
>
> On 06-02-17 15:45, Hans-Nikolai Viessmann wrote:
>
>> Hi all,
>>
>> Over the weekend I tried to setup Gres device type allocation but stumbled
>> onto a bit of a problem.
>>
>> The setup I'm working with is a cluster with Bright Cluster Manager
>> setup and
>> managing in total 10 nodes. Eight of these nodes contain GPUs - gpu[01-07]
>> contains one GPU each, whereas gpu08 contains two GPUs. The GPUs in
>> gpu08 are not the same, one is a Tesla device and the other a Quadro.
>> The other
>> two nodes have MICs on them, but these have not been configured yet.
>>
>> Some software version information:
>>
>>   * Bright Cluster Manager: 7.3 running on SL 7.2
>>   * SLURM: 16.05.2
>>
>> As per https://slurm.schedmd.com/gres.html, I setup my slurm.conf file
>> to have the
>> Gres line with the type specification - excerpt (full slurm.conf
>> attached):
>>
>> # Nodes
>> NodeName=mic[01-02]
>> NodeName=gpu08  Feature=multiple-gpus Gres=gpu:tesla:1,gpu:quadro:1
>> NodeName=gpu[01-07]  Gres=gpu:1
>> # Generic resources types
>> GresTypes=gpu,mic
>>
>> and on node08 I added to the gres.conf file the following configuration:
>>
>> Name=gpu File=/dev/nvidia0 Type=tesla
>> Name=gpu File=/dev/nvidia1 Type=quadro
>>
>> /I added nothing into the controller gres.conf file./
>>
>>
>> I believe that these settings/type information has propagated to slurmctld
>> as calling sinfo gives the following output:
>>
>> $ sinfo -o "%10P %.5a %.15l %.6D %.6t %25G %N"
>> PARTITION  AVAIL       TIMELIMIT  NODES  STATE GRES
>>  NODELIST
>> longq         up      7-00:00:00      1  drain (null)
>>  mic02
>> longq         up      7-00:00:00      1  alloc gpu:1
>> gpu06
>> longq         up      7-00:00:00      6   idle gpu:1
>> gpu[01-05,07]
>> longq         up      7-00:00:00      1   idle gpu:tesla:1,gpu:quadro:1
>>  gpu08
>> longq         up      7-00:00:00      1   idle (null)
>>  mic01
>> testq*        up           15:00      1  drain (null)
>>  mic02
>> testq*        up           15:00      1  alloc gpu:1
>> gpu06
>> testq*        up           15:00      6   idle gpu:1
>> gpu[01-05,07]
>> testq*        up           15:00      1   idle gpu:tesla:1,gpu:quadro:1
>>  gpu08
>> testq*        up           15:00      1   idle (null)
>>  mic01
>>
>> When I try to allocate the resource using salloc, I get the following
>> error message
>> though:
>>
>> $ salloc --gres=gpu:tesla:1
>> salloc: error: Job submit/allocate failed: Requested node configuration
>> is not available
>> salloc: Job allocation 544 has been revoked.
>>
>> Doing a normal allocation without the`--gres=<>' flag works, but when I
>> try the following
>> it fails as well:
>>
>> $ salloc -w gpu08 --gres=gpu:1
>> salloc: error: Job submit/allocate failed: Requested node configuration
>> is not available
>> salloc: Job allocation 545 has been revoked.
>>
>> Activating the DebugFlag Gres provides the following output in the
>> slurmctld.log file:
>>
>> [2017-02-06T14:22:46.990] gres/gpu: state for gpu08
>> [2017-02-06T14:22:46.990]   gres_cnt found:2 configured:2 avail:2 alloc:0
>> [2017-02-06T14:22:46.990]   gres_bit_alloc:
>> [2017-02-06T14:22:46.990]   gres_used:(null)
>> [2017-02-06T14:22:46.990]   topo_cpus_bitmap[0]:NULL
>> [2017-02-06T14:22:46.990]   topo_gres_bitmap[0]:0
>> [2017-02-06T14:22:46.990]   topo_gres_cnt_alloc[0]:0
>> [2017-02-06T14:22:46.990]   topo_gres_cnt_avail[0]:1
>> [2017-02-06T14:22:46.990]   type[0]:tesla
>> [2017-02-06T14:22:46.990]   topo_cpus_bitmap[1]:NULL
>> [2017-02-06T14:22:46.990]   topo_gres_bitmap[1]:1
>> [2017-02-06T14:22:46.990]   topo_gres_cnt_alloc[1]:0
>> [2017-02-06T14:22:46.990]   topo_gres_cnt_avail[1]:1
>> [2017-02-06T14:22:46.990]   type[1]:quadro
>> [2017-02-06T14:22:46.990]   type_cnt_alloc[0]:0
>> [2017-02-06T14:22:46.990]   type_cnt_avail[0]:1
>> [2017-02-06T14:22:46.990]   type[0]:tesla
>> [2017-02-06T14:22:46.990]   type_cnt_alloc[1]:0
>> [2017-02-06T14:22:46.990]   type_cnt_avail[1]:1
>> [2017-02-06T14:22:46.990]   type[1]:quadro
>> [2017-02-06T14:22:46.990] gres/mic: state for gpu08
>> [2017-02-06T14:22:46.990]   gres_cnt found:0 configured:0 avail:0 alloc:0
>> [2017-02-06T14:22:46.990]   gres_bit_alloc:NULL
>> [2017-02-06T14:22:46.990]   gres_used:(null)
>>
>> So here I am, a bit stumbled.
>>
>> Having said that, is there something wrong with my configuration? Am I
>> missing
>> anything or overlooked something?
>>
>> Any help with this would be greatly appreciated!
>>
>> Sincerely,
>> Hans Viessmann
>>
>> P.s. please find attached the slurm.conf file, the slurmctld.log file,
>> and gres.conf
>>        file.
>>
>> Untitled Document
>> ------------------------------------------------------------------------
>>
>> Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With
>> campuses and students across the entire globe we span the world,
>> delivering innovation and educational excellence in business,
>> engineering, design and the physical, social and life sciences.
>>
>> The contents of this e-mail (including any attachments) are
>> confidential. If you are not the intended recipient of this e-mail, any
>> disclosure, copying, distribution or use of its contents is strictly
>> prohibited, and you should please notify the sender immediately and then
>> delete it (including any attachments) from your system.
>>
>>
>
> --
> Robbert Eggermont                                  Intelligent Systems
> r.eggerm...@tudelft.nl         Electr.Eng., Mathematics & Comp.Science
> +31 15 27 83234                         Delft University of Technology
>

Reply via email to