Daniel,

What about a count?  Try adding a count=1 after each of your GPU lines.

John DeSantis

2015-05-06 11:54 GMT-04:00 Daniel Weber <[email protected]>:
>
> The same "problem" occurs when using the grey type in the srun syntax (using 
> i.e. --gres=gpu:tesla:1).
>
> Regards,
> Daniel
>
> --
> Von: John Desantis [mailto:[email protected]]
> Gesendet: Mittwoch, 6. Mai 2015 17:39
> An: slurm-dev
> Betreff: [slurm-dev] Re: Job allocation for GPU jobs doesn't work using gpu 
> plugin (node configuration not available)
>
>
> Daniel,
>
> We don't specify types in our Gres configuration, simply the resource.
>
> What happens if you update your srun syntax to:
>
> srun -n1 --gres=gpu:tesla:1
>
> Does that dispatch the job?
>
> John DeSantis
>
> 2015-05-06 9:40 GMT-04:00 Daniel Weber <[email protected]>:
>> Hello,
>>
>> currently I'm trying to set up SLURM on a gpu cluster with a small
>> number of nodes (where smurf0[1-7] are the node names) using the gpu
>> plugin to allocate jobs (requiring gpus).
>>
>> Unfortunately, when trying to run a gpu-job (any number of gpus;
>> --gres=gpu:N), SLURM doesn't execute it, asserting unavailability of
>> the requested configuration.
>> I attached some logs and configuration text files in order to provide
>> any information necessary to analyze this issue.
>>
>> Note: Cross posted here: http://serverfault.com/questions/685258
>>
>> Example (using some test.sh which is echoing $CUDA_VISIBLE_DEVICES):
>>
>>     srun -n1 --gres=gpu:1 test.sh
>>         --> srun: error: Unable to allocate resources: Requested node
>> configuration is not available
>>
>> The slurmctld log for such calls shows:
>>
>>     gres: gpu state for job X
>>         gres_cnt:1 node_cnt:1 type:(null)
>>         _pick_best_nodes: job X never runnable
>>         _slurm_rpc_allocate_resources: Requested node configuration is
>> not available
>>
>> Jobs with any other type of configured generic resource complete
>> successfully:
>>
>>     srun -n1 --gres=gram:500 test.sh
>>         --> CUDA_VISIBLE_DEVICES=NoDevFiles
>>
>> The nodes and gres configuration in slurm.conf (which is attached as
>> well) are like:
>>
>>     GresTypes=gpu,ram,gram,scratch
>>     ...
>>     NodeName=smurf01 NodeAddr=192.168.1.101 Feature="intel,fermi"
>> Boards=1
>> SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2
>> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
>>     NodeName=smurf02 NodeAddr=192.168.1.102 Feature="intel,fermi"
>> Boards=1
>> SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1
>> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
>>
>> The respective gres.conf files are
>>     Name=gpu Count=8 Type=tesla File=/dev/nvidia[0-7]
>>     Name=ram Count=48
>>     Name=gram Count=6000
>>     Name=scratch Count=1300
>>
>> The output of "scontrol show node" lists all the nodes with the
>> correct gres configuration i.e.:
>>
>>     NodeName=smurf01 Arch=x86_64 CoresPerSocket=6
>>        CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01 Features=intel,fermi
>>        Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
>>        ...etc.
>>
>> As far as I can tell, the slurmd daemon on the nodes recognizes the
>> gpus (and other generic resources) correctly.
>>
>> My slurmd.log on node smurf01 says
>>
>>     Gres Name = gpu Type = tesla Count = 8 ID = 7696487 File = /dev
>> /nvidia[0 - 7]
>>
>> The log for slurmctld shows
>>
>>     gres / gpu: state for smurf01
>>        gres_cnt found : 8 configured : 8 avail : 8 alloc : 0
>>        gres_bit_alloc :
>>        gres_used : (null)
>>
>> I can't figure out why the controller node states that jobs using
>> --gres=gpu:N are "never runnable" and why "the requested node
>> configuration is not available".
>> Any help is appreciated.
>>
>> Kind regards,
>> Daniel Weber
>>
>> PS: If further information is required, don't hesitate to ask.

Reply via email to