Daniel,

Instead of defining the GPU type in our Gres configuration (global
with hostnames, no count), we simply add a feature so that users can
request a GPU (or GPU's) via Gres and the specific model via a
constraint.  This may help out the situation so that your users can
request a specific GPU model.

--srun --gres=gpu:1 -C "gpu_k20"

I didn't think of it at the time, but I remember running --gres=help
when initially setting up GPU's to help rule out errors.  I don't know
if you ran that command or not, but it's worth a shot to verify that
Gres types are being seen correctly on a node by the controller.  I
also wonder if using a cluster wide Gres definition (vs. only on nodes
in question) would make a difference or not.

John DeSantis


2015-05-06 15:12 GMT-04:00 Daniel Weber <[email protected]>:
>
> Hi John,
>
> I already tried using "Count=1" for each line as well as "Count=8" for a 
> single configuration line as well.
>
> I "solved" (or better circumvented) the problem by removing the "Type=..." 
> specifications from the "gres.conf" files and from the slurm.conf.
>
> The jobs are running successfully without the possibility to request a 
> certain GPU type.
>
> The generic resource examples on schedmd.com explicitly show the "Type" 
> specifications on GPUs and I really would like to use them.
> I can handle that temporarily with node features instead but I'd prefer 
> utilizing the gpu types.
>
> Thank you for your help (and the hint into the right direction).
>
> Kind regards
> Daniel
>
>
> -----Ursprüngliche Nachricht-----
> Von: John Desantis [mailto:[email protected]]
> Gesendet: Mittwoch, 6. Mai 2015 18:16
> An: slurm-dev
> Betreff: [slurm-dev] Re: slurm-dev Re: Job allocation for GPU jobs doesn't 
> work using gpu plugin (node configuration not available)
>
>
> Daniel,
>
> What about a count?  Try adding a count=1 after each of your GPU lines.
>
> John DeSantis
>
> 2015-05-06 11:54 GMT-04:00 Daniel Weber <[email protected]>:
>>
>> The same "problem" occurs when using the grey type in the srun syntax (using 
>> i.e. --gres=gpu:tesla:1).
>>
>> Regards,
>> Daniel
>>
>> --
>> Von: John Desantis [mailto:[email protected]]
>> Gesendet: Mittwoch, 6. Mai 2015 17:39
>> An: slurm-dev
>> Betreff: [slurm-dev] Re: Job allocation for GPU jobs doesn't work
>> using gpu plugin (node configuration not available)
>>
>>
>> Daniel,
>>
>> We don't specify types in our Gres configuration, simply the resource.
>>
>> What happens if you update your srun syntax to:
>>
>> srun -n1 --gres=gpu:tesla:1
>>
>> Does that dispatch the job?
>>
>> John DeSantis
>>
>> 2015-05-06 9:40 GMT-04:00 Daniel Weber <[email protected]>:
>>> Hello,
>>>
>>> currently I'm trying to set up SLURM on a gpu cluster with a small
>>> number of nodes (where smurf0[1-7] are the node names) using the gpu
>>> plugin to allocate jobs (requiring gpus).
>>>
>>> Unfortunately, when trying to run a gpu-job (any number of gpus;
>>> --gres=gpu:N), SLURM doesn't execute it, asserting unavailability of
>>> the requested configuration.
>>> I attached some logs and configuration text files in order to provide
>>> any information necessary to analyze this issue.
>>>
>>> Note: Cross posted here: http://serverfault.com/questions/685258
>>>
>>> Example (using some test.sh which is echoing $CUDA_VISIBLE_DEVICES):
>>>
>>>     srun -n1 --gres=gpu:1 test.sh
>>>         --> srun: error: Unable to allocate resources: Requested node
>>> configuration is not available
>>>
>>> The slurmctld log for such calls shows:
>>>
>>>     gres: gpu state for job X
>>>         gres_cnt:1 node_cnt:1 type:(null)
>>>         _pick_best_nodes: job X never runnable
>>>         _slurm_rpc_allocate_resources: Requested node configuration
>>> is not available
>>>
>>> Jobs with any other type of configured generic resource complete
>>> successfully:
>>>
>>>     srun -n1 --gres=gram:500 test.sh
>>>         --> CUDA_VISIBLE_DEVICES=NoDevFiles
>>>
>>> The nodes and gres configuration in slurm.conf (which is attached as
>>> well) are like:
>>>
>>>     GresTypes=gpu,ram,gram,scratch
>>>     ...
>>>     NodeName=smurf01 NodeAddr=192.168.1.101 Feature="intel,fermi"
>>> Boards=1
>>> SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2
>>> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
>>>     NodeName=smurf02 NodeAddr=192.168.1.102 Feature="intel,fermi"
>>> Boards=1
>>> SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1
>>> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
>>>
>>> The respective gres.conf files are
>>>     Name=gpu Count=8 Type=tesla File=/dev/nvidia[0-7]
>>>     Name=ram Count=48
>>>     Name=gram Count=6000
>>>     Name=scratch Count=1300
>>>
>>> The output of "scontrol show node" lists all the nodes with the
>>> correct gres configuration i.e.:
>>>
>>>     NodeName=smurf01 Arch=x86_64 CoresPerSocket=6
>>>        CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01 Features=intel,fermi
>>>        Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
>>>        ...etc.
>>>
>>> As far as I can tell, the slurmd daemon on the nodes recognizes the
>>> gpus (and other generic resources) correctly.
>>>
>>> My slurmd.log on node smurf01 says
>>>
>>>     Gres Name = gpu Type = tesla Count = 8 ID = 7696487 File = /dev
>>> /nvidia[0 - 7]
>>>
>>> The log for slurmctld shows
>>>
>>>     gres / gpu: state for smurf01
>>>        gres_cnt found : 8 configured : 8 avail : 8 alloc : 0
>>>        gres_bit_alloc :
>>>        gres_used : (null)
>>>
>>> I can't figure out why the controller node states that jobs using
>>> --gres=gpu:N are "never runnable" and why "the requested node
>>> configuration is not available".
>>> Any help is appreciated.
>>>
>>> Kind regards,
>>> Daniel Weber
>>>
>>> PS: If further information is required, don't hesitate to ask.

Reply via email to