[slurm-dev] Re: slurm-dev Re: Job allocation for GPU jobs doesn't work using gpu plugin (node configuration not available)

John Desantis Wed, 06 May 2015 12:25:33 -0700

Daniel,

"I can handle that temporarily with node features instead but I'd
prefer utilizing the gpu types."


Guilty of reading your response too quickly...

John DeSantis

2015-05-06 15:22 GMT-04:00 John Desantis <[email protected]>:
> Daniel,
>
> Instead of defining the GPU type in our Gres configuration (global
> with hostnames, no count), we simply add a feature so that users can
> request a GPU (or GPU's) via Gres and the specific model via a
> constraint.  This may help out the situation so that your users can
> request a specific GPU model.
>
> --srun --gres=gpu:1 -C "gpu_k20"
>
> I didn't think of it at the time, but I remember running --gres=help
> when initially setting up GPU's to help rule out errors.  I don't know
> if you ran that command or not, but it's worth a shot to verify that
> Gres types are being seen correctly on a node by the controller.  I
> also wonder if using a cluster wide Gres definition (vs. only on nodes
> in question) would make a difference or not.
>
> John DeSantis
>
>
> 2015-05-06 15:12 GMT-04:00 Daniel Weber <[email protected]>:
>>
>> Hi John,
>>
>> I already tried using "Count=1" for each line as well as "Count=8" for a 
>> single configuration line as well.
>>
>> I "solved" (or better circumvented) the problem by removing the "Type=..." 
>> specifications from the "gres.conf" files and from the slurm.conf.
>>
>> The jobs are running successfully without the possibility to request a 
>> certain GPU type.
>>
>> The generic resource examples on schedmd.com explicitly show the "Type" 
>> specifications on GPUs and I really would like to use them.
>> I can handle that temporarily with node features instead but I'd prefer 
>> utilizing the gpu types.
>>
>> Thank you for your help (and the hint into the right direction).
>>
>> Kind regards
>> Daniel
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: John Desantis [mailto:[email protected]]
>> Gesendet: Mittwoch, 6. Mai 2015 18:16
>> An: slurm-dev
>> Betreff: [slurm-dev] Re: slurm-dev Re: Job allocation for GPU jobs doesn't 
>> work using gpu plugin (node configuration not available)
>>
>>
>> Daniel,
>>
>> What about a count?  Try adding a count=1 after each of your GPU lines.
>>
>> John DeSantis
>>
>> 2015-05-06 11:54 GMT-04:00 Daniel Weber <[email protected]>:
>>>
>>> The same "problem" occurs when using the grey type in the srun syntax 
>>> (using i.e. --gres=gpu:tesla:1).
>>>
>>> Regards,
>>> Daniel
>>>
>>> --
>>> Von: John Desantis [mailto:[email protected]]
>>> Gesendet: Mittwoch, 6. Mai 2015 17:39
>>> An: slurm-dev
>>> Betreff: [slurm-dev] Re: Job allocation for GPU jobs doesn't work
>>> using gpu plugin (node configuration not available)
>>>
>>>
>>> Daniel,
>>>
>>> We don't specify types in our Gres configuration, simply the resource.
>>>
>>> What happens if you update your srun syntax to:
>>>
>>> srun -n1 --gres=gpu:tesla:1
>>>
>>> Does that dispatch the job?
>>>
>>> John DeSantis
>>>
>>> 2015-05-06 9:40 GMT-04:00 Daniel Weber <[email protected]>:
>>>> Hello,
>>>>
>>>> currently I'm trying to set up SLURM on a gpu cluster with a small
>>>> number of nodes (where smurf0[1-7] are the node names) using the gpu
>>>> plugin to allocate jobs (requiring gpus).
>>>>
>>>> Unfortunately, when trying to run a gpu-job (any number of gpus;
>>>> --gres=gpu:N), SLURM doesn't execute it, asserting unavailability of
>>>> the requested configuration.
>>>> I attached some logs and configuration text files in order to provide
>>>> any information necessary to analyze this issue.
>>>>
>>>> Note: Cross posted here: http://serverfault.com/questions/685258
>>>>
>>>> Example (using some test.sh which is echoing $CUDA_VISIBLE_DEVICES):
>>>>
>>>>     srun -n1 --gres=gpu:1 test.sh
>>>>         --> srun: error: Unable to allocate resources: Requested node
>>>> configuration is not available
>>>>
>>>> The slurmctld log for such calls shows:
>>>>
>>>>     gres: gpu state for job X
>>>>         gres_cnt:1 node_cnt:1 type:(null)
>>>>         _pick_best_nodes: job X never runnable
>>>>         _slurm_rpc_allocate_resources: Requested node configuration
>>>> is not available
>>>>
>>>> Jobs with any other type of configured generic resource complete
>>>> successfully:
>>>>
>>>>     srun -n1 --gres=gram:500 test.sh
>>>>         --> CUDA_VISIBLE_DEVICES=NoDevFiles
>>>>
>>>> The nodes and gres configuration in slurm.conf (which is attached as
>>>> well) are like:
>>>>
>>>>     GresTypes=gpu,ram,gram,scratch
>>>>     ...
>>>>     NodeName=smurf01 NodeAddr=192.168.1.101 Feature="intel,fermi"
>>>> Boards=1
>>>> SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2
>>>> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
>>>>     NodeName=smurf02 NodeAddr=192.168.1.102 Feature="intel,fermi"
>>>> Boards=1
>>>> SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1
>>>> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
>>>>
>>>> The respective gres.conf files are
>>>>     Name=gpu Count=8 Type=tesla File=/dev/nvidia[0-7]
>>>>     Name=ram Count=48
>>>>     Name=gram Count=6000
>>>>     Name=scratch Count=1300
>>>>
>>>> The output of "scontrol show node" lists all the nodes with the
>>>> correct gres configuration i.e.:
>>>>
>>>>     NodeName=smurf01 Arch=x86_64 CoresPerSocket=6
>>>>        CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01 Features=intel,fermi
>>>>        Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
>>>>        ...etc.
>>>>
>>>> As far as I can tell, the slurmd daemon on the nodes recognizes the
>>>> gpus (and other generic resources) correctly.
>>>>
>>>> My slurmd.log on node smurf01 says
>>>>
>>>>     Gres Name = gpu Type = tesla Count = 8 ID = 7696487 File = /dev
>>>> /nvidia[0 - 7]
>>>>
>>>> The log for slurmctld shows
>>>>
>>>>     gres / gpu: state for smurf01
>>>>        gres_cnt found : 8 configured : 8 avail : 8 alloc : 0
>>>>        gres_bit_alloc :
>>>>        gres_used : (null)
>>>>
>>>> I can't figure out why the controller node states that jobs using
>>>> --gres=gpu:N are "never runnable" and why "the requested node
>>>> configuration is not available".
>>>> Any help is appreciated.
>>>>
>>>> Kind regards,
>>>> Daniel Weber
>>>>
>>>> PS: If further information is required, don't hesitate to ask.

[slurm-dev] Re: slurm-dev Re: Job allocation for GPU jobs doesn't work using gpu plugin (node configuration not available)

Reply via email to