Daniel, "I can handle that temporarily with node features instead but I'd prefer utilizing the gpu types."
Guilty of reading your response too quickly... John DeSantis 2015-05-06 15:22 GMT-04:00 John Desantis <[email protected]>: > Daniel, > > Instead of defining the GPU type in our Gres configuration (global > with hostnames, no count), we simply add a feature so that users can > request a GPU (or GPU's) via Gres and the specific model via a > constraint. This may help out the situation so that your users can > request a specific GPU model. > > --srun --gres=gpu:1 -C "gpu_k20" > > I didn't think of it at the time, but I remember running --gres=help > when initially setting up GPU's to help rule out errors. I don't know > if you ran that command or not, but it's worth a shot to verify that > Gres types are being seen correctly on a node by the controller. I > also wonder if using a cluster wide Gres definition (vs. only on nodes > in question) would make a difference or not. > > John DeSantis > > > 2015-05-06 15:12 GMT-04:00 Daniel Weber <[email protected]>: >> >> Hi John, >> >> I already tried using "Count=1" for each line as well as "Count=8" for a >> single configuration line as well. >> >> I "solved" (or better circumvented) the problem by removing the "Type=..." >> specifications from the "gres.conf" files and from the slurm.conf. >> >> The jobs are running successfully without the possibility to request a >> certain GPU type. >> >> The generic resource examples on schedmd.com explicitly show the "Type" >> specifications on GPUs and I really would like to use them. >> I can handle that temporarily with node features instead but I'd prefer >> utilizing the gpu types. >> >> Thank you for your help (and the hint into the right direction). >> >> Kind regards >> Daniel >> >> >> -----Ursprüngliche Nachricht----- >> Von: John Desantis [mailto:[email protected]] >> Gesendet: Mittwoch, 6. Mai 2015 18:16 >> An: slurm-dev >> Betreff: [slurm-dev] Re: slurm-dev Re: Job allocation for GPU jobs doesn't >> work using gpu plugin (node configuration not available) >> >> >> Daniel, >> >> What about a count? Try adding a count=1 after each of your GPU lines. >> >> John DeSantis >> >> 2015-05-06 11:54 GMT-04:00 Daniel Weber <[email protected]>: >>> >>> The same "problem" occurs when using the grey type in the srun syntax >>> (using i.e. --gres=gpu:tesla:1). >>> >>> Regards, >>> Daniel >>> >>> -- >>> Von: John Desantis [mailto:[email protected]] >>> Gesendet: Mittwoch, 6. Mai 2015 17:39 >>> An: slurm-dev >>> Betreff: [slurm-dev] Re: Job allocation for GPU jobs doesn't work >>> using gpu plugin (node configuration not available) >>> >>> >>> Daniel, >>> >>> We don't specify types in our Gres configuration, simply the resource. >>> >>> What happens if you update your srun syntax to: >>> >>> srun -n1 --gres=gpu:tesla:1 >>> >>> Does that dispatch the job? >>> >>> John DeSantis >>> >>> 2015-05-06 9:40 GMT-04:00 Daniel Weber <[email protected]>: >>>> Hello, >>>> >>>> currently I'm trying to set up SLURM on a gpu cluster with a small >>>> number of nodes (where smurf0[1-7] are the node names) using the gpu >>>> plugin to allocate jobs (requiring gpus). >>>> >>>> Unfortunately, when trying to run a gpu-job (any number of gpus; >>>> --gres=gpu:N), SLURM doesn't execute it, asserting unavailability of >>>> the requested configuration. >>>> I attached some logs and configuration text files in order to provide >>>> any information necessary to analyze this issue. >>>> >>>> Note: Cross posted here: http://serverfault.com/questions/685258 >>>> >>>> Example (using some test.sh which is echoing $CUDA_VISIBLE_DEVICES): >>>> >>>> srun -n1 --gres=gpu:1 test.sh >>>> --> srun: error: Unable to allocate resources: Requested node >>>> configuration is not available >>>> >>>> The slurmctld log for such calls shows: >>>> >>>> gres: gpu state for job X >>>> gres_cnt:1 node_cnt:1 type:(null) >>>> _pick_best_nodes: job X never runnable >>>> _slurm_rpc_allocate_resources: Requested node configuration >>>> is not available >>>> >>>> Jobs with any other type of configured generic resource complete >>>> successfully: >>>> >>>> srun -n1 --gres=gram:500 test.sh >>>> --> CUDA_VISIBLE_DEVICES=NoDevFiles >>>> >>>> The nodes and gres configuration in slurm.conf (which is attached as >>>> well) are like: >>>> >>>> GresTypes=gpu,ram,gram,scratch >>>> ... >>>> NodeName=smurf01 NodeAddr=192.168.1.101 Feature="intel,fermi" >>>> Boards=1 >>>> SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 >>>> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 >>>> NodeName=smurf02 NodeAddr=192.168.1.102 Feature="intel,fermi" >>>> Boards=1 >>>> SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1 >>>> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 >>>> >>>> The respective gres.conf files are >>>> Name=gpu Count=8 Type=tesla File=/dev/nvidia[0-7] >>>> Name=ram Count=48 >>>> Name=gram Count=6000 >>>> Name=scratch Count=1300 >>>> >>>> The output of "scontrol show node" lists all the nodes with the >>>> correct gres configuration i.e.: >>>> >>>> NodeName=smurf01 Arch=x86_64 CoresPerSocket=6 >>>> CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01 Features=intel,fermi >>>> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 >>>> ...etc. >>>> >>>> As far as I can tell, the slurmd daemon on the nodes recognizes the >>>> gpus (and other generic resources) correctly. >>>> >>>> My slurmd.log on node smurf01 says >>>> >>>> Gres Name = gpu Type = tesla Count = 8 ID = 7696487 File = /dev >>>> /nvidia[0 - 7] >>>> >>>> The log for slurmctld shows >>>> >>>> gres / gpu: state for smurf01 >>>> gres_cnt found : 8 configured : 8 avail : 8 alloc : 0 >>>> gres_bit_alloc : >>>> gres_used : (null) >>>> >>>> I can't figure out why the controller node states that jobs using >>>> --gres=gpu:N are "never runnable" and why "the requested node >>>> configuration is not available". >>>> Any help is appreciated. >>>> >>>> Kind regards, >>>> Daniel Weber >>>> >>>> PS: If further information is required, don't hesitate to ask.
