Daniel, You sparked an interest.
I was able to get Gres Types working by: 1.) Ensuring that the type was defined in slurm.conf for the nodes in question; 2.) Ensuring that the global gres.conf respected the type. salloc -n 1 --gres=gpu:Tesla-T10:1 salloc: Pending job allocation 532507 salloc: job 532507 queued and waiting for resources # slurm.conf Nodename=blah CPUs=16 CoresPerSocket=4 Sockets=4 RealMemory=129055 Feature=ib_ddr,ib_ofa,sse,sse2,sse3,tpa,cpu_xeon,xeon_E7330,gpu_T10,titan,mem_128G Gres=gpu:Tesla-T10:2 Weight=1000 # gres.conf 2015-05-06 15:25 GMT-04:00 John Desantis <[email protected]>: > > Daniel, > > "I can handle that temporarily with node features instead but I'd > prefer utilizing the gpu types." > > Guilty of reading your response too quickly... > > John DeSantis > > 2015-05-06 15:22 GMT-04:00 John Desantis <[email protected]>: >> Daniel, >> >> Instead of defining the GPU type in our Gres configuration (global >> with hostnames, no count), we simply add a feature so that users can >> request a GPU (or GPU's) via Gres and the specific model via a >> constraint. This may help out the situation so that your users can >> request a specific GPU model. >> >> --srun --gres=gpu:1 -C "gpu_k20" >> >> I didn't think of it at the time, but I remember running --gres=help >> when initially setting up GPU's to help rule out errors. I don't know >> if you ran that command or not, but it's worth a shot to verify that >> Gres types are being seen correctly on a node by the controller. I >> also wonder if using a cluster wide Gres definition (vs. only on nodes >> in question) would make a difference or not. >> >> John DeSantis >> >> >> 2015-05-06 15:12 GMT-04:00 Daniel Weber <[email protected]>: >>> >>> Hi John, >>> >>> I already tried using "Count=1" for each line as well as "Count=8" for a >>> single configuration line as well. >>> >>> I "solved" (or better circumvented) the problem by removing the "Type=..." >>> specifications from the "gres.conf" files and from the slurm.conf. >>> >>> The jobs are running successfully without the possibility to request a >>> certain GPU type. >>> >>> The generic resource examples on schedmd.com explicitly show the "Type" >>> specifications on GPUs and I really would like to use them. >>> I can handle that temporarily with node features instead but I'd prefer >>> utilizing the gpu types. >>> >>> Thank you for your help (and the hint into the right direction). >>> >>> Kind regards >>> Daniel >>> >>> >>> -----Ursprüngliche Nachricht----- >>> Von: John Desantis [mailto:[email protected]] >>> Gesendet: Mittwoch, 6. Mai 2015 18:16 >>> An: slurm-dev >>> Betreff: [slurm-dev] Re: slurm-dev Re: Job allocation for GPU jobs doesn't >>> work using gpu plugin (node configuration not available) >>> >>> >>> Daniel, >>> >>> What about a count? Try adding a count=1 after each of your GPU lines. >>> >>> John DeSantis >>> >>> 2015-05-06 11:54 GMT-04:00 Daniel Weber <[email protected]>: >>>> >>>> The same "problem" occurs when using the grey type in the srun syntax >>>> (using i.e. --gres=gpu:tesla:1). >>>> >>>> Regards, >>>> Daniel >>>> >>>> -- >>>> Von: John Desantis [mailto:[email protected]] >>>> Gesendet: Mittwoch, 6. Mai 2015 17:39 >>>> An: slurm-dev >>>> Betreff: [slurm-dev] Re: Job allocation for GPU jobs doesn't work >>>> using gpu plugin (node configuration not available) >>>> >>>> >>>> Daniel, >>>> >>>> We don't specify types in our Gres configuration, simply the resource. >>>> >>>> What happens if you update your srun syntax to: >>>> >>>> srun -n1 --gres=gpu:tesla:1 >>>> >>>> Does that dispatch the job? >>>> >>>> John DeSantis >>>> >>>> 2015-05-06 9:40 GMT-04:00 Daniel Weber <[email protected]>: >>>>> Hello, >>>>> >>>>> currently I'm trying to set up SLURM on a gpu cluster with a small >>>>> number of nodes (where smurf0[1-7] are the node names) using the gpu >>>>> plugin to allocate jobs (requiring gpus). >>>>> >>>>> Unfortunately, when trying to run a gpu-job (any number of gpus; >>>>> --gres=gpu:N), SLURM doesn't execute it, asserting unavailability of >>>>> the requested configuration. >>>>> I attached some logs and configuration text files in order to provide >>>>> any information necessary to analyze this issue. >>>>> >>>>> Note: Cross posted here: http://serverfault.com/questions/685258 >>>>> >>>>> Example (using some test.sh which is echoing $CUDA_VISIBLE_DEVICES): >>>>> >>>>> srun -n1 --gres=gpu:1 test.sh >>>>> --> srun: error: Unable to allocate resources: Requested node >>>>> configuration is not available >>>>> >>>>> The slurmctld log for such calls shows: >>>>> >>>>> gres: gpu state for job X >>>>> gres_cnt:1 node_cnt:1 type:(null) >>>>> _pick_best_nodes: job X never runnable >>>>> _slurm_rpc_allocate_resources: Requested node configuration >>>>> is not available >>>>> >>>>> Jobs with any other type of configured generic resource complete >>>>> successfully: >>>>> >>>>> srun -n1 --gres=gram:500 test.sh >>>>> --> CUDA_VISIBLE_DEVICES=NoDevFiles >>>>> >>>>> The nodes and gres configuration in slurm.conf (which is attached as >>>>> well) are like: >>>>> >>>>> GresTypes=gpu,ram,gram,scratch >>>>> ... >>>>> NodeName=smurf01 NodeAddr=192.168.1.101 Feature="intel,fermi" >>>>> Boards=1 >>>>> SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 >>>>> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 >>>>> NodeName=smurf02 NodeAddr=192.168.1.102 Feature="intel,fermi" >>>>> Boards=1 >>>>> SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1 >>>>> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 >>>>> >>>>> The respective gres.conf files are >>>>> Name=gpu Count=8 Type=tesla File=/dev/nvidia[0-7] >>>>> Name=ram Count=48 >>>>> Name=gram Count=6000 >>>>> Name=scratch Count=1300 >>>>> >>>>> The output of "scontrol show node" lists all the nodes with the >>>>> correct gres configuration i.e.: >>>>> >>>>> NodeName=smurf01 Arch=x86_64 CoresPerSocket=6 >>>>> CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01 Features=intel,fermi >>>>> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 >>>>> ...etc. >>>>> >>>>> As far as I can tell, the slurmd daemon on the nodes recognizes the >>>>> gpus (and other generic resources) correctly. >>>>> >>>>> My slurmd.log on node smurf01 says >>>>> >>>>> Gres Name = gpu Type = tesla Count = 8 ID = 7696487 File = /dev >>>>> /nvidia[0 - 7] >>>>> >>>>> The log for slurmctld shows >>>>> >>>>> gres / gpu: state for smurf01 >>>>> gres_cnt found : 8 configured : 8 avail : 8 alloc : 0 >>>>> gres_bit_alloc : >>>>> gres_used : (null) >>>>> >>>>> I can't figure out why the controller node states that jobs using >>>>> --gres=gpu:N are "never runnable" and why "the requested node >>>>> configuration is not available". >>>>> Any help is appreciated. >>>>> >>>>> Kind regards, >>>>> Daniel Weber >>>>> >>>>> PS: If further information is required, don't hesitate to ask.
