Daniel, What about a count? Try adding a count=1 after each of your GPU lines.
John DeSantis 2015-05-06 11:54 GMT-04:00 Daniel Weber <[email protected]>: > > The same "problem" occurs when using the grey type in the srun syntax (using > i.e. --gres=gpu:tesla:1). > > Regards, > Daniel > > -- > Von: John Desantis [mailto:[email protected]] > Gesendet: Mittwoch, 6. Mai 2015 17:39 > An: slurm-dev > Betreff: [slurm-dev] Re: Job allocation for GPU jobs doesn't work using gpu > plugin (node configuration not available) > > > Daniel, > > We don't specify types in our Gres configuration, simply the resource. > > What happens if you update your srun syntax to: > > srun -n1 --gres=gpu:tesla:1 > > Does that dispatch the job? > > John DeSantis > > 2015-05-06 9:40 GMT-04:00 Daniel Weber <[email protected]>: >> Hello, >> >> currently I'm trying to set up SLURM on a gpu cluster with a small >> number of nodes (where smurf0[1-7] are the node names) using the gpu >> plugin to allocate jobs (requiring gpus). >> >> Unfortunately, when trying to run a gpu-job (any number of gpus; >> --gres=gpu:N), SLURM doesn't execute it, asserting unavailability of >> the requested configuration. >> I attached some logs and configuration text files in order to provide >> any information necessary to analyze this issue. >> >> Note: Cross posted here: http://serverfault.com/questions/685258 >> >> Example (using some test.sh which is echoing $CUDA_VISIBLE_DEVICES): >> >> srun -n1 --gres=gpu:1 test.sh >> --> srun: error: Unable to allocate resources: Requested node >> configuration is not available >> >> The slurmctld log for such calls shows: >> >> gres: gpu state for job X >> gres_cnt:1 node_cnt:1 type:(null) >> _pick_best_nodes: job X never runnable >> _slurm_rpc_allocate_resources: Requested node configuration is >> not available >> >> Jobs with any other type of configured generic resource complete >> successfully: >> >> srun -n1 --gres=gram:500 test.sh >> --> CUDA_VISIBLE_DEVICES=NoDevFiles >> >> The nodes and gres configuration in slurm.conf (which is attached as >> well) are like: >> >> GresTypes=gpu,ram,gram,scratch >> ... >> NodeName=smurf01 NodeAddr=192.168.1.101 Feature="intel,fermi" >> Boards=1 >> SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 >> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 >> NodeName=smurf02 NodeAddr=192.168.1.102 Feature="intel,fermi" >> Boards=1 >> SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1 >> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 >> >> The respective gres.conf files are >> Name=gpu Count=8 Type=tesla File=/dev/nvidia[0-7] >> Name=ram Count=48 >> Name=gram Count=6000 >> Name=scratch Count=1300 >> >> The output of "scontrol show node" lists all the nodes with the >> correct gres configuration i.e.: >> >> NodeName=smurf01 Arch=x86_64 CoresPerSocket=6 >> CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01 Features=intel,fermi >> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 >> ...etc. >> >> As far as I can tell, the slurmd daemon on the nodes recognizes the >> gpus (and other generic resources) correctly. >> >> My slurmd.log on node smurf01 says >> >> Gres Name = gpu Type = tesla Count = 8 ID = 7696487 File = /dev >> /nvidia[0 - 7] >> >> The log for slurmctld shows >> >> gres / gpu: state for smurf01 >> gres_cnt found : 8 configured : 8 avail : 8 alloc : 0 >> gres_bit_alloc : >> gres_used : (null) >> >> I can't figure out why the controller node states that jobs using >> --gres=gpu:N are "never runnable" and why "the requested node >> configuration is not available". >> Any help is appreciated. >> >> Kind regards, >> Daniel Weber >> >> PS: If further information is required, don't hesitate to ask.
