Hi, This is just a guess, but there's also a cgroup.conf file where you might need to add:
ConstrainDevices=yes see: https://slurm.schedmd.com/cgroup.conf.html for more details. HTH, Yair. On Mon, Mar 12 2018, Sefa Arslan <sefa.ars...@tubitak.gov.tr> wrote: > Dear all, > > We have upgraded our cluster from 13 to slurm17.11.. We have some problem with > gpu configurations.. Although I request no GPUs, system let me use gpu > cards.. > > Let me explain.. > Slurm.conf: > SelectType=select/cons_res > SelectTypeParameters=CR_CPU_Memory > TaskPlugin=task/cgroup > TaskPlugin=task/cgroup > PreemptType=preempt/none > > NodeName=cudanode[1-20] Procs=40 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 > RealMemory=384000 Gres=gpu:2 > PartitionName=cuda Nodes=cudanode[1-20] Default=no MaxTime=15-00:00:00 > defaulttime=00:02:00 State=UP DefMemPerCPU=8500 MaxMemPerNode=380000 Shared=NO > Priority=1000 > > Gres.conf: > Name=gpu File=/dev/nvidia0 > CPUs=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78 > Name=gpu File=/dev/nvidia1 > CPUs=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79 > > I am testing the configuration with deviceQuery app comes with cuda9 pack. > > When I send a job with 2 gpus, system reserved right number of GPUS.. > srun -n 1 -p cuda --nodelist=cudanode1 --gres=gpu:2 ./cuda.sh > CUDA_VISIBLE_DEVICES: 0,1 > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime > Version = 7.5, NumDevs = 2, Device0 = Tesla P100-PCIE-16GB, Device1 = Tesla > P100-PCIE-16GB Result = PASS > > When I send a job with 1 gpus, system reserved right number of GPUS.. > > srun -n 1 -p cuda --nodelist=cudanode1 --gres=gpu:1 ./cuda.sh > CUDA_VISIBLE_DEVICES: 0 > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime > Version = 7.5, NumDevs = 1, Device0 = Tesla P100-PCIE-16GB > Result = PASS > > But when I send a job without any GPUS, system also let me use GPUS, that I > dont expect. > srun -n 1 -p cuda --nodelist=cudanode1 ./cuda.sh > CUDA_VISIBLE_DEVICES: > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime > Version = 7.5, NumDevs = 2, Device0 = Tesla P100-PCIE-16GB, Device1 = Tesla > P100-PCIE-16GB > Result = PASS > > By this way. I am able to run 40 jobs which all use the gpus on one server at > the same time. Is it a bug or I missed something ? While we use previous > versions of slurm, gpu allocation was how I expected. I also tried with > cuda-enabled namd which uses higher level hardware access methods and I get > the > same result. > > Another problem I hit, when I change the gpu configuration from Gres=gpu:2 to > Gres=gpu:no_consume:2 to be able to use simultaneously by many jobs, system > let > me use all cards independent of how many cards I request.. > > Regards, > Sefa ARSLAN