Hi Thomas. We limit the maximum number of GPUs a user can have allocated in a partition through the MaxTRESPerUser field of a QoS for GPU jobs, which is set as the partition QoS on our GPU partition. I.E:
We have a QOS `gpujobs` that sets MaxTRESPerUser => gres/gpu:4 to limit total number of allocated GPUs to 4, and set the GPU partition QoS to the `gpujobs` QoS. There is a section in the Slurm documentation on the 'Resource Limits' page entitled 'QOS specific limits supported ( https://slurm.schedmd.com/resource_limits.html) that details some care needed when using this kind of limit setting with typed GRES. Although it seems like you are trying to do something with generic GRES, it's worth a read! Killian On Thu, 23 Apr 2020 at 18:19, Theis, Thomas <thomas.th...@teledyne.com> wrote: > Hi everyone, > > First message, I am trying find a good way or multiple ways to limit the > usage of jobs per node or use of gpus per node, without blocking a user > from submitting them. > > > > Example. We have 10 nodes each with 4 gpus in a partition. We allow a team > of 6 people to submit jobs to any or all of the nodes. One job per gpu; > thus we can hold a total of 40 jobs concurrently in the partition. > > At the moment: each user usually submit 50- 100 jobs at once. Taking up > all gpus, and all other users have to wait in pending.. > > > > What I am trying to setup is allow all users to submit as many jobs as > they wish but only run on 1 out of the 4 gpus per node, or some number out > of the total 40 gpus across the entire partition. Using slurm 18.08.3.. > > > > This is roughly our slurm scripts. > > > > #SBATCH --job-name=Name # Job name > > #SBATCH --mem=5gb # Job memory request > > #SBATCH --ntasks=1 > > #SBATCH --gres=gpu:1 > > #SBATCH --partition=PART1 > > #SBATCH --time=200:00:00 # Time limit hrs:min:sec > > #SBATCH --output=job _%j.log # Standard output and error log > > #SBATCH --nodes=1 > > #SBATCH --qos=high > > > > srun -n1 --gres=gpu:1 --exclusive --export=ALL bash -c > "NV_GPU=$SLURM_JOB_GPUS nvidia-docker run --rm -e > SLURM_JOB_ID=$SLURM_JOB_ID -e SLURM_OUTPUT=$SLURM_OUTPUT --name > $SLURM_JOB_ID do_job.sh" > > > > *Thomas Theis* > > > -- Killian Murphy Research Software Engineer Wolfson Atmospheric Chemistry Laboratories University of York Heslington York YO10 5DD +44 (0)1904 32 4753 e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm