Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition

Theis, Thomas Thu, 07 May 2020 10:19:09 -0700

Here is the outputs
sacctmgr show qos –p

Name|Priority|GraceTime|Preempt|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GtPA|MinTRES|
normal|10000|00:00:00||cluster|||1.000000|gres/gpu=2||||||||||gres/gpu=2|||||||
now|1000000|00:00:00||cluster|||1.000000||||||||||||||||||
high|100000|00:00:00||cluster|||1.000000||||||||||||||||||


scontrol show part

PartitionName=PART1
   AllowGroups=trace_unix_group AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO 
MaxCPUsPerNode=UNLIMITED
   Nodes=node1,node2,node3,node4,….   PriorityJobFactor=1 PriorityTier=1 
RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=236 TotalNodes=11 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

Thomas Theis

From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Sean 
Crosby
Sent: Wednesday, May 6, 2020 6:22 PM
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per 
partition

External Email
Do you have other limits set? The QoS is hierarchical, and especially partition 
QoS can override other QoS.

What's the output of

sacctmgr show qos -p

and

scontrol show part

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia


On Wed, 6 May 2020 at 23:44, Theis, Thomas 
<thomas.th...@teledyne.com<mailto:thomas.th...@teledyne.com>> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation 
attempts.
________________________________
Still have the same issue when I updated the user and qos..
Command I am using.
‘sacctmgr modify qos normal set MaxTRESPerUser=gres/gpu=2’
I restarted the services. Unfortunately I am still have to saturate the cluster 
with jobs.

We have a cluster of 10 nodes each with 4 gpus, for a total of 40 gpus. Each 
node is identical in the software, OS, SLURM. etc.. I am trying to limit each 
user to only be able to use 2 out of 40 gpus across the entire cluster or 
partition. A intended bottle neck so no one can saturate the cluster..

I.E. desired outcome would be. Person A submits 100 jobs, 2 would run , and 98 
would be pending, 38 gpus would be idle. Once the 2 running are finished, 2 
more would run and 96 would be pending, still 38 gpus would be idle..



Thomas Theis

From: slurm-users 
<slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>>
 On Behalf Of Sean Crosby
Sent: Tuesday, May 5, 2020 6:48 PM
To: Slurm User Community List 
<slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per 
partition

External Email
Hi Thomas,

That value should be

sacctmgr modify qos gpujobs set MaxTRESPerUser=gres/gpu=4

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia


On Wed, 6 May 2020 at 04:53, Theis, Thomas 
<thomas.th...@teledyne.com<mailto:thomas.th...@teledyne.com>> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation 
attempts.
________________________________
Hey Killian,

I tried to limit the number of gpus a user can run on at a time by adding 
MaxTRESPerUser = gres:gpu4 to both the user and the qos.. I restarted slurm 
control daemon and unfortunately I am still able to run on all the gpus in the 
partition. Any other ideas?

Thomas Theis

From: slurm-users 
<slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>>
 On Behalf Of Killian Murphy
Sent: Thursday, April 23, 2020 1:33 PM
To: Slurm User Community List 
<slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] Limit the number of GPUS per user per partition

External Email
Hi Thomas.

We limit the maximum number of GPUs a user can have allocated in a partition 
through the MaxTRESPerUser field of a QoS for GPU jobs, which is set as the 
partition QoS on our GPU partition. I.E:

We have a QOS `gpujobs` that sets MaxTRESPerUser => gres/gpu:4 to limit total 
number of allocated GPUs to 4, and set the GPU partition QoS to the `gpujobs` 
QoS.

There is a section in the Slurm documentation on the 'Resource Limits' page 
entitled 'QOS specific limits supported 
(https://slurm.schedmd.com/resource_limits.html) that details some care needed 
when using this kind of limit setting with typed GRES. Although it seems like 
you are trying to do something with generic GRES, it's worth a read!

Killian



On Thu, 23 Apr 2020 at 18:19, Theis, Thomas 
<thomas.th...@teledyne.com<mailto:thomas.th...@teledyne.com>> wrote:
Hi everyone,
First message, I am trying find a good way or multiple ways to limit the usage 
of jobs per node or use of gpus per node, without blocking a user from 
submitting them.

Example. We have 10 nodes each with 4 gpus in a partition. We allow a team of 6 
people to submit jobs to any or all of the nodes. One job per gpu; thus we can 
hold a total of 40 jobs concurrently in the partition.
At the moment: each user usually submit 50- 100 jobs at once. Taking up all 
gpus, and all other users have to wait in pending..

What I am trying to setup is allow all users to submit as many jobs as they 
wish but only run on 1 out of the 4 gpus per node, or some number out of the 
total 40 gpus across the entire partition. Using slurm 18.08.3..

This is roughly our slurm scripts.

#SBATCH --job-name=Name # Job name
#SBATCH --mem=5gb                     # Job memory request
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --partition=PART1
#SBATCH --time=200:00:00               # Time limit hrs:min:sec
#SBATCH --output=job _%j.log         # Standard output and error log
#SBATCH --nodes=1
#SBATCH --qos=high

srun -n1 --gres=gpu:1 --exclusive --export=ALL bash -c "NV_GPU=$SLURM_JOB_GPUS 
nvidia-docker run --rm -e SLURM_JOB_ID=$SLURM_JOB_ID -e 
SLURM_OUTPUT=$SLURM_OUTPUT --name $SLURM_JOB_ID do_job.sh"

Thomas Theis



--
Killian Murphy
Research Software Engineer

Wolfson Atmospheric Chemistry Laboratories
University of York
Heslington
York
YO10 5DD
+44 (0)1904 32 4753

e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm

Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition

Reply via email to