We think we've discovered an error in slurm 2.2.1. When we do a
scontrol reconfig (or kill -hup the slurmctld), the usage->grp_used_cpus
field of qos'es and associations gets set to its previous value + the
current actual usage. (If we restart slurm, it gets set to the current
actual usage.)
We discovered this while implementing a GrpMEM limit (but the commands
below were run on a clean (unpatched) 2.2.1 build), so it seems probable
that this happens for other Grp limits, but we haven't investigated
that.
We can reproduce it on an un-patched slurm 2.2.1 like this (this is for
QoS limits, but we see the same behaviour for account limits):
Set up GrpCPUs limit for qos staff (on the controller node):
# sacctmgr modify qos staff set GrpCPUs=20
# sacctmgr show qos staff
Name Priority Preempt PreemptMode
Flags UsageThres GrpCPUs GrpCPUMins GrpJobs GrpNodes GrpSubmit GrpWall
MaxCPUs MaxCPUMins MaxJobs MaxNodes MaxSubmit MaxWall
---------- ---------- ---------- -----------
---------------------------------------- ---------- -------- -----------
------- -------- --------- ----------- -------- ----------- ------- --------
--------- -----------
staff 10000 lowpri cluster
20
Submit jobs to hit the limit (on a login node):
$ sbatch --account=staff --mem-per-cpu=500 --time=10:00 --ntasks=10
--wrap='sleep 600'
Submitted batch job 7
$ /usr/bin/sbatch --account=staff --mem-per-cpu=500 --time=10:00 --ntasks=10
--wrap='sleep 600'
Submitted batch job 8
$ /usr/bin/sbatch --account=staff --mem-per-cpu=500 --time=10:00 --ntasks=10
--wrap='sleep 600'
Submitted batch job 9
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
9 normal sbatch bhm PD 0:00 3
(AssociationResourceLimit)
8 normal sbatch bhm R 0:17 3 compute-0-[6-8]
7 normal sbatch bhm R 0:22 3 compute-0-[4-6]
(QoS staff is the default qos for account staff.)
Grep in slurmctld.log:
# grep 'exceeds group max cpu limit.*qos staff\|slurmctld starting\|Reconfigure
signal\|REQUEST_RECONFIGURE' /var/log/slurm/slurmctld.log
[2011-02-13T15:04:04] debug2: job 9 being held, the job is at or exceeds group
max cpu limit 20 with already used 20 + requested 10 for qos staff
[2011-02-13T15:04:04] debug2: job 9 being held, the job is at or exceeds group
max cpu limit 20 with already used 20 + requested 10 for qos staff
[2011-02-13T15:04:40] debug2: job 9 being held, the job is at or exceeds group
max cpu limit 20 with already used 20 + requested 10 for qos staff
[2011-02-13T15:04:40] debug2: job 9 being held, the job is at or exceeds group
max cpu limit 20 with already used 20 + requested 10 for qos staff
Then do a reconfigure and grep again:
# scontrol reconfigure
# grep 'exceeds group max cpu limit.*qos staff\|slurmctld starting\|Reconfigure
signal\|REQUEST_RECONFIGURE' /var/log/slurm/slurmctld.log
[2011-02-13T15:05:40] debug2: job 9 being held, the job is at or exceeds group
max cpu limit 20 with already used 20 + requested 10 for qos staff
[2011-02-13T15:06:26] Processing RPC: REQUEST_RECONFIGURE from uid=0
[2011-02-13T15:06:26] debug2: job 9 being held, the job is at or exceeds group
max cpu limit 20 with already used 40 + requested 10 for qos staff
Further reconfigures add to the grp_used_cpus the number of cpus in use:
for instance:
# scontrol reconfigure
# grep 'exceeds group max cpu limit.*qos staff\|slurmctld starting\|Reconfigure
signal\|REQUEST_RECONFIGURE' /var/log/slurm/slurmctld.log
[2011-02-13T15:06:40] debug2: job 9 being held, the job is at or exceeds group
max cpu limit 20 with already used 40 + requested 10 for qos staff
[2011-02-13T15:07:39] Processing RPC: REQUEST_RECONFIGURE from uid=0
[2011-02-13T15:07:40] debug2: job 9 being held, the job is at or exceeds group
max cpu limit 20 with already used 60 + requested 10 for qos staff
# scancel 7 ## used 10 cpus
# scontrol reconfigure
# grep 'exceeds group max cpu limit.*qos staff\|slurmctld starting\|Reconfigure
signal\|REQUEST_RECONFIGURE' /var/log/slurm/slurmctld.log
[2011-02-13T15:07:40] debug2: job 9 being held, the job is at or exceeds group
max cpu limit 20 with already used 60 + requested 10 for qos staff
[2011-02-13T15:07:57] debug2: job 9 being held, the job is at or exceeds group
max cpu limit 20 with already used 50 + requested 10 for qos staff
[2011-02-13T15:08:07] Processing RPC: REQUEST_RECONFIGURE from uid=0
[2011-02-13T15:08:07] debug2: job 9 being held, the job is at or exceeds group
max cpu limit 20 with already used 60 + requested 10 for qos staff
A complete restart resets the counters to their correct values.
I've attached our slurm.conf:
## slurm.conf: main configuration file for SLURM
## $Id: slurm.conf,v 1.25 2011/02/12 16:59:04 root Exp root $
###
### Cluster
###
ClusterName=titan
#default: AuthType=auth/munge
#default: CryptoType=crypto/munge
SlurmctldPort=6817
SlurmdPort=6818
TmpFs=/work
#default: TreeWidth=50 Use ceil(sqrt(#nodes))
TreeWidth=5
## Timers:
#default: MessageTimeout=10
SlurmdTimeout=36000
WaitTime=0
###
### Slurmctld
###
ControlMachine=teflon
#default: MinJobAge=300
SlurmUser=slurm
StateSaveLocation=/state/partition1/slurm/slurmstate
###
### Nodes
###
FastSchedule=2
HealthCheckInterval=60
HealthCheckProgram=/sbin/healthcheck
ReturnToService=1
Nodename=DEFAULT CoresPerSocket=2 Sockets=2 RealMemory=3949 State=unknown
TmpDisk=10000 Weight=2027
PartitionName=DEFAULT MaxTime=Infinite State=up Shared=NO
Include /etc/slurm/slurmnodes.conf
###
### Jobs
###
PropagateResourceLimits=NONE
DefMemPerCPU=500
EnforcePartLimits=yes
#default: InactiveLimit=0
JobFileAppend=1
#default: JobRequeue=1
JobSubmitPlugins=lua
#default: MaxJobCount=10000
#default: MpiDefault=none #FIXME: openmpi?
#default: OverTimeLimit=0
VSizeFactor=150
## Prologs/Epilogs
# run by slurmctld as SlurmUser on ControlMachine before granting a job
allocation:
#PrologSlurmctld=
# run by slurmd on each node prior to the first job step on the node:
Prolog=/site/sbin/slurmprolog
# run by srun on the node running srun, prior to the launch of a job step:
#SrunProlog=
# run as user for each task prior to initiate the task:
TaskProlog=/site/sbin/taskprolog
# run as user for each task after the task finishes:
#TaskEpilog=
# run by srun on the node running srun, after a job step finishes:
#SrunEpilog=
# run as root on each node when job has completed
Epilog=/site/sbin/slurmepilog
# run as SlurmUser on ControlMachine after the allocation is released:
#EpilogSlurmctld=
###
### Job Priority
###
PriorityType=priority/multifactor
#default: PriorityCalcPeriod=5
#default: PriorityDecayHalfLife=7-0 #(7 days)
#default: PriorityUsageResetPeriod=NONE
#default: PriorityMaxAge=7-0 #(7 days)
#default: PriorityFavorSmall=no
PriorityWeightAge=10000
#default: PriorityWeightFairshare=0
PriorityWeightJobSize=1000
#default: PriorityWeightPartition=0
PriorityWeightQOS=10000
###
### Scheduling
###
SchedulerType=sched/backfill
#default:
SchedulerParameters=default_queue_depth=100,defer=?,bf_interval=30,bf_window=1440,max_job_bf=50
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
PreemptMode=requeue
#PreemptMode=checkpoint # FIXME: cancels if checkpoint is not possible!
PreemptType=preempt/qos
CompleteWait=32 # KillWait + 2
#default: KillWait=30
###
### Checkpointing
###
# ************** WARNING ***********************
# *** ENABLING/DISABLING THIS KILLS ALL JOBS ***
# **********************************************
CheckpointType=checkpoint/blcr
JobCheckpointDir=/state/partition1/slurm/checkpoint
###
### Logging
###
SlurmctldDebug=6
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmSchedLogLevel=1
SlurmSchedLogFile=/var/log/slurm/sched.log
SlurmdDebug=5
SlurmdLogFile=/var/log/slurm/slurmd.log
#default: DebugFlags=
###
### Accounting (Slurmdbd)
###
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=blaster
JobAcctGatherType=jobacct_gather/linux
#default: JobAcctGatherFrequency=30
ProctrackType=proctrack/linuxproc # FIXME: check out cgroup
AccountingStorageEnforce=limits,qos
# combination of associations < limits < wckeys, qos
--
Regards,
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo