We think we've discovered an error in slurm 2.2.1.  When we do a
scontrol reconfig (or kill -hup the slurmctld), the usage->grp_used_cpus
field of qos'es and associations gets set to its previous value + the
current actual usage.  (If we restart slurm, it gets set to the current
actual usage.)

We discovered this while implementing a GrpMEM limit (but the commands
below were run on a clean (unpatched) 2.2.1 build), so it seems probable
that this happens for other Grp limits, but we haven't investigated
that.

We can reproduce it on an un-patched slurm 2.2.1 like this (this is for
QoS limits, but we see the same behaviour for account limits):


Set up GrpCPUs limit for qos staff (on the controller node):

# sacctmgr modify qos staff set GrpCPUs=20
# sacctmgr show qos staff
      Name   Priority    Preempt PreemptMode                                    
Flags UsageThres  GrpCPUs  GrpCPUMins GrpJobs GrpNodes GrpSubmit     GrpWall  
MaxCPUs  MaxCPUMins MaxJobs MaxNodes MaxSubmit     MaxWall 
---------- ---------- ---------- ----------- 
---------------------------------------- ---------- -------- ----------- 
------- -------- --------- ----------- -------- ----------- ------- -------- 
--------- ----------- 
     staff      10000     lowpri     cluster                                    
                       20                                                       
                                           

Submit jobs to hit the limit (on a login node):

$ sbatch --account=staff --mem-per-cpu=500 --time=10:00 --ntasks=10 
--wrap='sleep 600'
Submitted batch job 7
$ /usr/bin/sbatch --account=staff --mem-per-cpu=500 --time=10:00 --ntasks=10 
--wrap='sleep 600'
Submitted batch job 8
$ /usr/bin/sbatch --account=staff --mem-per-cpu=500 --time=10:00 --ntasks=10 
--wrap='sleep 600'
Submitted batch job 9
$ squeue   
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
      9    normal   sbatch      bhm  PD       0:00      3 
(AssociationResourceLimit)
      8    normal   sbatch      bhm   R       0:17      3 compute-0-[6-8]
      7    normal   sbatch      bhm   R       0:22      3 compute-0-[4-6]

(QoS staff is the default qos for account staff.)

Grep in slurmctld.log:

# grep 'exceeds group max cpu limit.*qos staff\|slurmctld starting\|Reconfigure 
signal\|REQUEST_RECONFIGURE' /var/log/slurm/slurmctld.log
[2011-02-13T15:04:04] debug2: job 9 being held, the job is at or exceeds group 
max cpu limit 20 with already used 20 + requested 10 for qos staff
[2011-02-13T15:04:04] debug2: job 9 being held, the job is at or exceeds group 
max cpu limit 20 with already used 20 + requested 10 for qos staff
[2011-02-13T15:04:40] debug2: job 9 being held, the job is at or exceeds group 
max cpu limit 20 with already used 20 + requested 10 for qos staff
[2011-02-13T15:04:40] debug2: job 9 being held, the job is at or exceeds group 
max cpu limit 20 with already used 20 + requested 10 for qos staff

Then do a reconfigure and grep again:

# scontrol reconfigure
# grep 'exceeds group max cpu limit.*qos staff\|slurmctld starting\|Reconfigure 
signal\|REQUEST_RECONFIGURE' /var/log/slurm/slurmctld.log
[2011-02-13T15:05:40] debug2: job 9 being held, the job is at or exceeds group 
max cpu limit 20 with already used 20 + requested 10 for qos staff
[2011-02-13T15:06:26] Processing RPC: REQUEST_RECONFIGURE from uid=0
[2011-02-13T15:06:26] debug2: job 9 being held, the job is at or exceeds group 
max cpu limit 20 with already used 40 + requested 10 for qos staff


Further reconfigures add to the grp_used_cpus the number of cpus in use:
for instance:

# scontrol reconfigure
# grep 'exceeds group max cpu limit.*qos staff\|slurmctld starting\|Reconfigure 
signal\|REQUEST_RECONFIGURE' /var/log/slurm/slurmctld.log
[2011-02-13T15:06:40] debug2: job 9 being held, the job is at or exceeds group 
max cpu limit 20 with already used 40 + requested 10 for qos staff
[2011-02-13T15:07:39] Processing RPC: REQUEST_RECONFIGURE from uid=0
[2011-02-13T15:07:40] debug2: job 9 being held, the job is at or exceeds group 
max cpu limit 20 with already used 60 + requested 10 for qos staff

# scancel 7  ## used 10 cpus
# scontrol reconfigure
# grep 'exceeds group max cpu limit.*qos staff\|slurmctld starting\|Reconfigure 
signal\|REQUEST_RECONFIGURE' /var/log/slurm/slurmctld.log
[2011-02-13T15:07:40] debug2: job 9 being held, the job is at or exceeds group 
max cpu limit 20 with already used 60 + requested 10 for qos staff
[2011-02-13T15:07:57] debug2: job 9 being held, the job is at or exceeds group 
max cpu limit 20 with already used 50 + requested 10 for qos staff
[2011-02-13T15:08:07] Processing RPC: REQUEST_RECONFIGURE from uid=0
[2011-02-13T15:08:07] debug2: job 9 being held, the job is at or exceeds group 
max cpu limit 20 with already used 60 + requested 10 for qos staff

A complete restart resets the counters to their correct values.

I've attached our slurm.conf:

## slurm.conf: main configuration file for SLURM
## $Id: slurm.conf,v 1.25 2011/02/12 16:59:04 root Exp root $


###
### Cluster
###

ClusterName=titan
#default: AuthType=auth/munge
#default: CryptoType=crypto/munge
SlurmctldPort=6817
SlurmdPort=6818
TmpFs=/work
#default: TreeWidth=50  Use ceil(sqrt(#nodes))
TreeWidth=5

## Timers:
#default: MessageTimeout=10
SlurmdTimeout=36000
WaitTime=0


###
### Slurmctld
###

ControlMachine=teflon
#default: MinJobAge=300
SlurmUser=slurm
StateSaveLocation=/state/partition1/slurm/slurmstate


###
### Nodes
###

FastSchedule=2
HealthCheckInterval=60
HealthCheckProgram=/sbin/healthcheck
ReturnToService=1
Nodename=DEFAULT CoresPerSocket=2 Sockets=2 RealMemory=3949 State=unknown 
TmpDisk=10000 Weight=2027
PartitionName=DEFAULT MaxTime=Infinite State=up Shared=NO
Include /etc/slurm/slurmnodes.conf


###
### Jobs
###

PropagateResourceLimits=NONE
DefMemPerCPU=500
EnforcePartLimits=yes
#default: InactiveLimit=0
JobFileAppend=1
#default: JobRequeue=1
JobSubmitPlugins=lua
#default: MaxJobCount=10000
#default: MpiDefault=none #FIXME: openmpi?
#default: OverTimeLimit=0
VSizeFactor=150

## Prologs/Epilogs
# run by slurmctld as SlurmUser on ControlMachine before granting a job 
allocation:
#PrologSlurmctld=
# run by slurmd on each node prior to the first job step on the node:
Prolog=/site/sbin/slurmprolog
# run by srun on the node running srun, prior to the launch of a job step:
#SrunProlog=
# run as user for each task prior to initiate the task:
TaskProlog=/site/sbin/taskprolog
# run as user for each task after the task finishes:
#TaskEpilog=
# run by srun on the node running srun, after a job step finishes:
#SrunEpilog=
# run as root on each node when job has completed
Epilog=/site/sbin/slurmepilog
# run as SlurmUser on ControlMachine after the allocation is released:
#EpilogSlurmctld=


###
### Job Priority
###

PriorityType=priority/multifactor
#default: PriorityCalcPeriod=5
#default: PriorityDecayHalfLife=7-0 #(7 days)
#default: PriorityUsageResetPeriod=NONE
#default: PriorityMaxAge=7-0 #(7 days)
#default: PriorityFavorSmall=no
PriorityWeightAge=10000
#default: PriorityWeightFairshare=0
PriorityWeightJobSize=1000
#default: PriorityWeightPartition=0
PriorityWeightQOS=10000


###
### Scheduling
###

SchedulerType=sched/backfill
#default: 
SchedulerParameters=default_queue_depth=100,defer=?,bf_interval=30,bf_window=1440,max_job_bf=50
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
PreemptMode=requeue
#PreemptMode=checkpoint         # FIXME: cancels if checkpoint is not possible!
PreemptType=preempt/qos
CompleteWait=32                 # KillWait + 2
#default: KillWait=30


###
### Checkpointing
###

# ************** WARNING ***********************
# *** ENABLING/DISABLING THIS KILLS ALL JOBS ***
# **********************************************
CheckpointType=checkpoint/blcr
JobCheckpointDir=/state/partition1/slurm/checkpoint


###
### Logging
###

SlurmctldDebug=6
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmSchedLogLevel=1
SlurmSchedLogFile=/var/log/slurm/sched.log
SlurmdDebug=5
SlurmdLogFile=/var/log/slurm/slurmd.log
#default: DebugFlags=


###
### Accounting (Slurmdbd)
###

AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=blaster
JobAcctGatherType=jobacct_gather/linux
#default: JobAcctGatherFrequency=30
ProctrackType=proctrack/linuxproc # FIXME: check out cgroup
AccountingStorageEnforce=limits,qos
# combination of associations < limits < wckeys, qos
-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo

Reply via email to