Hi, I'm trying to get a feel for fairshare scheduling.
I've got 3 VMs as scheduler and 2 worker nodes. I've created a
couple of accounts and assigned a couple of users to those
accounts:
slurmtest-sched# sacctmgr list assoc tree format=account,user,share
Account User Share
-------------------- ---------- ---------
root 1
atlas 80
atlas alexis parent
belle 20
belle alexis2 parent
slurmtest-sched#
Now I submit 20000 jobs alternately as the two users:
for X in {1..10000}; do su - alexis -c "sbatch slurmtest.sh"; su - alexis2 -c
"sbatch slurmtest.sh"; done
Both users run the same job, which is just:
alexis@slurmtest-sched:~$ cat ~/slurmtest.sh
#!/bin/bash
#SBATCH --job-name=slurmtest
#SBATCH --mem=10
#SBATCH --output=/dev/null
#SBATCH --error=/dev/null
#SBATCH --mail-user=me@mine
#SBATCH --partition=normal
true
alexis@slurmtest-sched:~$
I made the job deliberately quick so that I would see results
slurm responding quickly too.
Then I just run 'watch sshare -al' and try to understand
what I see, which includes:
alexis2@slurmtest-sched:~$ sshare -al
Account User Raw Shares Norm Shares Raw Usage Norm Usage
Effectv Usage FairShare GrpCPUMins CPURunMins
-------------------- ---------- ---------- ----------- ----------- -----------
------------- ---------- ----------- ---------------
root 1.000000 1555
0.000000 1.000000 14450
atlas 80 0.800000 752 0.476011
0.476011 0.662038 11560
atlas alexis parent 0.800000 752 0.476717
0.476717 0.661633 11560
belle 20 0.200000 802 0.523989
0.523989 0.162674 2890
belle alexis2 parent 0.200000 802 0.523283
0.523283 0.163073 2890
alexis2@slurmtest-sched:~$
and a few seconds later:
alexis2@slurmtest-sched:~$ sshare -al
Account User Raw Shares Norm Shares Raw Usage Norm Usage
Effectv Usage FairShare GrpCPUMins CPURunMins
-------------------- ---------- ---------- ----------- ----------- -----------
------------- ---------- ----------- ---------------
root 1.000000 1578
0.000000 1.000000 20230
atlas 80 0.800000 763 0.476011
0.476011 0.662038 8670
atlas alexis parent 0.800000 763 0.476717
0.476717 0.661633 8670
belle 20 0.200000 814 0.523989
0.523989 0.162674 11560
belle alexis2 parent 0.200000 814 0.523283
0.523283 0.163073 11560
alexis2@slurmtest-sched:~$
and so on.
So now to my questions:
The fairshare values are weightings for the priorities of *future*
jobs based on *past* scheduling behaviour; as such the fact that
their current values are close to an 80:20 split is just because that's
the prioritisation weighting for future jobs from the two users
that would be required to "redress the balance" (because currently
the two users have *not* had yet their fair shares of CPU-time).
Is that a correct reading of the fareshare values?
If that is correct, then I would expect the effective usage and
CPURunMins to tend towards an 80:20 split in line with the dictates
of the fairshare value, but this doesn't seem to be happening. Why? ...
am I just not waiting long enough? ... is there a way to make slurm
respond faster? ...at least for testing purposes?
Any suggestions appreciated!
I'm running:
slurmtest-sched# dpkg -l | grep slurm
ii slurm-llnl 2.6.5-1 amd64
Simple Linux Utility for Resource Management
ii slurm-llnl-basic-plugins 2.6.5-1 amd64
SLURM basic plugins
ii slurm-llnl-slurmdbd 2.6.5-1 amd64
Secure enterprise-wide interface to a database for SLURM
slurmtest-sched# cat /etc/issue
Ubuntu 14.04.4 LTS \n \l
slurmtest-sched# uname -a
Linux slurmtest-sched 3.13.0-79-generic #123-Ubuntu SMP Fri Feb 19 14:27:58 UTC
2016 x86_64 x86_64 x86_64 GNU/Linux
slurmtest-sched#
and the same on the two worker nodes.
slurm.conf contains:
slurmtest-sched# egrep -v '^(#| *$)' /etc/slurm-llnl/slurm.conf | sort
AccountingStorageEnforce=associations,limits
AccountingStorageHost=localhost
AccountingStorageType=accounting_storage/slurmdbd
AuthType=auth/munge
CacheGroups=0
ClusterName=slurmtest
ControlMachine=slurmtest-sched
CryptoType=crypto/munge
FastSchedule=2
InactiveLimit=0
JobAcctGatherType=jobacct_gather/linux
JobCompType=jobcomp/none
KillWait=30
MailProg=/usr/bin/mail
MinJobAge=300
MpiDefault=none
NodeName=slurmtest-wn[1-2] CPUS=4 Sockets=4 CoresPerSocket=1 ThreadsPerCore=1
RealMemory=2048 State=UNKNOWN
PartitionName=normal Nodes=slurmtest-wn[1-2] MaxTime=48:10:00 State=UP
AllowGroups=alexis,alexis2
PartitionName=short Nodes=slurmtest-wn[1-2] MaxTime=24:10:00 State=UP
AllowGroups=alexis,alexis2
PriorityCalcPeriod=1
PriorityDecayHalfLife=14-0
PriorityFavorSmall=NO
PriorityMaxAge=14-0
PriorityType=priority/multifactor
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=0
Proctracktype=proctrack/linuxproc
ReturnToService=0
SchedulerPort=7321
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
SlurmUser=slurm
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6817
SlurmctldTimeout=300
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmdTimeout=300
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
Waittime=0
slurmtest-sched#
Thanks!
Alexis