Hello,
I'm trying to set up SLURM-15.08.1 on a single multi-core node to
manage multi-threaded jobs. The machine has 16 physical cores
on 2 sockets with HyperThreading enabled. I'm using the EASY
scheduling algorithm with backfilling. The goal is to fully utilize all
the available cores at all times.
Given a list of three jobs with requirements of 8 cores, 2 cores,
and 4 cores, the expectation is that the jobs should be co-scheduled
to utilize 14 of the 16 available cores. However, I can't seem to
get SLURM to work as expected. SLURM runs the latter two jobs
together but refuses to schedule the first job until they finish.
(Is this the expected behavior of the EASY-backfilling algorithm?)
Here's the list of jobs:
$ cat job1.batch
#!/bin/bash
#SBATCH --sockets-per-node=1
#SBATCH --cores-per-socket=8
#SBATCH --threads-per-core=1
srun /path/to/application1
$ cat job2.batch
#!/bin/bash
#SBATCH --sockets-per-node=1
#SBATCH --cores-per-socket=2
#SBATCH --threads-per-core=1
srun /path/to/application2
$ cat job3.batch
#!/bin/bash
#SBATCH --sockets-per-node=1
#SBATCH --cores-per-socket=4
#SBATCH --threads-per-core=1
srun /path/to/application3
Here's my SLURM config:
$ cat /path/to/slurm.conf
ControlMachine=localhost
ControlAddr=127.0.0.1
AuthType=auth/none
CacheGroups=0
CryptoType=crypto/munge
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/path/to/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/path/to/pids/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/path/to/slurmdspooldir
SlurmUser=myuserid
SlurmdUser=myuserid
StateSaveLocation=/path/to/states
SwitchType=switch/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_CORE
AccountingStorageLoc=/path/to/accounting.log
AccountingStorageType=accounting_storage/filetxt
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompLoc=/path/to/completion.log
JobCompType=jobcomp/filetxt
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=3
SlurmdDebug=3
NodeName=localhost NodeAddr=127.0.0.1 Sockets=2 CoresPerSocket=8
ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=localhost Default=YES MaxTime=INFINITE State=UP
DebugFlags=Backfill,CPU_Bind,Priority,SelectType
I'm a SLURM newbie so I might be missing something obvious. I'd
appreciate any help.
Thanks,
-Rohan