Slurm: 14.11.8
CentOS 6.6
Just wondering if anyone else has run into this issue. We have a cluster
with heterogeneous nodes, some with 20 (real) cores and some with 32
(real) cores
Submitting something like:
sbatch --ntasks 16 --cpus-per-task 8 ...
Should result in 128 cores assigned to the job but we only end up with 124
cores assigned to the job and the following in slurmctld.log
error: cons_res: _compute_c_b_task_dist oversubscribe for job 655
Some SLURM* output from the job:
SLURM_NODELIST=mos[20-24]
SLURM_TOPOLOGY_ADDR=mos20
SLURMD_NODENAME=mos20
SLURM_PRIO_PROCESS=0
SLURM_NODE_ALIASES=(null)
SLURM_TOPOLOGY_ADDR_PATTERN=node
SLURM_NNODES=5
SLURM_JOBID=655
SLURM_NTASKS=16
SLURM_TASKS_PER_NODE=3,4(x3),1
SLURM_CPUS_PER_TASK=8
SLURM_JOB_ID=655
SLURM_NODEID=0
SLURM_NPROCS=16
SLURM_TASK_PID=8991
SLURM_CPUS_ON_NODE=20
SLURM_PROCID=0
SLURM_JOB_NODELIST=mos[20-24]
SLURM_LOCALID=0
SLURM_JOB_CPUS_PER_NODE=20,32(x3),8
SLURM_GTIDS=0
SLURM_JOB_PARTITION=mpi
SLURM_JOB_NUM_NODES=5
SLURM_MEM_PER_NODE=1024
Some relevant lines from slurm.conf
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
NodeName=mos[1-20] Gres=gpu:k20:1 CPUs=20 RealMemory=258439 Sockets=2
CoresPerSocket=10 ThreadsPerCore=1 State=UNKNOWN
NodeName=mos[21-24] CPUs=32 RealMemory=775550 Sockets=4 CoresPerSocket=8
ThreadsPerCore=1 State=UNKNOWN
PartitionName=gpu Nodes=mos[1-20] DefaultTime=01:00:00 MaxTime=7-00:00:00
State=UP
PartitionName=serial Nodes=mos[1-24] Default=YES DefaultTime=01:00:00
MaxTime=7-00:00:00 State=UP
PartitionName=threaded Nodes=mos[1-24] DefaultTime=01:00:00 MaxTime=7-00:00:00
State=UP
PartitionName=mpi Nodes=mos[1-24] DefaultTime=01:00:00 MaxTime=7-00:00:00
State=UP
I think the issue here is the "SLURM_TASKS_PER_NODE" which seems to be
calculated by the scheduler. It assigns a node that only has 20 cores
(mos20) and hence we see the "_compute_c_b_task_dist oversubscribe for
job" error.
thanks
-k
--
Kaizaad Bilimorya
Systems Administrator - SHARCNET | http://www.sharcnet.ca
Compute Canada | http://www.computecanada.ca
ph: (519) 824-4120 x52700