Slurm: 14.11.8
CentOS 6.6

Just wondering if anyone else has run into this issue. We have a cluster with heterogeneous nodes, some with 20 (real) cores and some with 32 (real) cores

Submitting something like:

sbatch --ntasks 16 --cpus-per-task 8 ...

Should result in 128 cores assigned to the job but we only end up with 124 cores assigned to the job and the following in slurmctld.log

error: cons_res: _compute_c_b_task_dist oversubscribe for job 655


Some SLURM* output from the job:

SLURM_NODELIST=mos[20-24]
SLURM_TOPOLOGY_ADDR=mos20
SLURMD_NODENAME=mos20
SLURM_PRIO_PROCESS=0
SLURM_NODE_ALIASES=(null)
SLURM_TOPOLOGY_ADDR_PATTERN=node
SLURM_NNODES=5
SLURM_JOBID=655
SLURM_NTASKS=16
SLURM_TASKS_PER_NODE=3,4(x3),1
SLURM_CPUS_PER_TASK=8
SLURM_JOB_ID=655
SLURM_NODEID=0
SLURM_NPROCS=16
SLURM_TASK_PID=8991
SLURM_CPUS_ON_NODE=20
SLURM_PROCID=0
SLURM_JOB_NODELIST=mos[20-24]
SLURM_LOCALID=0
SLURM_JOB_CPUS_PER_NODE=20,32(x3),8
SLURM_GTIDS=0
SLURM_JOB_PARTITION=mpi
SLURM_JOB_NUM_NODES=5
SLURM_MEM_PER_NODE=1024


Some relevant lines from slurm.conf

ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

NodeName=mos[1-20] Gres=gpu:k20:1 CPUs=20 RealMemory=258439 Sockets=2 
CoresPerSocket=10 ThreadsPerCore=1 State=UNKNOWN
NodeName=mos[21-24] CPUs=32 RealMemory=775550 Sockets=4 CoresPerSocket=8 
ThreadsPerCore=1 State=UNKNOWN
PartitionName=gpu Nodes=mos[1-20] DefaultTime=01:00:00 MaxTime=7-00:00:00 
State=UP
PartitionName=serial Nodes=mos[1-24] Default=YES DefaultTime=01:00:00 
MaxTime=7-00:00:00 State=UP
PartitionName=threaded Nodes=mos[1-24] DefaultTime=01:00:00 MaxTime=7-00:00:00 
State=UP
PartitionName=mpi Nodes=mos[1-24] DefaultTime=01:00:00 MaxTime=7-00:00:00 
State=UP


I think the issue here is the "SLURM_TASKS_PER_NODE" which seems to be calculated by the scheduler. It assigns a node that only has 20 cores (mos20) and hence we see the "_compute_c_b_task_dist oversubscribe for job" error.

thanks
-k
--
Kaizaad Bilimorya
Systems Administrator - SHARCNET | http://www.sharcnet.ca
Compute Canada | http://www.computecanada.ca
ph: (519) 824-4120 x52700

Reply via email to