We're having an issue with CPU binding when two jobs land on the same node.
Some cores are shared by the 2 jobs while others are left idle. Below is
output from "top" after pressing 'f' then 'j' to show processors used
(the P column):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
5577 bacon 20 0 3916 368 300 R 49.9 0.0 0:34.86 0
calcpi-parallel
5578 bacon 20 0 3916 372 300 R 49.9 0.0 0:34.89 2
calcpi-parallel
5609 bacon 20 0 410m 108m 3836 R 49.9 0.7 0:12.52 0 mpi_bench
5610 bacon 20 0 410m 110m 3836 R 49.9 0.7 0:12.52 2 mpi_bench
As you can see above, both jobs are using cores 0 and 2, while cores 1
and 3 are unused.
Here's what I think could possibly be relevant from our slurm.conf:
MpiDefault=none
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
TaskPlugin=task/affinity
TaskPluginParam=cores,verbose
FastSchedule=1
NodeName=compute-001 RealMemory=8000 Sockets=2 CoresPerSocket=2
State=UNKNOWN
NodeName=compute-002 RealMemory=15946 Sockets=2 CoresPerSocket=2
State=UNKNOWN
PartitionName=batch Nodes=compute-[001-002] Default=YES MaxTime=INFINITE
State=UP
This is a small test cluster running CentOS and SLURM 14.11.6.
Any suggestions would be appreciated.
Thanks,
Jason
--
All wars are civil wars, because all men are brothers ... Each one owes
infinitely more to the human race than to the particular country in
which he was born.
-- Francois Fenelon