We're having an issue with CPU binding when two jobs land on the same node.

Some cores are shared by the 2 jobs while others are left idle. Below is output from "top" after pressing 'f' then 'j' to show processors used (the P column):

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND
5577 bacon 20 0 3916 368 300 R 49.9 0.0 0:34.86 0 calcpi-parallel 5578 bacon 20 0 3916 372 300 R 49.9 0.0 0:34.89 2 calcpi-parallel
 5609 bacon     20   0  410m 108m 3836 R 49.9  0.7   0:12.52 0 mpi_bench
 5610 bacon     20   0  410m 110m 3836 R 49.9  0.7   0:12.52 2 mpi_bench

As you can see above, both jobs are using cores 0 and 2, while cores 1 and 3 are unused.

Here's what I think could possibly be relevant from our slurm.conf:

MpiDefault=none
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
TaskPlugin=task/affinity
TaskPluginParam=cores,verbose
FastSchedule=1
NodeName=compute-001 RealMemory=8000 Sockets=2 CoresPerSocket=2 State=UNKNOWN NodeName=compute-002 RealMemory=15946 Sockets=2 CoresPerSocket=2 State=UNKNOWN PartitionName=batch Nodes=compute-[001-002] Default=YES MaxTime=INFINITE State=UP

This is a small test cluster running CentOS and SLURM 14.11.6.

Any suggestions would be appreciated.

Thanks,

    Jason

--
All wars are civil wars, because all men are brothers ... Each one owes
infinitely more to the human race than to the particular country in
which he was born.
                -- Francois Fenelon

Reply via email to