I am relatively new to the Slurm system.
We have a cluster of several hundreds cores connected with infiniband
switch.
I is run by a Slurm scheduler ver 2.6.5-1 installed on an Ubuntu 14.04 OS.
Ever since we installed the system we are experiencing problems with
oversubscription of
cores on nodes.

>From what we've been able to figure out when we run  a parallel job and
request for a number
of cores, lets say 24:
#SBATCH -n 24

Slurm assign the right number of cores to the job and divide the cores
between the nodes
so that each node will not be overloaded. Our nodes have 16 cores each. So
if node 1 is
empty and node 2 has 8 free cores, in the output file it'll say that 16
cores were assigned to node
1 and 8 cores to node 2. Therefore each node should have a load of 16.

HOWEVER in practice each node gets a different number of cores. Node 1 can
get only 8 cores
leaving it half empty, while node 2 will get the rest of the 16 cores
bringing it to a load of 28.
We haven't figured out if there is any rule in the way the cores are
actually divided or is it random.
But it is definitely NOT how Slurm divides it and how it think it divides
it.

Any Idea how to resolve this issue?
Thanks
Omer

Reply via email to