Hi Omer,

As a first step, I would update Slurm to the latest version. 2.6 is kind of
old, so maybe your problem was a bug has been solved by now.

Besides, could you post a bit more about your system (MPI library?) and
slurm.conf relevant information?

Cheers,


Manuel

2016-06-22 0:15 GMT+02:00 omer bromberg <o...@wise.tau.ac.il>:

> I am relatively new to the Slurm system.
> We have a cluster of several hundreds cores connected with infiniband
> switch.
> I is run by a Slurm scheduler ver 2.6.5-1 installed on an Ubuntu 14.04 OS.
> Ever since we installed the system we are experiencing problems with
> oversubscription of
> cores on nodes.
>
> From what we've been able to figure out when we run  a parallel job and
> request for a number
> of cores, lets say 24:
> #SBATCH -n 24
>
> Slurm assign the right number of cores to the job and divide the cores
> between the nodes
> so that each node will not be overloaded. Our nodes have 16 cores each. So
> if node 1 is
> empty and node 2 has 8 free cores, in the output file it'll say that 16
> cores were assigned to node
> 1 and 8 cores to node 2. Therefore each node should have a load of 16.
>
> HOWEVER in practice each node gets a different number of cores. Node 1 can
> get only 8 cores
> leaving it half empty, while node 2 will get the rest of the 16 cores
> bringing it to a load of 28.
> We haven't figured out if there is any rule in the way the cores are
> actually divided or is it random.
> But it is definitely NOT how Slurm divides it and how it think it divides
> it.
>
> Any Idea how to resolve this issue?
> Thanks
> Omer
>

Reply via email to