Hi Omer, As a first step, I would update Slurm to the latest version. 2.6 is kind of old, so maybe your problem was a bug has been solved by now.
Besides, could you post a bit more about your system (MPI library?) and slurm.conf relevant information? Cheers, Manuel 2016-06-22 0:15 GMT+02:00 omer bromberg <o...@wise.tau.ac.il>: > I am relatively new to the Slurm system. > We have a cluster of several hundreds cores connected with infiniband > switch. > I is run by a Slurm scheduler ver 2.6.5-1 installed on an Ubuntu 14.04 OS. > Ever since we installed the system we are experiencing problems with > oversubscription of > cores on nodes. > > From what we've been able to figure out when we run a parallel job and > request for a number > of cores, lets say 24: > #SBATCH -n 24 > > Slurm assign the right number of cores to the job and divide the cores > between the nodes > so that each node will not be overloaded. Our nodes have 16 cores each. So > if node 1 is > empty and node 2 has 8 free cores, in the output file it'll say that 16 > cores were assigned to node > 1 and 8 cores to node 2. Therefore each node should have a load of 16. > > HOWEVER in practice each node gets a different number of cores. Node 1 can > get only 8 cores > leaving it half empty, while node 2 will get the rest of the 16 cores > bringing it to a load of 28. > We haven't figured out if there is any rule in the way the cores are > actually divided or is it random. > But it is definitely NOT how Slurm divides it and how it think it divides > it. > > Any Idea how to resolve this issue? > Thanks > Omer >