Hi, we've got a brand new cluster and a new SLURM installation (14.03.3), and 
I'm trying to set it up so that users can share nodes if desired.
At the moment, I'm seeing two unexpected behaviors that I could use some help 
figuring out.  Each of our nodes has 20 cores, and I'm assuming that cores are 
the smallest unit of division, i.e., not sharing cores.
When submitting small jobs, as long as all have shared=1, everything works 
fine.  However if there are a bunch of shared=1 jobs running on a node and a 
shared=0 job comes along, it gets incorrectly packed onto a node with other 
shared=1 jobs.  Once there's a single shared=0 job running on a node, 
subsequent jobs get assigned new nodes, regardless of their shared status.  
This seems like a bug to me, as I'd expect shared=0 jobs to never share a node.

The second issue I'm having is that if the first shared=0 job to come along is 
asking for more than the remaining cores available on a node, it gets packed on 
the node anyway, overallocating the node.  i.e., if there are 18 single core 
shared=1 jobs on a node, and I submit a 20-core shared=0 job, it will end up on 
the same node, and I end up with 38 tasks competing for 20 cores.

I've attached my slurm.conf, please let me know if I can provide other useful 
info.

Thanks,
Kevin

--
Kevin Hildebrand
Division of IT
University of Maryland, College Park

Attachment: slurm.conf
Description: slurm.conf

Reply via email to