Hi all, I'm working on a small cluster, 2048 cores in 128 nodes -- 32 nodes per bladecenter -- 4 bladecenters in one cabinet. I'm planning to use SLURM as my RM as well as Scheduler, but the first problem I have to tackle is the partition design.
You see, the main problem I face is that my Infiniband mesh is not contiguous. It doesn't span outside of a bladecenter. So it's really like I have four 512-way clusters. The first and most obvious solution is to make four SLURM partitions, and users can tell their jobs to go to which ever partition they want. Of course, the downside of that is a user-education problem. In all likelihood they would always just submit to the same partition, which would always stay busy while the others sat idle. Not awesome. And then I thought it would be ideal if, as with Torque routing queues, I could create a special partition called "batch" which merely routed jobs to other partitions in a round-robin or other fashion. However, if this is possible with SLURM, I have missed it in the docs. I've only see examples where Partitions are defined in terms of nodes, not other partitions. If I'm wrong on that, please someone correct me. And then I had another idea. What if I could use the topology/tree plugin to tell SLURM that I have, let's say, four switches (each representing the Infiniband switch in each blade chassis) -- and then associate each switch in topology.conf with only the nodes within its chassis? If that worked, then I could just make one big partition for all the nodes, and let the scheduler figure out that if it spreads an MPI job across nodes that don't touch the same switch, then it will fail. That is, if it works like that. These and other burning questions this SLURM newbie is dying to have answered. Your advice is much appreciated! Using CentOS 6.4 & SLURM 2.6.0 -- Jonathan Mills Systems Administrator Renaissance Computing Institute UNC-Chapel Hill
