Hi all,

I'm working on a small cluster, 2048 cores in 128 nodes -- 32 nodes per 
bladecenter -- 4 bladecenters in one cabinet.  I'm planning to use SLURM as my 
RM as well as Scheduler, but the first problem I have to tackle is the 
partition design.

You see, the main problem I face is that my Infiniband mesh is not contiguous.  
It doesn't span outside of a bladecenter.  So it's really like I have four 
512-way clusters.

The first and most obvious solution is to make four SLURM partitions, and users 
can tell their jobs to go to which ever partition they want.  Of course, the 
downside of that is a user-education problem.  In all likelihood they would 
always just submit to the same partition, which would always stay busy while 
the others sat idle.  Not awesome.

And then I thought it would be ideal if, as with Torque routing queues, I could 
create a special partition called "batch" which merely routed jobs to other 
partitions in a round-robin or other fashion.  However, if this is possible 
with SLURM, I have missed it in the docs.  I've only see examples where 
Partitions are defined in terms of nodes, not other partitions.  If I'm wrong 
on that, please someone correct me.

And then I had another idea.  What if I could use the topology/tree plugin to 
tell SLURM that I have, let's say, four switches (each representing the 
Infiniband switch in each blade chassis) -- and then associate each switch in 
topology.conf with only the nodes within its chassis?  If that worked, then I 
could just make one big partition for all the nodes, and let the scheduler 
figure out that if it spreads an MPI job across nodes that don't touch the same 
switch, then it will fail.  That is, if it works like that.


These and other burning questions this SLURM newbie is dying to have answered.  
Your advice is much appreciated!


Using CentOS 6.4 & SLURM 2.6.0

--
Jonathan Mills
Systems Administrator
Renaissance Computing Institute
UNC-Chapel Hill

Reply via email to