On 2013-08-02T08:46:08 EEST, Jonathan Mills wrote: > Hi all, > > I'm working on a small cluster, 2048 cores in 128 nodes -- 32 nodes per > bladecenter -- 4 bladecenters in one cabinet. I'm planning to use SLURM as > my RM as well as Scheduler, but the first problem I have to tackle is the > partition design. > > You see, the main problem I face is that my Infiniband mesh is not > contiguous. It doesn't span outside of a bladecenter. So it's really like I > have four 512-way clusters. > > The first and most obvious solution is to make four SLURM partitions, and > users can tell their jobs to go to which ever partition they want. Of > course, the downside of that is a user-education problem. In all likelihood > they would always just submit to the same partition, which would always stay > busy while the others sat idle. Not awesome. > > And then I thought it would be ideal if, as with Torque routing queues, I > could create a special partition called "batch" which merely routed jobs to > other partitions in a round-robin or other fashion. However, if this is > possible with SLURM, I have missed it in the docs. I've only see examples > where Partitions are defined in terms of nodes, not other partitions. If I'm > wrong on that, please someone correct me.
I think it should be possible with a so-called "job submit plugin". I haven't used those myself, but maybe those terms help you find the relevant stuff in the docs. > And then I had another idea. What if I could use the topology/tree plugin to > tell SLURM that I have, let's say, four switches (each representing the > Infiniband switch in each blade chassis) -- and then associate each switch in > topology.conf with only the nodes within its chassis? If that worked, then I > could just make one big partition for all the nodes, and let the scheduler > figure out that if it spreads an MPI job across nodes that don't touch the > same switch, then it will fail. That is, if it works like that. There are some options that one can set to specify how long a job will wait for an "optimal" topology, setting that time to infinity would then accomplish what you want. I'm not sure you can do that by default though, except with a job submit plugin as described above. Another option is to use the features/constraints system with a single partition. E.g. you define for each node a "feature" which specifies in which blade enclosure it sits, and then in the job submit script users have something like #SBATCH --constraint=[enc1|enc2|enc3|enc4] which tells slurm that all nodes for the job should be in a single enclosure. Again, with a job submit plugin it ought to be possible to add that constraint to all jobs automatically. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & BECS +358503841576 || [email protected]
