[slurm-dev] Re: Regarding partition design, topology

Janne Blomqvist Fri, 02 Aug 2013 01:34:34 -0700

On 2013-08-02T08:46:08 EEST, Jonathan Mills wrote:
> Hi all,
>
> I'm working on a small cluster, 2048 cores in 128 nodes -- 32 nodes per 
> bladecenter -- 4 bladecenters in one cabinet.  I'm planning to use SLURM as 
> my RM as well as Scheduler, but the first problem I have to tackle is the 
> partition design.
>
> You see, the main problem I face is that my Infiniband mesh is not 
> contiguous.  It doesn't span outside of a bladecenter.  So it's really like I 
> have four 512-way clusters.
>
> The first and most obvious solution is to make four SLURM partitions, and 
> users can tell their jobs to go to which ever partition they want.  Of 
> course, the downside of that is a user-education problem.  In all likelihood 
> they would always just submit to the same partition, which would always stay 
> busy while the others sat idle.  Not awesome.
>
> And then I thought it would be ideal if, as with Torque routing queues, I 
> could create a special partition called "batch" which merely routed jobs to 
> other partitions in a round-robin or other fashion.  However, if this is 
> possible with SLURM, I have missed it in the docs.  I've only see examples 
> where Partitions are defined in terms of nodes, not other partitions.  If I'm 
> wrong on that, please someone correct me.


I think it should be possible with a so-called "job submit plugin". I 
haven't used those myself, but maybe those terms help you find the 
relevant stuff in the docs.

> And then I had another idea.  What if I could use the topology/tree plugin to 
> tell SLURM that I have, let's say, four switches (each representing the 
> Infiniband switch in each blade chassis) -- and then associate each switch in 
> topology.conf with only the nodes within its chassis?  If that worked, then I 
> could just make one big partition for all the nodes, and let the scheduler 
> figure out that if it spreads an MPI job across nodes that don't touch the 
> same switch, then it will fail.  That is, if it works like that.

There are some options that one can set to specify how long a job will 
wait for an "optimal" topology, setting that time to infinity would 
then accomplish what you want. I'm not sure you can do that by default 
though, except with a job submit plugin as described above.

Another option is to use the features/constraints system with a single 
partition. E.g. you define for each node a "feature" which specifies in 
which blade enclosure it sits, and then in the job submit script users 
have something like

#SBATCH --constraint=[enc1|enc2|enc3|enc4]

which tells slurm that all nodes for the job should be in a single 
enclosure. Again, with a job submit plugin it ought to be possible to 
add that constraint to all jobs automatically.


--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & BECS
+358503841576 || [email protected]

[slurm-dev] Re: Regarding partition design, topology

Reply via email to