You can configure a single queue and use the topology/tree plugin to identify the nodes on separate fabrics.
Quoting John Desantis <[email protected]>:
Hello all, Unfortunately, I have some confusion regarding how to achieve a global and single partition for our users with several separate host groups after reading the man pages and various documentation. When I say host groups, I mean separate sets of hardware which utilize different infiniband fabrics and/or are accessible in different data centers, different CPU architectures, etc. During initial testing periods, I was able to have use of a default partition with all of the nodes allocated via the "Nodes=" value. All was well until a latter set of nodes were added which had a separate infiniband fabric. Testing proved that applications were attempting to utilize the nodes within the separate fabrics, which failed miserably, and as a result we're using separate partitions - which most users don't mind. Now that we're getting more users converted to Slurm, we're realizing that some users don't know how to check for free partitions and available hardware (boo!) and have grown used to our previous scheduler configuration of 1 global queue. I'm looking into how to emulate this and I'm not quite clear if this can be done using multiple partition definitions with a DEFAULT clause or not. I've looked at the topology/tree plugin as well and seeing that you can specify either switches or nodes, if this would be the preferred method to achieve 1 "global" partition which utilizes all of the separate hardware pools and respects the separate host groups. Thank you, John DeSantis
-- Morris "Moe" Jette CTO, SchedMD LLC Commercial Slurm Development and Support
