Hi,
Suppose I have a 100 node cluster with ~5% nodes down at any given time
(maintanence/hw failure/...).
One of the projects requires exclusive use of 5 nodes, and be able to
use entire cluster when available (when other projects aren't running).
I can do this easily if I maintain a static list of the exclusive nodes
in slurm.conf:
PartitionName=public Nodes=tux0[01-95] Default=YES
PartitionName=special Nodes=tux[001-100] Default=NO
And allowing only that project to use partition special.
However, due to the downtime of 5%, I'd like to maintain a dynamic
exclusive 5 nodes.
Any suggestions?
The project is serial and deployed as array of single node jobs, so I
can run it even when the other 95 nodes are full.
Thanks,
--Dani_L.