Yeah, I guess QoS won't really work for overflow. I was more thinking
of the QoS as a way to create a floating partition of 5 nodes with the
rest being in the public queue. They would send jobs to the QoS to hit
that and then when it is full they would submit to public as normal.
That's at least my thinking, but it's less seamless to the users as they
will have to consciously monitor what is going on.
-Paul Edmon-
On 11/19/2015 10:50 AM, Daniel Letai wrote:
Can you elaborate a little? I'm not sure what kind of QoS will help,
nor how to implement one that will satisfy the requirements.
On 11/19/2015 04:52 PM, Paul Edmon wrote:
You might consider a QoS for this. It may not do everything you want
but it will give you the flexibility.
-Paul Edmon-
On 11/19/2015 04:49 AM, Daniel Letai wrote:
Hi,
Suppose I have a 100 node cluster with ~5% nodes down at any given
time (maintanence/hw failure/...).
One of the projects requires exclusive use of 5 nodes, and be able
to use entire cluster when available (when other projects aren't
running).
I can do this easily if I maintain a static list of the exclusive
nodes in slurm.conf:
PartitionName=public Nodes=tux0[01-95] Default=YES
PartitionName=special Nodes=tux[001-100] Default=NO
And allowing only that project to use partition special.
However, due to the downtime of 5%, I'd like to maintain a dynamic
exclusive 5 nodes.
Any suggestions?
The project is serial and deployed as array of single node jobs, so
I can run it even when the other 95 nodes are full.
Thanks,
--Dani_L.