Yeah, I guess QoS won't really work for overflow. I was more thinking of the QoS as a way to create a floating partition of 5 nodes with the rest being in the public queue. They would send jobs to the QoS to hit that and then when it is full they would submit to public as normal. That's at least my thinking, but it's less seamless to the users as they will have to consciously monitor what is going on.

-Paul Edmon-

On 11/19/2015 10:50 AM, Daniel Letai wrote:

Can you elaborate a little? I'm not sure what kind of QoS will help, nor how to implement one that will satisfy the requirements.

On 11/19/2015 04:52 PM, Paul Edmon wrote:

You might consider a QoS for this. It may not do everything you want but it will give you the flexibility.

-Paul Edmon-

On 11/19/2015 04:49 AM, Daniel Letai wrote:

Hi,

Suppose I have a 100 node cluster with ~5% nodes down at any given time (maintanence/hw failure/...).

One of the projects requires exclusive use of 5 nodes, and be able to use entire cluster when available (when other projects aren't running).

I can do this easily if I maintain a static list of the exclusive nodes in slurm.conf:

PartitionName=public Nodes=tux0[01-95] Default=YES
PartitionName=special Nodes=tux[001-100] Default=NO

And allowing only that project to use partition special.

However, due to the downtime of 5%, I'd like to maintain a dynamic exclusive 5 nodes.
Any suggestions?

The project is serial and deployed as array of single node jobs, so I can run it even when the other 95 nodes are full.

Thanks,
--Dani_L.

Reply via email to