A possibility might be to do this using reservations.

You could create a 5 node reservation with all concerned users having
access, then have a script run by cron that periodically checks the state
of the node in the reservation, if any go down update the reservation
replacing the down nodes with up nodes. If there are no up nodes determine
the soonest a node will be free and add it to the reservation using the
IGNORE_JOBS flag.

Phil Eckert
LLNL

On 11/19/15, 8:09 AM, "Paul Edmon" <[email protected]> wrote:

>
>Yeah, I guess QoS won't really work for overflow.  I was more thinking
>of the QoS as a way to create a floating partition of 5 nodes with the
>rest being in the public queue.  They would send jobs to the QoS to hit
>that and then when it is full they would submit to public as normal.
>That's at least my thinking, but it's less seamless to the users as they
>will have to consciously monitor what is going on.
>
>-Paul Edmon-
>
>On 11/19/2015 10:50 AM, Daniel Letai wrote:
>>
>> Can you elaborate a little? I'm not sure what kind of QoS will help,
>> nor how to implement one that will satisfy the requirements.
>>
>> On 11/19/2015 04:52 PM, Paul Edmon wrote:
>>>
>>> You might consider a QoS for this.  It may not do everything you want
>>> but it will give you the flexibility.
>>>
>>> -Paul Edmon-
>>>
>>> On 11/19/2015 04:49 AM, Daniel Letai wrote:
>>>>
>>>> Hi,
>>>>
>>>> Suppose I have a 100 node cluster with ~5% nodes down at any given
>>>> time (maintanence/hw failure/...).
>>>>
>>>> One of the projects requires exclusive use of 5 nodes, and be able
>>>> to use entire cluster when available (when other projects aren't
>>>> running).
>>>>
>>>> I can do this easily if I maintain a static list of the exclusive
>>>> nodes in slurm.conf:
>>>>
>>>> PartitionName=public Nodes=tux0[01-95] Default=YES
>>>> PartitionName=special Nodes=tux[001-100] Default=NO
>>>>
>>>> And allowing only that project to use partition special.
>>>>
>>>> However, due to the downtime of 5%, I'd like to maintain a dynamic
>>>> exclusive 5 nodes.
>>>> Any suggestions?
>>>>
>>>> The project is serial and deployed as array of single node jobs, so
>>>> I can run it even when the other 95 nodes are full.
>>>>
>>>> Thanks,
>>>> --Dani_L.

Reply via email to