I could have sworn that I just heard it was possible to create a floating 
reservation for any number of nodes and that you could also cause it to replace 
nodes if any went missing with the "replace" flag. Is that not all in the 
current release?

--
____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
|| \\UTGERS      |---------------------*O*---------------------
||_// Biomedical | Ryan Novosielski - Senior Technologist
|| \\ and Health | [email protected]<mailto:[email protected]>- 
973/972.0922 (2x0922)
||  \\  Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
    `'

On Nov 21, 2015, at 11:30, Daniel Letai 
<[email protected]<mailto:[email protected]>> wrote:


John,

That's correct - exclusive use means the project must always have at
least 5 nodes available to it, at all times, even if it means those
nodes will be idle some of the time.

OTOH, if some of the other nodes are idle for whatever reason (no one
else is using the cluster), let the project use any (up to all)
available nodes.

The project is run automatically based on some data as it becomes
available to a dispatching app. Optimally it should be preemptable on
the other nodes but not on the exclusive ones, and must not preempt
other jobs, but the entire preemption issue is of secondary importance.

A reservation is somewhat better than hardcoded nodelist as in my first
post, but it's major drawback is that on reservation "renewal" there
might not be enough (or any) nodes available and the project will not
have enough nodes (since it can't preempt - unless somehow it can
preempt, but only on those 5 nodes in the "new" reservation?).

--Dani_L.

On 11/19/2015 10:35 PM, John Desantis wrote:
Daniel,

Could you provide more information on the project's needs?

A QOS could be configured with a generous priority and limits so that
the project cannot dominate the partition;  Reservations could be used
too, but you'd need to define at a minimum a start time and duration -
and when not in use the hardware would be idle and unavailable to
other users.

John DeSantis


2015-11-19 13:31 GMT-05:00 Daniel Letai 
<[email protected]<mailto:[email protected]>>:
The other issue is how to define the "public" partition. It would also have
to float, with lower priority, or else how would you achieve exclusivity  of
"special" on the 5node float?

--Dani_L.


On 11/19/2015 06:10 PM, Paul Edmon wrote:

Yeah, I guess QoS won't really work for overflow.  I was more thinking of
the QoS as a way to create a floating partition of 5 nodes with the rest
being in the public queue.  They would send jobs to the QoS to hit that and
then when it is full they would submit to public as normal.  That's at least
my thinking, but it's less seamless to the users as they will have to
consciously monitor what is going on.

-Paul Edmon-

On 11/19/2015 10:50 AM, Daniel Letai wrote:

Can you elaborate a little? I'm not sure what kind of QoS will help, nor
how to implement one that will satisfy the requirements.

On 11/19/2015 04:52 PM, Paul Edmon wrote:

You might consider a QoS for this.  It may not do everything you want
but it will give you the flexibility.

-Paul Edmon-

On 11/19/2015 04:49 AM, Daniel Letai wrote:

Hi,

Suppose I have a 100 node cluster with ~5% nodes down at any given time
(maintanence/hw failure/...).

One of the projects requires exclusive use of 5 nodes, and be able to
use entire cluster when available (when other projects aren't running).

I can do this easily if I maintain a static list of the exclusive nodes
in slurm.conf:

PartitionName=public Nodes=tux0[01-95] Default=YES
PartitionName=special Nodes=tux[001-100] Default=NO

And allowing only that project to use partition special.

However, due to the downtime of 5%, I'd like to maintain a dynamic
exclusive 5 nodes.
Any suggestions?

The project is serial and deployed as array of single node jobs, so I
can run it even when the other 95 nodes are full.

Thanks,
--Dani_L.

Reply via email to