I could have sworn that I just heard it was possible to create a floating reservation for any number of nodes and that you could also cause it to replace nodes if any went missing with the "replace" flag. Is that not all in the current release?
-- ____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* || \\UTGERS |---------------------*O*--------------------- ||_// Biomedical | Ryan Novosielski - Senior Technologist || \\ and Health | [email protected]<mailto:[email protected]>- 973/972.0922 (2x0922) || \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark `' On Nov 21, 2015, at 11:30, Daniel Letai <[email protected]<mailto:[email protected]>> wrote: John, That's correct - exclusive use means the project must always have at least 5 nodes available to it, at all times, even if it means those nodes will be idle some of the time. OTOH, if some of the other nodes are idle for whatever reason (no one else is using the cluster), let the project use any (up to all) available nodes. The project is run automatically based on some data as it becomes available to a dispatching app. Optimally it should be preemptable on the other nodes but not on the exclusive ones, and must not preempt other jobs, but the entire preemption issue is of secondary importance. A reservation is somewhat better than hardcoded nodelist as in my first post, but it's major drawback is that on reservation "renewal" there might not be enough (or any) nodes available and the project will not have enough nodes (since it can't preempt - unless somehow it can preempt, but only on those 5 nodes in the "new" reservation?). --Dani_L. On 11/19/2015 10:35 PM, John Desantis wrote: Daniel, Could you provide more information on the project's needs? A QOS could be configured with a generous priority and limits so that the project cannot dominate the partition; Reservations could be used too, but you'd need to define at a minimum a start time and duration - and when not in use the hardware would be idle and unavailable to other users. John DeSantis 2015-11-19 13:31 GMT-05:00 Daniel Letai <[email protected]<mailto:[email protected]>>: The other issue is how to define the "public" partition. It would also have to float, with lower priority, or else how would you achieve exclusivity of "special" on the 5node float? --Dani_L. On 11/19/2015 06:10 PM, Paul Edmon wrote: Yeah, I guess QoS won't really work for overflow. I was more thinking of the QoS as a way to create a floating partition of 5 nodes with the rest being in the public queue. They would send jobs to the QoS to hit that and then when it is full they would submit to public as normal. That's at least my thinking, but it's less seamless to the users as they will have to consciously monitor what is going on. -Paul Edmon- On 11/19/2015 10:50 AM, Daniel Letai wrote: Can you elaborate a little? I'm not sure what kind of QoS will help, nor how to implement one that will satisfy the requirements. On 11/19/2015 04:52 PM, Paul Edmon wrote: You might consider a QoS for this. It may not do everything you want but it will give you the flexibility. -Paul Edmon- On 11/19/2015 04:49 AM, Daniel Letai wrote: Hi, Suppose I have a 100 node cluster with ~5% nodes down at any given time (maintanence/hw failure/...). One of the projects requires exclusive use of 5 nodes, and be able to use entire cluster when available (when other projects aren't running). I can do this easily if I maintain a static list of the exclusive nodes in slurm.conf: PartitionName=public Nodes=tux0[01-95] Default=YES PartitionName=special Nodes=tux[001-100] Default=NO And allowing only that project to use partition special. However, due to the downtime of 5%, I'd like to maintain a dynamic exclusive 5 nodes. Any suggestions? The project is serial and deployed as array of single node jobs, so I can run it even when the other 95 nodes are full. Thanks, --Dani_L.
