Hi Ryan,

Your blog post captures much of my reasoning. I'm not sure how things are gonna work out with our much longer maximum walltime, although that only applies to one partition. The politics and user-education bits will be hardest, I think.

Thanks for the pointer to sshare, it will be really helpful!

Corey

On 07/24/2017 05:09 PM, Ryan Cox wrote:

Corey,

We almost exclusively use GrpCPURunMins as well as 3 or 7 day walltime
limits depending on the partition.  For my (somewhat rambling) thoughts
on the matter, see
http://tech.ryancox.net/2014/04/scheduler-limit-remaining-cputime-per.html.
It generally works pretty well.

We also have https://marylou.byu.edu/simulation/grpcpurunmins.php to
simulate various settings, though it needs some improvement such as a
realistic maximum.

sshare -l (TRESRunMins) should have the live stats you're looking for.

Ryan

On 07/24/2017 02:39 PM, Corey Keasling wrote:

Hi Slurm-Dev,

I'm currently designing and testing what will ultimately be a small
Slurm cluster of about 60 heterogeneous nodes (five different
generations of hardware).  Our user-base is also diverse, with need
for fast turnover of small, sequential jobs and for long-duration
parallel codes (e.g., 16 cores for several months).

In the past we limited users by how many cores they could allocate at
any one time.  This has the drawback that no distinction is made
between, say, 128 cores for 2 hours and 128 cores for 2 months.  We
want users to be able to run on a large portion of the cluster when it
is available while ensuring that they cannot take advantage of an idle
period to start jobs which will monopolize it for weeks.

Limiting by GrpCPURunMins seems like a good answer.  I think of it as
allocating computational area (i.e., cores*minutes) and not just width
(cores).  I'd love to know if anyone has any experience or thoughts on
imposing limits in this way.  Also, is anyone aware of a simple way to
calculate remaining "area"?  I can use squeue or sacct to ultimately
derive how much of a limit is in use by looking at remaining wall-time
and core count, but if there's something built in - or pre-existing -
it would be nice to know.

It's worth noting that the cluster is divided into several partitions
with most nodes existing in several.  This is partially political (to
give groups increased priority on nodes they helped pay for) and
partially practical (to ensure users explicitly requesting slow nodes
instead of just dumping them on ancient Opterons).  Also, each user
gets their own Account, so the QoS Grp limits apply to each human
separately.  Accounts would also have absolute core limits.

Thank you for your thoughts!

Corey


Reply via email to