Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Paul Edmon
We've been using a backfill priority partition for people doing HTC work.  We have requeue set so that jobs from the high priority partitions can take over. You can do this for your interactive nodes as well if you want. We dedicate hardware to interactive work and use Partition based QoS's

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Renfro, Michael
That’s the first limit I placed on our cluster, and it has generally worked out well (never used a job limit). A single account can get 1000 CPU-days in whatever distribution they want. I’ve just added a root-only ‘expedited’ QOS for times when the cluster is mostly idle, but a few users have

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Ole Holm Nielsen
On 05/08/2018 09:49 AM, John Hearns wrote: Actually what IS bad is users not putting cluster resources to good use. You can often see jobs which are 'stalled'  - ie the nodes are reserved for the job, but the internal logic of the job has failed and the executables have not launched. Or maybe

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread John Hearns
"Otherwise a user can have a sing le job that takes the entire cluster, and insidesplit it up the way he wants to." Yair, I agree. That is what I was referring to regardign interactive jobs. Perhaps not a user reserving the entire cluster, but a use reserving a lot of compute nodes and not making

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Yair Yarom
Hi, This is what we did, not sure those are the best solutions :) ## Queue stuffing We have set PriorityWeightAge several magnitudes lower than PriorityWeightFairshare, and we also have PriorityMaxAge set to cap of older jobs. As I see it, the fairshare is far more important than age. Besides

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Ole Holm Nielsen
On 05/08/2018 08:44 AM, Bjørn-Helge Mevik wrote: Jonathon A Anderson writes: ## Queue stuffing There is the bf_max_job_user SchedulerParameter, which is sort of the "poor man's MAXIJOB"; it limits the number of jobs from each user the backfiller will try to

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Bjørn-Helge Mevik
Jonathon A Anderson writes: > ## Queue stuffing There is the bf_max_job_user SchedulerParameter, which is sort of the "poor man's MAXIJOB"; it limits the number of jobs from each user the backfiller will try to start on each run. It doesn't do exactly what you

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-07 Thread Ryan Novosielski
One of these TRES-related ones in a QOS ought to do it: https://slurm.schedmd.com/resource_limits.html Your problem there, though, is you will eventually have stuff waiting to run it and when the system is idle. We had the same circumstance and the same eventual outcome. -- || \\UTGERS,