After trying a few job managers and slurm configurations we've settled
on a single partition with a hand full of QOS defined. We use the QOS
for time limits and giving specific classes of jobs higher priority. 2
QOS get most of the jobs, and we use FairShare to keep everyone happy.
We have everything in a single group right now, but from what I can
see it should scale out nicely to multiple groups too.

On the flip side though I have no experience at all of PBS...


On Thu, Jan 9, 2014 at 9:34 AM, Bill Wichser <[email protected]> wrote:
>
> After years and years of PBS use, it is time to modernize.  Speaking with a
> few of the developers at SC13 we have started the switch on two new clusters
> soon to be deployed and will not install Torque/Moab on these but will
> attempt Slurm instead.
>
> Naturally, things are quite different.  I've managed to implement the
> job_submit.lua script to emulate a routing queue similar to PBS keyed on job
> request time.
>
> But instead of trying to simply convert what I have with my current setup,
> maybe there is a better way.  For instance, a single partition and a QOS
> defined for time_limit lengths instead.  Obviously there are many ways to
> skin the cat the same as there are for other resource managers/schedulers.
>
> What I am hoping to find is just some solid advice, from you folks who are
> running Slurm.  I need some fairshare stuff for groups and limits for users
> and total jobs of length T and that's about it for now.  And while this
> could all be implemented in a variety of ways, is there something I should
> be aware of in the overall layout to make this easier down the road or
> should I just continue to port things from the years of doing PBS?
>
>
> Sincerely,
> Bill Wichser

Reply via email to