After trying a few job managers and slurm configurations we've settled on a single partition with a hand full of QOS defined. We use the QOS for time limits and giving specific classes of jobs higher priority. 2 QOS get most of the jobs, and we use FairShare to keep everyone happy. We have everything in a single group right now, but from what I can see it should scale out nicely to multiple groups too.
On the flip side though I have no experience at all of PBS... On Thu, Jan 9, 2014 at 9:34 AM, Bill Wichser <[email protected]> wrote: > > After years and years of PBS use, it is time to modernize. Speaking with a > few of the developers at SC13 we have started the switch on two new clusters > soon to be deployed and will not install Torque/Moab on these but will > attempt Slurm instead. > > Naturally, things are quite different. I've managed to implement the > job_submit.lua script to emulate a routing queue similar to PBS keyed on job > request time. > > But instead of trying to simply convert what I have with my current setup, > maybe there is a better way. For instance, a single partition and a QOS > defined for time_limit lengths instead. Obviously there are many ways to > skin the cat the same as there are for other resource managers/schedulers. > > What I am hoping to find is just some solid advice, from you folks who are > running Slurm. I need some fairshare stuff for groups and limits for users > and total jobs of length T and that's about it for now. And while this > could all be implemented in a variety of ways, is there something I should > be aware of in the overall layout to make this easier down the road or > should I just continue to port things from the years of doing PBS? > > > Sincerely, > Bill Wichser
