We use GrpTresRunMins for this, with the idea that it's OK for users to occupy lots of resources with short-running jobs, but not so much with long-running jobs.
On Wed, Feb 7, 2018 at 8:41 AM, Bill Barth <bba...@tacc.utexas.edu> wrote: > Of course, Matteo. Happy to help. Our job completion script is: > > #!/bin/bash > > OUTFILE=/var/log/slurm/tacc_jobs_completed > > echo "$JOBID:$UID:$ACCOUNT:$BATCH:$START:$END:$SUBMIT:$PARTITION: > $LIMIT:$JOBNAME:$JOBSTATE:$NODECNT:$PROCS" >> $OUTFILE > > exit 0 > > and our config settings (from scontrol show config) are: > > JobCompLoc = /etc/slurm/tacc_job_completion.sh > JobCompType = jobcomp/script > > Feel free to steal as much of that as you like, just update the lines and > names to remove the “tacc” parts. This script needs to be present on the > machine that slurmctld is running on. Our internal accounting system is > RESTful, so we’re thinking of using this mechanism to write accounting > records to it directly in this plugin script with curl/wget calls rather > than appending to our flat file and shipping that info to our database via > cron script nightly. That would give us the ability to do live updates of > balances (which the Slurm DB already supports) to prevent overdrawn > accounts. This is convoluted, but we have had to reinvent the wheel a > little since we need to report usage to both our local accounting database > and a national one. Yes, there were probably other ways to do this, but the > infrastructure is now historical and set in some stone. > > Best, > Bill. > > -- > Bill Barth, Ph.D., Director, HPC > bba...@tacc.utexas.edu | Phone: (512) 232-7069 > Office: ROC 1.435 | Fax: (512) 475-9445 > > > > On 2/7/18, 12:28 AM, "slurm-users on behalf of Matteo F" < > slurm-users-boun...@lists.schedmd.com on behalf of mfasco...@gmail.com> > wrote: > > Thanks Bill, I really appreciate the time you spent giving this > detailed answer. > I will have a look at the plugin system as the integration with out > accounting system would be a nice feature. > > > > > @Chris thanks, I've had a look GrpTRES but I'll probably go with the > Spank route. > > > Best, > Matteo > > On 6 February 2018 at 13:58, Bill Barth > <bba...@tacc.utexas.edu> wrote: > > Chris probably gives the Slurm-iest way to do this, but we use a Spank > plugin that counts the jobs that a user has in queue (running and waiting) > and sets a hard cap on how many they can have. This should probably be > scaled to the size of the system and the > partition they are submitting to, but on Stampede 2 (4200 KNL nodes > and 1736 SKX nodes), we set this, across all queues to about 50, which has > been our magic number, across numerous schedulers over the years on systems > ranging from hundreds of nodes to Stamped2e > 1 with 6400. Some users get more by request and most don’t even bump > up against the limits. We’ve started to look at using TRES on our test > system, but we haven’t gotten there yet. Our use of the DB is minimal, and > our process to get every user into it when > their TACC account is created is not 100% automated yet (we use the > job completion plugin to create a flat file with job records which our > local accounting system consumes to decrement allocation balances, if you > care to know). > > Best, > Bill. > > -- > Bill Barth, Ph.D., Director, HPC > bba...@tacc.utexas.edu | Phone: > (512) 232-7069 <tel:%28512%29%20232-7069> > Office: ROC 1.435 | Fax: (512) 475-9445 > <tel:%28512%29%20475-9445> > > > > On 2/6/18, 6:03 AM, "slurm-users on behalf of Christopher Samuel" < > slurm-users-boun...@lists.schedmd.com on behalf of > ch...@csamuel.org> wrote: > > On 06/02/18 21:40, Matteo F wrote: > > > I've tried to limit the number of running job using Qos -> > > MaxJobsPerAccount, but this wouldn't stop a user to just fill up > the > > cluster with fewer (but bigger) jobs. > > You probably want to look at what you can do with the slurmdbd > database > and associations. Things like GrpTRES: > > > https://slurm.schedmd.com/sacctmgr.html <https://slurm.schedmd.com/ > sacctmgr.html> > > # GrpTRES=<TRES=max TRES,...> > # Maximum number of TRES running jobs are able to be allocated > in > # aggregate for this association and all associations which are > children > # of this association. To clear a previously set value use the > modify > # command with a new value of -1 for each TRES id. > # > # NOTE: This limit only applies fully when using the Select > Consumable > # Resource plugin. > > Best of luck, > Chris > > > > > > > > > > > > > >