Of course, Matteo. Happy to help. Our job completion script is:

#!/bin/bash

OUTFILE=/var/log/slurm/tacc_jobs_completed

echo 
"$JOBID:$UID:$ACCOUNT:$BATCH:$START:$END:$SUBMIT:$PARTITION:$LIMIT:$JOBNAME:$JOBSTATE:$NODECNT:$PROCS"
 >> $OUTFILE

exit 0

and our config settings (from scontrol show config) are:

JobCompLoc              = /etc/slurm/tacc_job_completion.sh
JobCompType             = jobcomp/script

Feel free to steal as much of that as you like, just update the lines and names 
to remove the “tacc” parts. This script needs to be present on the machine that 
slurmctld is running on. Our internal accounting system is RESTful, so we’re 
thinking of using this mechanism to write accounting records to it directly in 
this plugin script with curl/wget calls rather than appending to our flat file 
and shipping that info to our database via cron script nightly. That would give 
us the ability to do live updates of balances (which the Slurm DB already 
supports) to prevent overdrawn accounts. This is convoluted, but we have had to 
reinvent the wheel a little since we need to report usage to both our local 
accounting database and a national one. Yes, there were probably other ways to 
do this, but the infrastructure is now historical and set in some stone.

Best,
Bill.

-- 
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu        |   Phone: (512) 232-7069
Office: ROC 1.435            |   Fax:   (512) 475-9445
 
 

On 2/7/18, 12:28 AM, "slurm-users on behalf of Matteo F" 
<slurm-users-boun...@lists.schedmd.com on behalf of mfasco...@gmail.com> wrote:

    Thanks Bill, I really appreciate the time you spent giving this detailed 
answer. 
    I will have a look at the plugin system as the integration with out 
accounting system would be a nice feature.
    
    
    
    
    @Chris thanks, I've had a look GrpTRES but I'll probably go with the Spank 
route.
    
    
    Best, 
    Matteo
    
    On 6 February 2018 at 13:58, Bill Barth 
    <bba...@tacc.utexas.edu> wrote:
    
    Chris probably gives the Slurm-iest way to do this, but we use a Spank 
plugin that counts the jobs that a user has in queue (running and waiting) and 
sets a hard cap on how many they can have. This should probably be scaled to 
the size of the system and the
     partition they are submitting to, but on Stampede 2 (4200 KNL nodes and 
1736 SKX nodes), we set this, across all queues to about 50, which has been our 
magic number, across numerous schedulers over the years on systems ranging from 
hundreds of nodes to Stamped2e
     1 with 6400. Some users get more by request and most don’t even bump up 
against the limits. We’ve started to look at using TRES on our test system, but 
we haven’t gotten there yet. Our use of the DB is minimal, and our process to 
get every user into it when
     their TACC account is created is not 100% automated yet (we use the job 
completion plugin to create a flat file with job records which our local 
accounting system consumes to decrement allocation balances, if you care to 
know).
    
    Best,
    Bill.
    
    --
    Bill Barth, Ph.D., Director, HPC
    bba...@tacc.utexas.edu        |   Phone:
    (512) 232-7069 <tel:%28512%29%20232-7069>
    Office: ROC 1.435            |   Fax:   (512) 475-9445 
<tel:%28512%29%20475-9445>
    
    
    
    On 2/6/18, 6:03 AM, "slurm-users on behalf of Christopher Samuel" 
<slurm-users-boun...@lists.schedmd.com on behalf of
    ch...@csamuel.org> wrote:
    
        On 06/02/18 21:40, Matteo F wrote:
    
        > I've tried to limit the number of running job using Qos ->
        > MaxJobsPerAccount, but this wouldn't stop a user to just fill up the
        > cluster with fewer (but bigger) jobs.
    
        You probably want to look at what you can do with the slurmdbd database
        and associations. Things like GrpTRES:
    
        
    https://slurm.schedmd.com/sacctmgr.html 
<https://slurm.schedmd.com/sacctmgr.html>
    
        # GrpTRES=<TRES=max TRES,...>
        #     Maximum number of TRES running jobs are able to be allocated in
        # aggregate for this association and all associations which are children
        # of this association. To clear a previously set value use the modify
        # command with a new value of -1 for each TRES id.
        #
        #  NOTE: This limit only applies fully when using the Select Consumable
        # Resource plugin.
    
        Best of luck,
        Chris
    
    
    
    
    
    
    
    
    
    
    
    

Reply via email to