I notice the same problem occasionally. I trust the info from the slurm database and have used sacct to generate "true" usage details for reports we provide. However, that doesn't solve the problem with people using more than we meant to allocate to their project. Limit enforcement seems to be tied to the data stored in the assoc_usage file rather than what's in the slurm database.
I've used the output from sacct and the assoc_usage documentation (source code) to create a fixed version of that file. Assuming that you've checked the docs to make sure the file format hasn't changed, it should be OK to do this, but since it's not a supported process, you're correct, that it may cause problems. Removing the assoc_usage file seemed to reset all share usage to 0, so that didn't really work for us. I've done the following several times to get share usage to match reality. To keep things simple, you'll have to do this while no jobs are running, or write code to deal with running jobs, or live with the fact that you're going to be off by a minute or so on usage data. - Run sacct to pull usage info and tabulate it for the assoc_usage file - Shut down slurmctl. - Keep a copy of the assoc_usage file, just in case - Run the magic code to create a "fixed" assoc_usage based on sacct results - Put the fixed assoc_usage file in place and start slurmctl It only takes a minute or so once we have the sacct data. I've tried to locate where things go wrong, but haven't come very close. It seems like some projects with the differences, have had new users added to the project recently. I tried testing this theory on a test configuration, but couldn't reproduce the problem, so I may be wrong on that. It seems like if I restart slurmctld regularly, that I don't get the same drift, but again, I don't have anything solid that says it helps. ----- Gary Skouson -----Original Message----- From: Stuart Rankin [mailto:[email protected]] Sent: Tuesday, May 26, 2015 1:00 PM To: slurm-dev Subject: [slurm-dev] sreport usage vs slurmctld raw usage Hi, I've noticed that we've developed (slurm 14.11.4 installation) a disparity between what slurmctld believes is total usage via sshare and what slurmdbd believes via sreport (exhibited below). Rectifying this is desirable as I suspect it is the reason some projects have been able to acquire negative balances (we use sbank). Currently I trust the slurmdbd view of things. Is there a simple procedure to realign slurmctld - e.g. would stopping slurmctld, deleting the assoc_usage checkpoint file, and restarting slurmctld do the desired thing? I discovered an earlier thread in which editing this file corrected a similar problem, which seems more dangerous. Thanks for any advice - Stuart sreport -t hours cluster AccountUtilizationByUser account=a_project start=2014-02-01T00:00:00 end=2015-05-28T00:00:00 | head -7 -------------------------------------------------------------------------------- Cluster/Account/User Utilization 2014-02-01T00:00:00 - 2015-05-27T23:59:59 (41554800 secs) Time reported in CPU Hours -------------------------------------------------------------------------------- Cluster Account Login Proper Name Used Energy --------- --------------- --------- --------------- ---------- ---------- hpcs a_project 1443228 0 sshare --long -A a_project Account User Raw Shares Norm Shares Raw Usage Norm Usage Effectv Usage FairShare GrpCPUMins CPURunMins -------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ----------- --------------- a_project 121 0.012100 5026671760 0.007770 0.016086 0.397927 100096860 277306 i.e. Raw Usage = 5026671760/3600 = 1396297.7 core hours -- Dr. Stuart Rankin Senior System Administrator High Performance Computing Service University of Cambridge Email: [email protected] Tel: (+)44 1223 763517
