Hi Gary, Thanks for this, you confirm my suspicions. Would you be prepared to share your magic code (what version of SLURM are you running)?
It would be great (feature request!) if there was a supported way to verify/restore agreement between these two sources of usage information. Best regards, Stuart On 27/05/15 02:13, Skouson, Gary B wrote: > I notice the same problem occasionally. I trust the info from the slurm > database and have used sacct to generate "true" usage details for reports we > provide. However, that doesn't solve the problem with people using more than > we meant to allocate to their project. Limit enforcement seems to be tied to > the data stored in the assoc_usage file rather than what's in the slurm > database. > > I've used the output from sacct and the assoc_usage documentation (source > code) to create a fixed version of that file. Assuming that you've checked > the docs to make sure the file format hasn't changed, it should be OK to do > this, but since it's not a supported process, you're correct, that it may > cause problems. Removing the assoc_usage file seemed to reset all share > usage to 0, so that didn't really work for us. > > I've done the following several times to get share usage to match reality. To > keep things simple, you'll have to do this while no jobs are running, or > write code to deal with running jobs, or live with the fact that you're going > to be off by a minute or so on usage data. > > - Run sacct to pull usage info and tabulate it for the assoc_usage file > > - Shut down slurmctl. > > - Keep a copy of the assoc_usage file, just in case > > - Run the magic code to create a "fixed" assoc_usage based on sacct results > > - Put the fixed assoc_usage file in place and start slurmctl > > It only takes a minute or so once we have the sacct data. > > I've tried to locate where things go wrong, but haven't come very close. It > seems like some projects with the differences, have had new users added to > the project recently. I tried testing this theory on a test configuration, > but couldn't reproduce the problem, so I may be wrong on that. It seems like > if I restart slurmctld regularly, that I don't get the same drift, but again, > I don't have anything solid that says it helps. > > ----- > Gary Skouson > > > -----Original Message----- > From: Stuart Rankin [mailto:[email protected]] > Sent: Tuesday, May 26, 2015 1:00 PM > To: slurm-dev > Subject: [slurm-dev] sreport usage vs slurmctld raw usage > > > Hi, > > I've noticed that we've developed (slurm 14.11.4 installation) a disparity > between what slurmctld > believes is total usage via sshare and what slurmdbd believes via sreport > (exhibited below). > > Rectifying this is desirable as I suspect it is the reason some projects have > been able to acquire > negative balances (we use sbank). Currently I trust the slurmdbd view of > things. Is there a simple > procedure to realign slurmctld - e.g. would stopping slurmctld, deleting the > assoc_usage checkpoint > file, and restarting slurmctld do the desired thing? I discovered an earlier > thread in which editing > this file corrected a similar problem, which seems more dangerous. > > Thanks for any advice - > > Stuart > > > sreport -t hours cluster AccountUtilizationByUser account=a_project > start=2014-02-01T00:00:00 > end=2015-05-28T00:00:00 | head -7 > > -------------------------------------------------------------------------------- > Cluster/Account/User Utilization 2014-02-01T00:00:00 - 2015-05-27T23:59:59 > (41554800 secs) > Time reported in CPU Hours > -------------------------------------------------------------------------------- > Cluster Account Login Proper Name Used Energy > --------- --------------- --------- --------------- ---------- ---------- > hpcs a_project 1443228 0 > > > > sshare --long -A a_project > > Account User Raw Shares Norm Shares Raw Usage Norm > Usage Effectv Usage > FairShare GrpCPUMins CPURunMins > -------------------- ---------- ---------- ----------- ----------- > ----------- ------------- > ---------- ----------- --------------- > a_project 121 0.012100 5026671760 0.007770 > 0.016086 0.397927 > 100096860 277306 > > > i.e. Raw Usage = 5026671760/3600 = 1396297.7 core hours > > > > > > -- Dr. Stuart Rankin Senior System Administrator High Performance Computing Service University of Cambridge Email: [email protected] Tel: (+)44 1223 763517
