"Skouson, Gary B" <[email protected]> writes: > I notice the same problem occasionally. I trust the info from the > slurm database and have used sacct to generate "true" usage details > for reports we provide. However, that doesn't solve the problem with > people using more than we meant to allocate to their project. Limit > enforcement seems to be tied to the data stored in the assoc_usage > file rather than what's in the slurm database.
We have experienced a similar situation, 14.11.5. Also GrpCPURunMins with strange contents make stray accounts hit AssocGrpCPURunMinsLimit, even w/o any jobs at all actually running under those accounts. > I've used the output from sacct and the assoc_usage documentation > (source code) to create a fixed version of that file. Assuming that > you've checked the docs to make sure the file format hasn't changed, > it should be OK to do this, but since it's not a supported process, > you're correct, that it may cause problems. Removing the assoc_usage > file seemed to reset all share usage to 0, so that didn't really work > for us. stop, zap the assoc_usage file and start did remedy the immediate problem, although that hardly is workable to do to many times. > I've done the following several times to get share usage to match > reality. To keep things simple, you'll have to do this while no jobs > are running, or write code to deal with running jobs, or live with the > fact that you're going to be off by a minute or so on usage data. > > - Run sacct to pull usage info and tabulate it for the assoc_usage file > > - Shut down slurmctl. > > - Keep a copy of the assoc_usage file, just in case > > - Run the magic code to create a "fixed" assoc_usage based on sacct results > > - Put the fixed assoc_usage file in place and start slurmctl > > It only takes a minute or so once we have the sacct data. > > I've tried to locate where things go wrong, but haven't come very > close. It seems like some projects with the differences, have had new > users added to the project recently. I tried testing this theory on a > test configuration, but couldn't reproduce the problem, so I may be > wrong on that. It seems like if I restart slurmctld regularly, that I > don't get the same drift, but again, I don't have anything solid that > says it helps. > > ----- > Gary Skouson regards, lars. > -----Original Message----- > From: Stuart Rankin [mailto:[email protected]] > Sent: Tuesday, May 26, 2015 1:00 PM > To: slurm-dev > Subject: [slurm-dev] sreport usage vs slurmctld raw usage > > > Hi, > > I've noticed that we've developed (slurm 14.11.4 installation) a disparity > between what slurmctld > believes is total usage via sshare and what slurmdbd believes via sreport > (exhibited below). > > Rectifying this is desirable as I suspect it is the reason some projects have > been able to acquire > negative balances (we use sbank). Currently I trust the slurmdbd view of > things. Is there a simple > procedure to realign slurmctld - e.g. would stopping slurmctld, deleting the > assoc_usage checkpoint > file, and restarting slurmctld do the desired thing? I discovered an earlier > thread in which editing > this file corrected a similar problem, which seems more dangerous. > > Thanks for any advice - > > Stuart > > > sreport -t hours cluster AccountUtilizationByUser account=a_project > start=2014-02-01T00:00:00 > end=2015-05-28T00:00:00 | head -7 > > -------------------------------------------------------------------------------- > Cluster/Account/User Utilization 2014-02-01T00:00:00 - 2015-05-27T23:59:59 > (41554800 secs) > Time reported in CPU Hours > -------------------------------------------------------------------------------- > Cluster Account Login Proper Name Used Energy > --------- --------------- --------- --------------- ---------- ---------- > hpcs a_project 1443228 0 > > > > sshare --long -A a_project > > Account User Raw Shares Norm Shares Raw Usage Norm > Usage Effectv Usage > FairShare GrpCPUMins CPURunMins > -------------------- ---------- ---------- ----------- ----------- > ----------- ------------- > ---------- ----------- --------------- > a_project 121 0.012100 5026671760 0.007770 > 0.016086 0.397927 > 100096860 277306 > > > i.e. Raw Usage = 5026671760/3600 = 1396297.7 core hours
