I notice the same problem occasionally.  I trust the info from the slurm 
database and have used sacct to generate "true" usage details for reports we 
provide. However, that doesn't solve the problem with people using more than we 
meant to allocate to their project. Limit enforcement seems to be tied to the 
data stored in the assoc_usage file rather than what's in the slurm database.

I've used the output from sacct and the assoc_usage documentation (source code) 
to create a fixed version of that file. Assuming that you've checked the docs 
to make sure the file format hasn't changed, it should be OK to do this, but 
since it's not a supported process, you're correct, that it may cause problems. 
 Removing the assoc_usage file seemed to reset all share usage to 0, so that 
didn't really work for us.

I've done the following several times to get share usage to match reality. To 
keep things simple, you'll have to do this while no jobs are running, or write 
code to deal with running jobs, or live with the fact that you're going to be 
off by a minute or so on usage data.

- Run sacct to pull usage info and tabulate it for the assoc_usage file 

- Shut down slurmctl.

- Keep a copy of the assoc_usage file, just in case

- Run the magic code to create a "fixed" assoc_usage based on sacct results

- Put the fixed assoc_usage file in place and start slurmctl

It only takes a minute or so once we have the sacct data. 

I've tried to locate where things go wrong, but haven't come very close.  It 
seems like some projects with the differences, have had new users added to the 
project recently.  I tried testing this theory on a test configuration, but 
couldn't reproduce the problem, so I may be wrong on that. It seems like if I 
restart slurmctld regularly, that I don't get the same drift, but again, I 
don't have anything solid that says it helps.

-----
Gary Skouson


-----Original Message-----
From: Stuart Rankin [mailto:[email protected]] 
Sent: Tuesday, May 26, 2015 1:00 PM
To: slurm-dev
Subject: [slurm-dev] sreport usage vs slurmctld raw usage


Hi,

I've noticed that we've developed (slurm 14.11.4 installation) a disparity 
between what slurmctld
believes is total usage via sshare and what slurmdbd believes via sreport 
(exhibited below).

Rectifying this is desirable as I suspect it is the reason some projects have 
been able to acquire
negative balances (we use sbank). Currently I trust the slurmdbd view of 
things. Is there a simple
procedure to realign slurmctld - e.g. would stopping slurmctld, deleting the 
assoc_usage checkpoint
file, and restarting slurmctld do the desired thing? I discovered an earlier 
thread in which editing
this file corrected a similar problem, which seems more dangerous.

Thanks for any advice -

Stuart


sreport -t hours  cluster AccountUtilizationByUser account=a_project 
start=2014-02-01T00:00:00
end=2015-05-28T00:00:00 | head -7

--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2014-02-01T00:00:00 - 2015-05-27T23:59:59 
(41554800 secs)
Time reported in CPU Hours
--------------------------------------------------------------------------------
  Cluster         Account     Login     Proper Name       Used     Energy
--------- --------------- --------- --------------- ---------- ----------
     hpcs        a_project                            1443228          0



sshare --long -A a_project

             Account       User Raw Shares Norm Shares   Raw Usage  Norm Usage 
Effectv Usage
FairShare  GrpCPUMins      CPURunMins
-------------------- ---------- ---------- ----------- ----------- ----------- 
-------------
---------- ----------- ---------------
a_project                           121    0.012100  5026671760    0.007770     
 0.016086   0.397927
  100096860          277306


i.e. Raw Usage = 5026671760/3600 = 1396297.7 core hours






-- 
Dr. Stuart Rankin

Senior System Administrator
High Performance Computing Service
University of Cambridge
Email: [email protected]
Tel: (+)44 1223 763517

Reply via email to