I would make sure you have the PriorityDecayHalfLife=0 in your slurm.conf or raw usage will get decayed over time, and will almost never match what sacct shows. If it isn't set to 0 what you are seeing is expected. You will probably want to set up PriorityUsageResetPeriod as well.

You can look these up in the slurm.conf man page http://slurm.schedmd.com/slurm.conf.html

On 05/27/15 06:28, lars malinowsky wrote:
"Skouson, Gary B" <[email protected]> writes:

I notice the same problem occasionally.  I trust the info from the
slurm database and have used sacct to generate "true" usage details
for reports we provide. However, that doesn't solve the problem with
people using more than we meant to allocate to their project. Limit
enforcement seems to be tied to the data stored in the assoc_usage
file rather than what's in the slurm database.
We have experienced a similar situation, 14.11.5.

Also GrpCPURunMins with strange contents make stray accounts
hit AssocGrpCPURunMinsLimit, even w/o any jobs at all actually
running under those accounts.

I've used the output from sacct and the assoc_usage documentation
(source code) to create a fixed version of that file. Assuming that
you've checked the docs to make sure the file format hasn't changed,
it should be OK to do this, but since it's not a supported process,
you're correct, that it may cause problems.  Removing the assoc_usage
file seemed to reset all share usage to 0, so that didn't really work
for us.
stop, zap the assoc_usage file and start did remedy the
immediate problem, although that hardly is workable to
do to many times.

I've done the following several times to get share usage to match
reality. To keep things simple, you'll have to do this while no jobs
are running, or write code to deal with running jobs, or live with the
fact that you're going to be off by a minute or so on usage data.

- Run sacct to pull usage info and tabulate it for the assoc_usage file

- Shut down slurmctl.

- Keep a copy of the assoc_usage file, just in case

- Run the magic code to create a "fixed" assoc_usage based on sacct results

- Put the fixed assoc_usage file in place and start slurmctl

It only takes a minute or so once we have the sacct data.

I've tried to locate where things go wrong, but haven't come very
close.  It seems like some projects with the differences, have had new
users added to the project recently.  I tried testing this theory on a
test configuration, but couldn't reproduce the problem, so I may be
wrong on that. It seems like if I restart slurmctld regularly, that I
don't get the same drift, but again, I don't have anything solid that
says it helps.

-----
Gary Skouson
regards,
lars.

-----Original Message-----
From: Stuart Rankin [mailto:[email protected]]
Sent: Tuesday, May 26, 2015 1:00 PM
To: slurm-dev
Subject: [slurm-dev] sreport usage vs slurmctld raw usage


Hi,

I've noticed that we've developed (slurm 14.11.4 installation) a disparity 
between what slurmctld
believes is total usage via sshare and what slurmdbd believes via sreport 
(exhibited below).

Rectifying this is desirable as I suspect it is the reason some projects have 
been able to acquire
negative balances (we use sbank). Currently I trust the slurmdbd view of 
things. Is there a simple
procedure to realign slurmctld - e.g. would stopping slurmctld, deleting the 
assoc_usage checkpoint
file, and restarting slurmctld do the desired thing? I discovered an earlier 
thread in which editing
this file corrected a similar problem, which seems more dangerous.

Thanks for any advice -

Stuart


sreport -t hours  cluster AccountUtilizationByUser account=a_project 
start=2014-02-01T00:00:00
end=2015-05-28T00:00:00 | head -7

--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2014-02-01T00:00:00 - 2015-05-27T23:59:59 
(41554800 secs)
Time reported in CPU Hours
--------------------------------------------------------------------------------
   Cluster         Account     Login     Proper Name       Used     Energy
--------- --------------- --------- --------------- ---------- ----------
      hpcs        a_project                            1443228          0



sshare --long -A a_project

              Account       User Raw Shares Norm Shares   Raw Usage  Norm Usage 
Effectv Usage
FairShare  GrpCPUMins      CPURunMins
-------------------- ---------- ---------- ----------- ----------- ----------- 
-------------
---------- ----------- ---------------
a_project                           121    0.012100  5026671760    0.007770     
 0.016086   0.397927
   100096860          277306


i.e. Raw Usage = 5026671760/3600 = 1396297.7 core hours

Reply via email to