"Skouson, Gary B" <[email protected]> writes:

> I notice the same problem occasionally.  I trust the info from the
> slurm database and have used sacct to generate "true" usage details
> for reports we provide. However, that doesn't solve the problem with
> people using more than we meant to allocate to their project. Limit
> enforcement seems to be tied to the data stored in the assoc_usage
> file rather than what's in the slurm database.

We have experienced a similar situation, 14.11.5.

Also GrpCPURunMins with strange contents make stray accounts
hit AssocGrpCPURunMinsLimit, even w/o any jobs at all actually
running under those accounts.

> I've used the output from sacct and the assoc_usage documentation
> (source code) to create a fixed version of that file. Assuming that
> you've checked the docs to make sure the file format hasn't changed,
> it should be OK to do this, but since it's not a supported process,
> you're correct, that it may cause problems.  Removing the assoc_usage
> file seemed to reset all share usage to 0, so that didn't really work
> for us.

stop, zap the assoc_usage file and start did remedy the
immediate problem, although that hardly is workable to
do to many times.

> I've done the following several times to get share usage to match
> reality. To keep things simple, you'll have to do this while no jobs
> are running, or write code to deal with running jobs, or live with the
> fact that you're going to be off by a minute or so on usage data.
>
> - Run sacct to pull usage info and tabulate it for the assoc_usage file 
>
> - Shut down slurmctl.
>
> - Keep a copy of the assoc_usage file, just in case
>
> - Run the magic code to create a "fixed" assoc_usage based on sacct results
>
> - Put the fixed assoc_usage file in place and start slurmctl
>
> It only takes a minute or so once we have the sacct data. 
>
> I've tried to locate where things go wrong, but haven't come very
> close.  It seems like some projects with the differences, have had new
> users added to the project recently.  I tried testing this theory on a
> test configuration, but couldn't reproduce the problem, so I may be
> wrong on that. It seems like if I restart slurmctld regularly, that I
> don't get the same drift, but again, I don't have anything solid that
> says it helps.
>
> -----
> Gary Skouson

regards,
lars.

> -----Original Message-----
> From: Stuart Rankin [mailto:[email protected]] 
> Sent: Tuesday, May 26, 2015 1:00 PM
> To: slurm-dev
> Subject: [slurm-dev] sreport usage vs slurmctld raw usage
>
>
> Hi,
>
> I've noticed that we've developed (slurm 14.11.4 installation) a disparity 
> between what slurmctld
> believes is total usage via sshare and what slurmdbd believes via sreport 
> (exhibited below).
>
> Rectifying this is desirable as I suspect it is the reason some projects have 
> been able to acquire
> negative balances (we use sbank). Currently I trust the slurmdbd view of 
> things. Is there a simple
> procedure to realign slurmctld - e.g. would stopping slurmctld, deleting the 
> assoc_usage checkpoint
> file, and restarting slurmctld do the desired thing? I discovered an earlier 
> thread in which editing
> this file corrected a similar problem, which seems more dangerous.
>
> Thanks for any advice -
>
> Stuart
>
>
> sreport -t hours  cluster AccountUtilizationByUser account=a_project 
> start=2014-02-01T00:00:00
> end=2015-05-28T00:00:00 | head -7
>
> --------------------------------------------------------------------------------
> Cluster/Account/User Utilization 2014-02-01T00:00:00 - 2015-05-27T23:59:59 
> (41554800 secs)
> Time reported in CPU Hours
> --------------------------------------------------------------------------------
>   Cluster         Account     Login     Proper Name       Used     Energy
> --------- --------------- --------- --------------- ---------- ----------
>      hpcs        a_project                            1443228          0
>
>
>
> sshare --long -A a_project
>
>              Account       User Raw Shares Norm Shares   Raw Usage  Norm 
> Usage Effectv Usage
> FairShare  GrpCPUMins      CPURunMins
> -------------------- ---------- ---------- ----------- ----------- 
> ----------- -------------
> ---------- ----------- ---------------
> a_project                           121    0.012100  5026671760    0.007770   
>    0.016086   0.397927
>   100096860          277306
>
>
> i.e. Raw Usage = 5026671760/3600 = 1396297.7 core hours

Reply via email to