On Wed, Feb 15, 2012 at 3:51 AM, Pär Andersson <[email protected]> wrote:
>
> Tim Carlson <[email protected]> writes:
>
>> Any idea how I can go back and possibly fix this mess? Is my
>> assumption about messing with QOS accurate? I could grab the database
>> from last night's backup as a last resort.
>
> I think it is hard to answer questions about the cause without more
> details about your QOS and associations are configured, and more a
> detailed log of what you did yesterday.
>
> Anyway, I just wanted to point out that another possible solution is to
> reset the usage of just the czt account.
>
> From the sacctmgr man page:
> RawUsage=<value>
>        This allows an administrator to reset the raw usage accrued to
>        an account.  The only value currently supported is 0 (zero).
>        This is a settable specifi‐ cation only - it cannot be used as a
>        filter to list accounts.
>
> Kind regards,
>
> Pär Andersson
> NSC

Thanks Pär,

After some more specific googling for things like assoc_usage I did
just that except on the users under the accounts that were problematic
and not the accounts themselves. That seemed to fix things as the
totals on the accounts adjusted correctly.

Some more specifics of what I did was to

1) Create a QOS called "buyin"

sacctmgr create qos name=buyin priority=1
sacctmgr modify account where name=czt cluster=olympus set qos=buyin
sacctmgr modify account where name=regmodel cluster=olympus set defaultqos=buyin

2) Adjust a few parameters in slurm.conf so they are non-zero

PriorityWeightFairshare=1000
PriorityWeightAge=500
PriorityWeightQOS=1000

3) Now tell slurm about the updated configuration

scontrol reconfigure

That seemed to do what I wanted and I could see from the output of
"sprio" that the weights and QOS settings were doing what I expected.
Then a few hours later I start getting lots of questions about why
jobs are stuck in AssociationLimit state. Start poking around and find
that the rawusage numbers are way off.  We use "sbank" which queries
sshare for usage info that that is where I first saw the problem.

Tim

Reply via email to