On Wed, Feb 15, 2012 at 3:51 AM, Pär Andersson <[email protected]> wrote: > > Tim Carlson <[email protected]> writes: > >> Any idea how I can go back and possibly fix this mess? Is my >> assumption about messing with QOS accurate? I could grab the database >> from last night's backup as a last resort. > > I think it is hard to answer questions about the cause without more > details about your QOS and associations are configured, and more a > detailed log of what you did yesterday. > > Anyway, I just wanted to point out that another possible solution is to > reset the usage of just the czt account. > > From the sacctmgr man page: > RawUsage=<value> > This allows an administrator to reset the raw usage accrued to > an account. The only value currently supported is 0 (zero). > This is a settable specifi‐ cation only - it cannot be used as a > filter to list accounts. > > Kind regards, > > Pär Andersson > NSC
Thanks Pär, After some more specific googling for things like assoc_usage I did just that except on the users under the accounts that were problematic and not the accounts themselves. That seemed to fix things as the totals on the accounts adjusted correctly. Some more specifics of what I did was to 1) Create a QOS called "buyin" sacctmgr create qos name=buyin priority=1 sacctmgr modify account where name=czt cluster=olympus set qos=buyin sacctmgr modify account where name=regmodel cluster=olympus set defaultqos=buyin 2) Adjust a few parameters in slurm.conf so they are non-zero PriorityWeightFairshare=1000 PriorityWeightAge=500 PriorityWeightQOS=1000 3) Now tell slurm about the updated configuration scontrol reconfigure That seemed to do what I wanted and I could see from the output of "sprio" that the weights and QOS settings were doing what I expected. Then a few hours later I start getting lots of questions about why jobs are stuck in AssociationLimit state. Start poking around and find that the rawusage numbers are way off. We use "sbank" which queries sshare for usage info that that is where I first saw the problem. Tim
