I ended up stopping the slurmctld daemons and re-writing the assoc_usage checkpoint file with the "corrected" usage info based on what sacct says has run. I restarted slurmctld and it seems to have picked up the updated usage.
Probably not a great general solution, since the documentation for the file format came from the source code and the format is likely to change without notice. It did get things back in sync though. I'll have to watch more closely to see if I can tell where things begin to go wrong. ----- Gary Skouson -----Original Message----- From: Lipari, Don [mailto:[email protected]] Sent: Monday, March 17, 2014 12:24 PM To: slurm-dev Subject: [slurm-dev] RE: sshare and sacct > -----Original Message----- > From: Skouson, Gary B [mailto:[email protected]] > Sent: Monday, March 17, 2014 11:48 AM > To: slurm-dev > Subject: [slurm-dev] RE: sshare and sacct > > Thanks. > > We did start with 0 usage on the accounts I'm looking at and we have the > share set to not decay or reset. For some user/account associations, we > have identical usage between sacct and sshare. For others, sshare shows > significantly less usage than sacct info. I'm not sure what caused the > difference. I can't think of a reason for the associations that show a discrepancy. Perhaps increasing the debug levels and adding Priority to your DebugFlags would shed light. > I was hoping I could update the sshare info to match what sacct says, but > I could only see how to reset share usage to 0 from the info I could find. Resetting RawUsage to zero is all that is currently possible. Support for non-zero values was never implemented. Don > ----- > Gary Skouson > > > -----Original Message----- > From: Lipari, Don [mailto:[email protected]] > Sent: Monday, March 17, 2014 8:47 AM > To: slurm-dev > Subject: [slurm-dev] RE: sshare and sacct > > Gary, > > The sacct command retrieves job and job step records from the slurmdb and > reports the statistics for the requested job(s). > > The sshare command provides the basis for the fair-scheduling component of > the multi-factor plugin. sshare lists the two components (shares and > usage) which are used to calculate the fair share factor for each user and > account. By default, one of the slurm.conf parameters which affect this > calculation (PriorityDecayHalfLife) is set to a 7 day decay. That means > that whatever raw usage appears in the sshare report, it is bound to be > less over time (in the absence of any more running jobs). > > So, it is not a surprise that there would be a discrepancy between the > usages reported by sacct and sshare. If you set PriorityDecayHalfLife to > not decay (zero), and if you started with zero usage, the usage numbers of > sacct and sreport should track until the PriorityUsageResetPeriod limit > was reached. At that point, the raw usage value would be reset to zero. > > Don > > > -----Original Message----- > > From: Skouson, Gary B [mailto:[email protected]] > > Sent: Friday, March 14, 2014 3:52 PM > > To: slurm-dev > > Subject: [slurm-dev] sshare and sacct > > > > > > We started using sshare to enforce limits on usage, and it seems that > > sshare is getting confused about actual usage. > > > > If I use sacct to check the usage for an account, I get different > numbers > > than sshare reports for the same account. > > > > Is there a way to "fix" sshare to reflect the usage found from sacct? > > > > I can see that I can reset the share usage to 0, but that's the only > value > > allowed at the moment. Is there some other way to set the rawusage to > fix > > sshare to reflect reality? > > > > ----- > > Gary Skouson
