My mistake! I was looking at the wrong figure/column: > I have logs where the reported > d_cpu matches the total number of CPU-seconds for an hour during the hourly > rollup, but sometimes the number is higher and sometimes lower.
No, d_cpu did not quite match the total number of CPU-seconds on our resource after all. Rather it was total_time, which is correct and does not change even when nodes are down. There were times when d_cpu did match total_time, however, e.g., (921600+235929600+0)(236851200) > 235929600 where 235929600 is the correct number of CPU-seconds for our system in an hour. After I had resolved the problematic job records, the complaints seem to go away, but came back a few days later: [2014-09-08T13:00:02.350] (223572256+14745600+0)(238317856) > 235929600 [2014-09-08T15:00:02.191] (229063568+14745600+0)(243809168) > 235929600 [2014-09-08T18:00:02.389] (221913328+14745600+0)(236658928) > 235929600 [2014-09-08T20:00:02.007] (227854160+14745600+0)(242599760) > 235929600 [2014-09-09T15:00:03.017] (225833760+14745600+0)(240579360) > 235929600 [2014-09-09T16:00:02.270] (225181472+14745600+0)(239927072) > 235929600 [2014-09-09T17:00:02.660] (222904176+14745600+0)(237649776) > 235929600 [2014-09-09T18:00:02.057] (231175072+14745600+0)(245920672) > 235929600 [2014-09-09T19:00:02.440] (225419680+14745600+0)(240165280) > 235929600 [2014-09-09T21:00:01.985] (222188384+14745600+0)(236933984) > 235929600 [2014-09-09T22:00:02.480] (222496064+14745600+0)(237241664) > 235929600 [2014-09-09T23:00:02.827] (225304240+14745600+0)(240049840) > 235929600 [2014-09-10T00:00:02.130] (230062480+14745600+0)(244808080) > 235929600 [2014-09-10T01:00:02.737] (231246000+14745600+0)(245991600) > 235929600 [2014-09-10T02:00:02.442] (226591488+14745600+0)(241337088) > 235929600 All jobs with state>1 have time_start values, and none have time_end of 0, so I don't know what else could be amiss in the job table. Regards Jeff
