Damien, None of your nodes in your output are currently down. Anything with an endtime is something that has been accounted for. If there are events without an endtime (Outside of actual downed nodes or Cluster processor count lines) then those are the ones to look at.
It is interesting about your extra time though. Do you happen to have the slurmdbd message about it? Thanks /David On 01/22/2013 11:56 AM, Damien François wrote: > > Dear all, > > on our cluster with slurm 2.3.3, the command 'sacctmgr list events' show a > long list of 'DOWN' events even though all nodes are up and well. (except for > one, see below) > > All the events date back from a total blackout the whole city experienced, > which lead to an abrupt loss of power after the UPS batteries got empty. > > Does this have an influence on the accounting (I get messages like 'We have > more time than is possible' in the logs), and how can I clear them ? > > Thanks your your help > > damien > > [root@lm ~]# sacctmgr list events > Cluster Node Name Start End State > Reason User > ---------- --------------- ------------------- ------------------- ------ > ------------------------------ --------------- > lmlm 2012-06-25T13:30:42 Unknown > Cluster processor count > lm lmPp001 2013-01-16T11:32:37 2013-01-16T11:51:14 DOWN* > Not responding slurm(102) > lm lmPp001 2013-01-16T11:51:36 2013-01-16T11:52:50 DOWN > (null) > lm lmPp001 2013-01-16T11:52:50 2013-01-16T12:46:13 DOWN > (null) > lm lmPp002 2013-01-16T11:32:37 2013-01-16T11:51:36 DOWN* > Not responding slurm(102) > lm lmPp002 2013-01-16T11:51:36 2013-01-16T11:52:00 DOWN* > Not responding slurm(102) > lm lmPp002 2013-01-16T11:52:50 2013-01-16T12:46:13 DOWN > (null) > lm lmPp003 2013-01-16T11:32:37 2013-01-16T11:51:36 DOWN* > Not responding slurm(102) > lm lmPp003 2013-01-16T11:51:36 2013-01-16T11:52:26 DOWN* > Not responding slurm(102) > lm lmPp003 2013-01-16T11:52:50 2013-01-16T12:46:13 DOWN > (null) > lm lmWn001 2013-01-16T11:32:37 2013-01-16T11:40:28 DOWN* > Not responding slurm(102) > lm lmWn001 2013-01-16T11:46:49 2013-01-16T12:00:04 DOWN > (null) > lm lmWn001 2013-01-16T12:00:04 2013-01-16T12:03:56 DOWN > (null) > lm lmWn001 2013-01-16T12:03:56 2013-01-16T12:46:47 DOWN > (null) > lm lmWn002 2013-01-16T11:32:37 2013-01-16T11:40:16 DOWN* > Not responding slurm(102) > lm lmWn002 2013-01-16T11:46:49 2013-01-16T12:00:04 DOWN > (null) > > [root@lm ~]# sinfo > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > Def* up 5-00:00:00 1 down* lmWn057 > Def* up 5-00:00:00 103 alloc lmWn[001-056,058-104] > Long up 21-00:00:0 8 alloc lmWn[105-112] > PostP up 6:00:00 3 alloc lmPp[001-003]
