Dear all,
on our cluster with slurm 2.3.3, the command 'sacctmgr list events' show a long
list of 'DOWN' events even though all nodes are up and well. (except for one,
see below)
All the events date back from a total blackout the whole city experienced,
which lead to an abrupt loss of power after the UPS batteries got empty.
Does this have an influence on the accounting (I get messages like 'We have
more time than is possible' in the logs), and how can I clear them ?
Thanks your your help
damien
[root@lm ~]# sacctmgr list events
Cluster Node Name Start End State
Reason User
---------- --------------- ------------------- ------------------- ------
------------------------------ ---------------
lmlm 2012-06-25T13:30:42 Unknown
Cluster processor count
lm lmPp001 2013-01-16T11:32:37 2013-01-16T11:51:14 DOWN*
Not responding slurm(102)
lm lmPp001 2013-01-16T11:51:36 2013-01-16T11:52:50 DOWN
(null)
lm lmPp001 2013-01-16T11:52:50 2013-01-16T12:46:13 DOWN
(null)
lm lmPp002 2013-01-16T11:32:37 2013-01-16T11:51:36 DOWN*
Not responding slurm(102)
lm lmPp002 2013-01-16T11:51:36 2013-01-16T11:52:00 DOWN*
Not responding slurm(102)
lm lmPp002 2013-01-16T11:52:50 2013-01-16T12:46:13 DOWN
(null)
lm lmPp003 2013-01-16T11:32:37 2013-01-16T11:51:36 DOWN*
Not responding slurm(102)
lm lmPp003 2013-01-16T11:51:36 2013-01-16T11:52:26 DOWN*
Not responding slurm(102)
lm lmPp003 2013-01-16T11:52:50 2013-01-16T12:46:13 DOWN
(null)
lm lmWn001 2013-01-16T11:32:37 2013-01-16T11:40:28 DOWN*
Not responding slurm(102)
lm lmWn001 2013-01-16T11:46:49 2013-01-16T12:00:04 DOWN
(null)
lm lmWn001 2013-01-16T12:00:04 2013-01-16T12:03:56 DOWN
(null)
lm lmWn001 2013-01-16T12:03:56 2013-01-16T12:46:47 DOWN
(null)
lm lmWn002 2013-01-16T11:32:37 2013-01-16T11:40:16 DOWN*
Not responding slurm(102)
lm lmWn002 2013-01-16T11:46:49 2013-01-16T12:00:04 DOWN
(null)
[root@lm ~]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
Def* up 5-00:00:00 1 down* lmWn057
Def* up 5-00:00:00 103 alloc lmWn[001-056,058-104]
Long up 21-00:00:0 8 alloc lmWn[105-112]
PostP up 6:00:00 3 alloc lmPp[001-003]