Dear all, 

on our cluster with slurm 2.3.3, the command 'sacctmgr list events' show a long 
list of 'DOWN' events even though all nodes are up and well. (except for one, 
see below)

All the events date back from a total blackout the whole city experienced, 
which lead to an abrupt loss of power after the UPS batteries got empty.

Does this have an influence on the accounting (I get messages like 'We have 
more time than is possible' in the logs), and how can I clear them ?

Thanks your your help

damien

[root@lm ~]# sacctmgr list events
   Cluster       Node Name               Start                 End  State       
                  Reason            User 
---------- --------------- ------------------- ------------------- ------ 
------------------------------ --------------- 
 lmlm                 2012-06-25T13:30:42             Unknown               
Cluster processor count                 
 lm lmPp001         2013-01-16T11:32:37 2013-01-16T11:51:14  DOWN*              
   Not responding      slurm(102) 
 lm lmPp001         2013-01-16T11:51:36 2013-01-16T11:52:50   DOWN              
           (null)                 
 lm lmPp001         2013-01-16T11:52:50 2013-01-16T12:46:13   DOWN              
           (null)                 
 lm lmPp002         2013-01-16T11:32:37 2013-01-16T11:51:36  DOWN*              
   Not responding      slurm(102) 
 lm lmPp002         2013-01-16T11:51:36 2013-01-16T11:52:00  DOWN*              
   Not responding      slurm(102) 
 lm lmPp002         2013-01-16T11:52:50 2013-01-16T12:46:13   DOWN              
           (null)                 
 lm lmPp003         2013-01-16T11:32:37 2013-01-16T11:51:36  DOWN*              
   Not responding      slurm(102) 
 lm lmPp003         2013-01-16T11:51:36 2013-01-16T11:52:26  DOWN*              
   Not responding      slurm(102) 
 lm lmPp003         2013-01-16T11:52:50 2013-01-16T12:46:13   DOWN              
           (null)                 
 lm lmWn001         2013-01-16T11:32:37 2013-01-16T11:40:28  DOWN*              
   Not responding      slurm(102) 
 lm lmWn001         2013-01-16T11:46:49 2013-01-16T12:00:04   DOWN              
           (null)                 
 lm lmWn001         2013-01-16T12:00:04 2013-01-16T12:03:56   DOWN              
           (null)                 
 lm lmWn001         2013-01-16T12:03:56 2013-01-16T12:46:47   DOWN              
           (null)                 
 lm lmWn002         2013-01-16T11:32:37 2013-01-16T11:40:16  DOWN*              
   Not responding      slurm(102) 
 lm lmWn002         2013-01-16T11:46:49 2013-01-16T12:00:04   DOWN              
           (null)                 

[root@lm ~]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
Def*         up 5-00:00:00      1  down* lmWn057
Def*         up 5-00:00:00    103  alloc lmWn[001-056,058-104]
Long         up 21-00:00:0      8  alloc lmWn[105-112]
PostP        up    6:00:00      3  alloc lmPp[001-003]

Reply via email to