Damien,

None of your nodes in your output are currently down.  Anything with an 
endtime is something that has been accounted for.  If there are events 
without an endtime (Outside of actual downed nodes or Cluster processor 
count lines) then those are the ones to look at.

It is interesting about your extra time though.  Do you happen to have 
the slurmdbd message about it?

Thanks

/David

On 01/22/2013 11:56 AM, Damien François wrote:
>
> Dear all,
>
> on our cluster with slurm 2.3.3, the command 'sacctmgr list events' show a 
> long list of 'DOWN' events even though all nodes are up and well. (except for 
> one, see below)
>
> All the events date back from a total blackout the whole city experienced, 
> which lead to an abrupt loss of power after the UPS batteries got empty.
>
> Does this have an influence on the accounting (I get messages like 'We have 
> more time than is possible' in the logs), and how can I clear them ?
>
> Thanks your your help
>
> damien
>
> [root@lm ~]# sacctmgr list events
>     Cluster       Node Name               Start                 End  State    
>                      Reason            User
> ---------- --------------- ------------------- ------------------- ------ 
> ------------------------------ ---------------
>   lmlm                 2012-06-25T13:30:42             Unknown               
> Cluster processor count
>   lm lmPp001         2013-01-16T11:32:37 2013-01-16T11:51:14  DOWN*           
>       Not responding      slurm(102)
>   lm lmPp001         2013-01-16T11:51:36 2013-01-16T11:52:50   DOWN           
>               (null)
>   lm lmPp001         2013-01-16T11:52:50 2013-01-16T12:46:13   DOWN           
>               (null)
>   lm lmPp002         2013-01-16T11:32:37 2013-01-16T11:51:36  DOWN*           
>       Not responding      slurm(102)
>   lm lmPp002         2013-01-16T11:51:36 2013-01-16T11:52:00  DOWN*           
>       Not responding      slurm(102)
>   lm lmPp002         2013-01-16T11:52:50 2013-01-16T12:46:13   DOWN           
>               (null)
>   lm lmPp003         2013-01-16T11:32:37 2013-01-16T11:51:36  DOWN*           
>       Not responding      slurm(102)
>   lm lmPp003         2013-01-16T11:51:36 2013-01-16T11:52:26  DOWN*           
>       Not responding      slurm(102)
>   lm lmPp003         2013-01-16T11:52:50 2013-01-16T12:46:13   DOWN           
>               (null)
>   lm lmWn001         2013-01-16T11:32:37 2013-01-16T11:40:28  DOWN*           
>       Not responding      slurm(102)
>   lm lmWn001         2013-01-16T11:46:49 2013-01-16T12:00:04   DOWN           
>               (null)
>   lm lmWn001         2013-01-16T12:00:04 2013-01-16T12:03:56   DOWN           
>               (null)
>   lm lmWn001         2013-01-16T12:03:56 2013-01-16T12:46:47   DOWN           
>               (null)
>   lm lmWn002         2013-01-16T11:32:37 2013-01-16T11:40:16  DOWN*           
>       Not responding      slurm(102)
>   lm lmWn002         2013-01-16T11:46:49 2013-01-16T12:00:04   DOWN           
>               (null)
>
> [root@lm ~]# sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> Def*         up 5-00:00:00      1  down* lmWn057
> Def*         up 5-00:00:00    103  alloc lmWn[001-056,058-104]
> Long         up 21-00:00:0      8  alloc lmWn[105-112]
> PostP        up    6:00:00      3  alloc lmPp[001-003]

Reply via email to