Hi Arjun,

On Fri, Jun 27, 2014 at 12:25:58AM -0700, Arjun J Rao wrote:

> Have SLURM set up on a cluster of 2 nodes qdr[3-4]
> Running sinfo shows the two nodes to be in a perpetual drain state.
> 
> sinfo -R yields the following :
> REASON           USER           TIMESTAMP                       NODELIST
> Epilog error         root              2014-02-03 T15:53:40
> qdr3
> Epilog error         root              2014-02-03 T15:52:42
> qdr4
> 
> The epilog error occured on 3rd February! (More than 4 months ago)
> 
> Why is this happening ?

Maybe an obvious question, but have you set the nodes to be 'resume' or 'idle'
using scontrol since then? In our setup at least, once a node is marked 'down',
we have to manually clear it to either 'resume' or 'idle'.

Paddy

-- 
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/

Reply via email to