Hi Arjun, On Fri, Jun 27, 2014 at 12:25:58AM -0700, Arjun J Rao wrote:
> Have SLURM set up on a cluster of 2 nodes qdr[3-4] > Running sinfo shows the two nodes to be in a perpetual drain state. > > sinfo -R yields the following : > REASON USER TIMESTAMP NODELIST > Epilog error root 2014-02-03 T15:53:40 > qdr3 > Epilog error root 2014-02-03 T15:52:42 > qdr4 > > The epilog error occured on 3rd February! (More than 4 months ago) > > Why is this happening ? Maybe an obvious question, but have you set the nodes to be 'resume' or 'idle' using scontrol since then? In our setup at least, once a node is marked 'down', we have to manually clear it to either 'resume' or 'idle'. Paddy -- Paddy Doyle Trinity Centre for High Performance Computing, Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. Phone: +353-1-896-3725 http://www.tchpc.tcd.ie/
