I didn't mark the node as "drained" But after issuing the command scontrol update NodeName="qdr3" State="IDLE"; sinfo showed both nodes to be idle and usable. I was also able to execute MPI jobs.
Thanks. On Fri, Jun 27, 2014 at 2:35 PM, Paddy Doyle <[email protected]> wrote: > > Hi Arjun, > > On Fri, Jun 27, 2014 at 12:25:58AM -0700, Arjun J Rao wrote: > > > Have SLURM set up on a cluster of 2 nodes qdr[3-4] > > Running sinfo shows the two nodes to be in a perpetual drain state. > > > > sinfo -R yields the following : > > REASON USER TIMESTAMP NODELIST > > Epilog error root 2014-02-03 T15:53:40 > > qdr3 > > Epilog error root 2014-02-03 T15:52:42 > > qdr4 > > > > The epilog error occured on 3rd February! (More than 4 months ago) > > > > Why is this happening ? > > Maybe an obvious question, but have you set the nodes to be 'resume' or > 'idle' > using scontrol since then? In our setup at least, once a node is marked > 'down', > we have to manually clear it to either 'resume' or 'idle'. > > Paddy > > -- > Paddy Doyle > Trinity Centre for High Performance Computing, > Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. > Phone: +353-1-896-3725 > http://www.tchpc.tcd.ie/ > <http://t.signauxun.com/link?url=http%3A%2F%2Fwww.tchpc.tcd.ie%2F&ukey=agxzfnNpZ25hbHNjcnhyGAsSC1VzZXJQcm9maWxlGICAgL27w5MKDA&k=1f7bb1d4-b936-4bb0-a3ec-61a63d3e760a> >
