Hi, Husen -- If you did not set the node to DRAIN, then it may be worthwhile trying to understand what conditions would have caused this to happen. Here I defer to those with more experience.
From an operational perspective -- setting the state is done through scontrol. See slurm.schedmd.com/scontrol.html There are examples at the end of the page -- and probably many experienced people to provide detailed help if the man page doesn't offer clear guidance. Cheers, ~ Emily ---------------------------------- E.M. Dragowsky, Ph.D. ITS -- Research Computing Case Western Reserve University (216) 368-0082 On Fri, Apr 8, 2016 at 9:41 AM, Husen R <[email protected]> wrote: > Hi Emily, > > Thank you for the information. > > How to avoid node from having DRAIN state ? > I didn't set its state to DRAIN. > > Thank you in advance > > Regards, > > Husen > > > > On Fri, Apr 8, 2016 at 8:16 PM, E.M. Dragowsky <[email protected]> wrote: > >> Hi, Husen -- >> >> The DRAIN state means the node is not available for jobs, at least as far >> as I understand from the documentation describing scontrol: >> >> If you want to remove a node from service, you typically want to set it's >> state to "DRAIN". >> >> Cheers, >> ~ Emily >> >> ---------------------------------- >> E.M. Dragowsky, Ph.D. >> ITS -- Research Computing >> Case Western Reserve University >> (216) 368-0082 >> >> On Fri, Apr 8, 2016 at 8:47 AM, Husen R <[email protected]> wrote: >> >>> Hello Remi, >>> >>> Thank you for your reply. >>> >>> here is the output of 'sinfo' and 'sinfo -R' respectively: >>> >>> pro@head-node:~$ sinfo >>> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >>> comeon* up 30:00 1 drain head-node >>> pro@head-node:~$ sinfo -R >>> REASON USER TIMESTAMP NODELIST >>> batch job complete f root 2016-04-08T16:16:38 head-node >>> >>> The state of my node is drain. I don't understand why the resources is >>> not available. Currently, I don't run any resource-hungry application on >>> that node. >>> >>> Regards, >>> >>> >>> Husen >>> >>> >>> On Fri, Apr 8, 2016 at 7:23 PM, Rémi Palancher <[email protected]> wrote: >>> >>>> >>>> Le 08/04/2016 13:39, Husen R a écrit : >>>> >>>>> [...] >>>>> pro@head-node:/mirror/source$ squeue >>>>> JOBID PARTITION NAME USER ST TIME >>>>> NODES NODELIST(REASON) >>>>> 70 comeon MatMul pro PD 0:00 >>>>> 1 (Resources) >>>>> 71 comeon MatMul pro PD 0:00 >>>>> 1 (Resources) >>>>> 72 comeon MatMul pro PD 0:00 >>>>> 1 (Resources) >>>>> >>>> >>>> In the last column, squeue gives you the reason why the job are >>>> pending. "Resources" means there is not enough resources available to run >>>> the jobs. >>>> >>>> Check the state of your nodes using `sinfo`. >>>> >>>> Best, >>>> Rémi >>>> >>> >>> >> >
