Hi, Husen --

If you did not set the node to DRAIN, then it may be worthwhile trying to
understand what conditions would have caused this to happen. Here I defer
to those with more experience.

From an operational perspective -- setting the state is done through
scontrol.
See slurm.schedmd.com/scontrol.html
There are examples at the end of the page -- and probably many experienced
people to provide detailed help if the man page doesn't offer clear
guidance.

Cheers,
~ Emily


----------------------------------
E.M. Dragowsky, Ph.D.
ITS -- Research Computing
Case Western Reserve University
(216) 368-0082

On Fri, Apr 8, 2016 at 9:41 AM, Husen R <[email protected]> wrote:

> Hi Emily,
>
> Thank you for the information.
>
> How to avoid node from having DRAIN state ?
> I didn't set its state to DRAIN.
>
> Thank you in advance
>
> Regards,
>
> Husen
>
>
>
> On Fri, Apr 8, 2016 at 8:16 PM, E.M. Dragowsky <[email protected]> wrote:
>
>> Hi, Husen --
>>
>> The DRAIN state means the node is not available for jobs, at least as far
>> as I understand from the documentation describing scontrol:
>>
>> If you want to remove a node from service, you typically want to set it's
>> state to "DRAIN".
>>
>> Cheers,
>> ~ Emily
>>
>> ----------------------------------
>> E.M. Dragowsky, Ph.D.
>> ITS -- Research Computing
>> Case Western Reserve University
>> (216) 368-0082
>>
>> On Fri, Apr 8, 2016 at 8:47 AM, Husen R <[email protected]> wrote:
>>
>>> Hello Remi,
>>>
>>> Thank you for your reply.
>>>
>>> here is the output of 'sinfo' and 'sinfo -R' respectively:
>>>
>>> pro@head-node:~$ sinfo
>>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>>> comeon*      up      30:00      1  drain head-node
>>> pro@head-node:~$ sinfo -R
>>> REASON               USER      TIMESTAMP           NODELIST
>>> batch job complete f root      2016-04-08T16:16:38 head-node
>>>
>>> The state of my node is drain. I don't understand why the resources is
>>> not available. Currently, I don't run any resource-hungry application on
>>> that node.
>>>
>>> Regards,
>>>
>>>
>>> Husen
>>>
>>>
>>> On Fri, Apr 8, 2016 at 7:23 PM, Rémi Palancher <[email protected]> wrote:
>>>
>>>>
>>>> Le 08/04/2016 13:39, Husen R a écrit :
>>>>
>>>>> [...]
>>>>> pro@head-node:/mirror/source$ squeue
>>>>>               JOBID   PARTITION        NAME      USER     ST       TIME
>>>>>   NODES     NODELIST(REASON)
>>>>>                  70    comeon         MatMul      pro     PD       0:00
>>>>>       1        (Resources)
>>>>>                  71    comeon         MatMul      pro     PD       0:00
>>>>>       1        (Resources)
>>>>>                  72    comeon         MatMul      pro     PD       0:00
>>>>>       1        (Resources)
>>>>>
>>>>
>>>> In the last column, squeue gives you the reason why the job are
>>>> pending. "Resources" means there is not enough resources available to run
>>>> the jobs.
>>>>
>>>> Check the state of your nodes using `sinfo`.
>>>>
>>>> Best,
>>>> Rémi
>>>>
>>>
>>>
>>
>

Reply via email to