On 21 September 2017 at 17:55, Fabrice Nininahazwe <nifawil...@gmail.com>
wrote:

>
> Dear developer,
>
> I have encountered some of the nodes that are down, I can ping to node
> n003 and not node n001, I have run scontrol update to change the state with
> no success below is the result after running scontrol show nodes:
>
> [root@slurm ~]# scontrol show node n001
> NodeName=n001 CoresPerSocket=4
>    CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=N/A
>    AvailableFeatures=(null)
>    ActiveFeatures=(null)
>    Gres=(null)
>    NodeAddr=n001 NodeHostName=n001 Version=(null)
>    RealMemory=16111 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
>    State=DOWN* ThreadsPerCore=1 TmpDisk=19990 Weight=1 Owner=N/A
> MCS_label=N/A
>    BootTime=None SlurmdStartTime=None
>    CapWatts=n/a
>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>    Reason=Not responding [root@2017-09-21T09:29:31]
>
>
> [root@slurm ~]# scontrol show node n003
> NodeName=n003 CoresPerSocket=4
>    CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=N/A
>    AvailableFeatures=(null)
>    ActiveFeatures=(null)
>    Gres=(null)
>    NodeAddr=n003 NodeHostName=n003 Version=(null)
>    RealMemory=16111 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
>    State=DOWN* ThreadsPerCore=1 TmpDisk=19990 Weight=1 Owner=N/A
> MCS_label=N/A
>    BootTime=None SlurmdStartTime=None
>    CapWatts=n/a
>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>    Reason=Not responding [root@2017-09-21T09:32:51]
>


Fabrice,

When Nodes go down unexpectedly, there are two things I check first:

 - is the date and time on the "down" node correct or at least within 60
seconds of the head node?
 - what does the slurmd log say on the node that is down, what does the
slurmctl log say on the head node?

You will find that if it's not the first point, the second point will point
you in the right direction.

cheers
L.

Reply via email to