Hi again

Finally I have installed the slurm, and all daemons are running ok in the
control machine.

[root@master /]# ps -el | grep slurm
       5 S  2000  2539     1  0  80   0 - 67803 futex_ ?        00:00:00
slurmdbd
       5 S  2000  2550     1  0  80   0 - 131352 hrtime ?       00:00:01
slurmctld

I have also installed the slurm in the nodes and the daemon is also running

[root@node_2 ~]# ps -el | grep slurm
                    1 S     0  2240     1  0  80   0 - 28081 inet_c
?        00:00:00 slurmd

But now, the state of the nodes is changing from 'idle' to 'down'

[root@master /]# sinfo
     PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
     debug*       up   infinite      8  down* node_[1-8]

when the nodes are down I execute the command
     scontrol update nodename=node_2 state=Resume

and the node comes again to "idle" state. But some minutes later the state
change again to 'down'

and when I check the info of a given node in the master node I get the next
info

[root@master /]# scontrol show node node_2
    NodeName=node_2 CoresPerSocket=1
      CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=N/A Features=(null)
      Gres=(null)
      NodeAddr=node_2 NodeHostName=node_2 Version=(null)
      RealMemory=1000 AllocMem=0 Sockets=1 Boards=1
      State=DOWN* ThreadsPerCore=1 TmpDisk=0 Weight=1
      BootTime=None SlurmdStartTime=None
      CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
      ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
      Reason=Not responding [slurm@2014-04-25T21:50:11]

but using the ping command, I can reach any node in the cluster

This is the information that contains the slurm.log in the node_2

[2014-04-25T23:01:01.224] CPU frequency setting not configured for this node
[2014-04-25T23:01:01.230] slurmd version 14.03.0 started
[2014-04-25T23:01:01.233] WARNING: We will use a much slower algorithm with
proctrack/pgid, use Proctracktype=proctrack/linuxproc or some other
proctrack when using jobacct_gather/linux
[2014-04-25T23:01:01.246] slurmd started on Fri, 25 Apr 2014 23:01:01 +0200
[2014-04-25T23:01:01.246] CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1
Memory=1460 TmpDisk=17846 Uptime=52
[2014-04-25T23:01:10.256] error: Unable to register: Unable to contact
slurm controller (connect failure)
[2014-04-25T23:01:20.266] error: Unable to register: Unable to contact
slurm controller (connect failure)
[2014-04-25T23:01:30.277] error: Unable to register: Unable to contact
slurm controller (connect failure)
[2014-04-25T23:01:40.287] error: Unable to register: Unable to contact
slurm controller (connect failure)
[2014-04-25T23:01:50.298] error: Unable to register: Unable to contact
slurm controller (connect failure)
[2014-04-25T23:02:00.309] error: Unable to register: Unable to contact
slurm controller (connect failure)


Can somebody tell me what is wrong in my configuration process? and what I
have to do to solve this problem?

Thank you very much

Reply via email to