Hi again
Finally I have installed the slurm, and all daemons are running ok in the
control machine.
[root@master /]# ps -el | grep slurm
5 S 2000 2539 1 0 80 0 - 67803 futex_ ? 00:00:00
slurmdbd
5 S 2000 2550 1 0 80 0 - 131352 hrtime ? 00:00:01
slurmctld
I have also installed the slurm in the nodes and the daemon is also running
[root@node_2 ~]# ps -el | grep slurm
1 S 0 2240 1 0 80 0 - 28081 inet_c
? 00:00:00 slurmd
But now, the state of the nodes is changing from 'idle' to 'down'
[root@master /]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 8 down* node_[1-8]
when the nodes are down I execute the command
scontrol update nodename=node_2 state=Resume
and the node comes again to "idle" state. But some minutes later the state
change again to 'down'
and when I check the info of a given node in the master node I get the next
info
[root@master /]# scontrol show node node_2
NodeName=node_2 CoresPerSocket=1
CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=N/A Features=(null)
Gres=(null)
NodeAddr=node_2 NodeHostName=node_2 Version=(null)
RealMemory=1000 AllocMem=0 Sockets=1 Boards=1
State=DOWN* ThreadsPerCore=1 TmpDisk=0 Weight=1
BootTime=None SlurmdStartTime=None
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Not responding [slurm@2014-04-25T21:50:11]
but using the ping command, I can reach any node in the cluster
This is the information that contains the slurm.log in the node_2
[2014-04-25T23:01:01.224] CPU frequency setting not configured for this node
[2014-04-25T23:01:01.230] slurmd version 14.03.0 started
[2014-04-25T23:01:01.233] WARNING: We will use a much slower algorithm with
proctrack/pgid, use Proctracktype=proctrack/linuxproc or some other
proctrack when using jobacct_gather/linux
[2014-04-25T23:01:01.246] slurmd started on Fri, 25 Apr 2014 23:01:01 +0200
[2014-04-25T23:01:01.246] CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1
Memory=1460 TmpDisk=17846 Uptime=52
[2014-04-25T23:01:10.256] error: Unable to register: Unable to contact
slurm controller (connect failure)
[2014-04-25T23:01:20.266] error: Unable to register: Unable to contact
slurm controller (connect failure)
[2014-04-25T23:01:30.277] error: Unable to register: Unable to contact
slurm controller (connect failure)
[2014-04-25T23:01:40.287] error: Unable to register: Unable to contact
slurm controller (connect failure)
[2014-04-25T23:01:50.298] error: Unable to register: Unable to contact
slurm controller (connect failure)
[2014-04-25T23:02:00.309] error: Unable to register: Unable to contact
slurm controller (connect failure)
Can somebody tell me what is wrong in my configuration process? and what I
have to do to solve this problem?
Thank you very much