Hi Lucas, It seems that your nodes can not reach your slurm controller. Do you have any firewall configured in the compute nodes? Try with a telnet if you can reach the controller port from a compute node.
Regards, Carles Fenoy Barcelona Supercomputing Center On Fri, Apr 25, 2014 at 11:47 PM, Lucas St <[email protected]> wrote: > Hi again > > Finally I have installed the slurm, and all daemons are running ok in the > control machine. > > [root@master /]# ps -el | grep slurm > 5 S 2000 2539 1 0 80 0 - 67803 futex_ ? 00:00:00 > slurmdbd > 5 S 2000 2550 1 0 80 0 - 131352 hrtime ? 00:00:01 > slurmctld > > I have also installed the slurm in the nodes and the daemon is also running > > [root@node_2 ~]# ps -el | grep slurm > 1 S 0 2240 1 0 80 0 - 28081 inet_c > ? 00:00:00 slurmd > > But now, the state of the nodes is changing from 'idle' to 'down' > > [root@master /]# sinfo > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > debug* up infinite 8 down* node_[1-8] > > when the nodes are down I execute the command > scontrol update nodename=node_2 state=Resume > > and the node comes again to "idle" state. But some minutes later the state > change again to 'down' > > and when I check the info of a given node in the master node I get the > next info > > [root@master /]# scontrol show node node_2 > NodeName=node_2 CoresPerSocket=1 > CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=N/A Features=(null) > Gres=(null) > NodeAddr=node_2 NodeHostName=node_2 Version=(null) > RealMemory=1000 AllocMem=0 Sockets=1 Boards=1 > State=DOWN* ThreadsPerCore=1 TmpDisk=0 Weight=1 > BootTime=None SlurmdStartTime=None > CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > Reason=Not responding [slurm@2014-04-25T21:50:11] > > but using the ping command, I can reach any node in the cluster > > This is the information that contains the slurm.log in the node_2 > > [2014-04-25T23:01:01.224] CPU frequency setting not configured for this > node > [2014-04-25T23:01:01.230] slurmd version 14.03.0 started > [2014-04-25T23:01:01.233] WARNING: We will use a much slower algorithm > with proctrack/pgid, use Proctracktype=proctrack/linuxproc or some other > proctrack when using jobacct_gather/linux > [2014-04-25T23:01:01.246] slurmd started on Fri, 25 Apr 2014 23:01:01 +0200 > [2014-04-25T23:01:01.246] CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 > Memory=1460 TmpDisk=17846 Uptime=52 > [2014-04-25T23:01:10.256] error: Unable to register: Unable to contact > slurm controller (connect failure) > [2014-04-25T23:01:20.266] error: Unable to register: Unable to contact > slurm controller (connect failure) > [2014-04-25T23:01:30.277] error: Unable to register: Unable to contact > slurm controller (connect failure) > [2014-04-25T23:01:40.287] error: Unable to register: Unable to contact > slurm controller (connect failure) > [2014-04-25T23:01:50.298] error: Unable to register: Unable to contact > slurm controller (connect failure) > [2014-04-25T23:02:00.309] error: Unable to register: Unable to contact > slurm controller (connect failure) > > > Can somebody tell me what is wrong in my configuration process? and what I > have to do to solve this problem? > > Thank you very much > > -- -- Carles Fenoy
