Hi Lucas,

It seems that your nodes can not reach your slurm controller. Do you have
any firewall configured in the compute nodes? Try with a telnet if you can
reach the controller port from a compute node.

Regards,
Carles Fenoy
Barcelona Supercomputing Center


On Fri, Apr 25, 2014 at 11:47 PM, Lucas St <[email protected]> wrote:

>  Hi again
>
> Finally I have installed the slurm, and all daemons are running ok in the
> control machine.
>
> [root@master /]# ps -el | grep slurm
>        5 S  2000  2539     1  0  80   0 - 67803 futex_ ?        00:00:00
> slurmdbd
>        5 S  2000  2550     1  0  80   0 - 131352 hrtime ?       00:00:01
> slurmctld
>
> I have also installed the slurm in the nodes and the daemon is also running
>
> [root@node_2 ~]# ps -el | grep slurm
>                     1 S     0  2240     1  0  80   0 - 28081 inet_c
> ?        00:00:00 slurmd
>
> But now, the state of the nodes is changing from 'idle' to 'down'
>
> [root@master /]# sinfo
>      PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>      debug*       up   infinite      8  down* node_[1-8]
>
> when the nodes are down I execute the command
>      scontrol update nodename=node_2 state=Resume
>
> and the node comes again to "idle" state. But some minutes later the state
> change again to 'down'
>
> and when I check the info of a given node in the master node I get the
> next info
>
> [root@master /]# scontrol show node node_2
>     NodeName=node_2 CoresPerSocket=1
>       CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=N/A Features=(null)
>       Gres=(null)
>       NodeAddr=node_2 NodeHostName=node_2 Version=(null)
>       RealMemory=1000 AllocMem=0 Sockets=1 Boards=1
>       State=DOWN* ThreadsPerCore=1 TmpDisk=0 Weight=1
>       BootTime=None SlurmdStartTime=None
>       CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>       ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>       Reason=Not responding [slurm@2014-04-25T21:50:11]
>
> but using the ping command, I can reach any node in the cluster
>
> This is the information that contains the slurm.log in the node_2
>
> [2014-04-25T23:01:01.224] CPU frequency setting not configured for this
> node
> [2014-04-25T23:01:01.230] slurmd version 14.03.0 started
> [2014-04-25T23:01:01.233] WARNING: We will use a much slower algorithm
> with proctrack/pgid, use Proctracktype=proctrack/linuxproc or some other
> proctrack when using jobacct_gather/linux
> [2014-04-25T23:01:01.246] slurmd started on Fri, 25 Apr 2014 23:01:01 +0200
> [2014-04-25T23:01:01.246] CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1
> Memory=1460 TmpDisk=17846 Uptime=52
> [2014-04-25T23:01:10.256] error: Unable to register: Unable to contact
> slurm controller (connect failure)
> [2014-04-25T23:01:20.266] error: Unable to register: Unable to contact
> slurm controller (connect failure)
> [2014-04-25T23:01:30.277] error: Unable to register: Unable to contact
> slurm controller (connect failure)
> [2014-04-25T23:01:40.287] error: Unable to register: Unable to contact
> slurm controller (connect failure)
> [2014-04-25T23:01:50.298] error: Unable to register: Unable to contact
> slurm controller (connect failure)
> [2014-04-25T23:02:00.309] error: Unable to register: Unable to contact
> slurm controller (connect failure)
>
>
> Can somebody tell me what is wrong in my configuration process? and what I
> have to do to solve this problem?
>
> Thank you very much
>
>


-- 
--
Carles Fenoy

Reply via email to