[slurm-dev] Re: slum in the nodes not working

Fany Pagés Díaz Tue, 22 Dec 2015 10:16:37 -0800

And how can I fix that?


De: Cooper, Trevor [mailto:tcoo...@sdsc.edu] 
Enviado el: martes, 22 de diciembre de 2015 12:04
Para: slurm-dev
Asunto: [slurm-dev] Re: slum in the nodes not working

 

It appears you may be missing a ControlMachine entry in your slurm.conf[1]. 

 

-- Trevor

 

[1] - http://slurm.schedmd.com/slurm.conf.html

 

On Dec 22, 2015, at 6:25 AM, Fany Pagés Díaz <fpa...@citi.cu 
<mailto:fpa...@citi.cu> > wrote:

 

when I saw the slurmctld.log this is the message that I get in each node.

 

  *slurmctld: error: this host (compute-0-0) not valid controller  (cluster or 
(null))*

 

This is my output of file slurm.conf. Any idea? thanks

 

SlurmUser=root
SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
CryptoType=crypto/munge
StateSaveLocation=/var/spool/slurm.state
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/linuxproc
PluginDir=/usr/lib64/slurm
CacheGroups=0
JobCheckpointDir=/var/spool/slurm.checkpoint
#SallocDefaultCommand = "xterm"
GresTypes=gpu
#FirstJobId=
ReturnToService=2
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
TaskPlugin=task/affinity
TrackWCKey=yes
TopologyPlugin=topology/none
#TreeWidth=50
TmpFs=/state/partition1
#UsePAM=
SlurmctldTimeout=300
SlurmdTimeout=40
InactiveLimit=30
MinJobAge=300
KillWait=30
WaitTime=30
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
#DefMemPerCPU=220
#MaxMemPerCPU=300
VSizeFactor=90
FastSchedule=0
PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityWeightFairshare=10000
PriorityWeightAge=1000
PriorityWeightPartition=10000
PriorityWeightJobSize=1000
PriorityMaxAge=1-0
PriorityWeightQOS=10000
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log
JobCompType=jobcomp/none
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
NodeName=DEFAULT State=UNKNOWN
NodeName=cluster NodeAddr=10.8.52.254 gres=gpu:2
PartitionName=DEFAULT AllocNodes=cluster State=UP
PartitionName=DEBUG
################ Do not edit below 
#############################################################
include /etc/slurm/headnode.conf
include /etc/slurm/nodenames.conf
include /etc/slurm/partitions.conf
################################################################################################

 

 

 

De: Eckert, Phil [mailto:ecke...@llnl.gov] 
Enviado el: lunes, 21 de diciembre de 2015 16:10
Para: slurm-dev
Asunto: [slurm-dev] Re: slum in the nodes not working

 

Make sure the slurm.conf file is identical on all nodes.

If the slurmctld is running , and all the slurmd’s are running take a look at 
the slurmctld.log, it should provide some clues, if not you might want to post 
the content of your slurm.conf file.

 

Phil Eckert

LLNL

 

From: Fany Pagés Díaz < <mailto:fpa...@citi.cu> fpa...@citi.cu>
Reply-To: slurm-dev < <mailto:slurm-dev@schedmd.com> slurm-dev@schedmd.com>
Date: Monday, December 21, 2015 at 12:39 PM
To: slurm-dev < <mailto:slurm-dev@schedmd.com> slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: slum in the nodes not working

 

When I start the server, the nodes was down, I start /etc/init.d/slurm en in 
the server and it´s fine, but in the nodes are down. I restart  the nodes again 
and nothing. any idea?

 

De: Carlos Fenoy [ <mailto:mini...@gmail.com> mailto:mini...@gmail.com] 
Enviado el: lunes, 21 de diciembre de 2015 12:59
Para: slurm-dev
Asunto: [slurm-dev] Re: slum in the nodes not working

 

You should not start the slurmctld on all the nodes, only in the head node of 
the cluster, and in the compute nodes start the slurmd with service slurm start

 

On Mon, Dec 21, 2015 at 6:27 PM, Fany Pagés Díaz < <mailto:fpa...@citi.cu> 
fpa...@citi.cu> wrote:

I had to turn off my cluster by electricity problems, and now slurm not 
working. The nodes are down and the demons of slurm in the nodes fails.
When I run in the slurmctld -D command nodes, I get the following error:

 

slurmctld: error: this host (compute-0-0) not valid controller (cluster or 
(null))

 

How can I fix that? any can help me, please?

Ing. Fany Pages Diaz





 

-- 

--
Carles Fenoy

[slurm-dev] Re: slum in the nodes not working

Reply via email to