[slurm-dev] Re: slum in the nodes not working

Cooper, Trevor Tue, 22 Dec 2015 13:15:06 -0800

Is slurmctld running somewhere in your cluster?

If so, on which host?


When you know that you can edit your slurm.conf file to add/set that 
configuration.

The 'example' slurm.conf file[1] starts like this...

#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=linux
ControlMachine=linux0
#

Then sync your slurm.conf file to all compute nodes (using whatever mechanism 
you use to coordinate your slurm.conf) and restart slurm on your compute nodes 
allowing them to connect to the controller.

-- Trevor

[1] - https://github.com/SchedMD/slurm/blob/master/etc/slurm.conf.example


> On Dec 22, 2015, at 10:17 AM, Fany Pagés Díaz <fpa...@citi.cu> wrote:
> 
> And how can I fix that?
>  
> De: Cooper, Trevor [mailto:tcoo...@sdsc.edu] 
> Enviado el: martes, 22 de diciembre de 2015 12:04
> Para: slurm-dev
> Asunto: [slurm-dev] Re: slum in the nodes not working
>  
> It appears you may be missing a ControlMachine entry in your slurm.conf[1]. 
>  
> -- Trevor
>  
> [1] - http://slurm.schedmd.com/slurm.conf.html
>  
> On Dec 22, 2015, at 6:25 AM, Fany Pagés Díaz <fpa...@citi.cu> wrote:
>  
> when I saw the slurmctld.log this is the message that I get in each node.
>  
>   *slurmctld: error: this host (compute-0-0) not valid controller  (cluster 
> or (null))*
>  
> This is my output of file slurm.conf. Any idea? thanks
>  
> SlurmUser=root
> SlurmdUser=root
> SlurmctldPort=6817
> SlurmdPort=6818
> AuthType=auth/munge
> CryptoType=crypto/munge
> StateSaveLocation=/var/spool/slurm.state
> SlurmdSpoolDir=/var/spool/slurmd
> SwitchType=switch/none
> MpiDefault=none
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmdPidFile=/var/run/slurmd.pid
> ProctrackType=proctrack/linuxproc
> PluginDir=/usr/lib64/slurm
> CacheGroups=0
> JobCheckpointDir=/var/spool/slurm.checkpoint
> #SallocDefaultCommand = "xterm"
> GresTypes=gpu
> #FirstJobId=
> ReturnToService=2
> #MaxJobCount=
> #PlugStackConfig=
> #PropagatePrioProcess=
> #PropagateResourceLimits=
> #PropagateResourceLimitsExcept=
> #Prolog=
> #Epilog=
> #SrunProlog=
> #SrunEpilog=
> #TaskProlog=
> #TaskEpilog=
> TaskPlugin=task/affinity
> TrackWCKey=yes
> TopologyPlugin=topology/none
> #TreeWidth=50
> TmpFs=/state/partition1
> #UsePAM=
> SlurmctldTimeout=300
> SlurmdTimeout=40
> InactiveLimit=30
> MinJobAge=300
> KillWait=30
> WaitTime=30
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core_Memory
> #DefMemPerCPU=220
> #MaxMemPerCPU=300
> VSizeFactor=90
> FastSchedule=0
> PriorityType=priority/multifactor
> PriorityDecayHalfLife=14-0
> PriorityWeightFairshare=10000
> PriorityWeightAge=1000
> PriorityWeightPartition=10000
> PriorityWeightJobSize=1000
> PriorityMaxAge=1-0
> PriorityWeightQOS=10000
> SlurmctldDebug=3
> SlurmctldLogFile=/var/log/slurm/slurmctld.log
> SlurmdDebug=3
> SlurmdLogFile=/var/log/slurm/slurmd.log
> JobCompType=jobcomp/none
> JobAcctGatherType=jobacct_gather/linux
> JobAcctGatherFrequency=30
> NodeName=DEFAULT State=UNKNOWN
> NodeName=cluster NodeAddr=10.8.52.254 gres=gpu:2
> PartitionName=DEFAULT AllocNodes=cluster State=UP
> PartitionName=DEBUG
> ################ Do not edit below 
> #############################################################
> include /etc/slurm/headnode.conf
> include /etc/slurm/nodenames.conf
> include /etc/slurm/partitions.conf
> ################################################################################################
>  
>  
>  
> De: Eckert, Phil [mailto:ecke...@llnl.gov] 
> Enviado el: lunes, 21 de diciembre de 2015 16:10
> Para: slurm-dev
> Asunto: [slurm-dev] Re: slum in the nodes not working
>  
> Make sure the slurm.conf file is identical on all nodes.
> If the slurmctld is running , and all the slurmd’s are running take a look at 
> the slurmctld.log, it should provide some clues, if not you might want to 
> post the content of your slurm.conf file.
>  
> Phil Eckert
> LLNL
>  
> From: Fany Pagés Díaz <fpa...@citi.cu>
> Reply-To: slurm-dev <slurm-dev@schedmd.com>
> Date: Monday, December 21, 2015 at 12:39 PM
> To: slurm-dev <slurm-dev@schedmd.com>
> Subject: [slurm-dev] Re: slum in the nodes not working
>  
> When I start the server, the nodes was down, I start /etc/init.d/slurm en in 
> the server and it´s fine, but in the nodes are down. I restart  the nodes 
> again and nothing. any idea?
>  
> De: Carlos Fenoy [mailto:mini...@gmail.com] 
> Enviado el: lunes, 21 de diciembre de 2015 12:59
> Para: slurm-dev
> Asunto: [slurm-dev] Re: slum in the nodes not working
>  
> You should not start the slurmctld on all the nodes, only in the head node of 
> the cluster, and in the compute nodes start the slurmd with service slurm 
> start
>  
> On Mon, Dec 21, 2015 at 6:27 PM, Fany Pagés Díaz <fpa...@citi.cu> wrote:
> I had to turn off my cluster by electricity problems, and now slurm not 
> working. The nodes are down and the demons of slurm in the nodes fails.
> When I run in the slurmctld -D command nodes, I get the following error:
>  
> slurmctld: error: this host (compute-0-0) not valid controller (cluster or 
> (null))
>  
> How can I fix that? any can help me, please?
> Ing. Fany Pages Diaz
> 
> 
>  
> -- 
> --
> Carles Fenoy

[slurm-dev] Re: slum in the nodes not working

Reply via email to