And how can I fix that?
De: Cooper, Trevor [mailto:tcoo...@sdsc.edu] Enviado el: martes, 22 de diciembre de 2015 12:04 Para: slurm-dev Asunto: [slurm-dev] Re: slum in the nodes not working It appears you may be missing a ControlMachine entry in your slurm.conf[1]. -- Trevor [1] - http://slurm.schedmd.com/slurm.conf.html On Dec 22, 2015, at 6:25 AM, Fany Pagés Díaz <fpa...@citi.cu <mailto:fpa...@citi.cu> > wrote: when I saw the slurmctld.log this is the message that I get in each node. *slurmctld: error: this host (compute-0-0) not valid controller (cluster or (null))* This is my output of file slurm.conf. Any idea? thanks SlurmUser=root SlurmdUser=root SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge CryptoType=crypto/munge StateSaveLocation=/var/spool/slurm.state SlurmdSpoolDir=/var/spool/slurmd SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid ProctrackType=proctrack/linuxproc PluginDir=/usr/lib64/slurm CacheGroups=0 JobCheckpointDir=/var/spool/slurm.checkpoint #SallocDefaultCommand = "xterm" GresTypes=gpu #FirstJobId= ReturnToService=2 #MaxJobCount= #PlugStackConfig= #PropagatePrioProcess= #PropagateResourceLimits= #PropagateResourceLimitsExcept= #Prolog= #Epilog= #SrunProlog= #SrunEpilog= #TaskProlog= #TaskEpilog= TaskPlugin=task/affinity TrackWCKey=yes TopologyPlugin=topology/none #TreeWidth=50 TmpFs=/state/partition1 #UsePAM= SlurmctldTimeout=300 SlurmdTimeout=40 InactiveLimit=30 MinJobAge=300 KillWait=30 WaitTime=30 SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory #DefMemPerCPU=220 #MaxMemPerCPU=300 VSizeFactor=90 FastSchedule=0 PriorityType=priority/multifactor PriorityDecayHalfLife=14-0 PriorityWeightFairshare=10000 PriorityWeightAge=1000 PriorityWeightPartition=10000 PriorityWeightJobSize=1000 PriorityMaxAge=1-0 PriorityWeightQOS=10000 SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm/slurmd.log JobCompType=jobcomp/none JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=30 NodeName=DEFAULT State=UNKNOWN NodeName=cluster NodeAddr=10.8.52.254 gres=gpu:2 PartitionName=DEFAULT AllocNodes=cluster State=UP PartitionName=DEBUG ################ Do not edit below ############################################################# include /etc/slurm/headnode.conf include /etc/slurm/nodenames.conf include /etc/slurm/partitions.conf ################################################################################################ De: Eckert, Phil [mailto:ecke...@llnl.gov] Enviado el: lunes, 21 de diciembre de 2015 16:10 Para: slurm-dev Asunto: [slurm-dev] Re: slum in the nodes not working Make sure the slurm.conf file is identical on all nodes. If the slurmctld is running , and all the slurmd’s are running take a look at the slurmctld.log, it should provide some clues, if not you might want to post the content of your slurm.conf file. Phil Eckert LLNL From: Fany Pagés Díaz < <mailto:fpa...@citi.cu> fpa...@citi.cu> Reply-To: slurm-dev < <mailto:slurm-dev@schedmd.com> slurm-dev@schedmd.com> Date: Monday, December 21, 2015 at 12:39 PM To: slurm-dev < <mailto:slurm-dev@schedmd.com> slurm-dev@schedmd.com> Subject: [slurm-dev] Re: slum in the nodes not working When I start the server, the nodes was down, I start /etc/init.d/slurm en in the server and it´s fine, but in the nodes are down. I restart the nodes again and nothing. any idea? De: Carlos Fenoy [ <mailto:mini...@gmail.com> mailto:mini...@gmail.com] Enviado el: lunes, 21 de diciembre de 2015 12:59 Para: slurm-dev Asunto: [slurm-dev] Re: slum in the nodes not working You should not start the slurmctld on all the nodes, only in the head node of the cluster, and in the compute nodes start the slurmd with service slurm start On Mon, Dec 21, 2015 at 6:27 PM, Fany Pagés Díaz < <mailto:fpa...@citi.cu> fpa...@citi.cu> wrote: I had to turn off my cluster by electricity problems, and now slurm not working. The nodes are down and the demons of slurm in the nodes fails. When I run in the slurmctld -D command nodes, I get the following error: slurmctld: error: this host (compute-0-0) not valid controller (cluster or (null)) How can I fix that? any can help me, please? Ing. Fany Pages Diaz -- -- Carles Fenoy