Hello all, I have 2 machines running Slurm 14.03.07, called torquepbs and torquepbsno1. Slurmctld is running in torquepbs, and there's a slurmd running in torquepbs and torquepbsno1. They both have the same munge key and the same configuration file (slurm.conf), but torquepbs is up and torquepbsno1 is down. Munge daemon is running on both machines.
What can be wrong? slurm.conf: # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=torquepbs ControlAddr=localhost # MailProg=/bin/mail MpiDefault=none #MpiParams=ports=#-# ProctrackType=proctrack/pgid ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root StateSaveLocation=/var/spool/slurmctld SwitchType=switch/none TaskPlugin=task/none # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # # # SCHEDULING FastSchedule=1 SchedulerType=sched/backfill #SchedulerPort=7321 SelectType=select/linear # # # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/none ClusterName=cluster #JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux #SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm-llnl/slurmcltd #SlurmdDebug=3 SlurmdLogFile=/var/log/slurm-llnl/slurmd # # # COMPUTE NODES NodeName=torquepbsno1 CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2 ThreadsP erCore=1 State=UNKNOWN NodeName=torquepbsno2 CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2 ThreadsP erCore=1 State=UNKNOWN NodeName=torquepbs CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2 ThreadsPerC ore=1 State=UNKNOWN PartitionName=particao1 Nodes=torquepbs,torquepbsno1,torquepbsno2 Default=YES Ma xTime=INFINITE State=UP Thanks in advance. -- =============== Erica Riello Aluna Engenharia de Computação PUC-Rio
