Hi, I am running slurm 14.11.4 on a 800 node RHEL6.6 general-purpose University cluster. Since upgrading from 14.03.3 we have been seeing the following problem and I'd appreciate any advice (maybe it's a bug but maybe I'm missing something obvious).
Occasionally the number of slurmctld threads starts to rise rapidly until it hits the hard coded 256 limit and stays there. The threads are in the futex_ state according to ps and logging stops (nothing out of the ordinary leaps out in the log before this happens). Naturally slurm clients then start failing with timeout messages (which isn't trivial since it is causing some not very resilient user pipelines to fail). This condition has persisted for several hours during the night without being detected. However there is a simple workaround, which is to send a STOP signal to slurmctld process, wait a few seconds, then resume it - this clears the logjam. Merely attaching a debugger has the same effect! I feel this must be a clue as to the root cause. I have already tried setting CommitDelay in slurmdbd.conf, increasing MessageTimeout, setting HealthCheckNodeState=CYCLE and decreasing/increasing bf_yield_interval/bf_yield_sleep without any apparent impact (please see slurm.conf attached). Any advice would be gratefully received. Many thanks - Stuart -- Dr. Stuart Rankin Senior System Administrator High Performance Computing Service University of Cambridge Email: [email protected] Tel: (+)44 1223 763517
# slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=scheduler #ControlAddr= #BackupController= #BackupAddr= # AuthType=auth/munge CacheGroups=0 #CheckpointType=checkpoint/none CryptoType=crypto/munge #DisableRootJobs=NO EnforcePartLimits=YES Epilog=/var/spool/slurm/check/slurm_epilogue.sh #EpilogSlurmctld= FirstJobId=390000 #MaxJobId=999999 #GresTypes=gpu #GroupUpdateForce=0 #GroupUpdateTime=600 #JobCheckpointDir=/var/slurm/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=1 #KillOnBadExit=0 #LaunchType=launch/slurm #Licenses=foo*4,bar MailProg=/bin/mail MaxJobCount=25000 #MaxStepCount=40000 #MaxTasksPerNode=128 MpiDefault=none MpiParams=ports=12000-12999 #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= PrivateData=accounts ProctrackType=proctrack/pgid Prolog=/var/spool/slurm/check/slurm_node_check.sh PrologFlags=Alloc #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= RebootProgram="/sbin/reboot" ReturnToService=1 SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --mpi=none $SHELL" SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurm/slurmd SlurmUser=slurm #SlurmdUser=root #SrunEpilog= #SrunProlog= StateSaveLocation=/var/spool/slurm/slurmctld SwitchType=switch/none #TaskEpilog= TaskPlugin=task/affinity TaskPluginParam=Cpusets #TaskProlog= #TopologyPlugin=topology/tree #TmpFs=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 HealthCheckNodeState=CYCLE HealthCheckInterval=600 HealthCheckProgram=/var/spool/slurm/check/slurm_node_check.sh InactiveLimit=0 KillWait=120 MessageTimeout=120 #ResvOverRun=0 MinJobAge=3600 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING #DefMemPerCPU=0 FastSchedule=1 #MaxMemPerCPU=0 #SchedulerRootFilter=1 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SchedulerParameters=bf_continue,bf_window=6480,bf_resolution=300,bf_yield_interval=1000000,bf_yield_sleep=1000000,bf_max_job_test=1000,bf_max_job_user=2 SchedulerPort=7321 SelectType=select/linear SelectTypeParameters=CR_Memory # # # JOB PRIORITY PriorityType=priority/multifactor PriorityDecayHalfLife=0 #PriorityCalcPeriod= #PriorityFavorSmall= PriorityMaxAge=28-0 # Using PriorityDecayHalfLife is preferred: PriorityUsageResetPeriod=NONE PriorityWeightAge=1000 PriorityWeightFairshare=1000 #PriorityWeightJobSize= #PriorityWeightPartition= PriorityWeightQOS=10000 # # # LOGGING AND ACCOUNTING AccountingStorageEnforce=qos,safe,limits AccountingStorageHost=scheduler #AccountingStorageLoc=/var/spool/slurm/slurm-jobacct.log #AccountingStoragePass= #AccountingStoragePort= AccountingStorageType=accounting_storage/slurmdbd #AccountingStorageUser= AccountingStoreJobComment=YES ClusterName=hpcs #DebugFlags= JobCompHost=scheduler JobCompLoc=/var/spool/slurm/slurm-jobcomp.log #JobCompPass= #JobCompPort= JobCompType=jobcomp/filetxt #JobCompUser= JobAcctGatherFrequency=task=300,network=300 JobAcctGatherType=jobacct_gather/linux AcctGatherProfileType=acct_gather_profile/hdf5 AcctGatherInfinibandType=acct_gather_infiniband/ofed AcctGatherEnergyType=acct_gather_energy/ipmi AcctGatherNodeFreq=0 SlurmctldDebug=5 SlurmctldLogFile=/var/spool/slurm/slurmctld/slurmctld.log SlurmdDebug=5 SlurmdLogFile=/var/spool/slurm/slurmd/slurmd.log #SlurmSchedLogFile= #SlurmSchedLogLevel= # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= ResumeTimeout=600 #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES NodeNAME=DEFAULT Sockets=2 ThreadsPerCore=1 State=UNKNOWN NodeName=sand-[1-7]-[1-80] CPUs=16 CoresPerSocket=8 RealMemory=63900 NodeName=sand-8-[1-40] CPUs=16 CoresPerSocket=8 RealMemory=63900 #NodeNAME=west-[1-8]-[1-16] CPUs=12 CoresPerSocket=6 RealMemory=35700 NodeNAME=tesla[1-128] CPUs=12 CoresPerSocket=6 RealMemory=63900 NodeNAME=stella1 CPUs=24 CoresPerSocket=12 RealMemory=128000 # PARTITIONS PartitionName=DEFAULT DefaultTime=10:00 MaxTime=36:00:00 Shared=NO State=UP PartitionName=sandybridge Nodes=sand-[1-7]-[1-80],sand-8-[1-40] Default=YES #PartitionName=westmere Nodes=west-[1-8]-[1-16] Default=NO PartitionName=tesla Nodes=tesla[1-128],stella1 Default=NO
