I run a slurmctld and slurmdbd on a Scientific Linux (SL) 5 server and have three SL6 nodes, all running Slurm 14.03.6, with one node behind another slurmctld on another cluster. The whole slurm setup seems to run fine with tests, even submitting from one cluster to the other. The slurmctld daemon on the machine where slurmdbd is also running, shows
error: slurm_receive_msg: Zero Bytes were transmitted or received and is spamming the /var/log/slurm/slurmctld unless I say scontrol setdebug fatal and then it's quiet. What could be causing the messages when all (at least the things I test) seems fine? Does it expect data from long-gone nodes or something? Across all nodes and servers the time is synchronized and the munged is running with the same keys. Configuration data as of 2014-08-17T18:38:36 AccountingStorageBackupHost = (null) AccountingStorageEnforce = none AccountingStorageHost = server620 AccountingStorageLoc = N/A AccountingStoragePort = 6819 AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreJobComment = YES AcctGatherEnergyType = acct_gather_energy/none AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInfinibandType = acct_gather_infiniband/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none AuthInfo = (null) AuthType = auth/munge BackupAddr = (null) BackupController = (null) BatchStartTimeout = 10 sec BOOT_TIME = 2014-08-17T18:20:51 CacheGroups = 0 CheckpointType = checkpoint/none ClusterName = testcluster CompleteWait = 0 sec ControlAddr = server620 ControlMachine = server620 CoreSpecPlugin = core_spec/none CryptoType = crypto/munge DebugFlags = (null) DefMemPerNode = UNLIMITED DisableRootJobs = NO DynAllocPort = 0 EnforcePartLimits = NO Epilog = (null) EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = ext_sensors/none ExtSensorsFreq = 0 sec FastSchedule = 0 FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = (null) GroupUpdateForce = 0 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 0 sec HealthCheckNodeState = ANY HealthCheckProgram = (null) InactiveLimit = 0 sec JobAcctGatherFrequency = 30 JobAcctGatherType = jobacct_gather/linux JobAcctGatherParams = (null) JobCheckpointDir = /var/slurm/checkpoint JobCompHost = localhost JobCompLoc = /var/log/slurm/slurm_jobcomploc JobCompPort = 0 JobCompType = jobcomp/filetxt JobCompUser = root JobContainerPlugin = job_container/none JobCredentialPrivateKey = (null) JobCredentialPublicCertificate = (null) JobFileAppend = 0 JobRequeue = 1 JobSubmitPlugins = (null) KeepAliveTime = SYSTEM_DEFAULT KillOnBadExit = 0 KillWait = 30 sec LaunchType = launch/slurm Licenses = (null) LicensesUsed = (null) MailProg = /bin/mail MaxArraySize = 1001 MaxJobCount = 10000 MaxJobId = 4294901760 MaxMemPerNode = UNLIMITED MaxStepCount = 40000 MaxTasksPerNode = 128 MessageTimeout = 10 sec MinJobAge = 300 sec MpiDefault = none MpiParams = (null) NEXT_JOB_ID = 2 OverTimeLimit = 0 min PluginDir = /usr/local/slurm-sl5/lib/slurm PlugStackConfig = /usr/local/slurm-etc/plugstack.conf PreemptMode = GANG,SUSPEND PreemptType = preempt/partition_prio PriorityType = priority/basic PrivateData = none ProctrackType = proctrack/pgid Prolog = (null) PrologSlurmctld = (null) PrologFlags = (null) PropagatePrioProcess = 0 PropagateResourceLimits = ALL PropagateResourceLimitsExcept = (null) RebootProgram = (null) ReconfigFlags = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 60 sec ResvEpilog = (null) ResvOverRun = 0 min ResvProlog = (null) ReturnToService = 1 SallocDefaultCommand = (null) SchedulerParameters = (null) SchedulerPort = 7321 SchedulerRootFilter = 1 SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill SelectType = select/linear SlurmUser = slurm(1283) SlurmctldDebug = fatal SlurmctldLogFile = /var/log/slurm/slurmctld.log SlurmSchedLogFile = (null) SlurmctldPort = 6817 SlurmctldTimeout = 120 sec SlurmdDebug = info SlurmdLogFile = /var/log/slurm/slurmd.log SlurmdPidFile = /var/run/slurm/slurmd.pid SlurmdPlugstack = (null) SlurmdPort = 6818 SlurmdSpoolDir = /var/spool/slurm/slurmd SlurmdTimeout = 300 sec SlurmdUser = root(0) SlurmSchedLogLevel = 0 SlurmctldPidFile = /var/run/slurm/slurmctld.pid SlurmctldPlugstack = (null) SLURM_CONF = /usr/local/slurm-etc/slurm.conf SLURM_VERSION = 14.03.6 SrunEpilog = (null) SrunProlog = (null) StateSaveLocation = /var/spool/slurm SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = NONE SuspendTimeout = 30 sec SwitchType = switch/none TaskEpilog = (null) TaskPlugin = task/none TaskPluginParam = (null type) TaskProlog = (null) TmpFS = /tmp TopologyPlugin = topology/none TrackWCKey = 0 TreeWidth = 50 UsePam = 0 UnkillableStepProgram = (null) UnkillableStepTimeout = 60 sec VSizeFactor = 0 percent WaitTime = 0 sec thanks, Gerben
