Your configuration seems pretty simple. The "agent thread # timed out"
says that the slurmctld communications were failing. I'm not sure what
the particular thread was trying to communicate with, but I'd take a
look at the SlurmdLogFile(s) on some compute nodes. Offhand, I'd guess
there is some networking issue. Take a look at the "troubleshooting
guide" if you haven't already:
http://www.schedmd.com/slurmdocs/troubleshoot.html
Quoting Mark Arnold <[email protected]>:
Hello again,
I have one other issue with slurm that I haven't been able to figure out.
Since the upgrade to redhat 6 slurm (versions 2.2.7) has stopped responding
at seemingly random times (sometimes several times a day, other times it
will go a few days without issues). When this happens you get the following
salloc: error: slurm_receive_msg: Socket timed out on send/recv operation
salloc: error: Failed to allocate resources: Socket timed out on send/recv
operation
I have bumped up the verbosity on the head node and I get the following in
the slurm log
[2011-09-07T16:12:56] debug3: agent thread 47448673310464 timed out
[2011-09-07T16:13:26] debug3: agent thread 47448673310464 timed out
and from top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
32360 slurm 20 0 4143m 8484 1948 S 200.1 0.0 6:51.61 slurmctld
So far I've been trying various things mentioned in the forums/mailing lists
but nothing has worked so far. I bumped up the MessageTimeou to 30 seconds
which helped delay the inevitable but I have run out of ideas. This cluster
averages around 100 allocations a day, so it's not heavily used. I initially
copied my slurm config over from redhat5 and just recently tried modifying
the example with my nodes and partitons schemes but it stopped working again
within a few minutes.
Any help would be greatly appreciated,
Mark
Below is my current running config.
Configuration data as of 2011-09-07T18:01:31
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost = localhost
AccountingStorageLoc = /var/log/slurm_jobacct.log
AccountingStoragePort = 0
AccountingStorageType = accounting_storage/none
AccountingStorageUser = root
AuthType = auth/munge
BackupAddr = (null)
BackupController = (null)
BatchStartTimeout = 10 sec
BOOT_TIME = 2011-09-07T17:58:35
CacheGroups = 1
CheckpointType = checkpoint/none
ClusterName = ri
CompleteWait = 0 sec
ControlAddr = head
ControlMachine = head
CryptoType = crypto/munge
DebugFlags = (null)
DefMemPerCPU = UNLIMITED
DisableRootJobs = NO
EnforcePartLimits = NO
Epilog = /usr/local/slurm/epilog
EpilogMsgTime = 2000 usec
EpilogSlurmctld = (null)
FastSchedule = 1
FirstJobId = 1
GetEnvTimeout = 2 sec
GresTypes = (null)
GroupUpdateForce = 0
GroupUpdateTime = 600 sec
HashVal = Match
HealthCheckInterval = 0 sec
HealthCheckProgram = (null)
InactiveLimit = 600 sec
JobAcctGatherFrequency = 30 sec
JobAcctGatherType = jobacct_gather/none
JobCheckpointDir = /var/slurm/checkpoint
JobCompHost = localhost
JobCompLoc = /var/log/slurm_jobcomp.log
JobCompPort = 0
JobCompType = jobcomp/none
JobCompUser = root
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend = 0
JobRequeue = 1
JobSubmitPlugins = (null)
KillOnBadExit = 0
KillWait = 30 sec
Licenses = (null)
MailProg = /bin/mail
MaxJobCount = 10000
MaxMemPerCPU = UNLIMITED
MaxTasksPerNode = 128
MessageTimeout = 30 sec
MinJobAge = 300 sec
MpiDefault = none
MpiParams = (null)
NEXT_JOB_ID = 2502
OverTimeLimit = 0 min
PluginDir = /usr/lib64/slurm
PlugStackConfig = /etc/slurm/plugstack.conf
PreemptMode = OFF
PreemptType = preempt/none
PriorityType = priority/basic
PrivateData = none
ProctrackType = proctrack/pgid
Prolog = (null)
PrologSlurmctld = (null)
PropagatePrioProcess = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
ResumeProgram = (null)
ResumeRate = 300 nodes/min
ResumeTimeout = 60 sec
ResvOverRun = 0 min
ReturnToService = 2
SallocDefaultCommand = (null)
SchedulerParameters = max_job_bf=15,interval=20
SchedulerPort = 7321
SchedulerRootFilter = 1
SchedulerTimeSlice = 30 sec
SchedulerType = sched/backfill
SelectType = select/linear
SlurmUser = slurm(497)
SlurmctldDebug = 9
SlurmctldLogFile = /var/log/slurm/slurmctld.log
SlurmSchedLogFile = (null)
SlurmctldPort = 6817
SlurmctldTimeout = 120 sec
SlurmdDebug = 3
SlurmdLogFile = /var/log/slurm/slurmd.log
SlurmdPidFile = /var/run/slurmd.pid
SlurmdPort = 6818
SlurmdSpoolDir = /tmp/slurmd
SlurmdTimeout = 300 sec
SlurmdUser = root(0)
SlurmSchedLogLevel = 0
SlurmctldPidFile = /var/run/slurmctld.pid
SLURM_CONF = /etc/slurm/slurm.conf
SLURM_VERSION = 2.2.7
SrunEpilog = (null)
SrunProlog = (null)
StateSaveLocation = /tmp
SuspendExcNodes = (null)
SuspendExcParts = (null)
SuspendProgram = (null)
SuspendRate = 60 nodes/min
SuspendTime = NONE
SuspendTimeout = 30 sec
SwitchType = switch/none
TaskEpilog = (null)
TaskPlugin = task/none
TaskPluginParam = (null type)
TaskProlog = (null)
TmpFS = /tmp
TopologyPlugin = topology/none
TrackWCKey = 0
TreeWidth = 50
UsePam = 0
UnkillableStepProgram = (null)
UnkillableStepTimeout = 60 sec
VSizeFactor = 0 percent
WaitTime = 0 sec
Slurmctld(primary/backup) at head/(NULL) are UP/DOWN