Hello again, I have one other issue with slurm that I haven't been able to figure out. Since the upgrade to redhat 6 slurm (versions 2.2.7) has stopped responding at seemingly random times (sometimes several times a day, other times it will go a few days without issues). When this happens you get the following
salloc: error: slurm_receive_msg: Socket timed out on send/recv operation salloc: error: Failed to allocate resources: Socket timed out on send/recv operation I have bumped up the verbosity on the head node and I get the following in the slurm log [2011-09-07T16:12:56] debug3: agent thread 47448673310464 timed out [2011-09-07T16:13:26] debug3: agent thread 47448673310464 timed out and from top PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 32360 slurm 20 0 4143m 8484 1948 S 200.1 0.0 6:51.61 slurmctld So far I've been trying various things mentioned in the forums/mailing lists but nothing has worked so far. I bumped up the MessageTimeou to 30 seconds which helped delay the inevitable but I have run out of ideas. This cluster averages around 100 allocations a day, so it's not heavily used. I initially copied my slurm config over from redhat5 and just recently tried modifying the example with my nodes and partitons schemes but it stopped working again within a few minutes. Any help would be greatly appreciated, Mark Below is my current running config. Configuration data as of 2011-09-07T18:01:31 AccountingStorageBackupHost = (null) AccountingStorageEnforce = none AccountingStorageHost = localhost AccountingStorageLoc = /var/log/slurm_jobacct.log AccountingStoragePort = 0 AccountingStorageType = accounting_storage/none AccountingStorageUser = root AuthType = auth/munge BackupAddr = (null) BackupController = (null) BatchStartTimeout = 10 sec BOOT_TIME = 2011-09-07T17:58:35 CacheGroups = 1 CheckpointType = checkpoint/none ClusterName = ri CompleteWait = 0 sec ControlAddr = head ControlMachine = head CryptoType = crypto/munge DebugFlags = (null) DefMemPerCPU = UNLIMITED DisableRootJobs = NO EnforcePartLimits = NO Epilog = /usr/local/slurm/epilog EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) FastSchedule = 1 FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = (null) GroupUpdateForce = 0 GroupUpdateTime = 600 sec HashVal = Match HealthCheckInterval = 0 sec HealthCheckProgram = (null) InactiveLimit = 600 sec JobAcctGatherFrequency = 30 sec JobAcctGatherType = jobacct_gather/none JobCheckpointDir = /var/slurm/checkpoint JobCompHost = localhost JobCompLoc = /var/log/slurm_jobcomp.log JobCompPort = 0 JobCompType = jobcomp/none JobCompUser = root JobCredentialPrivateKey = (null) JobCredentialPublicCertificate = (null) JobFileAppend = 0 JobRequeue = 1 JobSubmitPlugins = (null) KillOnBadExit = 0 KillWait = 30 sec Licenses = (null) MailProg = /bin/mail MaxJobCount = 10000 MaxMemPerCPU = UNLIMITED MaxTasksPerNode = 128 MessageTimeout = 30 sec MinJobAge = 300 sec MpiDefault = none MpiParams = (null) NEXT_JOB_ID = 2502 OverTimeLimit = 0 min PluginDir = /usr/lib64/slurm PlugStackConfig = /etc/slurm/plugstack.conf PreemptMode = OFF PreemptType = preempt/none PriorityType = priority/basic PrivateData = none ProctrackType = proctrack/pgid Prolog = (null) PrologSlurmctld = (null) PropagatePrioProcess = 0 PropagateResourceLimits = ALL PropagateResourceLimitsExcept = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 60 sec ResvOverRun = 0 min ReturnToService = 2 SallocDefaultCommand = (null) SchedulerParameters = max_job_bf=15,interval=20 SchedulerPort = 7321 SchedulerRootFilter = 1 SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill SelectType = select/linear SlurmUser = slurm(497) SlurmctldDebug = 9 SlurmctldLogFile = /var/log/slurm/slurmctld.log SlurmSchedLogFile = (null) SlurmctldPort = 6817 SlurmctldTimeout = 120 sec SlurmdDebug = 3 SlurmdLogFile = /var/log/slurm/slurmd.log SlurmdPidFile = /var/run/slurmd.pid SlurmdPort = 6818 SlurmdSpoolDir = /tmp/slurmd SlurmdTimeout = 300 sec SlurmdUser = root(0) SlurmSchedLogLevel = 0 SlurmctldPidFile = /var/run/slurmctld.pid SLURM_CONF = /etc/slurm/slurm.conf SLURM_VERSION = 2.2.7 SrunEpilog = (null) SrunProlog = (null) StateSaveLocation = /tmp SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = NONE SuspendTimeout = 30 sec SwitchType = switch/none TaskEpilog = (null) TaskPlugin = task/none TaskPluginParam = (null type) TaskProlog = (null) TmpFS = /tmp TopologyPlugin = topology/none TrackWCKey = 0 TreeWidth = 50 UsePam = 0 UnkillableStepProgram = (null) UnkillableStepTimeout = 60 sec VSizeFactor = 0 percent WaitTime = 0 sec Slurmctld(primary/backup) at head/(NULL) are UP/DOWN
