Hello again,

I have one other issue with slurm that I haven't been able to figure out.
Since the upgrade to redhat 6 slurm (versions 2.2.7) has stopped responding
at seemingly random times (sometimes several times a day, other times it
will go a few days without issues). When this happens you get the following

salloc: error: slurm_receive_msg: Socket timed out on send/recv operation
salloc: error: Failed to allocate resources: Socket timed out on send/recv
operation

I have bumped up the verbosity on the head node and I get the following in
the slurm log

[2011-09-07T16:12:56] debug3: agent thread 47448673310464 timed out
[2011-09-07T16:13:26] debug3: agent thread 47448673310464 timed out

and from top

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

32360 slurm     20   0 4143m 8484 1948 S 200.1  0.0   6:51.61 slurmctld


So far I've been trying various things mentioned in the forums/mailing lists
but nothing has worked so far. I bumped up the MessageTimeou to 30 seconds
which helped delay the inevitable but I have run out of ideas. This cluster
averages around 100 allocations a day, so it's not heavily used. I initially
copied my slurm config over from redhat5 and just recently tried modifying
the example with my nodes and partitons schemes but it stopped working again
within a few minutes.

Any help would be greatly appreciated,

Mark

Below is my current running config.


Configuration data as of 2011-09-07T18:01:31
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost   = localhost
AccountingStorageLoc    = /var/log/slurm_jobacct.log
AccountingStoragePort   = 0
AccountingStorageType   = accounting_storage/none
AccountingStorageUser   = root
AuthType                = auth/munge
BackupAddr              = (null)
BackupController        = (null)
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2011-09-07T17:58:35
CacheGroups             = 1
CheckpointType          = checkpoint/none
ClusterName             = ri
CompleteWait            = 0 sec
ControlAddr             = head
ControlMachine          = head
CryptoType              = crypto/munge
DebugFlags              = (null)
DefMemPerCPU            = UNLIMITED
DisableRootJobs         = NO
EnforcePartLimits       = NO
Epilog                  = /usr/local/slurm/epilog
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
FastSchedule            = 1
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = (null)
GroupUpdateForce        = 0
GroupUpdateTime         = 600 sec
HashVal                 = Match
HealthCheckInterval     = 0 sec
HealthCheckProgram      = (null)
InactiveLimit           = 600 sec
JobAcctGatherFrequency  = 30 sec
JobAcctGatherType       = jobacct_gather/none
JobCheckpointDir        = /var/slurm/checkpoint
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = (null)
KillOnBadExit           = 0
KillWait                = 30 sec
Licenses                = (null)
MailProg                = /bin/mail
MaxJobCount             = 10000
MaxMemPerCPU            = UNLIMITED
MaxTasksPerNode         = 128
MessageTimeout          = 30 sec
MinJobAge               = 300 sec
MpiDefault              = none
MpiParams               = (null)
NEXT_JOB_ID             = 2502
OverTimeLimit           = 0 min
PluginDir               = /usr/lib64/slurm
PlugStackConfig         = /etc/slurm/plugstack.conf
PreemptMode             = OFF
PreemptType             = preempt/none
PriorityType            = priority/basic
PrivateData             = none
ProctrackType           = proctrack/pgid
Prolog                  = (null)
PrologSlurmctld         = (null)
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvOverRun             = 0 min
ReturnToService         = 2
SallocDefaultCommand    = (null)
SchedulerParameters     = max_job_bf=15,interval=20
SchedulerPort           = 7321
SchedulerRootFilter     = 1
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/linear
SlurmUser               = slurm(497)
SlurmctldDebug          = 9
SlurmctldLogFile        = /var/log/slurm/slurmctld.log
SlurmSchedLogFile       = (null)
SlurmctldPort           = 6817
SlurmctldTimeout        = 120 sec
SlurmdDebug             = 3
SlurmdLogFile           = /var/log/slurm/slurmd.log
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /tmp/slurmd
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 2.2.7
SrunEpilog              = (null)
SrunProlog              = (null)
StateSaveLocation       = /tmp
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = task/none
TaskPluginParam         = (null type)
TaskProlog              = (null)
TmpFS                   = /tmp
TopologyPlugin          = topology/none
TrackWCKey              = 0
TreeWidth               = 50
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec

Slurmctld(primary/backup) at head/(NULL) are UP/DOWN

Reply via email to