Hi Tim,
Forgive me if the answers to these are buried in the documentation you sent:
1. What version of MPICH2 are you running? (Is it the same on both
control nodes?)
2. Does "srun -N2 hostname" work as expected?
Andy
On 07/02/2012 07:27 AM, Tim Butters wrote:
Problem with MPICH2 communication between nodes
Hi,
First of all thanks for any help in advance, and I apologise if I have
missed something simple with this problem, I can't seem to get things
working properly.
I have slurm installed on a small cluster with 1 control node and 2
compute nodes (each compute node has 48 cores). Everything works fine
for jobs running on just one compute node, but when I run an mpi
(MPICH2) job that spans both compute nodes I get the following error:
$ srun -n49 ./a.out
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(174).....................: MPI_Send(buf=0xb3e078, count=22,
MPI_CHAR, dest=0, tag=50, MPI_COMM_WORLD) failed
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1709).......: Communication error
srun: error: computenode2: task 48: Exited with exit code 1
Hello from 1 computenode1
Hello from 2 computenode1
............etc.
I get the results from the first compute node (Hello form 1
computenode1), then it seems to hang indefinitely.
The jobs are compiled using mpic++ -L/usr/lib64/slurm -lpmi hello.cc.
If run using sbatch then the error file contains this line:
"srun: error: slurm_send_recv_rc_msg_only_one: Connection timed out"
I have run slurmctld and slurmd in terminals (-vvvvvvvvv) but haven't
been able to find anything useful in the messages, I have attached the
output to this email (controldnode.txt, computenode1.txt and
computenode2.txt). All IP addresses have been replaced with <*.*.*.*>.
I have also attached the log files for slurmctld and slurmd from the
two compute nodes.
Many thanks for your help,
Tim
scontrol show config:
Configuration data as of 2012-07-02T11:41:01
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost = localhost
AccountingStorageLoc = /var/log/slurm_jobacct.log
AccountingStoragePort = 0
AccountingStorageType = accounting_storage/none
AccountingStorageUser = root
AccountingStoreJobComment = YES
AuthType = auth/munge
BackupAddr = (null)
BackupController = (null)
BatchStartTimeout = 10 sec
BOOT_TIME = 2012-07-02T11:40:40
CacheGroups = 0
CheckpointType = checkpoint/none
ClusterName = cluster
CompleteWait = 0 sec
ControlAddr = <*.*.*.*>
ControlMachine = controlnode
CryptoType = crypto/munge
DebugFlags = (null)
DefMemPerNode = UNLIMITED
DisableRootJobs = NO
EnforcePartLimits = NO
Epilog = (null)
EpilogMsgTime = 2000 usec
EpilogSlurmctld = (null)
FastSchedule = 1
FirstJobId = 1
GetEnvTimeout = 2 sec
GresTypes = (null)
GroupUpdateForce = 0
GroupUpdateTime = 600 sec
HASH_VAL = Match
HealthCheckInterval = 0 sec
HealthCheckProgram = (null)
InactiveLimit = 0 sec
JobAcctGatherFrequency = 30 sec
JobAcctGatherType = jobacct_gather/none
JobCheckpointDir = /var/slurm/checkpoint
JobCompHost = localhost
JobCompLoc = /var/log/slurm_jobcomp.log
JobCompPort = 0
JobCompType = jobcomp/none
JobCompUser = root
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend = 0
JobRequeue = 1
JobSubmitPlugins = (null)
KillOnBadExit = 0
KillWait = 30 sec
Licenses = (null)
MailProg = /bin/mail
MaxJobCount = 10000
MaxJobId = 4294901760
MaxMemPerNode = UNLIMITED
MaxStepCount = 40000
MaxTasksPerNode = 128
MessageTimeout = 10 sec
MinJobAge = 300 sec
MpiDefault = none
MpiParams = (null)
NEXT_JOB_ID = 205
OverTimeLimit = 0 min
PluginDir = /usr/lib64/slurm
PlugStackConfig = /etc/slurm/plugstack.conf
PreemptMode = OFF
PreemptType = preempt/none
PriorityType = priority/basic
PrivateData = none
ProctrackType = proctrack/pgid
Prolog = (null)
PrologSlurmctld = (null)
PropagatePrioProcess = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
ResumeProgram = (null)
ResumeRate = 300 nodes/min
ResumeTimeout = 60 sec
ResvOverRun = 0 min
ReturnToService = 1
SallocDefaultCommand = (null)
SchedulerParameters = (null)
SchedulerPort = 7321
SchedulerRootFilter = 1
SchedulerTimeSlice = 30 sec
SchedulerType = sched/backfill
SelectType = select/cons_res
SelectTypeParameters = CR_CPU
SlurmUser = slurm(494)
SlurmctldDebug = 3
SlurmctldLogFile = /var/log/slurm/slurmctld
SlurmSchedLogFile = (null)
SlurmctldPort = 6817
SlurmctldTimeout = 120 sec
SlurmdDebug = 3
SlurmdLogFile = /var/log/slurm/slurmd
SlurmdPidFile = /var/run/slurm/slurmd.pid
SlurmdPort = 6818
SlurmdSpoolDir = /var/spool/slurm/slurmd
SlurmdTimeout = 300 sec
SlurmdUser = root(0)
SlurmSchedLogLevel = 0
SlurmctldPidFile = /var/run/slurm/slurmctld.pid
SLURM_CONF = /etc/slurm/slurm.conf
SLURM_VERSION = 2.3.5
SrunEpilog = (null)
SrunProlog = (null)
StateSaveLocation = /var/tmp
SuspendExcNodes = (null)
SuspendExcParts = (null)
SuspendProgram = (null)
SuspendRate = 60 nodes/min
SuspendTime = NONE
SuspendTimeout = 30 sec
SwitchType = switch/none
TaskEpilog = (null)
TaskPlugin = task/none
TaskPluginParam = (null type)
TaskProlog = (null)
TmpFS = /tmp
TopologyPlugin = topology/none
TrackWCKey = 0
TreeWidth = 50
UsePam = 0
UnkillableStepProgram = (null)
UnkillableStepTimeout = 60 sec
VSizeFactor = 0 percent
WaitTime = 0 sec
Slurmctld(primary/backup) at controlnode/(NULL) are UP/DOWN
--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1-786-263-9743
My opinions are not necessarily those of HP