Hi Andy, 1. The MPICH2 version is 1.2.1 and is the same across all nodes. 2. Yes, that successfully returns the hostname of each compute node.
Thanks, Tim On 2 July 2012 13:29, Andy Riebs <[email protected]> wrote: > Hi Tim, > > Forgive me if the answers to these are buried in the documentation you > sent: > > 1. What version of MPICH2 are you running? (Is it the same on both control > nodes?) > 2. Does "srun -N2 hostname" work as expected? > > Andy > > > On 07/02/2012 07:27 AM, Tim Butters wrote: > > Hi, > > First of all thanks for any help in advance, and I apologise if I have > missed something simple with this problem, I can't seem to get things > working properly. > > I have slurm installed on a small cluster with 1 control node and 2 > compute nodes (each compute node has 48 cores). Everything works fine for > jobs running on just one compute node, but when I run an mpi (MPICH2) job > that spans both compute nodes I get the following error: > > $ srun -n49 ./a.out > Fatal error in MPI_Send: Other MPI error, error stack: > MPI_Send(174).....................: MPI_Send(buf=0xb3e078, count=22, > MPI_CHAR, dest=0, tag=50, MPI_COMM_WORLD) failed > MPIDI_CH3I_Progress(150)..........: > MPID_nem_mpich2_blocking_recv(948): > MPID_nem_tcp_connpoll(1709).......: Communication error > srun: error: computenode2: task 48: Exited with exit code 1 > Hello from 1 computenode1 > Hello from 2 computenode1 > ............etc. > > I get the results from the first compute node (Hello form 1 > computenode1), then it seems to hang indefinitely. > > The jobs are compiled using mpic++ -L/usr/lib64/slurm -lpmi hello.cc. > > If run using sbatch then the error file contains this line: > "srun: error: slurm_send_recv_rc_msg_only_one: Connection timed out" > > I have run slurmctld and slurmd in terminals (-vvvvvvvvv) but haven't > been able to find anything useful in the messages, I have attached the > output to this email (controldnode.txt, computenode1.txt and > computenode2.txt). All IP addresses have been replaced with <*.*.*.*>. I > have also attached the log files for slurmctld and slurmd from the two > compute nodes. > > Many thanks for your help, > > Tim > > scontrol show config: > Configuration data as of 2012-07-02T11:41:01 > AccountingStorageBackupHost = (null) > AccountingStorageEnforce = none > AccountingStorageHost = localhost > AccountingStorageLoc = /var/log/slurm_jobacct.log > AccountingStoragePort = 0 > AccountingStorageType = accounting_storage/none > AccountingStorageUser = root > AccountingStoreJobComment = YES > AuthType = auth/munge > BackupAddr = (null) > BackupController = (null) > BatchStartTimeout = 10 sec > BOOT_TIME = 2012-07-02T11:40:40 > CacheGroups = 0 > CheckpointType = checkpoint/none > ClusterName = cluster > CompleteWait = 0 sec > ControlAddr = <*.*.*.*> > ControlMachine = controlnode > CryptoType = crypto/munge > DebugFlags = (null) > DefMemPerNode = UNLIMITED > DisableRootJobs = NO > EnforcePartLimits = NO > Epilog = (null) > EpilogMsgTime = 2000 usec > EpilogSlurmctld = (null) > FastSchedule = 1 > FirstJobId = 1 > GetEnvTimeout = 2 sec > GresTypes = (null) > GroupUpdateForce = 0 > GroupUpdateTime = 600 sec > HASH_VAL = Match > HealthCheckInterval = 0 sec > HealthCheckProgram = (null) > InactiveLimit = 0 sec > JobAcctGatherFrequency = 30 sec > JobAcctGatherType = jobacct_gather/none > JobCheckpointDir = /var/slurm/checkpoint > JobCompHost = localhost > JobCompLoc = /var/log/slurm_jobcomp.log > JobCompPort = 0 > JobCompType = jobcomp/none > JobCompUser = root > JobCredentialPrivateKey = (null) > JobCredentialPublicCertificate = (null) > JobFileAppend = 0 > JobRequeue = 1 > JobSubmitPlugins = (null) > KillOnBadExit = 0 > KillWait = 30 sec > Licenses = (null) > MailProg = /bin/mail > MaxJobCount = 10000 > MaxJobId = 4294901760 > MaxMemPerNode = UNLIMITED > MaxStepCount = 40000 > MaxTasksPerNode = 128 > MessageTimeout = 10 sec > MinJobAge = 300 sec > MpiDefault = none > MpiParams = (null) > NEXT_JOB_ID = 205 > OverTimeLimit = 0 min > PluginDir = /usr/lib64/slurm > PlugStackConfig = /etc/slurm/plugstack.conf > PreemptMode = OFF > PreemptType = preempt/none > PriorityType = priority/basic > PrivateData = none > ProctrackType = proctrack/pgid > Prolog = (null) > PrologSlurmctld = (null) > PropagatePrioProcess = 0 > PropagateResourceLimits = ALL > PropagateResourceLimitsExcept = (null) > ResumeProgram = (null) > ResumeRate = 300 nodes/min > ResumeTimeout = 60 sec > ResvOverRun = 0 min > ReturnToService = 1 > SallocDefaultCommand = (null) > SchedulerParameters = (null) > SchedulerPort = 7321 > SchedulerRootFilter = 1 > SchedulerTimeSlice = 30 sec > SchedulerType = sched/backfill > SelectType = select/cons_res > SelectTypeParameters = CR_CPU > SlurmUser = slurm(494) > SlurmctldDebug = 3 > SlurmctldLogFile = /var/log/slurm/slurmctld > SlurmSchedLogFile = (null) > SlurmctldPort = 6817 > SlurmctldTimeout = 120 sec > SlurmdDebug = 3 > SlurmdLogFile = /var/log/slurm/slurmd > SlurmdPidFile = /var/run/slurm/slurmd.pid > SlurmdPort = 6818 > SlurmdSpoolDir = /var/spool/slurm/slurmd > SlurmdTimeout = 300 sec > SlurmdUser = root(0) > SlurmSchedLogLevel = 0 > SlurmctldPidFile = /var/run/slurm/slurmctld.pid > SLURM_CONF = /etc/slurm/slurm.conf > SLURM_VERSION = 2.3.5 > SrunEpilog = (null) > SrunProlog = (null) > StateSaveLocation = /var/tmp > SuspendExcNodes = (null) > SuspendExcParts = (null) > SuspendProgram = (null) > SuspendRate = 60 nodes/min > SuspendTime = NONE > SuspendTimeout = 30 sec > SwitchType = switch/none > TaskEpilog = (null) > TaskPlugin = task/none > TaskPluginParam = (null type) > TaskProlog = (null) > TmpFS = /tmp > TopologyPlugin = topology/none > TrackWCKey = 0 > TreeWidth = 50 > UsePam = 0 > UnkillableStepProgram = (null) > UnkillableStepTimeout = 60 sec > VSizeFactor = 0 percent > WaitTime = 0 sec > > Slurmctld(primary/backup) at controlnode/(NULL) are UP/DOWN > > > -- > Andy Riebs > Hewlett-Packard Company > High Performance Computing+1-786-263-9743 > My opinions are not necessarily those of HP > > > >
