[slurm-dev] Re: Problem with MPICH2 communication between nodes

Tim Butters Mon, 02 Jul 2012 05:43:11 -0700

Hi Andy,

1. The MPICH2 version is 1.2.1 and is the same across all nodes.
2. Yes, that successfully returns the hostname of each compute node.


Thanks,

Tim

On 2 July 2012 13:29, Andy Riebs <[email protected]> wrote:

>  Hi Tim,
>
> Forgive me if the answers to these are buried in the documentation you
> sent:
>
> 1. What version of MPICH2 are you running? (Is it the same on both control
> nodes?)
> 2. Does "srun -N2 hostname" work as expected?
>
> Andy
>
>
> On 07/02/2012 07:27 AM, Tim Butters wrote:
>
> Hi,
>
>  First of all thanks for any help in advance, and I apologise if I have
> missed something simple with this problem, I can't seem to get things
> working properly.
>
>  I have slurm installed on a small cluster with 1 control node and 2
> compute nodes (each compute node has 48 cores). Everything works fine for
> jobs running on just one compute node, but when I run an mpi (MPICH2) job
> that spans both compute nodes I get the following error:
>
>  $ srun -n49 ./a.out
>  Fatal error in MPI_Send: Other MPI error, error stack:
> MPI_Send(174).....................: MPI_Send(buf=0xb3e078, count=22,
> MPI_CHAR, dest=0, tag=50, MPI_COMM_WORLD) failed
> MPIDI_CH3I_Progress(150)..........:
> MPID_nem_mpich2_blocking_recv(948):
> MPID_nem_tcp_connpoll(1709).......: Communication error
> srun: error: computenode2: task 48: Exited with exit code 1
>  Hello from 1 computenode1
> Hello from 2 computenode1
> ............etc.
>
>  I get the results from the first compute node (Hello form 1
> computenode1), then it seems to hang indefinitely.
>
>  The jobs are compiled using mpic++ -L/usr/lib64/slurm -lpmi hello.cc.
>
>  If run using sbatch then the error file contains this line:
> "srun: error: slurm_send_recv_rc_msg_only_one: Connection timed out"
>
>  I have run slurmctld and slurmd in terminals (-vvvvvvvvv) but haven't
> been able to find anything useful in the messages, I have attached the
> output to this email (controldnode.txt, computenode1.txt and
> computenode2.txt). All IP addresses have been replaced with <*.*.*.*>. I
> have also attached the log files for slurmctld and slurmd from the two
> compute nodes.
>
>  Many thanks for your help,
>
>  Tim
>
>  scontrol show config:
>  Configuration data as of 2012-07-02T11:41:01
> AccountingStorageBackupHost = (null)
> AccountingStorageEnforce = none
> AccountingStorageHost   = localhost
> AccountingStorageLoc    = /var/log/slurm_jobacct.log
> AccountingStoragePort   = 0
> AccountingStorageType   = accounting_storage/none
> AccountingStorageUser   = root
> AccountingStoreJobComment = YES
> AuthType                = auth/munge
> BackupAddr              = (null)
> BackupController        = (null)
>  BatchStartTimeout       = 10 sec
> BOOT_TIME               = 2012-07-02T11:40:40
> CacheGroups             = 0
> CheckpointType          = checkpoint/none
> ClusterName             = cluster
> CompleteWait            = 0 sec
> ControlAddr             = <*.*.*.*>
> ControlMachine          = controlnode
> CryptoType              = crypto/munge
> DebugFlags              = (null)
> DefMemPerNode           = UNLIMITED
> DisableRootJobs         = NO
> EnforcePartLimits       = NO
> Epilog                  = (null)
> EpilogMsgTime           = 2000 usec
> EpilogSlurmctld         = (null)
> FastSchedule            = 1
> FirstJobId              = 1
> GetEnvTimeout           = 2 sec
> GresTypes               = (null)
> GroupUpdateForce        = 0
> GroupUpdateTime         = 600 sec
> HASH_VAL                = Match
> HealthCheckInterval     = 0 sec
> HealthCheckProgram      = (null)
> InactiveLimit           = 0 sec
> JobAcctGatherFrequency  = 30 sec
> JobAcctGatherType       = jobacct_gather/none
> JobCheckpointDir        = /var/slurm/checkpoint
> JobCompHost             = localhost
> JobCompLoc              = /var/log/slurm_jobcomp.log
> JobCompPort             = 0
> JobCompType             = jobcomp/none
> JobCompUser             = root
> JobCredentialPrivateKey = (null)
> JobCredentialPublicCertificate = (null)
> JobFileAppend           = 0
> JobRequeue              = 1
> JobSubmitPlugins        = (null)
> KillOnBadExit           = 0
> KillWait                = 30 sec
> Licenses                = (null)
> MailProg                = /bin/mail
> MaxJobCount             = 10000
> MaxJobId                = 4294901760
> MaxMemPerNode           = UNLIMITED
> MaxStepCount            = 40000
> MaxTasksPerNode         = 128
> MessageTimeout          = 10 sec
> MinJobAge               = 300 sec
>  MpiDefault              = none
> MpiParams               = (null)
> NEXT_JOB_ID             = 205
> OverTimeLimit           = 0 min
> PluginDir               = /usr/lib64/slurm
> PlugStackConfig         = /etc/slurm/plugstack.conf
> PreemptMode             = OFF
> PreemptType             = preempt/none
> PriorityType            = priority/basic
> PrivateData             = none
> ProctrackType           = proctrack/pgid
> Prolog                  = (null)
> PrologSlurmctld         = (null)
> PropagatePrioProcess    = 0
> PropagateResourceLimits = ALL
> PropagateResourceLimitsExcept = (null)
> ResumeProgram           = (null)
> ResumeRate              = 300 nodes/min
> ResumeTimeout           = 60 sec
> ResvOverRun             = 0 min
> ReturnToService         = 1
> SallocDefaultCommand    = (null)
>  SchedulerParameters     = (null)
> SchedulerPort           = 7321
> SchedulerRootFilter     = 1
> SchedulerTimeSlice      = 30 sec
> SchedulerType           = sched/backfill
> SelectType              = select/cons_res
> SelectTypeParameters    = CR_CPU
> SlurmUser               = slurm(494)
> SlurmctldDebug          = 3
> SlurmctldLogFile        = /var/log/slurm/slurmctld
> SlurmSchedLogFile       = (null)
> SlurmctldPort           = 6817
> SlurmctldTimeout        = 120 sec
> SlurmdDebug             = 3
> SlurmdLogFile           = /var/log/slurm/slurmd
> SlurmdPidFile           = /var/run/slurm/slurmd.pid
> SlurmdPort              = 6818
> SlurmdSpoolDir          = /var/spool/slurm/slurmd
> SlurmdTimeout           = 300 sec
> SlurmdUser              = root(0)
> SlurmSchedLogLevel      = 0
> SlurmctldPidFile        = /var/run/slurm/slurmctld.pid
> SLURM_CONF              = /etc/slurm/slurm.conf
> SLURM_VERSION           = 2.3.5
> SrunEpilog              = (null)
> SrunProlog              = (null)
> StateSaveLocation       = /var/tmp
> SuspendExcNodes         = (null)
> SuspendExcParts         = (null)
> SuspendProgram          = (null)
> SuspendRate             = 60 nodes/min
> SuspendTime             = NONE
> SuspendTimeout          = 30 sec
> SwitchType              = switch/none
> TaskEpilog              = (null)
> TaskPlugin              = task/none
> TaskPluginParam         = (null type)
> TaskProlog              = (null)
> TmpFS                   = /tmp
> TopologyPlugin          = topology/none
> TrackWCKey              = 0
>  TreeWidth               = 50
> UsePam                  = 0
> UnkillableStepProgram   = (null)
> UnkillableStepTimeout   = 60 sec
> VSizeFactor             = 0 percent
> WaitTime                = 0 sec
>
>  Slurmctld(primary/backup) at controlnode/(NULL) are UP/DOWN
>
>
> --
> Andy Riebs
> Hewlett-Packard Company
> High Performance Computing+1-786-263-9743
> My opinions are not necessarily those of HP
>
>
>
>

[slurm-dev] Re: Problem with MPICH2 communication between nodes

Reply via email to