Dean, That typically occurs when some nodes have multiple interfaces, and several nodes have a similar IP on a private/unused interface.
I suggest you explicitly restrict the interface Open MPI should be using. For example, you can mpirun --mca btl_tcp_if_include eth0 ... Cheers, Gilles On Fri, Nov 27, 2020 at 7:36 PM CHESTER, DEAN (PGR) via users <users@lists.open-mpi.org> wrote: > > Hi, > > I am trying to set up some machines with OpenMPI connected with ethernet to > expand some batch system we already have in use. > > This is controlled with Slurm already and we are able to get a basic MPI > program running across 2 of the machines but when I compile and something > that actually performs communication it fails. > > Slurm was not configured with PMI/PMI2 so we require running with mpirun for > program execution. > > OpenMPI is installed on my home space which is accessible on all of the nodes > we are trying to run on. > > My hello world application gets the world size, rank and hostname and prints > this. This successfully launches and runs. > > Hello world from processor viper-03, rank 0 out of 8 processors > Hello world from processor viper-03, rank 1 out of 8 processors > Hello world from processor viper-03, rank 2 out of 8 processors > Hello world from processor viper-03, rank 3 out of 8 processors > Hello world from processor viper-04, rank 4 out of 8 processors > Hello world from processor viper-04, rank 5 out of 8 processors > Hello world from processor viper-04, rank 6 out of 8 processors > Hello world from processor viper-04, rank 7 out of 8 processors > > I then tried to run the OSU micro-benchmarks but these fail to run. I get the > following output: > > # OSU MPI Latency Test v5.6.3 > # Size Latency (us) > [viper-01:25885] [[21336,0],0] ORTE_ERROR_LOG: Data unpack would read past > end of buffer in file util/show_help.c at line 507 > -------------------------------------------------------------------------- > WARNING: Open MPI accepted a TCP connection from what appears to be a > another Open MPI process but cannot find a corresponding process > entry for that peer. > > This attempted connection will be ignored; your MPI job may or may not > continue properly. > > Local host: viper-02 > PID: 20406 > ————————————————————————————————————— > > The machines are firewall yet the ports 9000-9060 are open. I have set the > following MCA parameters to match the open ports: > > btl_tcp_port_min_v4=9000 > btl_tcp_port_range_v4=60 > oob_tcp_dynamic_ipv4_ports=9020 > > OpenMPI 4.0.5 was built with GCC 4.8.5 and only the installation prefix was > set to $HOME/local/ompi. > > What else could be going wrong? > > Kind Regards, > > Dean