Dean,

That typically occurs when some nodes have multiple interfaces, and
several nodes have a similar IP on a private/unused interface.

I suggest you explicitly restrict the interface Open MPI should be using.
For example, you can

mpirun --mca btl_tcp_if_include eth0 ...

Cheers,

Gilles

On Fri, Nov 27, 2020 at 7:36 PM CHESTER, DEAN (PGR) via users
<users@lists.open-mpi.org> wrote:
>
> Hi,
>
> I am trying to set up some machines with OpenMPI connected with ethernet to 
> expand some batch system we already have in use.
>
> This is controlled with Slurm already and we are able to get a basic MPI 
> program running across 2 of the machines but when I compile and something 
> that actually performs communication it fails.
>
> Slurm was not configured with PMI/PMI2 so we require running with mpirun for 
> program execution.
>
> OpenMPI is installed on my home space which is accessible on all of the nodes 
> we are trying to run on.
>
> My hello world application gets the world size, rank and hostname and prints 
> this. This successfully launches and runs.
>
> Hello world from processor viper-03, rank 0 out of 8 processors
> Hello world from processor viper-03, rank 1 out of 8 processors
> Hello world from processor viper-03, rank 2 out of 8 processors
> Hello world from processor viper-03, rank 3 out of 8 processors
> Hello world from processor viper-04, rank 4 out of 8 processors
> Hello world from processor viper-04, rank 5 out of 8 processors
> Hello world from processor viper-04, rank 6 out of 8 processors
> Hello world from processor viper-04, rank 7 out of 8 processors
>
> I then tried to run the OSU micro-benchmarks but these fail to run. I get the 
> following output:
>
> # OSU MPI Latency Test v5.6.3
> # Size          Latency (us)
> [viper-01:25885] [[21336,0],0] ORTE_ERROR_LOG: Data unpack would read past 
> end of buffer in file util/show_help.c at line 507
> --------------------------------------------------------------------------
> WARNING: Open MPI accepted a TCP connection from what appears to be a
> another Open MPI process but cannot find a corresponding process
> entry for that peer.
>
> This attempted connection will be ignored; your MPI job may or may not
> continue properly.
>
>   Local host: viper-02
>   PID:        20406
> —————————————————————————————————————
>
> The machines are firewall yet the ports 9000-9060 are open. I have set the 
> following MCA parameters to match the open ports:
>
> btl_tcp_port_min_v4=9000
> btl_tcp_port_range_v4=60
> oob_tcp_dynamic_ipv4_ports=9020
>
> OpenMPI 4.0.5 was built with GCC 4.8.5 and only the installation prefix was 
> set to $HOME/local/ompi.
>
> What else could be going wrong?
>
> Kind Regards,
>
> Dean

Reply via email to