Re: [OMPI users] Unable to run complicated MPI Program

2020-11-28 Thread CHESTER, DEAN (PGR) via users
Thank you explicitly setting the interface like below has resolved this. 

Thanks, 

Dean 

> On 28 Nov 2020, at 10:27, Gilles Gouaillardet via users 
>  wrote:
> 
> Dean,
> 
> That typically occurs when some nodes have multiple interfaces, and
> several nodes have a similar IP on a private/unused interface.
> 
> I suggest you explicitly restrict the interface Open MPI should be using.
> For example, you can
> 
> mpirun --mca btl_tcp_if_include eth0 ...
> 
> Cheers,
> 
> Gilles
> 
> On Fri, Nov 27, 2020 at 7:36 PM CHESTER, DEAN (PGR) via users
>  wrote:
>> 
>> Hi,
>> 
>> I am trying to set up some machines with OpenMPI connected with ethernet to 
>> expand some batch system we already have in use.
>> 
>> This is controlled with Slurm already and we are able to get a basic MPI 
>> program running across 2 of the machines but when I compile and something 
>> that actually performs communication it fails.
>> 
>> Slurm was not configured with PMI/PMI2 so we require running with mpirun for 
>> program execution.
>> 
>> OpenMPI is installed on my home space which is accessible on all of the 
>> nodes we are trying to run on.
>> 
>> My hello world application gets the world size, rank and hostname and prints 
>> this. This successfully launches and runs.
>> 
>> Hello world from processor viper-03, rank 0 out of 8 processors
>> Hello world from processor viper-03, rank 1 out of 8 processors
>> Hello world from processor viper-03, rank 2 out of 8 processors
>> Hello world from processor viper-03, rank 3 out of 8 processors
>> Hello world from processor viper-04, rank 4 out of 8 processors
>> Hello world from processor viper-04, rank 5 out of 8 processors
>> Hello world from processor viper-04, rank 6 out of 8 processors
>> Hello world from processor viper-04, rank 7 out of 8 processors
>> 
>> I then tried to run the OSU micro-benchmarks but these fail to run. I get 
>> the following output:
>> 
>> # OSU MPI Latency Test v5.6.3
>> # Size  Latency (us)
>> [viper-01:25885] [[21336,0],0] ORTE_ERROR_LOG: Data unpack would read past 
>> end of buffer in file util/show_help.c at line 507
>> --
>> WARNING: Open MPI accepted a TCP connection from what appears to be a
>> another Open MPI process but cannot find a corresponding process
>> entry for that peer.
>> 
>> This attempted connection will be ignored; your MPI job may or may not
>> continue properly.
>> 
>>  Local host: viper-02
>>  PID:20406
>> —
>> 
>> The machines are firewall yet the ports 9000-9060 are open. I have set the 
>> following MCA parameters to match the open ports:
>> 
>> btl_tcp_port_min_v4=9000
>> btl_tcp_port_range_v4=60
>> oob_tcp_dynamic_ipv4_ports=9020
>> 
>> OpenMPI 4.0.5 was built with GCC 4.8.5 and only the installation prefix was 
>> set to $HOME/local/ompi.
>> 
>> What else could be going wrong?
>> 
>> Kind Regards,
>> 
>> Dean



Re: [OMPI users] Unable to run complicated MPI Program

2020-11-28 Thread Gilles Gouaillardet via users
Dean,

That typically occurs when some nodes have multiple interfaces, and
several nodes have a similar IP on a private/unused interface.

I suggest you explicitly restrict the interface Open MPI should be using.
For example, you can

mpirun --mca btl_tcp_if_include eth0 ...

Cheers,

Gilles

On Fri, Nov 27, 2020 at 7:36 PM CHESTER, DEAN (PGR) via users
 wrote:
>
> Hi,
>
> I am trying to set up some machines with OpenMPI connected with ethernet to 
> expand some batch system we already have in use.
>
> This is controlled with Slurm already and we are able to get a basic MPI 
> program running across 2 of the machines but when I compile and something 
> that actually performs communication it fails.
>
> Slurm was not configured with PMI/PMI2 so we require running with mpirun for 
> program execution.
>
> OpenMPI is installed on my home space which is accessible on all of the nodes 
> we are trying to run on.
>
> My hello world application gets the world size, rank and hostname and prints 
> this. This successfully launches and runs.
>
> Hello world from processor viper-03, rank 0 out of 8 processors
> Hello world from processor viper-03, rank 1 out of 8 processors
> Hello world from processor viper-03, rank 2 out of 8 processors
> Hello world from processor viper-03, rank 3 out of 8 processors
> Hello world from processor viper-04, rank 4 out of 8 processors
> Hello world from processor viper-04, rank 5 out of 8 processors
> Hello world from processor viper-04, rank 6 out of 8 processors
> Hello world from processor viper-04, rank 7 out of 8 processors
>
> I then tried to run the OSU micro-benchmarks but these fail to run. I get the 
> following output:
>
> # OSU MPI Latency Test v5.6.3
> # Size  Latency (us)
> [viper-01:25885] [[21336,0],0] ORTE_ERROR_LOG: Data unpack would read past 
> end of buffer in file util/show_help.c at line 507
> --
> WARNING: Open MPI accepted a TCP connection from what appears to be a
> another Open MPI process but cannot find a corresponding process
> entry for that peer.
>
> This attempted connection will be ignored; your MPI job may or may not
> continue properly.
>
>   Local host: viper-02
>   PID:20406
> —
>
> The machines are firewall yet the ports 9000-9060 are open. I have set the 
> following MCA parameters to match the open ports:
>
> btl_tcp_port_min_v4=9000
> btl_tcp_port_range_v4=60
> oob_tcp_dynamic_ipv4_ports=9020
>
> OpenMPI 4.0.5 was built with GCC 4.8.5 and only the installation prefix was 
> set to $HOME/local/ompi.
>
> What else could be going wrong?
>
> Kind Regards,
>
> Dean


[OMPI users] Unable to run complicated MPI Program

2020-11-27 Thread CHESTER, DEAN (PGR) via users
Hi, 

I am trying to set up some machines with OpenMPI connected with ethernet to 
expand some batch system we already have in use. 

This is controlled with Slurm already and we are able to get a basic MPI 
program running across 2 of the machines but when I compile and something that 
actually performs communication it fails. 

Slurm was not configured with PMI/PMI2 so we require running with mpirun for 
program execution. 

OpenMPI is installed on my home space which is accessible on all of the nodes 
we are trying to run on.

My hello world application gets the world size, rank and hostname and prints 
this. This successfully launches and runs.

Hello world from processor viper-03, rank 0 out of 8 processors
Hello world from processor viper-03, rank 1 out of 8 processors
Hello world from processor viper-03, rank 2 out of 8 processors
Hello world from processor viper-03, rank 3 out of 8 processors
Hello world from processor viper-04, rank 4 out of 8 processors
Hello world from processor viper-04, rank 5 out of 8 processors
Hello world from processor viper-04, rank 6 out of 8 processors
Hello world from processor viper-04, rank 7 out of 8 processors 

I then tried to run the OSU micro-benchmarks but these fail to run. I get the 
following output: 

# OSU MPI Latency Test v5.6.3
# Size  Latency (us)
[viper-01:25885] [[21336,0],0] ORTE_ERROR_LOG: Data unpack would read past end 
of buffer in file util/show_help.c at line 507
--
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

  Local host: viper-02
  PID:20406
—

The machines are firewall yet the ports 9000-9060 are open. I have set the 
following MCA parameters to match the open ports:  

btl_tcp_port_min_v4=9000
btl_tcp_port_range_v4=60
oob_tcp_dynamic_ipv4_ports=9020

OpenMPI 4.0.5 was built with GCC 4.8.5 and only the installation prefix was set 
to $HOME/local/ompi.

What else could be going wrong? 

Kind Regards, 

Dean