Re: [OMPI users] Unable to run complicated MPI Program
Thank you explicitly setting the interface like below has resolved this. Thanks, Dean > On 28 Nov 2020, at 10:27, Gilles Gouaillardet via users > wrote: > > Dean, > > That typically occurs when some nodes have multiple interfaces, and > several nodes have a similar IP on a private/unused interface. > > I suggest you explicitly restrict the interface Open MPI should be using. > For example, you can > > mpirun --mca btl_tcp_if_include eth0 ... > > Cheers, > > Gilles > > On Fri, Nov 27, 2020 at 7:36 PM CHESTER, DEAN (PGR) via users > wrote: >> >> Hi, >> >> I am trying to set up some machines with OpenMPI connected with ethernet to >> expand some batch system we already have in use. >> >> This is controlled with Slurm already and we are able to get a basic MPI >> program running across 2 of the machines but when I compile and something >> that actually performs communication it fails. >> >> Slurm was not configured with PMI/PMI2 so we require running with mpirun for >> program execution. >> >> OpenMPI is installed on my home space which is accessible on all of the >> nodes we are trying to run on. >> >> My hello world application gets the world size, rank and hostname and prints >> this. This successfully launches and runs. >> >> Hello world from processor viper-03, rank 0 out of 8 processors >> Hello world from processor viper-03, rank 1 out of 8 processors >> Hello world from processor viper-03, rank 2 out of 8 processors >> Hello world from processor viper-03, rank 3 out of 8 processors >> Hello world from processor viper-04, rank 4 out of 8 processors >> Hello world from processor viper-04, rank 5 out of 8 processors >> Hello world from processor viper-04, rank 6 out of 8 processors >> Hello world from processor viper-04, rank 7 out of 8 processors >> >> I then tried to run the OSU micro-benchmarks but these fail to run. I get >> the following output: >> >> # OSU MPI Latency Test v5.6.3 >> # Size Latency (us) >> [viper-01:25885] [[21336,0],0] ORTE_ERROR_LOG: Data unpack would read past >> end of buffer in file util/show_help.c at line 507 >> -- >> WARNING: Open MPI accepted a TCP connection from what appears to be a >> another Open MPI process but cannot find a corresponding process >> entry for that peer. >> >> This attempted connection will be ignored; your MPI job may or may not >> continue properly. >> >> Local host: viper-02 >> PID:20406 >> — >> >> The machines are firewall yet the ports 9000-9060 are open. I have set the >> following MCA parameters to match the open ports: >> >> btl_tcp_port_min_v4=9000 >> btl_tcp_port_range_v4=60 >> oob_tcp_dynamic_ipv4_ports=9020 >> >> OpenMPI 4.0.5 was built with GCC 4.8.5 and only the installation prefix was >> set to $HOME/local/ompi. >> >> What else could be going wrong? >> >> Kind Regards, >> >> Dean
Re: [OMPI users] Unable to run complicated MPI Program
Dean, That typically occurs when some nodes have multiple interfaces, and several nodes have a similar IP on a private/unused interface. I suggest you explicitly restrict the interface Open MPI should be using. For example, you can mpirun --mca btl_tcp_if_include eth0 ... Cheers, Gilles On Fri, Nov 27, 2020 at 7:36 PM CHESTER, DEAN (PGR) via users wrote: > > Hi, > > I am trying to set up some machines with OpenMPI connected with ethernet to > expand some batch system we already have in use. > > This is controlled with Slurm already and we are able to get a basic MPI > program running across 2 of the machines but when I compile and something > that actually performs communication it fails. > > Slurm was not configured with PMI/PMI2 so we require running with mpirun for > program execution. > > OpenMPI is installed on my home space which is accessible on all of the nodes > we are trying to run on. > > My hello world application gets the world size, rank and hostname and prints > this. This successfully launches and runs. > > Hello world from processor viper-03, rank 0 out of 8 processors > Hello world from processor viper-03, rank 1 out of 8 processors > Hello world from processor viper-03, rank 2 out of 8 processors > Hello world from processor viper-03, rank 3 out of 8 processors > Hello world from processor viper-04, rank 4 out of 8 processors > Hello world from processor viper-04, rank 5 out of 8 processors > Hello world from processor viper-04, rank 6 out of 8 processors > Hello world from processor viper-04, rank 7 out of 8 processors > > I then tried to run the OSU micro-benchmarks but these fail to run. I get the > following output: > > # OSU MPI Latency Test v5.6.3 > # Size Latency (us) > [viper-01:25885] [[21336,0],0] ORTE_ERROR_LOG: Data unpack would read past > end of buffer in file util/show_help.c at line 507 > -- > WARNING: Open MPI accepted a TCP connection from what appears to be a > another Open MPI process but cannot find a corresponding process > entry for that peer. > > This attempted connection will be ignored; your MPI job may or may not > continue properly. > > Local host: viper-02 > PID:20406 > — > > The machines are firewall yet the ports 9000-9060 are open. I have set the > following MCA parameters to match the open ports: > > btl_tcp_port_min_v4=9000 > btl_tcp_port_range_v4=60 > oob_tcp_dynamic_ipv4_ports=9020 > > OpenMPI 4.0.5 was built with GCC 4.8.5 and only the installation prefix was > set to $HOME/local/ompi. > > What else could be going wrong? > > Kind Regards, > > Dean
[OMPI users] Unable to run complicated MPI Program
Hi, I am trying to set up some machines with OpenMPI connected with ethernet to expand some batch system we already have in use. This is controlled with Slurm already and we are able to get a basic MPI program running across 2 of the machines but when I compile and something that actually performs communication it fails. Slurm was not configured with PMI/PMI2 so we require running with mpirun for program execution. OpenMPI is installed on my home space which is accessible on all of the nodes we are trying to run on. My hello world application gets the world size, rank and hostname and prints this. This successfully launches and runs. Hello world from processor viper-03, rank 0 out of 8 processors Hello world from processor viper-03, rank 1 out of 8 processors Hello world from processor viper-03, rank 2 out of 8 processors Hello world from processor viper-03, rank 3 out of 8 processors Hello world from processor viper-04, rank 4 out of 8 processors Hello world from processor viper-04, rank 5 out of 8 processors Hello world from processor viper-04, rank 6 out of 8 processors Hello world from processor viper-04, rank 7 out of 8 processors I then tried to run the OSU micro-benchmarks but these fail to run. I get the following output: # OSU MPI Latency Test v5.6.3 # Size Latency (us) [viper-01:25885] [[21336,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/show_help.c at line 507 -- WARNING: Open MPI accepted a TCP connection from what appears to be a another Open MPI process but cannot find a corresponding process entry for that peer. This attempted connection will be ignored; your MPI job may or may not continue properly. Local host: viper-02 PID:20406 — The machines are firewall yet the ports 9000-9060 are open. I have set the following MCA parameters to match the open ports: btl_tcp_port_min_v4=9000 btl_tcp_port_range_v4=60 oob_tcp_dynamic_ipv4_ports=9020 OpenMPI 4.0.5 was built with GCC 4.8.5 and only the installation prefix was set to $HOME/local/ompi. What else could be going wrong? Kind Regards, Dean