Allan, remember that Infiniband is not Ethernet. You dont NEED to set up IPOIB interfaces.
Two diagnostics please for you to run: ibnetdiscover ibdiagnet Let us please have the reuslts of ibnetdiscover On 19 May 2017 at 09:25, John Hearns <hear...@googlemail.com> wrote: > Giles, Allan, > > if the host 'smd' is acting as a cluster head node it is not a must for it > to have an Infiniband card. > So you should be able to run jobs across the other nodes, which have > Qlogic cards. > I may have something mixed up here, if so I am sorry. > > If you want also to run jobs on the smd host, you should take note of what > Giles says. > You may be out of luck in that case. > > On 19 May 2017 at 09:15, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > >> Allan, >> >> >> i just noted smd has a Mellanox card, while other nodes have QLogic cards. >> >> mtl/psm works best for QLogic while btl/openib (or mtl/mxm) work best for >> Mellanox, >> >> but these are not interoperable. also, i do not think btl/openib can be >> used with QLogic cards >> >> (please someone correct me if i am wrong) >> >> >> from the logs, i can see that smd (Mellanox) is not even able to use the >> infiniband port. >> >> if you run with 2 MPI tasks, both run on smd and hence btl/vader is used, >> that is why it works >> >> if you run with more than 2 MPI tasks, then smd and other nodes are used, >> and every MPI task fall back to btl/tcp >> >> for inter node communication. >> >> [smd][[41971,1],1][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] >> connect() to 192.168.1.196 failed: No route to host (113) >> >> this usually indicates a firewall, but since both ssh and oob/tcp are >> fine, this puzzles me. >> >> >> what if you >> >> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 >> --mca btl_tcp_if_include 192.168.1.0/24 --mca pml ob1 --mca btl >> tcp,sm,vader,self ring >> >> that should work with no error messages, and then you can try with 12 MPI >> tasks >> >> (note internode MPI communications will use tcp only) >> >> >> if you want optimal performance, i am afraid you cannot run any MPI task >> on smd (so mtl/psm can be used ) >> >> (btw, make sure PSM support was built in Open MPI) >> >> a suboptimal option is to force MPI communications on IPoIB with >> >> /* make sure all nodes can ping each other via IPoIB first */ >> >> mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_if_include >> 10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self >> >> >> >> Cheers, >> >> >> Gilles >> >> >> On 5/19/2017 3:50 PM, Allan Overstreet wrote: >> >>> Gilles, >>> >>> On which node is mpirun invoked ? >>> >>> The mpirun command was involed on node smd. >>> >>> Are you running from a batch manager? >>> >>> No. >>> >>> Is there any firewall running on your nodes ? >>> >>> No CentOS minimal does not have a firewall installed and Ubuntu >>> Mate's firewall is disabled. >>> >>> All three of your commands have appeared to run successfully. The >>> outputs of the three commands are attached. >>> >>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 >>> --mca oob_base_verbose 100 true &> cmd1 >>> >>> mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 >>> --mca oob_base_verbose 100 true &> cmd2 >>> >>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 >>> --mca oob_base_verbose 100 ring &> cmd3 >>> >>> If I increase the number of processors in the ring program, mpirun will >>> not succeed. >>> >>> mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 >>> --mca oob_base_verbose 100 ring &> cmd4 >>> >>> >>> On 05/19/2017 02:18 AM, Gilles Gouaillardet wrote: >>> >>>> Allan, >>>> >>>> >>>> - on which node is mpirun invoked ? >>>> >>>> - are you running from a batch manager ? >>>> >>>> - is there any firewall running on your nodes ? >>>> >>>> >>>> the error is likely occuring when wiring-up mpirun/orted >>>> >>>> what if you >>>> >>>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 >>>> --mca oob_base_verbose 100 true >>>> >>>> then (if the previous command worked) >>>> >>>> mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 >>>> --mca oob_base_verbose 100 true >>>> >>>> and finally (if both previous commands worked) >>>> >>>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 >>>> --mca oob_base_verbose 100 ring >>>> >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> On 5/19/2017 3:07 PM, Allan Overstreet wrote: >>>> >>>>> I experiencing many different errors with openmpi version 2.1.1. I >>>>> have had a suspicion that this might be related to the way the servers >>>>> were >>>>> connected and configured. Regardless below is a diagram of how the server >>>>> are configured. >>>>> >>>>> __ _ >>>>> [__]|=| >>>>> /::/|_| >>>>> HOST: smd >>>>> Dual 1Gb Ethernet Bonded >>>>> .-------------> Bond0 IP: 192.168.1.200 >>>>> | Infiniband Card: MHQH29B-XTR <------------. >>>>> | Ib0 IP: 10.1.0.1 | >>>>> | OS: Ubuntu Mate | >>>>> | __ _ | >>>>> | [__]|=| | >>>>> | /::/|_| | >>>>> | HOST: sm1 | >>>>> | Dual 1Gb Ethernet Bonded | >>>>> |-------------> Bond0 IP: 192.168.1.196 | >>>>> | Infiniband Card: QLOGIC QLE7340 <---------| >>>>> | Ib0 IP: 10.1.0.2 | >>>>> | OS: Centos 7 Minimal | >>>>> | __ _ | >>>>> | [__]|=| | >>>>> |---------. /::/|_| | >>>>> | | HOST: sm2 | >>>>> | | Dual 1Gb Ethernet Bonded | >>>>> | '---> Bond0 IP: 192.168.1.199 | >>>>> __________ Infiniband Card: QLOGIC QLE7340 __________ >>>>> [_|||||||_°] Ib0 IP: 10.1.0.3 [_|||||||_°] >>>>> [_|||||||_°] OS: Centos 7 Minimal [_|||||||_°] >>>>> [_|||||||_°] __ _ [_|||||||_°] >>>>> Gb Ethernet Switch [__]|=| Voltaire 4036 QDR Switch >>>>> | /::/|_| | >>>>> | HOST: sm3 | >>>>> | Dual 1Gb Ethernet Bonded | >>>>> |-------------> Bond0 IP: 192.168.1.203 | >>>>> | Infiniband Card: QLOGIC QLE7340 <----------| >>>>> | Ib0 IP: 10.1.0.4 | >>>>> | OS: Centos 7 Minimal | >>>>> | __ _ | >>>>> | [__]|=| | >>>>> | /::/|_| | >>>>> | HOST: sm4 | >>>>> | Dual 1Gb Ethernet Bonded | >>>>> |-------------> Bond0 IP: 192.168.1.204 | >>>>> | Infiniband Card: QLOGIC QLE7340 <----------| >>>>> | Ib0 IP: 10.1.0.5 | >>>>> | OS: Centos 7 Minimal | >>>>> | __ _ | >>>>> | [__]|=| | >>>>> | /::/|_| | >>>>> | HOST: dl580 | >>>>> | Dual 1Gb Ethernet Bonded | >>>>> '-------------> Bond0 IP: 192.168.1.201 | >>>>> Infiniband Card: QLOGIC QLE7340 <----------' >>>>> Ib0 IP: 10.1.0.6 >>>>> OS: Centos 7 Minimal >>>>> >>>>> I have ensured that the Infiniband adapters can ping each other and >>>>> every node can passwordless ssh into every other node. Every node has the >>>>> same /etc/hosts file, >>>>> >>>>> cat /etc/hosts >>>>> >>>>> 127.0.0.1 localhost >>>>> 192.168.1.200 smd >>>>> 192.168.1.196 sm1 >>>>> 192.168.1.199 sm2 >>>>> 192.168.1.203 sm3 >>>>> 192.168.1.204 sm4 >>>>> 192.168.1.201 dl580 >>>>> >>>>> 10.1.0.1 smd-ib >>>>> 10.1.0.2 sm1-ib >>>>> 10.1.0.3 sm2-ib >>>>> 10.1.0.4 sm3-ib >>>>> 10.1.0.5 sm4-ib >>>>> 10.1.0.6 dl580-ib >>>>> >>>>> I have been using a simple ring test program to test openmpi. The code >>>>> for this program is attached. >>>>> >>>>> The hostfile used in all the commands is, >>>>> >>>>> cat ./nodes >>>>> >>>>> smd slots=2 >>>>> sm1 slots=2 >>>>> sm2 slots=2 >>>>> sm3 slots=2 >>>>> sm4 slots=2 >>>>> dl580 slots=2 >>>>> >>>>> When running the following command on smd, >>>>> >>>>> mpirun -mca btl openib,self -np 2 --hostfile nodes ./ring >>>>> >>>>> I obtain the following error, >>>>> >>>>> ------------------------------------------------------------ >>>>> A process or daemon was unable to complete a TCP connection >>>>> to another process: >>>>> Local host: sm1 >>>>> Remote host: 192.168.1.200 >>>>> This is usually caused by a firewall on the remote host. Please >>>>> check that any firewall (e.g., iptables) has been disabled and >>>>> try again. >>>>> ------------------------------------------------------------ >>>>> -------------------------------------------------------------------------- >>>>> >>>>> No OpenFabrics connection schemes reported that they were able to be >>>>> used on a specific port. As such, the openib BTL (OpenFabrics >>>>> support) will be disabled for this port. >>>>> >>>>> Local host: smd >>>>> Local device: mlx4_0 >>>>> Local port: 1 >>>>> CPCs attempted: rdmacm, udcm >>>>> -------------------------------------------------------------------------- >>>>> >>>>> Process 1 received token -1 from process 0 >>>>> Process 0 received token -1 from process 1 >>>>> [smd:12800] 1 more process has sent help message >>>>> help-mpi-btl-openib-cpc-base.txt / no cpcs for port >>>>> [smd:12800] Set MCA parameter "orte_base_help_aggregate" to 0 to see >>>>> all help / error messages\ >>>>> >>>>> When increasing the number of processors no program output is produced. >>>>> >>>>> mpirun -mca btl openib,self -np 4 --hostfile nodes ./ring >>>>> ------------------------------------------------------------ >>>>> A process or daemon was unable to complete a TCP connection >>>>> to another process: >>>>> Local host: sm2 >>>>> Remote host: 192.168.1.200 >>>>> This is usually caused by a firewall on the remote host. Please >>>>> check that any firewall (e.g., iptables) has been disabled and >>>>> try again. >>>>> ------------------------------------------------------------ >>>>> *** An error occurred in MPI_Init >>>>> *** on a NULL communicator >>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now >>>>> abort, >>>>> *** and potentially your MPI job) >>>>> *** An error occurred in MPI_Init >>>>> *** on a NULL communicator >>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now >>>>> abort, >>>>> *** and potentially your MPI job) >>>>> -------------------------------------------------------------------------- >>>>> >>>>> A requested component was not found, or was unable to be opened. This >>>>> means that this component is either not installed or is unable to be >>>>> used on your system (e.g., sometimes this means that shared libraries >>>>> that the component requires are unable to be found/loaded). Note that >>>>> Open MPI stopped checking at the first component that it did not find. >>>>> >>>>> Host: sm1.overst.local >>>>> Framework: btl >>>>> Component: openib >>>>> -------------------------------------------------------------------------- >>>>> >>>>> -------------------------------------------------------------------------- >>>>> >>>>> It looks like MPI_INIT failed for some reason; your parallel process is >>>>> likely to abort. There are many reasons that a parallel process can >>>>> fail during MPI_INIT; some of which are due to configuration or >>>>> environment >>>>> problems. This failure appears to be an internal failure; here's some >>>>> additional information (which may only be relevant to an Open MPI >>>>> developer): >>>>> >>>>> mca_bml_base_open() failed >>>>> --> Returned "Not found" (-13) instead of "Success" (0) >>>>> -------------------------------------------------------------------------- >>>>> >>>>> -------------------------------------------------------------------------- >>>>> >>>>> No OpenFabrics connection schemes reported that they were able to be >>>>> used on a specific port. As such, the openib BTL (OpenFabrics >>>>> support) will be disabled for this port. >>>>> >>>>> Local host: smd >>>>> Local device: mlx4_0 >>>>> Local port: 1 >>>>> CPCs attempted: rdmacm, udcm >>>>> -------------------------------------------------------------------------- >>>>> >>>>> [smd:12953] 1 more process has sent help message help-mca-base.txt / >>>>> find-available:not-valid >>>>> [smd:12953] Set MCA parameter "orte_base_help_aggregate" to 0 to see >>>>> all help / error messages >>>>> [smd:12953] 1 more process has sent help message help-mpi-runtime.txt >>>>> / mpi_init:startup:internal-failure >>>>> [smd:12953] 1 more process has sent help message >>>>> help-mpi-btl-openib-cpc-base.txt / no cpcs for port >>>>> >>>>> Running mpirun from other nodes does not resolve the issue. I have >>>>> checked that none of the nodes is running a firewall that would be >>>>> blocking >>>>> tcp connections. >>>>> >>>>> The error with the mlx4_0 adapter is expected as this is used as an >>>>> 10Gb Ethernet adapter to another network. The infiniband adapter on smd >>>>> that is being used for QDR infiniband is mlx4_1. >>>>> >>>>> Any help would be appreciated. >>>>> >>>>> Sincerely, >>>>> >>>>> Allan Overstreet >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>> >>>> >>>> >>> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> > >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users