Allan,
remember that Infiniband is not Ethernet.  You dont NEED to set up IPOIB
interfaces.

Two diagnostics please for you to run:

ibnetdiscover

ibdiagnet


Let us please have the reuslts of    ibnetdiscover




On 19 May 2017 at 09:25, John Hearns <hear...@googlemail.com> wrote:

> Giles, Allan,
>
> if the host 'smd' is acting as a cluster head node it is not a must for it
> to have an Infiniband card.
> So you should be able to run jobs across the other nodes, which have
> Qlogic cards.
> I may have something mixed up here, if so I am sorry.
>
> If you want also to run jobs on the smd host, you should take note of what
> Giles says.
> You may be out of luck in that case.
>
> On 19 May 2017 at 09:15, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
>
>> Allan,
>>
>>
>> i just noted smd has a Mellanox card, while other nodes have QLogic cards.
>>
>> mtl/psm works best for QLogic while btl/openib (or mtl/mxm) work best for
>> Mellanox,
>>
>> but these are not interoperable. also, i do not think btl/openib can be
>> used with QLogic cards
>>
>> (please someone correct me if i am wrong)
>>
>>
>> from the logs, i can see that smd (Mellanox) is not even able to use the
>> infiniband port.
>>
>> if you run with 2 MPI tasks, both run on smd and hence btl/vader is used,
>> that is why it works
>>
>> if you run with more than 2 MPI tasks, then smd and other nodes are used,
>> and every MPI task fall back to btl/tcp
>>
>> for inter node communication.
>>
>> [smd][[41971,1],1][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 192.168.1.196 failed: No route to host (113)
>>
>> this usually indicates a firewall, but since both ssh and oob/tcp are
>> fine, this puzzles me.
>>
>>
>> what if you
>>
>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>> --mca btl_tcp_if_include 192.168.1.0/24 --mca pml ob1 --mca btl
>> tcp,sm,vader,self  ring
>>
>> that should work with no error messages, and then you can try with 12 MPI
>> tasks
>>
>> (note internode MPI communications will use tcp only)
>>
>>
>> if you want optimal performance, i am afraid you cannot run any MPI task
>> on smd (so mtl/psm can be used )
>>
>> (btw, make sure PSM support was built in Open MPI)
>>
>> a suboptimal option is to force MPI communications on IPoIB with
>>
>> /* make sure all nodes can ping each other via IPoIB first */
>>
>> mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_if_include
>> 10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self
>>
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
>>
>> On 5/19/2017 3:50 PM, Allan Overstreet wrote:
>>
>>> Gilles,
>>>
>>> On which node is mpirun invoked ?
>>>
>>>     The mpirun command was involed on node smd.
>>>
>>> Are you running from a batch manager?
>>>
>>>     No.
>>>
>>> Is there any firewall running on your nodes ?
>>>
>>>     No CentOS minimal does not have a firewall installed and Ubuntu
>>> Mate's firewall is disabled.
>>>
>>> All three of your commands have appeared to run successfully. The
>>> outputs of the three commands are attached.
>>>
>>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>> --mca oob_base_verbose 100 true &> cmd1
>>>
>>> mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>> --mca oob_base_verbose 100 true &> cmd2
>>>
>>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>> --mca oob_base_verbose 100 ring &> cmd3
>>>
>>> If I increase the number of processors in the ring program, mpirun will
>>> not succeed.
>>>
>>> mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>> --mca oob_base_verbose 100 ring &> cmd4
>>>
>>>
>>> On 05/19/2017 02:18 AM, Gilles Gouaillardet wrote:
>>>
>>>> Allan,
>>>>
>>>>
>>>> - on which node is mpirun invoked ?
>>>>
>>>> - are you running from a batch manager ?
>>>>
>>>> - is there any firewall running on your nodes ?
>>>>
>>>>
>>>> the error is likely occuring when wiring-up mpirun/orted
>>>>
>>>> what if you
>>>>
>>>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>>> --mca oob_base_verbose 100 true
>>>>
>>>> then (if the previous command worked)
>>>>
>>>> mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>>> --mca oob_base_verbose 100 true
>>>>
>>>> and finally (if both previous commands worked)
>>>>
>>>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>>> --mca oob_base_verbose 100 ring
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> On 5/19/2017 3:07 PM, Allan Overstreet wrote:
>>>>
>>>>> I experiencing many different errors with openmpi version 2.1.1. I
>>>>> have had a suspicion that this might be related to the way the servers 
>>>>> were
>>>>> connected and configured. Regardless below is a diagram of how the server
>>>>> are configured.
>>>>>
>>>>>                                         __  _
>>>>>                                        [__]|=|
>>>>>                                        /::/|_|
>>>>>                            HOST: smd
>>>>>                            Dual 1Gb Ethernet Bonded
>>>>>            .-------------> Bond0 IP: 192.168.1.200
>>>>>            |               Infiniband Card: MHQH29B-XTR <------------.
>>>>>            |               Ib0 IP: 10.1.0.1                          |
>>>>>            |               OS: Ubuntu Mate                           |
>>>>>            |                           __ _                         |
>>>>>            | [__]|=|                        |
>>>>>            | /::/|_|                        |
>>>>>            |               HOST: sm1                                 |
>>>>>            |               Dual 1Gb Ethernet Bonded                  |
>>>>>            |-------------> Bond0 IP: 192.168.1.196                   |
>>>>>            |               Infiniband Card: QLOGIC QLE7340 <---------|
>>>>>            |               Ib0 IP: 10.1.0.2                          |
>>>>>            |               OS: Centos 7 Minimal                      |
>>>>>            |                           __ _                         |
>>>>>            | [__]|=|                        |
>>>>>            |---------. /::/|_|                        |
>>>>>            |         |     HOST: sm2                                 |
>>>>>            |         |     Dual 1Gb Ethernet Bonded                  |
>>>>>            |         '---> Bond0 IP: 192.168.1.199                   |
>>>>>        __________          Infiniband Card: QLOGIC QLE7340 __________
>>>>>       [_|||||||_°]         Ib0 IP: 10.1.0.3 [_|||||||_°]
>>>>>       [_|||||||_°]         OS: Centos 7 Minimal [_|||||||_°]
>>>>>       [_|||||||_°]                     __ _ [_|||||||_°]
>>>>>    Gb Ethernet Switch                 [__]|=| Voltaire 4036 QDR Switch
>>>>>            | /::/|_|                         |
>>>>>            |               HOST: sm3                                  |
>>>>>            |               Dual 1Gb Ethernet Bonded                   |
>>>>>            |-------------> Bond0 IP: 192.168.1.203                    |
>>>>>            |               Infiniband Card: QLOGIC QLE7340 <----------|
>>>>>            |               Ib0 IP: 10.1.0.4                           |
>>>>>            |               OS: Centos 7 Minimal                       |
>>>>>            |                          __ _                           |
>>>>>            | [__]|=|                          |
>>>>>            | /::/|_|                          |
>>>>>            |               HOST: sm4                                  |
>>>>>            |               Dual 1Gb Ethernet Bonded                   |
>>>>>            |-------------> Bond0 IP: 192.168.1.204                    |
>>>>>            |               Infiniband Card: QLOGIC QLE7340 <----------|
>>>>>            |               Ib0 IP: 10.1.0.5                           |
>>>>>            |               OS: Centos 7 Minimal                       |
>>>>>            |                         __ _                            |
>>>>>            | [__]|=|                           |
>>>>>            | /::/|_|                           |
>>>>>            |               HOST: dl580                                |
>>>>>            |               Dual 1Gb Ethernet Bonded                   |
>>>>>            '-------------> Bond0 IP: 192.168.1.201                    |
>>>>>                            Infiniband Card: QLOGIC QLE7340 <----------'
>>>>>                            Ib0 IP: 10.1.0.6
>>>>>                            OS: Centos 7 Minimal
>>>>>
>>>>> I have ensured that the Infiniband adapters can ping each other and
>>>>> every node can passwordless ssh into every other node. Every node has the
>>>>> same /etc/hosts file,
>>>>>
>>>>> cat /etc/hosts
>>>>>
>>>>> 127.0.0.1    localhost
>>>>> 192.168.1.200    smd
>>>>> 192.168.1.196    sm1
>>>>> 192.168.1.199    sm2
>>>>> 192.168.1.203    sm3
>>>>> 192.168.1.204    sm4
>>>>> 192.168.1.201    dl580
>>>>>
>>>>> 10.1.0.1    smd-ib
>>>>> 10.1.0.2    sm1-ib
>>>>> 10.1.0.3    sm2-ib
>>>>> 10.1.0.4    sm3-ib
>>>>> 10.1.0.5    sm4-ib
>>>>> 10.1.0.6    dl580-ib
>>>>>
>>>>> I have been using a simple ring test program to test openmpi. The code
>>>>> for this program is attached.
>>>>>
>>>>> The hostfile used in all the commands is,
>>>>>
>>>>> cat ./nodes
>>>>>
>>>>> smd slots=2
>>>>> sm1 slots=2
>>>>> sm2 slots=2
>>>>> sm3 slots=2
>>>>> sm4 slots=2
>>>>> dl580 slots=2
>>>>>
>>>>> When running the following command on smd,
>>>>>
>>>>> mpirun -mca btl openib,self -np 2 --hostfile nodes ./ring
>>>>>
>>>>> I obtain the following error,
>>>>>
>>>>> ------------------------------------------------------------
>>>>> A process or daemon was unable to complete a TCP connection
>>>>> to another process:
>>>>>   Local host:    sm1
>>>>>   Remote host:   192.168.1.200
>>>>> This is usually caused by a firewall on the remote host. Please
>>>>> check that any firewall (e.g., iptables) has been disabled and
>>>>> try again.
>>>>> ------------------------------------------------------------
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> No OpenFabrics connection schemes reported that they were able to be
>>>>> used on a specific port.  As such, the openib BTL (OpenFabrics
>>>>> support) will be disabled for this port.
>>>>>
>>>>>   Local host:           smd
>>>>>   Local device:         mlx4_0
>>>>>   Local port:           1
>>>>>   CPCs attempted:       rdmacm, udcm
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> Process 1 received token -1 from process 0
>>>>> Process 0 received token -1 from process 1
>>>>> [smd:12800] 1 more process has sent help message
>>>>> help-mpi-btl-openib-cpc-base.txt / no cpcs for port
>>>>> [smd:12800] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>>>>> all help / error messages\
>>>>>
>>>>> When increasing the number of processors no program output is produced.
>>>>>
>>>>> mpirun -mca btl openib,self -np 4 --hostfile nodes ./ring
>>>>> ------------------------------------------------------------
>>>>> A process or daemon was unable to complete a TCP connection
>>>>> to another process:
>>>>>   Local host:    sm2
>>>>>   Remote host:   192.168.1.200
>>>>> This is usually caused by a firewall on the remote host. Please
>>>>> check that any firewall (e.g., iptables) has been disabled and
>>>>> try again.
>>>>> ------------------------------------------------------------
>>>>> *** An error occurred in MPI_Init
>>>>> *** on a NULL communicator
>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
>>>>> abort,
>>>>> ***    and potentially your MPI job)
>>>>> *** An error occurred in MPI_Init
>>>>> *** on a NULL communicator
>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
>>>>> abort,
>>>>> ***    and potentially your MPI job)
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> A requested component was not found, or was unable to be opened. This
>>>>> means that this component is either not installed or is unable to be
>>>>> used on your system (e.g., sometimes this means that shared libraries
>>>>> that the component requires are unable to be found/loaded). Note that
>>>>> Open MPI stopped checking at the first component that it did not find.
>>>>>
>>>>> Host:      sm1.overst.local
>>>>> Framework: btl
>>>>> Component: openib
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>>>> likely to abort.  There are many reasons that a parallel process can
>>>>> fail during MPI_INIT; some of which are due to configuration or
>>>>> environment
>>>>> problems.  This failure appears to be an internal failure; here's some
>>>>> additional information (which may only be relevant to an Open MPI
>>>>> developer):
>>>>>
>>>>>   mca_bml_base_open() failed
>>>>>   --> Returned "Not found" (-13) instead of "Success" (0)
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> No OpenFabrics connection schemes reported that they were able to be
>>>>> used on a specific port.  As such, the openib BTL (OpenFabrics
>>>>> support) will be disabled for this port.
>>>>>
>>>>>   Local host:           smd
>>>>>   Local device:         mlx4_0
>>>>>   Local port:           1
>>>>>   CPCs attempted:       rdmacm, udcm
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> [smd:12953] 1 more process has sent help message help-mca-base.txt /
>>>>> find-available:not-valid
>>>>> [smd:12953] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>>>>> all help / error messages
>>>>> [smd:12953] 1 more process has sent help message help-mpi-runtime.txt
>>>>> / mpi_init:startup:internal-failure
>>>>> [smd:12953] 1 more process has sent help message
>>>>> help-mpi-btl-openib-cpc-base.txt / no cpcs for port
>>>>>
>>>>> Running mpirun from other nodes does not resolve the issue. I have
>>>>> checked that none of the nodes is running a firewall that would be 
>>>>> blocking
>>>>> tcp connections.
>>>>>
>>>>> The error with the mlx4_0 adapter is expected as this is used as an
>>>>> 10Gb Ethernet adapter to another network. The infiniband adapter on smd
>>>>> that is being used for QDR infiniband is mlx4_1.
>>>>>
>>>>> Any help would be appreciated.
>>>>>
>>>>> Sincerely,
>>>>>
>>>>> Allan Overstreet
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>>
>>>>
>>>>
>>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to