Giles, Allan,

if the host 'smd' is acting as a cluster head node it is not a must for it
to have an Infiniband card.
So you should be able to run jobs across the other nodes, which have Qlogic
cards.
I may have something mixed up here, if so I am sorry.

If you want also to run jobs on the smd host, you should take note of what
Giles says.
You may be out of luck in that case.

On 19 May 2017 at 09:15, Gilles Gouaillardet <gil...@rist.or.jp> wrote:

> Allan,
>
>
> i just noted smd has a Mellanox card, while other nodes have QLogic cards.
>
> mtl/psm works best for QLogic while btl/openib (or mtl/mxm) work best for
> Mellanox,
>
> but these are not interoperable. also, i do not think btl/openib can be
> used with QLogic cards
>
> (please someone correct me if i am wrong)
>
>
> from the logs, i can see that smd (Mellanox) is not even able to use the
> infiniband port.
>
> if you run with 2 MPI tasks, both run on smd and hence btl/vader is used,
> that is why it works
>
> if you run with more than 2 MPI tasks, then smd and other nodes are used,
> and every MPI task fall back to btl/tcp
>
> for inter node communication.
>
> [smd][[41971,1],1][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.1.196 failed: No route to host (113)
>
> this usually indicates a firewall, but since both ssh and oob/tcp are
> fine, this puzzles me.
>
>
> what if you
>
> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
> --mca btl_tcp_if_include 192.168.1.0/24 --mca pml ob1 --mca btl
> tcp,sm,vader,self  ring
>
> that should work with no error messages, and then you can try with 12 MPI
> tasks
>
> (note internode MPI communications will use tcp only)
>
>
> if you want optimal performance, i am afraid you cannot run any MPI task
> on smd (so mtl/psm can be used )
>
> (btw, make sure PSM support was built in Open MPI)
>
> a suboptimal option is to force MPI communications on IPoIB with
>
> /* make sure all nodes can ping each other via IPoIB first */
>
> mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_if_include
> 10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self
>
>
>
> Cheers,
>
>
> Gilles
>
>
> On 5/19/2017 3:50 PM, Allan Overstreet wrote:
>
>> Gilles,
>>
>> On which node is mpirun invoked ?
>>
>>     The mpirun command was involed on node smd.
>>
>> Are you running from a batch manager?
>>
>>     No.
>>
>> Is there any firewall running on your nodes ?
>>
>>     No CentOS minimal does not have a firewall installed and Ubuntu
>> Mate's firewall is disabled.
>>
>> All three of your commands have appeared to run successfully. The outputs
>> of the three commands are attached.
>>
>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>> --mca oob_base_verbose 100 true &> cmd1
>>
>> mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>> --mca oob_base_verbose 100 true &> cmd2
>>
>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>> --mca oob_base_verbose 100 ring &> cmd3
>>
>> If I increase the number of processors in the ring program, mpirun will
>> not succeed.
>>
>> mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>> --mca oob_base_verbose 100 ring &> cmd4
>>
>>
>> On 05/19/2017 02:18 AM, Gilles Gouaillardet wrote:
>>
>>> Allan,
>>>
>>>
>>> - on which node is mpirun invoked ?
>>>
>>> - are you running from a batch manager ?
>>>
>>> - is there any firewall running on your nodes ?
>>>
>>>
>>> the error is likely occuring when wiring-up mpirun/orted
>>>
>>> what if you
>>>
>>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>> --mca oob_base_verbose 100 true
>>>
>>> then (if the previous command worked)
>>>
>>> mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>> --mca oob_base_verbose 100 true
>>>
>>> and finally (if both previous commands worked)
>>>
>>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>> --mca oob_base_verbose 100 ring
>>>
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On 5/19/2017 3:07 PM, Allan Overstreet wrote:
>>>
>>>> I experiencing many different errors with openmpi version 2.1.1. I have
>>>> had a suspicion that this might be related to the way the servers were
>>>> connected and configured. Regardless below is a diagram of how the server
>>>> are configured.
>>>>
>>>>                                         __  _
>>>>                                        [__]|=|
>>>>                                        /::/|_|
>>>>                            HOST: smd
>>>>                            Dual 1Gb Ethernet Bonded
>>>>            .-------------> Bond0 IP: 192.168.1.200
>>>>            |               Infiniband Card: MHQH29B-XTR <------------.
>>>>            |               Ib0 IP: 10.1.0.1                          |
>>>>            |               OS: Ubuntu Mate                           |
>>>>            |                           __ _                         |
>>>>            | [__]|=|                        |
>>>>            | /::/|_|                        |
>>>>            |               HOST: sm1                                 |
>>>>            |               Dual 1Gb Ethernet Bonded                  |
>>>>            |-------------> Bond0 IP: 192.168.1.196                   |
>>>>            |               Infiniband Card: QLOGIC QLE7340 <---------|
>>>>            |               Ib0 IP: 10.1.0.2                          |
>>>>            |               OS: Centos 7 Minimal                      |
>>>>            |                           __ _                         |
>>>>            | [__]|=|                        |
>>>>            |---------. /::/|_|                        |
>>>>            |         |     HOST: sm2                                 |
>>>>            |         |     Dual 1Gb Ethernet Bonded                  |
>>>>            |         '---> Bond0 IP: 192.168.1.199                   |
>>>>        __________          Infiniband Card: QLOGIC QLE7340 __________
>>>>       [_|||||||_°]         Ib0 IP: 10.1.0.3 [_|||||||_°]
>>>>       [_|||||||_°]         OS: Centos 7 Minimal [_|||||||_°]
>>>>       [_|||||||_°]                     __ _ [_|||||||_°]
>>>>    Gb Ethernet Switch                 [__]|=| Voltaire 4036 QDR Switch
>>>>            | /::/|_|                         |
>>>>            |               HOST: sm3                                  |
>>>>            |               Dual 1Gb Ethernet Bonded                   |
>>>>            |-------------> Bond0 IP: 192.168.1.203                    |
>>>>            |               Infiniband Card: QLOGIC QLE7340 <----------|
>>>>            |               Ib0 IP: 10.1.0.4                           |
>>>>            |               OS: Centos 7 Minimal                       |
>>>>            |                          __ _                           |
>>>>            | [__]|=|                          |
>>>>            | /::/|_|                          |
>>>>            |               HOST: sm4                                  |
>>>>            |               Dual 1Gb Ethernet Bonded                   |
>>>>            |-------------> Bond0 IP: 192.168.1.204                    |
>>>>            |               Infiniband Card: QLOGIC QLE7340 <----------|
>>>>            |               Ib0 IP: 10.1.0.5                           |
>>>>            |               OS: Centos 7 Minimal                       |
>>>>            |                         __ _                            |
>>>>            | [__]|=|                           |
>>>>            | /::/|_|                           |
>>>>            |               HOST: dl580                                |
>>>>            |               Dual 1Gb Ethernet Bonded                   |
>>>>            '-------------> Bond0 IP: 192.168.1.201                    |
>>>>                            Infiniband Card: QLOGIC QLE7340 <----------'
>>>>                            Ib0 IP: 10.1.0.6
>>>>                            OS: Centos 7 Minimal
>>>>
>>>> I have ensured that the Infiniband adapters can ping each other and
>>>> every node can passwordless ssh into every other node. Every node has the
>>>> same /etc/hosts file,
>>>>
>>>> cat /etc/hosts
>>>>
>>>> 127.0.0.1    localhost
>>>> 192.168.1.200    smd
>>>> 192.168.1.196    sm1
>>>> 192.168.1.199    sm2
>>>> 192.168.1.203    sm3
>>>> 192.168.1.204    sm4
>>>> 192.168.1.201    dl580
>>>>
>>>> 10.1.0.1    smd-ib
>>>> 10.1.0.2    sm1-ib
>>>> 10.1.0.3    sm2-ib
>>>> 10.1.0.4    sm3-ib
>>>> 10.1.0.5    sm4-ib
>>>> 10.1.0.6    dl580-ib
>>>>
>>>> I have been using a simple ring test program to test openmpi. The code
>>>> for this program is attached.
>>>>
>>>> The hostfile used in all the commands is,
>>>>
>>>> cat ./nodes
>>>>
>>>> smd slots=2
>>>> sm1 slots=2
>>>> sm2 slots=2
>>>> sm3 slots=2
>>>> sm4 slots=2
>>>> dl580 slots=2
>>>>
>>>> When running the following command on smd,
>>>>
>>>> mpirun -mca btl openib,self -np 2 --hostfile nodes ./ring
>>>>
>>>> I obtain the following error,
>>>>
>>>> ------------------------------------------------------------
>>>> A process or daemon was unable to complete a TCP connection
>>>> to another process:
>>>>   Local host:    sm1
>>>>   Remote host:   192.168.1.200
>>>> This is usually caused by a firewall on the remote host. Please
>>>> check that any firewall (e.g., iptables) has been disabled and
>>>> try again.
>>>> ------------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>>
>>>> No OpenFabrics connection schemes reported that they were able to be
>>>> used on a specific port.  As such, the openib BTL (OpenFabrics
>>>> support) will be disabled for this port.
>>>>
>>>>   Local host:           smd
>>>>   Local device:         mlx4_0
>>>>   Local port:           1
>>>>   CPCs attempted:       rdmacm, udcm
>>>> --------------------------------------------------------------------------
>>>>
>>>> Process 1 received token -1 from process 0
>>>> Process 0 received token -1 from process 1
>>>> [smd:12800] 1 more process has sent help message
>>>> help-mpi-btl-openib-cpc-base.txt / no cpcs for port
>>>> [smd:12800] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>>>> all help / error messages\
>>>>
>>>> When increasing the number of processors no program output is produced.
>>>>
>>>> mpirun -mca btl openib,self -np 4 --hostfile nodes ./ring
>>>> ------------------------------------------------------------
>>>> A process or daemon was unable to complete a TCP connection
>>>> to another process:
>>>>   Local host:    sm2
>>>>   Remote host:   192.168.1.200
>>>> This is usually caused by a firewall on the remote host. Please
>>>> check that any firewall (e.g., iptables) has been disabled and
>>>> try again.
>>>> ------------------------------------------------------------
>>>> *** An error occurred in MPI_Init
>>>> *** on a NULL communicator
>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>>> ***    and potentially your MPI job)
>>>> *** An error occurred in MPI_Init
>>>> *** on a NULL communicator
>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>>> ***    and potentially your MPI job)
>>>> --------------------------------------------------------------------------
>>>>
>>>> A requested component was not found, or was unable to be opened. This
>>>> means that this component is either not installed or is unable to be
>>>> used on your system (e.g., sometimes this means that shared libraries
>>>> that the component requires are unable to be found/loaded). Note that
>>>> Open MPI stopped checking at the first component that it did not find.
>>>>
>>>> Host:      sm1.overst.local
>>>> Framework: btl
>>>> Component: openib
>>>> --------------------------------------------------------------------------
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>>> likely to abort.  There are many reasons that a parallel process can
>>>> fail during MPI_INIT; some of which are due to configuration or
>>>> environment
>>>> problems.  This failure appears to be an internal failure; here's some
>>>> additional information (which may only be relevant to an Open MPI
>>>> developer):
>>>>
>>>>   mca_bml_base_open() failed
>>>>   --> Returned "Not found" (-13) instead of "Success" (0)
>>>> --------------------------------------------------------------------------
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> No OpenFabrics connection schemes reported that they were able to be
>>>> used on a specific port.  As such, the openib BTL (OpenFabrics
>>>> support) will be disabled for this port.
>>>>
>>>>   Local host:           smd
>>>>   Local device:         mlx4_0
>>>>   Local port:           1
>>>>   CPCs attempted:       rdmacm, udcm
>>>> --------------------------------------------------------------------------
>>>>
>>>> [smd:12953] 1 more process has sent help message help-mca-base.txt /
>>>> find-available:not-valid
>>>> [smd:12953] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>>>> all help / error messages
>>>> [smd:12953] 1 more process has sent help message help-mpi-runtime.txt /
>>>> mpi_init:startup:internal-failure
>>>> [smd:12953] 1 more process has sent help message
>>>> help-mpi-btl-openib-cpc-base.txt / no cpcs for port
>>>>
>>>> Running mpirun from other nodes does not resolve the issue. I have
>>>> checked that none of the nodes is running a firewall that would be blocking
>>>> tcp connections.
>>>>
>>>> The error with the mlx4_0 adapter is expected as this is used as an
>>>> 10Gb Ethernet adapter to another network. The infiniband adapter on smd
>>>> that is being used for QDR infiniband is mlx4_1.
>>>>
>>>> Any help would be appreciated.
>>>>
>>>> Sincerely,
>>>>
>>>> Allan Overstreet
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>
>>>
>>>
>>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to