Gus, Gilles and John,

Thanks for the help. Let me first post (below) the output from checkouts of
the IB network:
ibdiagnet
ibhosts
ibstat  (for login node, for now)

What do you think?
Thanks
--Boris


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

-bash-4.1$ *ibdiagnet*
----------
Load Plugins from:
/usr/share/ibdiagnet2.1.1/plugins/
(You can specify more paths to be looked in with "IBDIAGNET_PLUGINS_PATH"
env variable)

Plugin Name                                   Result     Comment
libibdiagnet_cable_diag_plugin-2.1.1          Succeeded  Plugin loaded
libibdiagnet_phy_diag_plugin-2.1.1            Succeeded  Plugin loaded

---------------------------------------------
Discovery
-E- Failed to initialize

-E- Fabric Discover failed, err=IBDiag initialize wasn't done
-E- Fabric Discover failed, MAD err=Failed to register SMI class

---------------------------------------------
Summary
-I- Stage                     Warnings   Errors     Comment
-I- Discovery                                       NA
-I- Lids Check                                      NA
-I- Links Check                                     NA
-I- Subnet Manager                                  NA
-I- Port Counters                                   NA
-I- Nodes Information                               NA
-I- Speed / Width checks                            NA
-I- Partition Keys                                  NA
-I- Alias GUIDs                                     NA
-I- Temperature Sensing                             NA

-I- You can find detailed errors/warnings in:
/var/tmp/ibdiagnet2/ibdiagnet2.log

-E- A fatal error occurred, exiting...
-bash-4.1$
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

-bash-4.1$ *ibhosts*
ibwarn: [168221] mad_rpc_open_port: client_register for mgmt 1 failed
src/ibnetdisc.c:766; can't open MAD port ((null):0)
/usr/sbin/ibnetdiscover: iberror: failed: discover failed
-bash-4.1$

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-bash-4.1$ *ibstat*
CA 'mlx5_0'
        CA type: MT4115
        Number of ports: 1
        Firmware version: 12.17.2020
        Hardware version: 0
        Node GUID: 0x248a0703005abb1c
        System image GUID: 0x248a0703005abb1c
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x3c010000
                Port GUID: 0x268a07fffe5abb1c
                Link layer: Ethernet
CA 'mlx5_1'
        CA type: MT4115
        Number of ports: 1
        Firmware version: 12.17.2020
        Hardware version: 0
        Node GUID: 0x248a0703005abb1d
        System image GUID: 0x248a0703005abb1c
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x3c010000
                Port GUID: 0x0000000000000000
                Link layer: Ethernet
CA 'mlx5_2'
        CA type: MT4115
        Number of ports: 1
        Firmware version: 12.17.2020
        Hardware version: 0
        Node GUID: 0x248a0703005abb30
        System image GUID: 0x248a0703005abb30
        Port 1:
                State: Down
                Physical state: Disabled
                Rate: 100
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x3c010000
                Port GUID: 0x268a07fffe5abb30
                Link layer: Ethernet
CA 'mlx5_3'
        CA type: MT4115
        Number of ports: 1
        Firmware version: 12.17.2020
        Hardware version: 0
        Node GUID: 0x248a0703005abb31
        System image GUID: 0x248a0703005abb30
        Port 1:
                State: Down
                Physical state: Disabled
                Rate: 100
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x3c010000
                Port GUID: 0x268a07fffe5abb31
                Link layer: Ethernet
-bash-4.1$
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

On Fri, Jul 14, 2017 at 12:37 AM, John Hearns via users <
users@lists.open-mpi.org> wrote:

> ABoris, as Gilles says - first do som elower level checkouts of your
> Infiniband network.
> I suggest running:
> ibdiagnet
> ibhosts
> and then as Gilles says 'ibstat' on each node
>
>
>
> On 14 July 2017 at 03:58, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
>
>> Boris,
>>
>>
>> Open MPI should automatically detect the infiniband hardware, and use
>> openib (and *not* tcp) for inter node communications
>>
>> and a shared memory optimized btl (e.g. sm or vader) for intra node
>> communications.
>>
>>
>> note if you "-mca btl openib,self", you tell Open MPI to use the openib
>> btl between any tasks,
>>
>> including tasks running on the same node (which is less efficient than
>> using sm or vader)
>>
>>
>> at first, i suggest you make sure infiniband is up and running on all
>> your nodes.
>>
>> (just run ibstat, at least one port should be listed, state should be
>> Active, and all nodes should have the same SM lid)
>>
>>
>> then try to run two tasks on two nodes.
>>
>>
>> if this does not work, you can
>>
>> mpirun --mca btl_base_verbose 100 ...
>>
>> and post the logs so we can investigate from there.
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
>>
>>
>> On 7/14/2017 6:43 AM, Boris M. Vulovic wrote:
>>
>>>
>>> I would like to know how to invoke InfiniBand hardware on CentOS 6x
>>> cluster with OpenMPI (static libs.) for running my C++ code. This is how I
>>> compile and run:
>>>
>>> /usr/local/open-mpi/1.10.7/bin/mpic++ -L/usr/local/open-mpi/1.10.7/lib
>>> -Bstatic main.cpp -o DoWork
>>>
>>> usr/local/open-mpi/1.10.7/bin/mpiexec -mca btl tcp,self --hostfile
>>> hostfile5 -host node01,node02,node03,node04,node05 -n 200 DoWork
>>>
>>> Here, "*-mca btl tcp,self*" reveals that *TCP* is used, and the cluster
>>> has InfiniBand.
>>>
>>> What should be changed in compiling and running commands for InfiniBand
>>> to be invoked? If I just replace "*-mca btl tcp,self*" with "*-mca btl
>>> openib,self*" then I get plenty of errors with relevant one saying:
>>>
>>> /At least one pair of MPI processes are unable to reach each other for
>>> MPI communications. This means that no Open MPI device has indicated that
>>> it can be used to communicate between these processes. This is an error;
>>> Open MPI requires that all MPI processes be able to reach each other. This
>>> error can sometimes be the result of forgetting to specify the "self" BTL./
>>>
>>> Thanks very much!!!
>>>
>>>
>>> *Boris *
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>



-- 

*Boris M. Vulovic*
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to