Boris,

these logs seem a bit odd to me.
as far as i remember, the state is POLLING when there is no subnet manager.
and when there is one, the state is ACTIVE *but* both Base and SM lid
are non zero

btw, is IPoIB configured ?
if yes, then can your hosts ping each other with this interface.

i noted your host has 4 ib ports, but only 2 are active.
you might want to try using one port at first, for example, you can
mpirun --mca btl_openib_if_include mlx5_0 ...

Cheers,

Gilles

On Sat, Jul 15, 2017 at 1:34 AM, Boris M. Vulovic
<boris.m.vulo...@gmail.com> wrote:
> Gus, Gilles and John,
>
> Thanks for the help. Let me first post (below) the output from checkouts of
> the IB network:
> ibdiagnet
> ibhosts
> ibstat  (for login node, for now)
>
> What do you think?
> Thanks
> --Boris
>
>
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>
> -bash-4.1$ ibdiagnet
> ----------
> Load Plugins from:
> /usr/share/ibdiagnet2.1.1/plugins/
> (You can specify more paths to be looked in with "IBDIAGNET_PLUGINS_PATH"
> env variable)
>
> Plugin Name                                   Result     Comment
> libibdiagnet_cable_diag_plugin-2.1.1          Succeeded  Plugin loaded
> libibdiagnet_phy_diag_plugin-2.1.1            Succeeded  Plugin loaded
>
> ---------------------------------------------
> Discovery
> -E- Failed to initialize
>
> -E- Fabric Discover failed, err=IBDiag initialize wasn't done
> -E- Fabric Discover failed, MAD err=Failed to register SMI class
>
> ---------------------------------------------
> Summary
> -I- Stage                     Warnings   Errors     Comment
> -I- Discovery                                       NA
> -I- Lids Check                                      NA
> -I- Links Check                                     NA
> -I- Subnet Manager                                  NA
> -I- Port Counters                                   NA
> -I- Nodes Information                               NA
> -I- Speed / Width checks                            NA
> -I- Partition Keys                                  NA
> -I- Alias GUIDs                                     NA
> -I- Temperature Sensing                             NA
>
> -I- You can find detailed errors/warnings in:
> /var/tmp/ibdiagnet2/ibdiagnet2.log
>
> -E- A fatal error occurred, exiting...
> -bash-4.1$
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>
> -bash-4.1$ ibhosts
> ibwarn: [168221] mad_rpc_open_port: client_register for mgmt 1 failed
> src/ibnetdisc.c:766; can't open MAD port ((null):0)
> /usr/sbin/ibnetdiscover: iberror: failed: discover failed
> -bash-4.1$
>
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> -bash-4.1$ ibstat
> CA 'mlx5_0'
>         CA type: MT4115
>         Number of ports: 1
>         Firmware version: 12.17.2020
>         Hardware version: 0
>         Node GUID: 0x248a0703005abb1c
>         System image GUID: 0x248a0703005abb1c
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 100
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x3c010000
>                 Port GUID: 0x268a07fffe5abb1c
>                 Link layer: Ethernet
> CA 'mlx5_1'
>         CA type: MT4115
>         Number of ports: 1
>         Firmware version: 12.17.2020
>         Hardware version: 0
>         Node GUID: 0x248a0703005abb1d
>         System image GUID: 0x248a0703005abb1c
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 100
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x3c010000
>                 Port GUID: 0x0000000000000000
>                 Link layer: Ethernet
> CA 'mlx5_2'
>         CA type: MT4115
>         Number of ports: 1
>         Firmware version: 12.17.2020
>         Hardware version: 0
>         Node GUID: 0x248a0703005abb30
>         System image GUID: 0x248a0703005abb30
>         Port 1:
>                 State: Down
>                 Physical state: Disabled
>                 Rate: 100
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x3c010000
>                 Port GUID: 0x268a07fffe5abb30
>                 Link layer: Ethernet
> CA 'mlx5_3'
>         CA type: MT4115
>         Number of ports: 1
>         Firmware version: 12.17.2020
>         Hardware version: 0
>         Node GUID: 0x248a0703005abb31
>         System image GUID: 0x248a0703005abb30
>         Port 1:
>                 State: Down
>                 Physical state: Disabled
>                 Rate: 100
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x3c010000
>                 Port GUID: 0x268a07fffe5abb31
>                 Link layer: Ethernet
> -bash-4.1$
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>
> On Fri, Jul 14, 2017 at 12:37 AM, John Hearns via users
> <users@lists.open-mpi.org> wrote:
>>
>> ABoris, as Gilles says - first do som elower level checkouts of your
>> Infiniband network.
>> I suggest running:
>> ibdiagnet
>> ibhosts
>> and then as Gilles says 'ibstat' on each node
>>
>>
>>
>> On 14 July 2017 at 03:58, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
>>>
>>> Boris,
>>>
>>>
>>> Open MPI should automatically detect the infiniband hardware, and use
>>> openib (and *not* tcp) for inter node communications
>>>
>>> and a shared memory optimized btl (e.g. sm or vader) for intra node
>>> communications.
>>>
>>>
>>> note if you "-mca btl openib,self", you tell Open MPI to use the openib
>>> btl between any tasks,
>>>
>>> including tasks running on the same node (which is less efficient than
>>> using sm or vader)
>>>
>>>
>>> at first, i suggest you make sure infiniband is up and running on all
>>> your nodes.
>>>
>>> (just run ibstat, at least one port should be listed, state should be
>>> Active, and all nodes should have the same SM lid)
>>>
>>>
>>> then try to run two tasks on two nodes.
>>>
>>>
>>> if this does not work, you can
>>>
>>> mpirun --mca btl_base_verbose 100 ...
>>>
>>> and post the logs so we can investigate from there.
>>>
>>>
>>> Cheers,
>>>
>>>
>>> Gilles
>>>
>>>
>>>
>>> On 7/14/2017 6:43 AM, Boris M. Vulovic wrote:
>>>>
>>>>
>>>> I would like to know how to invoke InfiniBand hardware on CentOS 6x
>>>> cluster with OpenMPI (static libs.) for running my C++ code. This is how I
>>>> compile and run:
>>>>
>>>> /usr/local/open-mpi/1.10.7/bin/mpic++ -L/usr/local/open-mpi/1.10.7/lib
>>>> -Bstatic main.cpp -o DoWork
>>>>
>>>> usr/local/open-mpi/1.10.7/bin/mpiexec -mca btl tcp,self --hostfile
>>>> hostfile5 -host node01,node02,node03,node04,node05 -n 200 DoWork
>>>>
>>>> Here, "*-mca btl tcp,self*" reveals that *TCP* is used, and the cluster
>>>> has InfiniBand.
>>>>
>>>> What should be changed in compiling and running commands for InfiniBand
>>>> to be invoked? If I just replace "*-mca btl tcp,self*" with "*-mca btl
>>>> openib,self*" then I get plenty of errors with relevant one saying:
>>>>
>>>> /At least one pair of MPI processes are unable to reach each other for
>>>> MPI communications. This means that no Open MPI device has indicated that 
>>>> it
>>>> can be used to communicate between these processes. This is an error; Open
>>>> MPI requires that all MPI processes be able to reach each other. This error
>>>> can sometimes be the result of forgetting to specify the "self" BTL./
>>>>
>>>> Thanks very much!!!
>>>>
>>>>
>>>> *Boris *
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
>
> --
>
> Boris M. Vulovic
>
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to