Since these ports are running in actual Ethernet mode (as opposed to
IPoIB), I do not think the interface names will be of the ibN (ib0,
ib1, etc) format. It is more likely that the interface names will be
of the form ethN or enPApBsCfD.

It would be best to check with your system administrator, but if the
'ethtool' tool is installed, you can check the line speed of an
interface by running 'ethtool <interface_name>' and looking for the
'Speed:' value.

Cheers,
Rusty D.



On Mon, Jul 17, 2017 at 1:06 PM, Gus Correa <g...@ldeo.columbia.edu> wrote:
> Hi Boris
>
> The nodes may have standard Gigabit Ethernet interfaces,
> besides the Infiniband (RoCE).
> You may want to direct OpenMPI to use the Infiniband interfaces,
> not Gigabit Ethernet,
> by adding something like this to "--mca btl self,vader,self":
>
> "--mca btl_tcp_if_include ib0,ib1"
>
> (Where the interface names ib0,ib1 are just my guess for
> what your nodes may have. Check with your "root" system administrator!)
>
> That syntax may also use IP address, or a subnet mask,
> whichever it is simpler for you.
> It is better explained in this FAQ:
>
> https://www.open-mpi.org/faq/?category=all#tcp-selection
>
> BTW, some of your questions (and others that you may hit later)
> are covered in the OpenMPI FAQ:
>
> https://www.open-mpi.org/faq/?category=all
>
> I hope this helps,
> Gus Correa
>
>
> On 07/17/2017 12:43 PM, Boris M. Vulovic wrote:
>>
>> Gus, Gilles, Russell, John:
>>
>> Thanks very much for the replies and the help.
>> I got confirmation from the "root" that it is indeed RoCE with 100G.
>>
>> I'll go over the info in the link Russell provided, but have a quick
>> question: if I run the "*mpiexec*" with "*-mca btl tcp,self*" do I get the
>> benefit of *RoCE *(the fastest speed)?
>>
>> I'll go over the details of all reply and post useful feedback.
>>
>> Thanks very much all!
>>
>> Best,
>>
>> --Boris
>>
>>
>>
>>
>> On Mon, Jul 17, 2017 at 6:31 AM, Russell Dekema <deke...@umich.edu
>> <mailto:deke...@umich.edu>> wrote:
>>
>>     It looks like you have two dual-port Mellanox VPI cards in this
>>     machine. These cards can be set to run InfiniBand or Ethernet on a
>>     port-by-port basis, and all four of your ports are set to Ethernet
>>     mode. Two of your ports have active 100 gigabit Ethernet links, and
>>     the other two have no link up at all.
>>
>>     With no InfiniBand links on the machine, you will, of course, not be
>>     able to run your OpenMPI job over InfiniBand.
>>
>>     If your machines and network are set up for it, you might be able to
>>     run your job over RoCE (RDMA Over Converged Ethernet) using one or
>>     both of those 100 GbE links. I have never used RoCE myself, but one
>>     starting point for gathering more information on it might be the
>>     following section of the OpenMPI FAQ:
>>
>>     https://www.open-mpi.org/faq/?category=openfabrics#ompi-over-roce
>>     <https://www.open-mpi.org/faq/?category=openfabrics#ompi-over-roce>
>>
>>     Sincerely,
>>     Rusty Dekema
>>     University of Michigan
>>     Advanced Research Computing - Technology Services
>>
>>
>>     On Fri, Jul 14, 2017 at 12:34 PM, Boris M. Vulovic
>>     <boris.m.vulo...@gmail.com <mailto:boris.m.vulo...@gmail.com>> wrote:
>>      > Gus, Gilles and John,
>>      >
>>      > Thanks for the help. Let me first post (below) the output from
>>     checkouts of
>>      > the IB network:
>>      > ibdiagnet
>>      > ibhosts
>>      > ibstat  (for login node, for now)
>>      >
>>      > What do you think?
>>      > Thanks
>>      > --Boris
>>      >
>>      >
>>      >
>>
>> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>>      >
>>      > -bash-4.1$ ibdiagnet
>>      > ----------
>>      > Load Plugins from:
>>      > /usr/share/ibdiagnet2.1.1/plugins/
>>      > (You can specify more paths to be looked in with
>>     "IBDIAGNET_PLUGINS_PATH"
>>      > env variable)
>>      >
>>      > Plugin Name                                   Result     Comment
>>      > libibdiagnet_cable_diag_plugin-2.1.1          Succeeded  Plugin
>>     loaded
>>      > libibdiagnet_phy_diag_plugin-2.1.1            Succeeded  Plugin
>>     loaded
>>      >
>>      > ---------------------------------------------
>>      > Discovery
>>      > -E- Failed to initialize
>>      >
>>      > -E- Fabric Discover failed, err=IBDiag initialize wasn't done
>>      > -E- Fabric Discover failed, MAD err=Failed to register SMI class
>>      >
>>      > ---------------------------------------------
>>      > Summary
>>      > -I- Stage                     Warnings   Errors     Comment
>>      > -I- Discovery                                       NA
>>      > -I- Lids Check                                      NA
>>      > -I- Links Check                                     NA
>>      > -I- Subnet Manager                                  NA
>>      > -I- Port Counters                                   NA
>>      > -I- Nodes Information                               NA
>>      > -I- Speed / Width checks                            NA
>>      > -I- Partition Keys                                  NA
>>      > -I- Alias GUIDs                                     NA
>>      > -I- Temperature Sensing                             NA
>>      >
>>      > -I- You can find detailed errors/warnings in:
>>      > /var/tmp/ibdiagnet2/ibdiagnet2.log
>>      >
>>      > -E- A fatal error occurred, exiting...
>>      > -bash-4.1$
>>      >
>>
>> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>>      >
>>      > -bash-4.1$ ibhosts
>>      > ibwarn: [168221] mad_rpc_open_port: client_register for mgmt 1
>> failed
>>      > src/ibnetdisc.c:766; can't open MAD port ((null):0)
>>      > /usr/sbin/ibnetdiscover: iberror: failed: discover failed
>>      > -bash-4.1$
>>      >
>>      >
>>
>> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>>      > -bash-4.1$ ibstat
>>      > CA 'mlx5_0'
>>      >         CA type: MT4115
>>      >         Number of ports: 1
>>      >         Firmware version: 12.17.2020
>>      >         Hardware version: 0
>>      >         Node GUID: 0x248a0703005abb1c
>>      >         System image GUID: 0x248a0703005abb1c
>>      >         Port 1:
>>      >                 State: Active
>>      >                 Physical state: LinkUp
>>      >                 Rate: 100
>>      >                 Base lid: 0
>>      >                 LMC: 0
>>      >                 SM lid: 0
>>      >                 Capability mask: 0x3c010000
>>      >                 Port GUID: 0x268a07fffe5abb1c
>>      >                 Link layer: Ethernet
>>      > CA 'mlx5_1'
>>      >         CA type: MT4115
>>      >         Number of ports: 1
>>      >         Firmware version: 12.17.2020
>>      >         Hardware version: 0
>>      >         Node GUID: 0x248a0703005abb1d
>>      >         System image GUID: 0x248a0703005abb1c
>>      >         Port 1:
>>      >                 State: Active
>>      >                 Physical state: LinkUp
>>      >                 Rate: 100
>>      >                 Base lid: 0
>>      >                 LMC: 0
>>      >                 SM lid: 0
>>      >                 Capability mask: 0x3c010000
>>      >                 Port GUID: 0x0000000000000000
>>      >                 Link layer: Ethernet
>>      > CA 'mlx5_2'
>>      >         CA type: MT4115
>>      >         Number of ports: 1
>>      >         Firmware version: 12.17.2020
>>      >         Hardware version: 0
>>      >         Node GUID: 0x248a0703005abb30
>>      >         System image GUID: 0x248a0703005abb30
>>      >         Port 1:
>>      >                 State: Down
>>      >                 Physical state: Disabled
>>      >                 Rate: 100
>>      >                 Base lid: 0
>>      >                 LMC: 0
>>      >                 SM lid: 0
>>      >                 Capability mask: 0x3c010000
>>      >                 Port GUID: 0x268a07fffe5abb30
>>      >                 Link layer: Ethernet
>>      > CA 'mlx5_3'
>>      >         CA type: MT4115
>>      >         Number of ports: 1
>>      >         Firmware version: 12.17.2020
>>      >         Hardware version: 0
>>      >         Node GUID: 0x248a0703005abb31
>>      >         System image GUID: 0x248a0703005abb30
>>      >         Port 1:
>>      >                 State: Down
>>      >                 Physical state: Disabled
>>      >                 Rate: 100
>>      >                 Base lid: 0
>>      >                 LMC: 0
>>      >                 SM lid: 0
>>      >                 Capability mask: 0x3c010000
>>      >                 Port GUID: 0x268a07fffe5abb31
>>      >                 Link layer: Ethernet
>>      > -bash-4.1$
>>      >
>>
>> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>>      >
>>      > On Fri, Jul 14, 2017 at 12:37 AM, John Hearns via users
>>      > <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
>>      >>
>>      >> ABoris, as Gilles says - first do som elower level checkouts of
>> your
>>      >> Infiniband network.
>>      >> I suggest running:
>>      >> ibdiagnet
>>      >> ibhosts
>>      >> and then as Gilles says 'ibstat' on each node
>>      >>
>>      >>
>>      >>
>>      >> On 14 July 2017 at 03:58, Gilles Gouaillardet <gil...@rist.or.jp
>>     <mailto:gil...@rist.or.jp>> wrote:
>>      >>>
>>      >>> Boris,
>>      >>>
>>      >>>
>>      >>> Open MPI should automatically detect the infiniband hardware,
>>     and use
>>      >>> openib (and *not* tcp) for inter node communications
>>      >>>
>>      >>> and a shared memory optimized btl (e.g. sm or vader) for intra
>> node
>>      >>> communications.
>>      >>>
>>      >>>
>>      >>> note if you "-mca btl openib,self", you tell Open MPI to use
>>     the openib
>>      >>> btl between any tasks,
>>      >>>
>>      >>> including tasks running on the same node (which is less
>>     efficient than
>>      >>> using sm or vader)
>>      >>>
>>      >>>
>>      >>> at first, i suggest you make sure infiniband is up and running
>>     on all
>>      >>> your nodes.
>>      >>>
>>      >>> (just run ibstat, at least one port should be listed, state
>>     should be
>>      >>> Active, and all nodes should have the same SM lid)
>>      >>>
>>      >>>
>>      >>> then try to run two tasks on two nodes.
>>      >>>
>>      >>>
>>      >>> if this does not work, you can
>>      >>>
>>      >>> mpirun --mca btl_base_verbose 100 ...
>>      >>>
>>      >>> and post the logs so we can investigate from there.
>>      >>>
>>      >>>
>>      >>> Cheers,
>>      >>>
>>      >>>
>>      >>> Gilles
>>      >>>
>>      >>>
>>      >>>
>>      >>> On 7/14/2017 6:43 AM, Boris M. Vulovic wrote:
>>      >>>>
>>      >>>>
>>      >>>> I would like to know how to invoke InfiniBand hardware on
>>     CentOS 6x
>>      >>>> cluster with OpenMPI (static libs.) for running my C++ code.
>>     This is how I
>>      >>>> compile and run:
>>      >>>>
>>      >>>> /usr/local/open-mpi/1.10.7/bin/mpic++
>>     -L/usr/local/open-mpi/1.10.7/lib
>>      >>>> -Bstatic main.cpp -o DoWork
>>      >>>>
>>      >>>> usr/local/open-mpi/1.10.7/bin/mpiexec -mca btl tcp,self
>> --hostfile
>>      >>>> hostfile5 -host node01,node02,node03,node04,node05 -n 200 DoWork
>>      >>>>
>>      >>>> Here, "*-mca btl tcp,self*" reveals that *TCP* is used, and
>>     the cluster
>>      >>>> has InfiniBand.
>>      >>>>
>>      >>>> What should be changed in compiling and running commands for
>>     InfiniBand
>>      >>>> to be invoked? If I just replace "*-mca btl tcp,self*" with
>>     "*-mca btl
>>      >>>> openib,self*" then I get plenty of errors with relevant one
>>     saying:
>>      >>>>
>>      >>>> /At least one pair of MPI processes are unable to reach each
>>     other for
>>      >>>> MPI communications. This means that no Open MPI device has
>>     indicated that it
>>      >>>> can be used to communicate between these processes. This is an
>>     error; Open
>>      >>>> MPI requires that all MPI processes be able to reach each
>>     other. This error
>>      >>>> can sometimes be the result of forgetting to specify the
>>     "self" BTL./
>>      >>>>
>>      >>>> Thanks very much!!!
>>      >>>>
>>      >>>>
>>      >>>> *Boris *
>>      >>>>
>>      >>>>
>>      >>>>
>>      >>>>
>>      >>>> _______________________________________________
>>      >>>> users mailing list
>>      >>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>      >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>     <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>      >>>
>>      >>>
>>      >>> _______________________________________________
>>      >>> users mailing list
>>      >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>      >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>     <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>      >>
>>      >>
>>      >>
>>      >> _______________________________________________
>>      >> users mailing list
>>      >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>      >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>     <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>      >
>>      >
>>      >
>>      >
>>      > --
>>      >
>>      > Boris M. Vulovic
>>      >
>>      >
>>      >
>>      > _______________________________________________
>>      > users mailing list
>>      > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>      > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>     <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>     _______________________________________________
>>     users mailing list
>>     users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>     https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>     <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>
>>
>>
>>
>> --
>>
>> *Boris M. Vulovic*
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to