Boris, do you have a Subnet Manager running on your fabric? I am sorry if there have bene other replies ot this over the weekend.
On 14 July 2017 at 18:34, Boris M. Vulovic <boris.m.vulo...@gmail.com> wrote: > Gus, Gilles and John, > > Thanks for the help. Let me first post (below) the output from checkouts > of the IB network: > ibdiagnet > ibhosts > ibstat (for login node, for now) > > What do you think? > Thanks > --Boris > > > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% > > -bash-4.1$ *ibdiagnet* > ---------- > Load Plugins from: > /usr/share/ibdiagnet2.1.1/plugins/ > (You can specify more paths to be looked in with "IBDIAGNET_PLUGINS_PATH" > env variable) > > Plugin Name Result Comment > libibdiagnet_cable_diag_plugin-2.1.1 Succeeded Plugin loaded > libibdiagnet_phy_diag_plugin-2.1.1 Succeeded Plugin loaded > > --------------------------------------------- > Discovery > -E- Failed to initialize > > -E- Fabric Discover failed, err=IBDiag initialize wasn't done > -E- Fabric Discover failed, MAD err=Failed to register SMI class > > --------------------------------------------- > Summary > -I- Stage Warnings Errors Comment > -I- Discovery NA > -I- Lids Check NA > -I- Links Check NA > -I- Subnet Manager NA > -I- Port Counters NA > -I- Nodes Information NA > -I- Speed / Width checks NA > -I- Partition Keys NA > -I- Alias GUIDs NA > -I- Temperature Sensing NA > > -I- You can find detailed errors/warnings in: /var/tmp/ibdiagnet2/ > ibdiagnet2.log > > -E- A fatal error occurred, exiting... > -bash-4.1$ > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% > > -bash-4.1$ *ibhosts* > ibwarn: [168221] mad_rpc_open_port: client_register for mgmt 1 failed > src/ibnetdisc.c:766; can't open MAD port ((null):0) > /usr/sbin/ibnetdiscover: iberror: failed: discover failed > -bash-4.1$ > > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% > -bash-4.1$ *ibstat* > CA 'mlx5_0' > CA type: MT4115 > Number of ports: 1 > Firmware version: 12.17.2020 > Hardware version: 0 > Node GUID: 0x248a0703005abb1c > System image GUID: 0x248a0703005abb1c > Port 1: > State: Active > Physical state: LinkUp > Rate: 100 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x3c010000 > Port GUID: 0x268a07fffe5abb1c > Link layer: Ethernet > CA 'mlx5_1' > CA type: MT4115 > Number of ports: 1 > Firmware version: 12.17.2020 > Hardware version: 0 > Node GUID: 0x248a0703005abb1d > System image GUID: 0x248a0703005abb1c > Port 1: > State: Active > Physical state: LinkUp > Rate: 100 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x3c010000 > Port GUID: 0x0000000000000000 > Link layer: Ethernet > CA 'mlx5_2' > CA type: MT4115 > Number of ports: 1 > Firmware version: 12.17.2020 > Hardware version: 0 > Node GUID: 0x248a0703005abb30 > System image GUID: 0x248a0703005abb30 > Port 1: > State: Down > Physical state: Disabled > Rate: 100 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x3c010000 > Port GUID: 0x268a07fffe5abb30 > Link layer: Ethernet > CA 'mlx5_3' > CA type: MT4115 > Number of ports: 1 > Firmware version: 12.17.2020 > Hardware version: 0 > Node GUID: 0x248a0703005abb31 > System image GUID: 0x248a0703005abb30 > Port 1: > State: Down > Physical state: Disabled > Rate: 100 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x3c010000 > Port GUID: 0x268a07fffe5abb31 > Link layer: Ethernet > -bash-4.1$ > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% > > On Fri, Jul 14, 2017 at 12:37 AM, John Hearns via users < > users@lists.open-mpi.org> wrote: > >> ABoris, as Gilles says - first do som elower level checkouts of your >> Infiniband network. >> I suggest running: >> ibdiagnet >> ibhosts >> and then as Gilles says 'ibstat' on each node >> >> >> >> On 14 July 2017 at 03:58, Gilles Gouaillardet <gil...@rist.or.jp> wrote: >> >>> Boris, >>> >>> >>> Open MPI should automatically detect the infiniband hardware, and use >>> openib (and *not* tcp) for inter node communications >>> >>> and a shared memory optimized btl (e.g. sm or vader) for intra node >>> communications. >>> >>> >>> note if you "-mca btl openib,self", you tell Open MPI to use the openib >>> btl between any tasks, >>> >>> including tasks running on the same node (which is less efficient than >>> using sm or vader) >>> >>> >>> at first, i suggest you make sure infiniband is up and running on all >>> your nodes. >>> >>> (just run ibstat, at least one port should be listed, state should be >>> Active, and all nodes should have the same SM lid) >>> >>> >>> then try to run two tasks on two nodes. >>> >>> >>> if this does not work, you can >>> >>> mpirun --mca btl_base_verbose 100 ... >>> >>> and post the logs so we can investigate from there. >>> >>> >>> Cheers, >>> >>> >>> Gilles >>> >>> >>> >>> On 7/14/2017 6:43 AM, Boris M. Vulovic wrote: >>> >>>> >>>> I would like to know how to invoke InfiniBand hardware on CentOS 6x >>>> cluster with OpenMPI (static libs.) for running my C++ code. This is how I >>>> compile and run: >>>> >>>> /usr/local/open-mpi/1.10.7/bin/mpic++ -L/usr/local/open-mpi/1.10.7/lib >>>> -Bstatic main.cpp -o DoWork >>>> >>>> usr/local/open-mpi/1.10.7/bin/mpiexec -mca btl tcp,self --hostfile >>>> hostfile5 -host node01,node02,node03,node04,node05 -n 200 DoWork >>>> >>>> Here, "*-mca btl tcp,self*" reveals that *TCP* is used, and the cluster >>>> has InfiniBand. >>>> >>>> What should be changed in compiling and running commands for InfiniBand >>>> to be invoked? If I just replace "*-mca btl tcp,self*" with "*-mca btl >>>> openib,self*" then I get plenty of errors with relevant one saying: >>>> >>>> /At least one pair of MPI processes are unable to reach each other for >>>> MPI communications. This means that no Open MPI device has indicated that >>>> it can be used to communicate between these processes. This is an error; >>>> Open MPI requires that all MPI processes be able to reach each other. This >>>> error can sometimes be the result of forgetting to specify the "self" BTL./ >>>> >>>> Thanks very much!!! >>>> >>>> >>>> *Boris * >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>> >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> >> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> > > > > -- > > *Boris M. Vulovic* > > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users