Boris, these logs seem a bit odd to me. as far as i remember, the state is POLLING when there is no subnet manager. and when there is one, the state is ACTIVE *but* both Base and SM lid are non zero
btw, is IPoIB configured ? if yes, then can your hosts ping each other with this interface. i noted your host has 4 ib ports, but only 2 are active. you might want to try using one port at first, for example, you can mpirun --mca btl_openib_if_include mlx5_0 ... Cheers, Gilles On Sat, Jul 15, 2017 at 1:34 AM, Boris M. Vulovic <boris.m.vulo...@gmail.com> wrote: > Gus, Gilles and John, > > Thanks for the help. Let me first post (below) the output from checkouts of > the IB network: > ibdiagnet > ibhosts > ibstat (for login node, for now) > > What do you think? > Thanks > --Boris > > > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% > > -bash-4.1$ ibdiagnet > ---------- > Load Plugins from: > /usr/share/ibdiagnet2.1.1/plugins/ > (You can specify more paths to be looked in with "IBDIAGNET_PLUGINS_PATH" > env variable) > > Plugin Name Result Comment > libibdiagnet_cable_diag_plugin-2.1.1 Succeeded Plugin loaded > libibdiagnet_phy_diag_plugin-2.1.1 Succeeded Plugin loaded > > --------------------------------------------- > Discovery > -E- Failed to initialize > > -E- Fabric Discover failed, err=IBDiag initialize wasn't done > -E- Fabric Discover failed, MAD err=Failed to register SMI class > > --------------------------------------------- > Summary > -I- Stage Warnings Errors Comment > -I- Discovery NA > -I- Lids Check NA > -I- Links Check NA > -I- Subnet Manager NA > -I- Port Counters NA > -I- Nodes Information NA > -I- Speed / Width checks NA > -I- Partition Keys NA > -I- Alias GUIDs NA > -I- Temperature Sensing NA > > -I- You can find detailed errors/warnings in: > /var/tmp/ibdiagnet2/ibdiagnet2.log > > -E- A fatal error occurred, exiting... > -bash-4.1$ > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% > > -bash-4.1$ ibhosts > ibwarn: [168221] mad_rpc_open_port: client_register for mgmt 1 failed > src/ibnetdisc.c:766; can't open MAD port ((null):0) > /usr/sbin/ibnetdiscover: iberror: failed: discover failed > -bash-4.1$ > > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% > -bash-4.1$ ibstat > CA 'mlx5_0' > CA type: MT4115 > Number of ports: 1 > Firmware version: 12.17.2020 > Hardware version: 0 > Node GUID: 0x248a0703005abb1c > System image GUID: 0x248a0703005abb1c > Port 1: > State: Active > Physical state: LinkUp > Rate: 100 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x3c010000 > Port GUID: 0x268a07fffe5abb1c > Link layer: Ethernet > CA 'mlx5_1' > CA type: MT4115 > Number of ports: 1 > Firmware version: 12.17.2020 > Hardware version: 0 > Node GUID: 0x248a0703005abb1d > System image GUID: 0x248a0703005abb1c > Port 1: > State: Active > Physical state: LinkUp > Rate: 100 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x3c010000 > Port GUID: 0x0000000000000000 > Link layer: Ethernet > CA 'mlx5_2' > CA type: MT4115 > Number of ports: 1 > Firmware version: 12.17.2020 > Hardware version: 0 > Node GUID: 0x248a0703005abb30 > System image GUID: 0x248a0703005abb30 > Port 1: > State: Down > Physical state: Disabled > Rate: 100 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x3c010000 > Port GUID: 0x268a07fffe5abb30 > Link layer: Ethernet > CA 'mlx5_3' > CA type: MT4115 > Number of ports: 1 > Firmware version: 12.17.2020 > Hardware version: 0 > Node GUID: 0x248a0703005abb31 > System image GUID: 0x248a0703005abb30 > Port 1: > State: Down > Physical state: Disabled > Rate: 100 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x3c010000 > Port GUID: 0x268a07fffe5abb31 > Link layer: Ethernet > -bash-4.1$ > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% > > On Fri, Jul 14, 2017 at 12:37 AM, John Hearns via users > <users@lists.open-mpi.org> wrote: >> >> ABoris, as Gilles says - first do som elower level checkouts of your >> Infiniband network. >> I suggest running: >> ibdiagnet >> ibhosts >> and then as Gilles says 'ibstat' on each node >> >> >> >> On 14 July 2017 at 03:58, Gilles Gouaillardet <gil...@rist.or.jp> wrote: >>> >>> Boris, >>> >>> >>> Open MPI should automatically detect the infiniband hardware, and use >>> openib (and *not* tcp) for inter node communications >>> >>> and a shared memory optimized btl (e.g. sm or vader) for intra node >>> communications. >>> >>> >>> note if you "-mca btl openib,self", you tell Open MPI to use the openib >>> btl between any tasks, >>> >>> including tasks running on the same node (which is less efficient than >>> using sm or vader) >>> >>> >>> at first, i suggest you make sure infiniband is up and running on all >>> your nodes. >>> >>> (just run ibstat, at least one port should be listed, state should be >>> Active, and all nodes should have the same SM lid) >>> >>> >>> then try to run two tasks on two nodes. >>> >>> >>> if this does not work, you can >>> >>> mpirun --mca btl_base_verbose 100 ... >>> >>> and post the logs so we can investigate from there. >>> >>> >>> Cheers, >>> >>> >>> Gilles >>> >>> >>> >>> On 7/14/2017 6:43 AM, Boris M. Vulovic wrote: >>>> >>>> >>>> I would like to know how to invoke InfiniBand hardware on CentOS 6x >>>> cluster with OpenMPI (static libs.) for running my C++ code. This is how I >>>> compile and run: >>>> >>>> /usr/local/open-mpi/1.10.7/bin/mpic++ -L/usr/local/open-mpi/1.10.7/lib >>>> -Bstatic main.cpp -o DoWork >>>> >>>> usr/local/open-mpi/1.10.7/bin/mpiexec -mca btl tcp,self --hostfile >>>> hostfile5 -host node01,node02,node03,node04,node05 -n 200 DoWork >>>> >>>> Here, "*-mca btl tcp,self*" reveals that *TCP* is used, and the cluster >>>> has InfiniBand. >>>> >>>> What should be changed in compiling and running commands for InfiniBand >>>> to be invoked? If I just replace "*-mca btl tcp,self*" with "*-mca btl >>>> openib,self*" then I get plenty of errors with relevant one saying: >>>> >>>> /At least one pair of MPI processes are unable to reach each other for >>>> MPI communications. This means that no Open MPI device has indicated that >>>> it >>>> can be used to communicate between these processes. This is an error; Open >>>> MPI requires that all MPI processes be able to reach each other. This error >>>> can sometimes be the result of forgetting to specify the "self" BTL./ >>>> >>>> Thanks very much!!! >>>> >>>> >>>> *Boris * >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> >> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > > > -- > > Boris M. Vulovic > > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users