Gus, Gilles and John, Thanks for the help. Let me first post (below) the output from checkouts of the IB network: ibdiagnet ibhosts ibstat (for login node, for now)
What do you think? Thanks --Boris %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -bash-4.1$ *ibdiagnet* ---------- Load Plugins from: /usr/share/ibdiagnet2.1.1/plugins/ (You can specify more paths to be looked in with "IBDIAGNET_PLUGINS_PATH" env variable) Plugin Name Result Comment libibdiagnet_cable_diag_plugin-2.1.1 Succeeded Plugin loaded libibdiagnet_phy_diag_plugin-2.1.1 Succeeded Plugin loaded --------------------------------------------- Discovery -E- Failed to initialize -E- Fabric Discover failed, err=IBDiag initialize wasn't done -E- Fabric Discover failed, MAD err=Failed to register SMI class --------------------------------------------- Summary -I- Stage Warnings Errors Comment -I- Discovery NA -I- Lids Check NA -I- Links Check NA -I- Subnet Manager NA -I- Port Counters NA -I- Nodes Information NA -I- Speed / Width checks NA -I- Partition Keys NA -I- Alias GUIDs NA -I- Temperature Sensing NA -I- You can find detailed errors/warnings in: /var/tmp/ibdiagnet2/ibdiagnet2.log -E- A fatal error occurred, exiting... -bash-4.1$ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -bash-4.1$ *ibhosts* ibwarn: [168221] mad_rpc_open_port: client_register for mgmt 1 failed src/ibnetdisc.c:766; can't open MAD port ((null):0) /usr/sbin/ibnetdiscover: iberror: failed: discover failed -bash-4.1$ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -bash-4.1$ *ibstat* CA 'mlx5_0' CA type: MT4115 Number of ports: 1 Firmware version: 12.17.2020 Hardware version: 0 Node GUID: 0x248a0703005abb1c System image GUID: 0x248a0703005abb1c Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x3c010000 Port GUID: 0x268a07fffe5abb1c Link layer: Ethernet CA 'mlx5_1' CA type: MT4115 Number of ports: 1 Firmware version: 12.17.2020 Hardware version: 0 Node GUID: 0x248a0703005abb1d System image GUID: 0x248a0703005abb1c Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x3c010000 Port GUID: 0x0000000000000000 Link layer: Ethernet CA 'mlx5_2' CA type: MT4115 Number of ports: 1 Firmware version: 12.17.2020 Hardware version: 0 Node GUID: 0x248a0703005abb30 System image GUID: 0x248a0703005abb30 Port 1: State: Down Physical state: Disabled Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x3c010000 Port GUID: 0x268a07fffe5abb30 Link layer: Ethernet CA 'mlx5_3' CA type: MT4115 Number of ports: 1 Firmware version: 12.17.2020 Hardware version: 0 Node GUID: 0x248a0703005abb31 System image GUID: 0x248a0703005abb30 Port 1: State: Down Physical state: Disabled Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x3c010000 Port GUID: 0x268a07fffe5abb31 Link layer: Ethernet -bash-4.1$ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% On Fri, Jul 14, 2017 at 12:37 AM, John Hearns via users < users@lists.open-mpi.org> wrote: > ABoris, as Gilles says - first do som elower level checkouts of your > Infiniband network. > I suggest running: > ibdiagnet > ibhosts > and then as Gilles says 'ibstat' on each node > > > > On 14 July 2017 at 03:58, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > >> Boris, >> >> >> Open MPI should automatically detect the infiniband hardware, and use >> openib (and *not* tcp) for inter node communications >> >> and a shared memory optimized btl (e.g. sm or vader) for intra node >> communications. >> >> >> note if you "-mca btl openib,self", you tell Open MPI to use the openib >> btl between any tasks, >> >> including tasks running on the same node (which is less efficient than >> using sm or vader) >> >> >> at first, i suggest you make sure infiniband is up and running on all >> your nodes. >> >> (just run ibstat, at least one port should be listed, state should be >> Active, and all nodes should have the same SM lid) >> >> >> then try to run two tasks on two nodes. >> >> >> if this does not work, you can >> >> mpirun --mca btl_base_verbose 100 ... >> >> and post the logs so we can investigate from there. >> >> >> Cheers, >> >> >> Gilles >> >> >> >> On 7/14/2017 6:43 AM, Boris M. Vulovic wrote: >> >>> >>> I would like to know how to invoke InfiniBand hardware on CentOS 6x >>> cluster with OpenMPI (static libs.) for running my C++ code. This is how I >>> compile and run: >>> >>> /usr/local/open-mpi/1.10.7/bin/mpic++ -L/usr/local/open-mpi/1.10.7/lib >>> -Bstatic main.cpp -o DoWork >>> >>> usr/local/open-mpi/1.10.7/bin/mpiexec -mca btl tcp,self --hostfile >>> hostfile5 -host node01,node02,node03,node04,node05 -n 200 DoWork >>> >>> Here, "*-mca btl tcp,self*" reveals that *TCP* is used, and the cluster >>> has InfiniBand. >>> >>> What should be changed in compiling and running commands for InfiniBand >>> to be invoked? If I just replace "*-mca btl tcp,self*" with "*-mca btl >>> openib,self*" then I get plenty of errors with relevant one saying: >>> >>> /At least one pair of MPI processes are unable to reach each other for >>> MPI communications. This means that no Open MPI device has indicated that >>> it can be used to communicate between these processes. This is an error; >>> Open MPI requires that all MPI processes be able to reach each other. This >>> error can sometimes be the result of forgetting to specify the "self" BTL./ >>> >>> Thanks very much!!! >>> >>> >>> *Boris * >>> >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > -- *Boris M. Vulovic*
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users