Re: [OMPI users] Q: Basic invoking of InfiniBand with OpenMPI
Gus, Gilles and John, Thanks for the help. Let me first post (below) the output from checkouts of the IB network: ibdiagnet ibhosts ibstat (for login node, for now) What do you think? Thanks --Boris -bash-4.1$ *ibdiagnet* -- Load Plugins from: /usr/share/ibdiagnet2.1.1/plugins/ (You can specify more paths to be looked in with "IBDIAGNET_PLUGINS_PATH" env variable) Plugin Name Result Comment libibdiagnet_cable_diag_plugin-2.1.1 Succeeded Plugin loaded libibdiagnet_phy_diag_plugin-2.1.1Succeeded Plugin loaded - Discovery -E- Failed to initialize -E- Fabric Discover failed, err=IBDiag initialize wasn't done -E- Fabric Discover failed, MAD err=Failed to register SMI class - Summary -I- Stage Warnings Errors Comment -I- Discovery NA -I- Lids Check NA -I- Links Check NA -I- Subnet Manager NA -I- Port Counters NA -I- Nodes Information NA -I- Speed / Width checksNA -I- Partition Keys NA -I- Alias GUIDs NA -I- Temperature Sensing NA -I- You can find detailed errors/warnings in: /var/tmp/ibdiagnet2/ibdiagnet2.log -E- A fatal error occurred, exiting... -bash-4.1$ -bash-4.1$ *ibhosts* ibwarn: [168221] mad_rpc_open_port: client_register for mgmt 1 failed src/ibnetdisc.c:766; can't open MAD port ((null):0) /usr/sbin/ibnetdiscover: iberror: failed: discover failed -bash-4.1$ -bash-4.1$ *ibstat* CA 'mlx5_0' CA type: MT4115 Number of ports: 1 Firmware version: 12.17.2020 Hardware version: 0 Node GUID: 0x248a0703005abb1c System image GUID: 0x248a0703005abb1c Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x3c01 Port GUID: 0x268a07fffe5abb1c Link layer: Ethernet CA 'mlx5_1' CA type: MT4115 Number of ports: 1 Firmware version: 12.17.2020 Hardware version: 0 Node GUID: 0x248a0703005abb1d System image GUID: 0x248a0703005abb1c Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x3c01 Port GUID: 0x Link layer: Ethernet CA 'mlx5_2' CA type: MT4115 Number of ports: 1 Firmware version: 12.17.2020 Hardware version: 0 Node GUID: 0x248a0703005abb30 System image GUID: 0x248a0703005abb30 Port 1: State: Down Physical state: Disabled Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x3c01 Port GUID: 0x268a07fffe5abb30 Link layer: Ethernet CA 'mlx5_3' CA type: MT4115 Number of ports: 1 Firmware version: 12.17.2020 Hardware version: 0 Node GUID: 0x248a0703005abb31 System image GUID: 0x248a0703005abb30 Port 1: State: Down Physical state: Disabled Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x3c01 Port GUID: 0x268a07fffe5abb31 Link layer: Ethernet -bash-4.1$ %% On Fri, Jul 14, 2017 at 12:37 AM, John Hearns via users < users@lists.open-mpi.org> wrote: > ABoris, as Gilles says - first do som elower level checkouts of your > Infiniband network. > I suggest running: > ibdiagnet > ibhosts > and then as Gilles says 'ibstat' on each node > > > > On 14 July 2017 at 03:58, Gilles Gouaillardet wrote: > >> Boris, >> >> >> Open MPI should automatically detect the infiniband hardware, and use >> openib (and *not* tcp) for inter node communications >> >> and a shared memory optimized btl (e.g. sm or vader) for intra node >> communications. >> >> >> note if you "-mca btl openib,self", you tell Open MPI to use the openib >> btl betw
Re: [OMPI users] Q: Basic invoking of InfiniBand with OpenMPI
ABoris, as Gilles says - first do som elower level checkouts of your Infiniband network. I suggest running: ibdiagnet ibhosts and then as Gilles says 'ibstat' on each node On 14 July 2017 at 03:58, Gilles Gouaillardet wrote: > Boris, > > > Open MPI should automatically detect the infiniband hardware, and use > openib (and *not* tcp) for inter node communications > > and a shared memory optimized btl (e.g. sm or vader) for intra node > communications. > > > note if you "-mca btl openib,self", you tell Open MPI to use the openib > btl between any tasks, > > including tasks running on the same node (which is less efficient than > using sm or vader) > > > at first, i suggest you make sure infiniband is up and running on all your > nodes. > > (just run ibstat, at least one port should be listed, state should be > Active, and all nodes should have the same SM lid) > > > then try to run two tasks on two nodes. > > > if this does not work, you can > > mpirun --mca btl_base_verbose 100 ... > > and post the logs so we can investigate from there. > > > Cheers, > > > Gilles > > > > On 7/14/2017 6:43 AM, Boris M. Vulovic wrote: > >> >> I would like to know how to invoke InfiniBand hardware on CentOS 6x >> cluster with OpenMPI (static libs.) for running my C++ code. This is how I >> compile and run: >> >> /usr/local/open-mpi/1.10.7/bin/mpic++ -L/usr/local/open-mpi/1.10.7/lib >> -Bstatic main.cpp -o DoWork >> >> usr/local/open-mpi/1.10.7/bin/mpiexec -mca btl tcp,self --hostfile >> hostfile5 -host node01,node02,node03,node04,node05 -n 200 DoWork >> >> Here, "*-mca btl tcp,self*" reveals that *TCP* is used, and the cluster >> has InfiniBand. >> >> What should be changed in compiling and running commands for InfiniBand >> to be invoked? If I just replace "*-mca btl tcp,self*" with "*-mca btl >> openib,self*" then I get plenty of errors with relevant one saying: >> >> /At least one pair of MPI processes are unable to reach each other for >> MPI communications. This means that no Open MPI device has indicated that >> it can be used to communicate between these processes. This is an error; >> Open MPI requires that all MPI processes be able to reach each other. This >> error can sometimes be the result of forgetting to specify the "self" BTL./ >> >> Thanks very much!!! >> >> >> *Boris * >> >> >> >> >> ___ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users