Short version: -------------- What you really want is:
mpirun --mca pml ob1 ... The "--mca mtl ^psm" way will get the same result, but forcing pml=ob1 is really a slightly Better solution (from a semantic perspective) More detail: ------------ Similarly, there's actually 3 different PMLs (PML = point-to-point message layer -- it's the layer that effects MPI point-to-point semantics, and drives an underlying transport layer). Here's a section from the README: - There are three MPI network models available: "ob1", "csum", and "cm". "ob1" and "csum" use BTL ("Byte Transfer Layer") components for each supported network. "cm" uses MTL ("Matching Transport Layer") components for each supported network. - "ob1" supports a variety of networks that can be used in combination with each other (per OS constraints; e.g., there are reports that the GM and OpenFabrics kernel drivers do not operate well together): - OpenFabrics: InfiniBand, iWARP, and RoCE - Loopback (send-to-self) - Myrinet MX and Open-MX - Portals - Quadrics Elan - Shared memory - TCP - SCTP - uDAPL - Windows Verbs - "csum" is exactly the same as "ob1", except that it performs additional data integrity checks to ensure that the received data is intact (vs. trusting the underlying network to deliver the data correctly). csum supports all the same networks as ob1, but there is a performance penalty for the additional integrity checks. - "cm" supports a smaller number of networks (and they cannot be used together), but may provide better better overall MPI performance: - Myrinet MX and Open-MX - InfiniPath PSM - Mellanox MXM - Portals Open MPI will, by default, choose to use "cm" when the InfiniPath PSM or the Mellanox MXM MTL can be used. Otherwise, "ob1" will be used and the corresponding BTLs will be selected. "csum" will never be selected by default. Users can force the use of ob1 or cm if desired by setting the "pml" MCA parameter at run-time: shell$ mpirun --mca pml ob1 ... or shell$ mpirun --mca pml csum ... or shell$ mpirun --mca pml cm ... This means that: if you force ob1 (or csum), then only BTLs will be used. If you force cm, then only MTLs will be used. If you don't specify which PML to use, then OMPI will prefer cm/MTLs (if it finds any available MTLs) over ob1/BTLs. On Oct 15, 2013, at 12:38 PM, Kevin M. Hildebrand <ke...@umd.edu> wrote: > Ahhh, that's the piece I was missing. I've been trying to debug everything I > could think of related to 'btl', and was completely unaware that 'mtl' was > also a transport. > > If I run a job using --mca mtl ^psm, it does indeed run properly across all > of my nodes. (Whether or not that's the 'right' thing to do is yet to be > determined.) > > Thanks for your help! > > Kevin > > > -----Original Message----- > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Dave Love > Sent: Tuesday, October 15, 2013 10:16 AM > To: Open MPI Users > Subject: Re: [OMPI users] Need help running jobs across different IB vendors > > "Kevin M. Hildebrand" <ke...@umd.edu> writes: > >> Hi, I'm trying to run an OpenMPI 1.6.5 job across a set of nodes, some >> with Mellanox cards and some with Qlogic cards. > > Maybe you shouldn't... (I'm blessed in one cluster with three somewhat > incompatible types of QLogic card and a set of Mellanox ones, but > they're in separate islands, apart from the two different SDR ones.) > >> I'm getting errors indicating "At least one pair of MPI processes are unable >> to reach each other for MPI communications". As far as I can tell all of >> the nodes are properly configured and able to reach each other, via IP and >> non-IP connections. >> I've also discovered that even if I turn off the IB transport via "--mca btl >> tcp,self" I'm still getting the same issue. >> The test works fine if I run it confined to hosts with identical IB cards. >> I'd appreciate some assistance in figuring out what I'm doing wrong. > > I assume the QLogic cards are using PSM. You'd need to force them to > use openib with something like --mca mtl ^psm and make sure they have > the ipathverbs library available. You probably won't like the resulting > performance -- users here noticed when one set fell back to openib from > psm recently. > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/