Re: [OMPI users] MPI_Allreduce hangs

Martin Siegert Wed, 27 Jun 2012 14:25:49 -0400

Hi Jeff,

On Wed, Jun 20, 2012 at 04:16:12PM -0400, Jeff Squyres wrote:
> On Jun 20, 2012, at 3:36 PM, Martin Siegert wrote:
> 
> > by now we know of three programs - dirac, wrf, quantum espresso - that
> > all hang with openmpi-1.4.x (have not yet checked with openmpi-1.6).
> > All of these programs run to completion with the mpiexec commandline
> > argument: --mca btl_openib_flags 305
> > We now set this in the global configuration file openmpi-mca-params.conf.
> > What is the reason that this is not the default in the first place?
> > Are there any negative effects?
> 
> Two things:
> 
> 1. These flags -- 305 (or 0x131 or 0001 0011 0001) translate to telling the 
> openib BTL the following:
> 
> - 1: SEND: meaning that the openib BTL is using send/receive semantics
> - 16: ACK: meaningless with the ob1 PML
> - 32: CHECKSUM: meaningless with the ob1 PML
> - 256: meaningless
> 
> What's meaning here is what is missing: RDMA PUT and GET.  So all RDMA 
> support is disabled.
> 
> This will work fine, but you may want to increase your 
> mca_btl_openib_eager_limit size (e.g., U. Michigan did the same thing as you 
> -- disabled RDMA -- but increased the eager limit to 64k to get back some of 
> the lost performance).
> 
> 2. We believe that we have *finally* (just recently) fixed this issue in the 
> SVN trunk and upcoming 1.6.1 release.  I have a test pre-release 1.6.1 
> tarball -- would you mind giving it a whirl?
> 
> http://www.open-mpi.org/~jsquyres/unofficial/openmpi-1.6.1ticket3131r26612M.tar.bz2


Thanks! I tried this and, indeed, the program (I tested quantum espresso,
pw.x, so far) no longer hangs.

Then I went one step further and benchmarked the following three cases:

1) pw.x compiled with openmpi-1.3.3
2) pw.x compiled with openmpi-1.4.3 and
   btl_openib_flags = 305
   btl_openib_eager_limit = 65536
   in etc/openmpi-mca-params.conf
3) pw.x compiled with openmpi-1.6.1ticket3131r26612M

These are the results time (in seconds) per iteration - smaller is better:
1) 33.11
2) 28.23
3) 34.81

That's rather disappointing, isn't it?

Cheers,
Martin

Re: [OMPI users] MPI_Allreduce hangs

Reply via email to