[OMPI devel] Seeing message failures in OpenMPI 4.0.1 on UCX

2019-04-16 Thread Dave Turner
After installing UCX 1.5.0 and OpenMPI 4.0.1 compiled for UCX and
without verbs
(full details below), my NetPIPE benchmark is reporting message failures for
some message sizes above 300 KB.  There are no failures when I benchmark
with
a non-UCX (verbs) version of OpenMPI 4.0.1, and no failures when I test the
UCX version
with --mca btl tcp,self.  These failures show up in testing QDR IB and 40
GbE networks.
NetPIPE tests the first and last bytes always, but can do a full integrity
test
using --integrity that tests all bytes and this shows that no message is
being
received in the cases of the failures.

Details on the system and software installation are below followed by
several NetPIPE runs illustrating the errors.  This includes a minimal
case of 3 ping-pong messages where the middle one shows failures.  Let me
know if there's any more information you need, or any additional tests I
can run.

 Dave Turner



CentOS 7 on Intel processors, QDR IB and 40 GbE tests

UCX 1.5.0 installed from the tarball according to the docs on the webpage

OpenMPI-4.0.1 configured for verbs with:

./configure F77=ifort FC=ifort
--prefix=/homes/daveturner/libs/openmpi-4.0.1-verbs
--enable-mpirun-prefix-by-default --enable-mpi-fortran=all --enable-mpi-cxx
--enable-ipv6 --with-verbs --with-slurm --disable-dlopen

OpenMPI-4.0.1 configured for UCX  with:

./configure F77=ifort FC=ifort
--prefix=/homes/daveturner/libs/openmpi-4.0.1-ucx
--enable-mpirun-prefix-by-default --enable-mpi-fortran=all --enable-mpi-cxx
--enable-ipv6 --without-verbs --with-slurm --disable-dlopen
--with-ucx=/homes/daveturner/libs/ucx-1.5.0/install

NetPIPE compiled with:

/homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpicc -g -O3 -Wall -lrt -DMPI
 ./src/netpipe.c ./src/mpi.c -o NPmpi-4.0.1-ucx -I./src

(http://netpipe.cs.ksu.edu/  compiled with 'make mpi')



**
Normal uni-directional point-to-point test shows errors (testing first and
last bytes)
for messages over 300 KB.
**

Elf77 /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpirun -np 2 --hostfile
hf.elf NPmpi-4.0.1-ucx -o np.elf.mpi-4.0.1-ucx-ib --printhostnames
Saving output to np.elf.mpi-4.0.1-ucx-ib

Proc 0 is on host elf77

Proc 1 is on host elf78

  Clock resolution ~   1.000 nsecs  Clock accuracy ~  38.000 nsecs

Start testing with 7 trials for each message size
  1:   1  B 24999 times -->3.766 Mbps  in2.124 usecs
  2:   2  B117702 times -->8.386 Mbps  in1.908 usecs
  3:   3  B131032 times -->   12.633 Mbps  in1.900 usecs
  4:   4  B131592 times -->   16.715 Mbps  in1.914 usecs
  5:   6  B130589 times -->   25.077 Mbps  in1.914 usecs
  6:   8  B130608 times -->   33.402 Mbps  in1.916 usecs
  7:  12  B130477 times -->   50.047 Mbps  in1.918 usecs
  8:  13  B130329 times -->   54.872 Mbps  in1.895 usecs
  9:  16  B131904 times -->   67.187 Mbps  in1.905 usecs
 10:  19  B131225 times -->   79.255 Mbps  in1.918 usecs
 11:  21  B130353 times -->   87.118 Mbps  in1.928 usecs
 12:  24  B129640 times -->   99.831 Mbps  in1.923 usecs
 13:  27  B129988 times -->  111.760 Mbps  in1.933 usecs
 14:  29  B129351 times -->  121.048 Mbps  in1.917 usecs
 15:  32  B130439 times -->  132.620 Mbps  in1.930 usecs
 16:  35  B129511 times -->  144.272 Mbps  in1.941 usecs
 17:  45  B128814 times -->  182.881 Mbps  in1.968 usecs
 18:  48  B127000 times -->  194.231 Mbps  in1.977 usecs
 19:  51  B126452 times -->  206.193 Mbps  in1.979 usecs
 20:  61  B126343 times -->  236.168 Mbps  in2.066 usecs
 21:  64  B120987 times -->  244.690 Mbps  in2.092 usecs
 22:  67  B119477 times -->  256.660 Mbps  in2.088 usecs
 23:  93  B119710 times -->  242.428 Mbps  in3.069 usecs
 24:  96  B 81460 times -->  250.503 Mbps  in3.066 usecs
 25:  99  B 81543 times -->  258.376 Mbps  in3.065 usecs
 26: 125  B 81558 times -->  321.127 Mbps  in3.114 usecs
 27: 128  B 80281 times -->  328.788 Mbps  in3.114 usecs
 28: 131  B 80270 times -->  336.387 Mbps  in3.115 usecs
 29: 189  B 80244 times -->  474.304 Mbps  in3.188 usecs
 30: 192  B 78423 times -->  482.258 Mbps  in3.185 usecs
 31: 195  B 78492 times -->  489.635 Mbps  in3.186 usecs
 32: 253  B 78467 times -->  623.891 Mbps  in3.244 usecs
 33: 256  B 77061 times -->  631.098 Mbps  in3.245 usecs
 34: 259  B 77038 times -->  637.905 Mbps  in3.248 usecs
 35: 381  B 76967 times -->  906.297 Mbps  in3.363 usecs
 36: 384  B 74335 times -->  913.387 Mbps  in3.363 usecs
 37: 

Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-16 Thread Saliya Ekanayake
Thank you, Nathan. This makes more sense now.

On Tue, Apr 16, 2019 at 6:48 AM Nathan Hjelm  wrote:

> What Ralph said. You just blow memory on a queue that is not recovered in
> the current implementation.
>
> Also, moving to Allreduce will resolve the issue as now every call is
> effectively also a barrier. I have found with some benchmarks and
> collective implementations it can be faster than reduce anyway. That is why
> it might be worth trying.
>
> -Nathan
>
> > On Apr 15, 2019, at 2:33 PM, Saliya Ekanayake  wrote:
> >
> > Thank you, Nathan. Could you elaborate a bit on what happens internally?
> From your answer it seems, the program will still produce the correct
> output at the end but it'll use more resources.
> >
> > On Mon, Apr 15, 2019 at 9:00 AM Nathan Hjelm via devel <
> devel@lists.open-mpi.org> wrote:
> > If you do that it may run out of resources and deadlock or crash. I
> recommend either 1) adding a barrier every 100 iterations, 2) using
> allreduce, or 3) enable coll/sync (which essentially does 1). Honestly, 2
> is probably the easiest option and depending on how large you run may not
> be any slower than 1 or 3.
> >
> > -Nathan
> >
> > > On Apr 15, 2019, at 9:53 AM, Saliya Ekanayake 
> wrote:
> > >
> > > Hi Devs,
> > >
> > > When doing MPI_Reduce in a loop (collecting on Rank 0), is it the
> correct understanding that ranks other than root (0 in this case) will pass
> the collective as soon as their data is written to MPI buffers without
> waiting for all of them to be received at the root?
> > >
> > > If that's the case then what would happen (semantically) if we execute
> MPI_Reduce in a loop without a barrier allowing non-root ranks to hit the
> collective multiple times while the root will be processing an earlier
> reduce? For example, the root can be in the first reduce invocation, while
> another rank is in the second the reduce invocation.
> > >
> > > Thank you,
> > > Saliya
> > >
> > > --
> > > Saliya Ekanayake, Ph.D
> > > Postdoctoral Scholar
> > > Performance and Algorithms Research (PAR) Group
> > > Lawrence Berkeley National Laboratory
> > > Phone: 510-486-5772
> > >
> > > ___
> > > devel mailing list
> > > devel@lists.open-mpi.org
> > > https://lists.open-mpi.org/mailman/listinfo/devel
> >
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/devel
> >
> >
> > --
> > Saliya Ekanayake, Ph.D
> > Postdoctoral Scholar
> > Performance and Algorithms Research (PAR) Group
> > Lawrence Berkeley National Laboratory
> > Phone: 510-486-5772
> >
>
>

-- 
Saliya Ekanayake, Ph.D
Postdoctoral Scholar
Performance and Algorithms Research (PAR) Group
Lawrence Berkeley National Laboratory
Phone: 510-486-5772
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-16 Thread Nathan Hjelm via devel
What Ralph said. You just blow memory on a queue that is not recovered in the 
current implementation.

Also, moving to Allreduce will resolve the issue as now every call is 
effectively also a barrier. I have found with some benchmarks and collective 
implementations it can be faster than reduce anyway. That is why it might be 
worth trying.

-Nathan

> On Apr 15, 2019, at 2:33 PM, Saliya Ekanayake  wrote:
> 
> Thank you, Nathan. Could you elaborate a bit on what happens internally? From 
> your answer it seems, the program will still produce the correct output at 
> the end but it'll use more resources. 
> 
> On Mon, Apr 15, 2019 at 9:00 AM Nathan Hjelm via devel 
>  wrote:
> If you do that it may run out of resources and deadlock or crash. I recommend 
> either 1) adding a barrier every 100 iterations, 2) using allreduce, or 3) 
> enable coll/sync (which essentially does 1). Honestly, 2 is probably the 
> easiest option and depending on how large you run may not be any slower than 
> 1 or 3.
> 
> -Nathan
> 
> > On Apr 15, 2019, at 9:53 AM, Saliya Ekanayake  wrote:
> > 
> > Hi Devs,
> > 
> > When doing MPI_Reduce in a loop (collecting on Rank 0), is it the correct 
> > understanding that ranks other than root (0 in this case) will pass the 
> > collective as soon as their data is written to MPI buffers without waiting 
> > for all of them to be received at the root?
> > 
> > If that's the case then what would happen (semantically) if we execute 
> > MPI_Reduce in a loop without a barrier allowing non-root ranks to hit the 
> > collective multiple times while the root will be processing an earlier 
> > reduce? For example, the root can be in the first reduce invocation, while 
> > another rank is in the second the reduce invocation.
> > 
> > Thank you,
> > Saliya
> > 
> > -- 
> > Saliya Ekanayake, Ph.D
> > Postdoctoral Scholar
> > Performance and Algorithms Research (PAR) Group
> > Lawrence Berkeley National Laboratory
> > Phone: 510-486-5772
> > 
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
> 
> 
> -- 
> Saliya Ekanayake, Ph.D
> Postdoctoral Scholar
> Performance and Algorithms Research (PAR) Group
> Lawrence Berkeley National Laboratory
> Phone: 510-486-5772
> 

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel