[OMPI devel] Seeing message failures in OpenMPI 4.0.1 on UCX
After installing UCX 1.5.0 and OpenMPI 4.0.1 compiled for UCX and without verbs (full details below), my NetPIPE benchmark is reporting message failures for some message sizes above 300 KB. There are no failures when I benchmark with a non-UCX (verbs) version of OpenMPI 4.0.1, and no failures when I test the UCX version with --mca btl tcp,self. These failures show up in testing QDR IB and 40 GbE networks. NetPIPE tests the first and last bytes always, but can do a full integrity test using --integrity that tests all bytes and this shows that no message is being received in the cases of the failures. Details on the system and software installation are below followed by several NetPIPE runs illustrating the errors. This includes a minimal case of 3 ping-pong messages where the middle one shows failures. Let me know if there's any more information you need, or any additional tests I can run. Dave Turner CentOS 7 on Intel processors, QDR IB and 40 GbE tests UCX 1.5.0 installed from the tarball according to the docs on the webpage OpenMPI-4.0.1 configured for verbs with: ./configure F77=ifort FC=ifort --prefix=/homes/daveturner/libs/openmpi-4.0.1-verbs --enable-mpirun-prefix-by-default --enable-mpi-fortran=all --enable-mpi-cxx --enable-ipv6 --with-verbs --with-slurm --disable-dlopen OpenMPI-4.0.1 configured for UCX with: ./configure F77=ifort FC=ifort --prefix=/homes/daveturner/libs/openmpi-4.0.1-ucx --enable-mpirun-prefix-by-default --enable-mpi-fortran=all --enable-mpi-cxx --enable-ipv6 --without-verbs --with-slurm --disable-dlopen --with-ucx=/homes/daveturner/libs/ucx-1.5.0/install NetPIPE compiled with: /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpicc -g -O3 -Wall -lrt -DMPI ./src/netpipe.c ./src/mpi.c -o NPmpi-4.0.1-ucx -I./src (http://netpipe.cs.ksu.edu/ compiled with 'make mpi') ** Normal uni-directional point-to-point test shows errors (testing first and last bytes) for messages over 300 KB. ** Elf77 /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpirun -np 2 --hostfile hf.elf NPmpi-4.0.1-ucx -o np.elf.mpi-4.0.1-ucx-ib --printhostnames Saving output to np.elf.mpi-4.0.1-ucx-ib Proc 0 is on host elf77 Proc 1 is on host elf78 Clock resolution ~ 1.000 nsecs Clock accuracy ~ 38.000 nsecs Start testing with 7 trials for each message size 1: 1 B 24999 times -->3.766 Mbps in2.124 usecs 2: 2 B117702 times -->8.386 Mbps in1.908 usecs 3: 3 B131032 times --> 12.633 Mbps in1.900 usecs 4: 4 B131592 times --> 16.715 Mbps in1.914 usecs 5: 6 B130589 times --> 25.077 Mbps in1.914 usecs 6: 8 B130608 times --> 33.402 Mbps in1.916 usecs 7: 12 B130477 times --> 50.047 Mbps in1.918 usecs 8: 13 B130329 times --> 54.872 Mbps in1.895 usecs 9: 16 B131904 times --> 67.187 Mbps in1.905 usecs 10: 19 B131225 times --> 79.255 Mbps in1.918 usecs 11: 21 B130353 times --> 87.118 Mbps in1.928 usecs 12: 24 B129640 times --> 99.831 Mbps in1.923 usecs 13: 27 B129988 times --> 111.760 Mbps in1.933 usecs 14: 29 B129351 times --> 121.048 Mbps in1.917 usecs 15: 32 B130439 times --> 132.620 Mbps in1.930 usecs 16: 35 B129511 times --> 144.272 Mbps in1.941 usecs 17: 45 B128814 times --> 182.881 Mbps in1.968 usecs 18: 48 B127000 times --> 194.231 Mbps in1.977 usecs 19: 51 B126452 times --> 206.193 Mbps in1.979 usecs 20: 61 B126343 times --> 236.168 Mbps in2.066 usecs 21: 64 B120987 times --> 244.690 Mbps in2.092 usecs 22: 67 B119477 times --> 256.660 Mbps in2.088 usecs 23: 93 B119710 times --> 242.428 Mbps in3.069 usecs 24: 96 B 81460 times --> 250.503 Mbps in3.066 usecs 25: 99 B 81543 times --> 258.376 Mbps in3.065 usecs 26: 125 B 81558 times --> 321.127 Mbps in3.114 usecs 27: 128 B 80281 times --> 328.788 Mbps in3.114 usecs 28: 131 B 80270 times --> 336.387 Mbps in3.115 usecs 29: 189 B 80244 times --> 474.304 Mbps in3.188 usecs 30: 192 B 78423 times --> 482.258 Mbps in3.185 usecs 31: 195 B 78492 times --> 489.635 Mbps in3.186 usecs 32: 253 B 78467 times --> 623.891 Mbps in3.244 usecs 33: 256 B 77061 times --> 631.098 Mbps in3.245 usecs 34: 259 B 77038 times --> 637.905 Mbps in3.248 usecs 35: 381 B 76967 times --> 906.297 Mbps in3.363 usecs 36: 384 B 74335 times --> 913.387 Mbps in3.363 usecs 37:
Re: [OMPI devel] MPI Reduce Without a Barrier
Thank you, Nathan. This makes more sense now. On Tue, Apr 16, 2019 at 6:48 AM Nathan Hjelm wrote: > What Ralph said. You just blow memory on a queue that is not recovered in > the current implementation. > > Also, moving to Allreduce will resolve the issue as now every call is > effectively also a barrier. I have found with some benchmarks and > collective implementations it can be faster than reduce anyway. That is why > it might be worth trying. > > -Nathan > > > On Apr 15, 2019, at 2:33 PM, Saliya Ekanayake wrote: > > > > Thank you, Nathan. Could you elaborate a bit on what happens internally? > From your answer it seems, the program will still produce the correct > output at the end but it'll use more resources. > > > > On Mon, Apr 15, 2019 at 9:00 AM Nathan Hjelm via devel < > devel@lists.open-mpi.org> wrote: > > If you do that it may run out of resources and deadlock or crash. I > recommend either 1) adding a barrier every 100 iterations, 2) using > allreduce, or 3) enable coll/sync (which essentially does 1). Honestly, 2 > is probably the easiest option and depending on how large you run may not > be any slower than 1 or 3. > > > > -Nathan > > > > > On Apr 15, 2019, at 9:53 AM, Saliya Ekanayake > wrote: > > > > > > Hi Devs, > > > > > > When doing MPI_Reduce in a loop (collecting on Rank 0), is it the > correct understanding that ranks other than root (0 in this case) will pass > the collective as soon as their data is written to MPI buffers without > waiting for all of them to be received at the root? > > > > > > If that's the case then what would happen (semantically) if we execute > MPI_Reduce in a loop without a barrier allowing non-root ranks to hit the > collective multiple times while the root will be processing an earlier > reduce? For example, the root can be in the first reduce invocation, while > another rank is in the second the reduce invocation. > > > > > > Thank you, > > > Saliya > > > > > > -- > > > Saliya Ekanayake, Ph.D > > > Postdoctoral Scholar > > > Performance and Algorithms Research (PAR) Group > > > Lawrence Berkeley National Laboratory > > > Phone: 510-486-5772 > > > > > > ___ > > > devel mailing list > > > devel@lists.open-mpi.org > > > https://lists.open-mpi.org/mailman/listinfo/devel > > > > ___ > > devel mailing list > > devel@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/devel > > > > > > -- > > Saliya Ekanayake, Ph.D > > Postdoctoral Scholar > > Performance and Algorithms Research (PAR) Group > > Lawrence Berkeley National Laboratory > > Phone: 510-486-5772 > > > > -- Saliya Ekanayake, Ph.D Postdoctoral Scholar Performance and Algorithms Research (PAR) Group Lawrence Berkeley National Laboratory Phone: 510-486-5772 ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] MPI Reduce Without a Barrier
What Ralph said. You just blow memory on a queue that is not recovered in the current implementation. Also, moving to Allreduce will resolve the issue as now every call is effectively also a barrier. I have found with some benchmarks and collective implementations it can be faster than reduce anyway. That is why it might be worth trying. -Nathan > On Apr 15, 2019, at 2:33 PM, Saliya Ekanayake wrote: > > Thank you, Nathan. Could you elaborate a bit on what happens internally? From > your answer it seems, the program will still produce the correct output at > the end but it'll use more resources. > > On Mon, Apr 15, 2019 at 9:00 AM Nathan Hjelm via devel > wrote: > If you do that it may run out of resources and deadlock or crash. I recommend > either 1) adding a barrier every 100 iterations, 2) using allreduce, or 3) > enable coll/sync (which essentially does 1). Honestly, 2 is probably the > easiest option and depending on how large you run may not be any slower than > 1 or 3. > > -Nathan > > > On Apr 15, 2019, at 9:53 AM, Saliya Ekanayake wrote: > > > > Hi Devs, > > > > When doing MPI_Reduce in a loop (collecting on Rank 0), is it the correct > > understanding that ranks other than root (0 in this case) will pass the > > collective as soon as their data is written to MPI buffers without waiting > > for all of them to be received at the root? > > > > If that's the case then what would happen (semantically) if we execute > > MPI_Reduce in a loop without a barrier allowing non-root ranks to hit the > > collective multiple times while the root will be processing an earlier > > reduce? For example, the root can be in the first reduce invocation, while > > another rank is in the second the reduce invocation. > > > > Thank you, > > Saliya > > > > -- > > Saliya Ekanayake, Ph.D > > Postdoctoral Scholar > > Performance and Algorithms Research (PAR) Group > > Lawrence Berkeley National Laboratory > > Phone: 510-486-5772 > > > > ___ > > devel mailing list > > devel@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/devel > > ___ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel > > > -- > Saliya Ekanayake, Ph.D > Postdoctoral Scholar > Performance and Algorithms Research (PAR) Group > Lawrence Berkeley National Laboratory > Phone: 510-486-5772 > ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel