Hello Adam, IMB had a bug related to Reduce_scatter.
https://github.com/intel/mpi-benchmarks/pull/11 I'm not sure this bug is the cause but you can try the patch. https://github.com/intel/mpi-benchmarks/commit/841446d8cf4ca1f607c0f24b9a424ee39ee1f569 Thanks, Takahiro Kawashima, Fujitsu > Hello, > > When I do a run with OpenMPI v4.0.0 on Infiniband with this command: mpirun > --mca btl_openib_warn_no_device_params_found 0 --map-by node --mca > orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca > btl_openib_allow_ib 1 -np 6 > -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1 > > I get this error: > > #---------------------------------------------------------------- > # Benchmarking Reduce_scatter > # #processes = 4 > # ( 2 additional processes waiting in MPI_Barrier) > #---------------------------------------------------------------- > #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] > 0 1000 0.14 0.15 0.14 > 4 1000 5.00 7.58 6.28 > 8 1000 5.13 7.68 6.41 > 16 1000 5.05 7.74 6.39 > 32 1000 5.43 7.96 6.75 > 64 1000 6.78 8.56 7.69 > 128 1000 7.77 9.55 8.59 > 256 1000 8.28 10.96 9.66 > 512 1000 9.19 12.49 10.85 > 1024 1000 11.78 15.01 13.38 > 2048 1000 17.41 19.51 18.52 > 4096 1000 25.73 28.22 26.89 > 8192 1000 47.75 49.44 48.79 > 16384 1000 81.10 90.15 84.75 > 32768 1000 163.01 178.58 173.19 > 65536 640 315.63 340.51 333.18 > 131072 320 475.48 528.82 510.85 > 262144 160 979.70 1063.81 1035.61 > 524288 80 2070.51 2242.58 2150.15 > 1048576 40 4177.36 4527.25 4431.65 > 2097152 20 8738.08 9340.50 9147.89 > [pandora:04500] *** Process received signal *** > [pandora:04500] Signal: Segmentation fault (11) > [pandora:04500] Signal code: Address not mapped (1) > [pandora:04500] Failing at address: 0x7f310ebffff0 > [pandora:04499] *** Process received signal *** > [pandora:04499] Signal: Segmentation fault (11) > [pandora:04499] Signal code: Address not mapped (1) > [pandora:04499] Failing at address: 0x7f28b11ffff0 > [pandora:04500] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f3126bef680] > [pandora:04500] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f312695c4a0] > [pandora:04500] [ 2] > /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f312628be55] > [pandora:04500] [ 3] [pandora:04499] [ 0] > /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f3126ea798b] > [pandora:04500] [ 4] /usr/lib64/libpthread.so.0(+0xf680)[0x7f28c91ef680] > [pandora:04499] [ 1] > /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f3126e7eda7] > [pandora:04500] [ 5] IMB-MPI1[0x40b83b] > [pandora:04500] [ 6] IMB-MPI1[0x407155] > [pandora:04500] [ 7] IMB-MPI1[0x4022ea] > [pandora:04500] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f28c8f5c4a0] > [pandora:04499] [ 2] > /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f31268323d5] > [pandora:04500] [ 9] IMB-MPI1[0x401d49] > [pandora:04500] *** End of error message *** > /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f28c888be55] > [pandora:04499] [ 3] > /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f28c94a798b] > [pandora:04499] [ 4] > /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f28c947eda7] > [pandora:04499] [ 5] IMB-MPI1[0x40b83b] > [pandora:04499] [ 6] IMB-MPI1[0x407155] > [pandora:04499] [ 7] IMB-MPI1[0x4022ea] > [pandora:04499] [ 8] > /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f28c8e323d5] > [pandora:04499] [ 9] IMB-MPI1[0x401d49] > [pandora:04499] *** End of error message *** > [phoebe:03779] *** Process received signal *** > [phoebe:03779] Signal: Segmentation fault (11) > [phoebe:03779] Signal code: Address not mapped (1) > [phoebe:03779] Failing at address: 0x7f483d6ffff0 > [phoebe:03779] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f48556c7680] > [phoebe:03779] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f48554344a0] > [phoebe:03779] [ 2] > /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f4854d63e55] > [phoebe:03779] [ 3] > /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f485597f98b] > [phoebe:03779] [ 4] > /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f4855956da7] > [phoebe:03779] [ 5] IMB-MPI1[0x40b83b] > [phoebe:03779] [ 6] IMB-MPI1[0x407155] > [phoebe:03779] [ 7] IMB-MPI1[0x4022ea] > [phoebe:03779] [ 8] > /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f485530a3d5] > [phoebe:03779] [ 9] IMB-MPI1[0x401d49] > [phoebe:03779] *** End of error message *** > -------------------------------------------------------------------------- > Primary job terminated normally, but 1 process returned > a non-zero exit code. Per user-direction, the job has been aborted. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that process rank 1 with PID 3779 on node phoebe-ib exited > on signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > > Also if I reinstall 3.1.2 I do not have this issue at all. > > Any thoughts on what could be the issue? > > Thanks, > Adam LeBlanc _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users