Re: [OMPI users] MPI_Reduce_Scatter Segmentation Fault with Intel 2019 Update 1 Compilers on OPA-1
Thanks Mikhail, You have a good point. With the current semantic used in the IMB benchmark, this cannot be equivalent to MPI_Reduce() of N bytes followed by MPI_Scatterv() of N bytes. So this is indeed a semantical question : what should be a MPI_Reduce_scatter() of N bytes equivalent to ? 1) MPI_Reduce() of N bytes followed by MPI_Scatterv() in which each task receives N/commsize bytes 2) MPI_Reduce() of N*commsize bytes followed by MPI_Scatterv() in which each task receives N bytes. I honestly have no opinion on that, and as long as there is no memory corruption, I am happy with both options. Cheers, Gilles On 12/5/2018 12:25 PM, Mikhail Kurnosov wrote: Hi, The memory manager of IMB (IMB_mem_manager.c) do not support the MPI_Reduce_scatter operation. It allocates too small send buffer: sizeof(msg), but the operation requires commsize * sizeof(msg). There are two possible solutions: 1) Fix computations of recvcounts (as proposed by Gilles) 2) Change memory allocation for send buffer in the memory manager of IMB. That approach was consistent with IMB style (for example, buffer allocation for MPI_Scatter operation) WBR, Mikhail Kurnosov On 04.12.2018 17:06, Peter Kjellström wrote: On Mon, 3 Dec 2018 19:41:25 + "Hammond, Simon David via users" wrote: > Hi Open MPI Users, > > Just wanted to report a bug we have seen with OpenMPI 3.1.3 and 4.0.0 > when using the Intel 2019 Update 1 compilers on our > Skylake/OmniPath-1 cluster. The bug occurs when running the Github > master src_c variant of the Intel MPI Benchmarks. I've noticed this also when using intel mpi (2018 and 2019u1). I classified it as a bug in imb but didn't look too deep (new reduce_scatter code). /Peter K -- Sent from my Android device with K-9 Mail. Please excuse my brevity. ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] MPI_Reduce_Scatter Segmentation Fault with Intel 2019 Update 1 Compilers on OPA-1
Hi, The memory manager of IMB (IMB_mem_manager.c) do not support the MPI_Reduce_scatter operation. It allocates too small send buffer: sizeof(msg), but the operation requires commsize * sizeof(msg). There are two possible solutions: 1) Fix computations of recvcounts (as proposed by Gilles) 2) Change memory allocation for send buffer in the memory manager of IMB. That approach was consistent with IMB style (for example, buffer allocation for MPI_Scatter operation) WBR, Mikhail Kurnosov On 04.12.2018 17:06, Peter Kjellström wrote: On Mon, 3 Dec 2018 19:41:25 + "Hammond, Simon David via users" wrote: > Hi Open MPI Users, > > Just wanted to report a bug we have seen with OpenMPI 3.1.3 and 4.0.0 > when using the Intel 2019 Update 1 compilers on our > Skylake/OmniPath-1 cluster. The bug occurs when running the Github > master src_c variant of the Intel MPI Benchmarks. I've noticed this also when using intel mpi (2018 and 2019u1). I classified it as a bug in imb but didn't look too deep (new reduce_scatter code). /Peter K -- Sent from my Android device with K-9 Mail. Please excuse my brevity. ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] MPI_Reduce_Scatter Segmentation Fault with Intel 2019 Update 1 Compilers on OPA-1
Thanks for the report. As far as I am concerned, this is a bug in the IMB benchmark, and I issued a PR to fix that https://github.com/intel/mpi-benchmarks/pull/11 Meanwhile, you can manually download and apply the patch at https://github.com/intel/mpi-benchmarks/pull/11.patch Cheers, Gilles On 12/4/2018 4:41 AM, Hammond, Simon David via users wrote: Hi Open MPI Users, Just wanted to report a bug we have seen with OpenMPI 3.1.3 and 4.0.0 when using the Intel 2019 Update 1 compilers on our Skylake/OmniPath-1 cluster. The bug occurs when running the Github master src_c variant of the Intel MPI Benchmarks. Configuration: ./configure --prefix=/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144 --with-slurm --with-psm2 CC=/home/projects/x86-64/intel/compilers/2019/compilers_and_libraries_2019.1.144/linux/bin/intel64/icc CXX=/home/projects/x86-64/intel/compilers/2019/compilers_and_libraries_2019.1.144/linux/bin/intel64/icpc FC=/home/projects/x86-64/intel/compilers/2019/compilers_and_libraries_2019.1.144/linux/bin/intel64/ifort --with-zlib=/home/projects/x86-64/zlib/1.2.11 --with-valgrind=/home/projects/x86-64/valgrind/3.13.0 Operating System is RedHat 7.4 release and we utilize a local build of GCC 7.2.0 for our Intel compiler (C++) header files. Everything makes correctly, and passes a make check without any issues. We then compile IMB and run IMB-MPI1 on 24 nodes and get the following: # # Benchmarking Reduce_scatter # #processes = 64 # ( 1088 additional processes waiting in MPI_Barrier) # #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 1000 0.18 0.19 0.18 4 1000 7.3910.37 8.68 8 1000 7.8411.14 9.23 16 1000 8.5012.3710.14 32 100010.3714.6612.15 64 100013.7618.8216.17 128 100021.6327.6124.87 256 100039.9847.2743.96 512 100072.9378.5975.15 1024 1000 147.21 152.98 149.94 2048 1000 413.41 426.90 420.15 4096 1000 421.28 442.58 434.52 8192 1000 418.31 450.20 438.51 16384 1000 1082.85 1221.44 1140.92 32768 1000 2434.11 2529.90 2476.72 65536 640 5469.57 6048.60 5687.08 131072 320 11702.94 12435.06 12075.07 262144 160 19214.42 20433.83 19883.80 524288 80 49462.22 53896.43 52101.56 1048576 40119422.53131922.20126920.99 2097152 20256345.97288185.72275767.05 [node06:351648] *** Process received signal *** [node06:351648] Signal: Segmentation fault (11) [node06:351648] Signal code: Invalid permissions (2) [node06:351648] Failing at address: 0x7fdb6efc4000 [node06:351648] [ 0] /lib64/libpthread.so.0(+0xf5e0)[0x7fdb8646c5e0] [node06:351648] [ 1] ./IMB-MPI1(__intel_avx_rep_memcpy+0x140)[0x415380] [node06:351648] [ 2] /home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libopen-pal.so.40(opal_datatype_copy_content_same_ddt+0xca)[0x7fdb858d847a] [node06:351648] [ 3] /home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x3f9)[0x7fdb86c43b29] [node06:351648] [ 4] /home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1d7)[0x7fdb86c1de67] [node06:351648] [ 5] ./IMB-MPI1[0x40d624] [node06:351648] [ 6] ./IMB-MPI1[0x407d16] [node06:351648] [ 7] ./IMB-MPI1[0x403356] [node06:351648] [ 8] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fdb860bbc05] [node06:351648] [ 9] ./IMB-MPI1[0x402da9] [node06:351648] *** End of error message *** [node06:351649] *** Process received signal *** [node06:351649] Signal: Segmentation fault (11) [node06:351649] Signal code: Invalid permissions (2) [node06:351649] Failing at address: 0x7f9b19c6f000 [node06:351649] [ 0] /lib64/libpthread.so.0(+0xf5e0)[0x7f9b311295e0] [node06:351649] [ 1] ./IMB-MPI1(__intel_avx_rep_memcpy+0x140)[0x415380] [node06:351649] [ 2] /home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libopen-pal.so.40(opal_datatype_copy_content_same_ddt+0xca)[0x7f9b3059547a] [node06:351649] [ 3] /home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x3f9)[0x7f9b31900b29] [node06:351649] [ 4]
Re: [OMPI users] MPI_Reduce_Scatter Segmentation Fault with Intel 2019 Update 1 Compilers on OPA-1
On Tue, 4 Dec 2018 09:15:13 -0500 George Bosilca wrote: > I'm trying to replicate using the same compiler (icc 2019) on my OSX > over TCP and shared memory with no luck so far. So either the > segfault it's something specific to OmniPath or to the memcpy > implementation used on Skylake. Note that it's the imb-2019.1 that is the problem (I think). And I did get it to crash even on a single node (skylake / centos7). /Peter -- Sent from my Android device with K-9 Mail. Please excuse my brevity.___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] MPI_Reduce_Scatter Segmentation Fault with Intel 2019 Update 1 Compilers on OPA-1
I'm trying to replicate using the same compiler (icc 2019) on my OSX over TCP and shared memory with no luck so far. So either the segfault it's something specific to OmniPath or to the memcpy implementation used on Skylake. I tried to use the trace you sent, more specifically the opal_datatype_copy_content_same_ddt mention, to understand where the segfault happen, but unfortunately there are 3 calls to opal_datatype_copy_content_same_ddt in the reduce_scatter algorithm. Can you please build in debug mode and if you can replicate the segfault send me the stack trace. Thanks, Geore. On Tue, Dec 4, 2018 at 5:07 AM Peter Kjellström wrote: > On Mon, 3 Dec 2018 19:41:25 + > "Hammond, Simon David via users" wrote: > > > Hi Open MPI Users, > > > > Just wanted to report a bug we have seen with OpenMPI 3.1.3 and 4.0.0 > > when using the Intel 2019 Update 1 compilers on our > > Skylake/OmniPath-1 cluster. The bug occurs when running the Github > > master src_c variant of the Intel MPI Benchmarks. > > I've noticed this also when using intel mpi (2018 and 2019u1). I > classified it as a bug in imb but didn't look too deep (new > reduce_scatter code). > > /Peter K > > -- > Sent from my Android device with K-9 Mail. Please excuse my > brevity.___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] MPI_Reduce_Scatter Segmentation Fault with Intel 2019 Update 1 Compilers on OPA-1
On Mon, 3 Dec 2018 19:41:25 + "Hammond, Simon David via users" wrote: > Hi Open MPI Users, > > Just wanted to report a bug we have seen with OpenMPI 3.1.3 and 4.0.0 > when using the Intel 2019 Update 1 compilers on our > Skylake/OmniPath-1 cluster. The bug occurs when running the Github > master src_c variant of the Intel MPI Benchmarks. I've noticed this also when using intel mpi (2018 and 2019u1). I classified it as a bug in imb but didn't look too deep (new reduce_scatter code). /Peter K -- Sent from my Android device with K-9 Mail. Please excuse my brevity.___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users