Re: [OMPI users] MPI_Reduce_Scatter Segmentation Fault with Intel 2019 Update 1 Compilers on OPA-1

2018-12-04 Thread Gilles Gouaillardet

Thanks Mikhail,


You have a good point.


With the current semantic used in the IMB benchmark, this cannot be 
equivalent to


MPI_Reduce() of N bytes followed by MPI_Scatterv() of N bytes.


So this is indeed a semantical question :

what should be a MPI_Reduce_scatter() of N bytes equivalent to ?

1) MPI_Reduce() of N bytes followed by MPI_Scatterv() in which each task 
receives N/commsize bytes


2) MPI_Reduce() of N*commsize bytes followed by MPI_Scatterv() in which 
each task receives N bytes.



I honestly have no opinion on that, and as long as there is no memory 
corruption, I am happy with both options.




Cheers,


Gilles

On 12/5/2018 12:25 PM, Mikhail Kurnosov wrote:

Hi,

The memory manager of IMB (IMB_mem_manager.c) do not support the 
MPI_Reduce_scatter operation. It allocates too small send buffer: 
sizeof(msg), but the operation requires commsize * sizeof(msg).

There are two possible solutions:

1) Fix computations of recvcounts (as proposed by Gilles)
2) Change memory allocation for send buffer in the memory manager of 
IMB. That approach was consistent with IMB style (for example, buffer 
allocation for MPI_Scatter operation)


WBR,
Mikhail Kurnosov
On 04.12.2018 17:06, Peter Kjellström wrote:

On Mon, 3 Dec 2018 19:41:25 +
"Hammond, Simon David via users"  wrote:

 > Hi Open MPI Users,
 >
 > Just wanted to report a bug we have seen with OpenMPI 3.1.3 and 4.0.0
 > when using the Intel 2019 Update 1 compilers on our
 > Skylake/OmniPath-1 cluster. The bug occurs when running the Github
 > master src_c variant of the Intel MPI Benchmarks.

I've noticed this also when using intel mpi (2018 and 2019u1). I
classified it as a bug in imb but didn't look too deep (new
reduce_scatter code).

/Peter K

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI_Reduce_Scatter Segmentation Fault with Intel 2019 Update 1 Compilers on OPA-1

2018-12-04 Thread Mikhail Kurnosov

Hi,

The memory manager of IMB (IMB_mem_manager.c) do not support the 
MPI_Reduce_scatter operation. It allocates too small send buffer: 
sizeof(msg), but the operation requires commsize * sizeof(msg).

There are two possible solutions:

1) Fix computations of recvcounts (as proposed by Gilles)
2) Change memory allocation for send buffer in the memory manager of 
IMB. That approach was consistent with IMB style (for example, buffer 
allocation for MPI_Scatter operation)


WBR,
Mikhail Kurnosov
On 04.12.2018 17:06, Peter Kjellström wrote:

On Mon, 3 Dec 2018 19:41:25 +
"Hammond, Simon David via users"  wrote:

 > Hi Open MPI Users,
 >
 > Just wanted to report a bug we have seen with OpenMPI 3.1.3 and 4.0.0
 > when using the Intel 2019 Update 1 compilers on our
 > Skylake/OmniPath-1 cluster. The bug occurs when running the Github
 > master src_c variant of the Intel MPI Benchmarks.

I've noticed this also when using intel mpi (2018 and 2019u1). I
classified it as a bug in imb but didn't look too deep (new
reduce_scatter code).

/Peter K

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI_Reduce_Scatter Segmentation Fault with Intel 2019 Update 1 Compilers on OPA-1

2018-12-04 Thread Gilles Gouaillardet

Thanks for the report.


As far as I am concerned, this is a bug in the IMB benchmark, and I 
issued a PR to fix that


https://github.com/intel/mpi-benchmarks/pull/11


Meanwhile, you can manually download and apply the patch at

https://github.com/intel/mpi-benchmarks/pull/11.patch



Cheers,


Gilles


On 12/4/2018 4:41 AM, Hammond, Simon David via users wrote:

Hi Open MPI Users,

Just wanted to report a bug we have seen with OpenMPI 3.1.3 and 4.0.0 when 
using the Intel 2019 Update 1 compilers on our Skylake/OmniPath-1 cluster. The 
bug occurs when running the Github master src_c variant of the Intel MPI 
Benchmarks.

Configuration:

./configure --prefix=/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144 
--with-slurm --with-psm2 
CC=/home/projects/x86-64/intel/compilers/2019/compilers_and_libraries_2019.1.144/linux/bin/intel64/icc
 
CXX=/home/projects/x86-64/intel/compilers/2019/compilers_and_libraries_2019.1.144/linux/bin/intel64/icpc
 
FC=/home/projects/x86-64/intel/compilers/2019/compilers_and_libraries_2019.1.144/linux/bin/intel64/ifort
 --with-zlib=/home/projects/x86-64/zlib/1.2.11 
--with-valgrind=/home/projects/x86-64/valgrind/3.13.0

Operating System is RedHat 7.4 release and we utilize a local build of GCC 
7.2.0 for our Intel compiler (C++) header files. Everything makes correctly, 
and passes a make check without any issues.

We then compile IMB and run IMB-MPI1 on 24 nodes and get the following:

#
# Benchmarking Reduce_scatter
# #processes = 64
# ( 1088 additional processes waiting in MPI_Barrier)
#
#bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
 0 1000 0.18 0.19 0.18
 4 1000 7.3910.37 8.68
 8 1000 7.8411.14 9.23
16 1000 8.5012.3710.14
32 100010.3714.6612.15
64 100013.7618.8216.17
   128 100021.6327.6124.87
   256 100039.9847.2743.96
   512 100072.9378.5975.15
  1024 1000   147.21   152.98   149.94
  2048 1000   413.41   426.90   420.15
  4096 1000   421.28   442.58   434.52
  8192 1000   418.31   450.20   438.51
 16384 1000  1082.85  1221.44  1140.92
 32768 1000  2434.11  2529.90  2476.72
 65536  640  5469.57  6048.60  5687.08
131072  320 11702.94 12435.06 12075.07
262144  160 19214.42 20433.83 19883.80
524288   80 49462.22 53896.43 52101.56
   1048576   40119422.53131922.20126920.99
   2097152   20256345.97288185.72275767.05
[node06:351648] *** Process received signal ***
[node06:351648] Signal: Segmentation fault (11)
[node06:351648] Signal code: Invalid permissions (2)
[node06:351648] Failing at address: 0x7fdb6efc4000
[node06:351648] [ 0] /lib64/libpthread.so.0(+0xf5e0)[0x7fdb8646c5e0]
[node06:351648] [ 1] ./IMB-MPI1(__intel_avx_rep_memcpy+0x140)[0x415380]
[node06:351648] [ 2] 
/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libopen-pal.so.40(opal_datatype_copy_content_same_ddt+0xca)[0x7fdb858d847a]
[node06:351648] [ 3] 
/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x3f9)[0x7fdb86c43b29]
[node06:351648] [ 4] 
/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1d7)[0x7fdb86c1de67]
[node06:351648] [ 5] ./IMB-MPI1[0x40d624]
[node06:351648] [ 6] ./IMB-MPI1[0x407d16]
[node06:351648] [ 7] ./IMB-MPI1[0x403356]
[node06:351648] [ 8] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fdb860bbc05]
[node06:351648] [ 9] ./IMB-MPI1[0x402da9]
[node06:351648] *** End of error message ***
[node06:351649] *** Process received signal ***
[node06:351649] Signal: Segmentation fault (11)
[node06:351649] Signal code: Invalid permissions (2)
[node06:351649] Failing at address: 0x7f9b19c6f000
[node06:351649] [ 0] /lib64/libpthread.so.0(+0xf5e0)[0x7f9b311295e0]
[node06:351649] [ 1] ./IMB-MPI1(__intel_avx_rep_memcpy+0x140)[0x415380]
[node06:351649] [ 2] 
/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libopen-pal.so.40(opal_datatype_copy_content_same_ddt+0xca)[0x7f9b3059547a]
[node06:351649] [ 3] 
/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x3f9)[0x7f9b31900b29]
[node06:351649] [ 4] 

Re: [OMPI users] filesystem-dependent failure building Fortran interfaces

2018-12-04 Thread Jeff Squyres (jsquyres) via users
Hi Dave; thanks for reporting.

Yes, we've fixed this -- it should be included in 4.0.1.

https://github.com/open-mpi/ompi/pull/6121

If you care, you can try the nightly 4.0.x snapshot tarball -- it should 
include this fix:

 https://www.open-mpi.org/nightly/v4.0.x/


> On Dec 4, 2018, at 8:10 AM, Dave Love  wrote:
> 
> If you try to build somewhere out of tree, not in a subdir of the
> source, the Fortran build is likely to fail because mpi-ext-module.F90
> does
> 
>   include 
> '/openmpi-4.0.0/ompi/mpiext/pcollreq/mpif-h/mpiext_pcollreq_mpifh.h'
> 
> and can exceed the fixed line length.  It either needs to add (the
> compiler's equivalent of gfortran's) -ffixed-line-length-none to FFLAGS
> or, I guess, set the include path; the latter may be more robust.
> 
> [The situation arises, for instance, if the source location is
> read-only.  I haven't checked, but I think this was OK in v3.]
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] filesystem-dependent failure building Fortran interfaces

2018-12-04 Thread Dave Love
If you try to build somewhere out of tree, not in a subdir of the
source, the Fortran build is likely to fail because mpi-ext-module.F90
does

   include 
'/openmpi-4.0.0/ompi/mpiext/pcollreq/mpif-h/mpiext_pcollreq_mpifh.h'

and can exceed the fixed line length.  It either needs to add (the
compiler's equivalent of gfortran's) -ffixed-line-length-none to FFLAGS
or, I guess, set the include path; the latter may be more robust.

[The situation arises, for instance, if the source location is
read-only.  I haven't checked, but I think this was OK in v3.]
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] MPI_Reduce_Scatter Segmentation Fault with Intel 2019 Update 1 Compilers on OPA-1

2018-12-04 Thread Peter Kjellström
On Tue, 4 Dec 2018 09:15:13 -0500
George Bosilca  wrote:

> I'm trying to replicate using the same compiler (icc 2019) on my OSX
> over TCP and shared memory with no luck so far. So either the
> segfault it's something specific to OmniPath or to the memcpy
> implementation used on Skylake.

Note that it's the imb-2019.1 that is the problem (I think). And I did
get it to crash even on a single node (skylake / centos7).

/Peter

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI_Reduce_Scatter Segmentation Fault with Intel 2019 Update 1 Compilers on OPA-1

2018-12-04 Thread George Bosilca
I'm trying to replicate using the same compiler (icc 2019) on my OSX over
TCP and shared memory with no luck so far. So either the segfault it's
something specific to OmniPath or to the memcpy implementation used on
Skylake. I tried to use the trace you sent, more specifically the
opal_datatype_copy_content_same_ddt mention, to understand where the
segfault happen, but unfortunately there are 3 calls to
opal_datatype_copy_content_same_ddt in the reduce_scatter algorithm. Can
you please build in debug mode and if you can replicate the segfault send
me the stack trace.

Thanks,
  Geore.


On Tue, Dec 4, 2018 at 5:07 AM Peter Kjellström  wrote:

> On Mon, 3 Dec 2018 19:41:25 +
> "Hammond, Simon David via users"  wrote:
>
> > Hi Open MPI Users,
> >
> > Just wanted to report a bug we have seen with OpenMPI 3.1.3 and 4.0.0
> > when using the Intel 2019 Update 1 compilers on our
> > Skylake/OmniPath-1 cluster. The bug occurs when running the Github
> > master src_c variant of the Intel MPI Benchmarks.
>
> I've noticed this also when using intel mpi (2018 and 2019u1). I
> classified it as a bug in imb but didn't look too deep (new
> reduce_scatter code).
>
> /Peter K
>
> --
> Sent from my Android device with K-9 Mail. Please excuse my
> brevity.___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI_Reduce_Scatter Segmentation Fault with Intel 2019 Update 1 Compilers on OPA-1

2018-12-04 Thread Peter Kjellström
On Mon, 3 Dec 2018 19:41:25 +
"Hammond, Simon David via users"  wrote:

> Hi Open MPI Users,
> 
> Just wanted to report a bug we have seen with OpenMPI 3.1.3 and 4.0.0
> when using the Intel 2019 Update 1 compilers on our
> Skylake/OmniPath-1 cluster. The bug occurs when running the Github
> master src_c variant of the Intel MPI Benchmarks.

I've noticed this also when using intel mpi (2018 and 2019u1). I
classified it as a bug in imb but didn't look too deep (new
reduce_scatter code).

/Peter K

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users