Re: [OMPI devel] 3.1.2: Datatype errors and segfault in MPI_Allgatherv

2018-11-02 Thread Gilles Gouaillardet
Thanks Ben ! I opened https://github.com/open-mpi/ompi/issues/6016 in order to track this issue, and wrote a simpler example that evidences this issue. We should follow-up there from now. fwiw, several bug fixes have not been backported into the v3 branches. Note that using the ddt

Re: [OMPI devel] 3.1.2: Datatype errors and segfault in MPI_Allgatherv

2018-11-01 Thread Larry Baker via devel
Things that read like they should be unsigned look suspicious to me: nbElems -909934592 count -1819869184 Larry Baker US Geological Survey 650-329-5608 ba...@usgs.gov > On Nov 1, 2018, at 10:34 PM, Ben Menadue wrote: > > Hi, > > I haven’t heard back from the user yet, but I just put this

Re: [OMPI devel] 3.1.2: Datatype errors and segfault in MPI_Allgatherv

2018-11-01 Thread Ben Menadue
Hi,I haven’t heard back from the user yet, but I just put this example together which works on 1, 2, and 3 ranks but fails for 4. Unfortunately it needs a fair amount of memory, about 14.3GB per process, so I was running it with -map-by ppr:1:node.It doesn’t fail with the segfault as the user’s

Re: [OMPI devel] 3.1.2: Datatype errors and segfault in MPI_Allgatherv

2018-11-01 Thread Ben Menadue
HI Gilles, > On 2 Nov 2018, at 11:03 am, Gilles Gouaillardet wrote: > I noted the stack traces refers opal_cuda_memcpy(). Is this issue specific to > CUDA environments ? No, this is just on normal CPU-only nodes. But memcpy always goes through opal_cuda_memcpy when CUDA support is enabled,

Re: [OMPI devel] 3.1.2: Datatype errors and segfault in MPI_Allgatherv

2018-11-01 Thread Gilles Gouaillardet
Hi Ben, I noted the stack traces refers opal_cuda_memcpy(). Is this issue specific to CUDA environments ? The coll/tuned default collective module is known not to work when tasks use matching but different signatures. For example, one task sends one vector of N elements, and the other

[OMPI devel] 3.1.2: Datatype errors and segfault in MPI_Allgatherv

2018-11-01 Thread Ben Menadue
Hi, One of our users is reporting an issue using MPI_Allgatherv with a large derived datatype — it segfaults inside OpenMPI. Using a debug build of OpenMPI 3.1.2 produces a ton of messages like this before the segfault: [r3816:50921] ../../../../../opal/datatype/opal_datatype_pack.h:53