Re: [OMPI devel] 3.1.2: Datatype errors and segfault in MPI_Allgatherv

2018-11-02 Thread Gilles Gouaillardet
Thanks Ben ! I opened https://github.com/open-mpi/ompi/issues/6016 in order to track this issue, and wrote a simpler example that evidences this issue. We should follow-up there from now. fwiw, several bug fixes have not been backported into the v3 branches. Note that using the ddt

Re: [OMPI devel] 3.1.2: Datatype errors and segfault in MPI_Allgatherv

2018-11-01 Thread Larry Baker via devel
Things that read like they should be unsigned look suspicious to me: nbElems -909934592 count -1819869184 Larry Baker US Geological Survey 650-329-5608 ba...@usgs.gov > On Nov 1, 2018, at 10:34 PM, Ben Menadue wrote: > > Hi, > > I haven’t heard back from the user yet, but I just put this

Re: [OMPI devel] 3.1.2: Datatype errors and segfault in MPI_Allgatherv

2018-11-01 Thread Ben Menadue
Hi,I haven’t heard back from the user yet, but I just put this example together which works on 1, 2, and 3 ranks but fails for 4. Unfortunately it needs a fair amount of memory, about 14.3GB per process, so I was running it with -map-by ppr:1:node.It doesn’t fail with the segfault as the user’s

Re: [OMPI devel] 3.1.2: Datatype errors and segfault in MPI_Allgatherv

2018-11-01 Thread Ben Menadue
HI Gilles, > On 2 Nov 2018, at 11:03 am, Gilles Gouaillardet wrote: > I noted the stack traces refers opal_cuda_memcpy(). Is this issue specific to > CUDA environments ? No, this is just on normal CPU-only nodes. But memcpy always goes through opal_cuda_memcpy when CUDA support is enabled,

Re: [OMPI devel] 3.1.2: Datatype errors and segfault in MPI_Allgatherv

2018-11-01 Thread Gilles Gouaillardet
Hi Ben, I noted the stack traces refers opal_cuda_memcpy(). Is this issue specific to CUDA environments ? The coll/tuned default collective module is known not to work when tasks use matching but different signatures. For example, one task sends one vector of N elements, and the other