Re: [OMPI devel] 3.1.2: Datatype errors and segfault in MPI_Allgatherv

Gilles Gouaillardet Fri, 02 Nov 2018 02:00:33 -0700

Thanks Ben !

I opened https://github.com/open-mpi/ompi/issues/6016 in order to trackthis issue, and wrote a simpler example that evidences this issue.


We should follow-up there from now.


fwiw, several bug fixes have not been backported into the v3 branches.

Note that using the ddt datatype instead of MPI_DATATYPE_NULL could begood enough as a workaround for the time being

(and unlike forcing the coll/basic component, performances will beunaffected)




Cheers,


Gilles


On 11/2/2018 2:34 PM, Ben Menadue wrote:

Hi,
I haven’t heard back from the user yet, but I just put this exampletogether which works on 1, 2, and 3 ranks but fails for 4.Unfortunately it needs a fair amount of memory, about 14.3GB perprocess, so I was running it with -map-by ppr:1:node.
It doesn’t fail with the segfault as the user’s code does, but it doesSIGABRT:
16:12bjm900@r4320 MPI_TESTS> mpirun -mca pml ob1 -mca coll ^fca,hcoll-map-by ppr:1:node -np 4 ./a.out
[r4450:11544] ../../../../../opal/datatype/opal_datatype_pack.h:53
Pointer 0x2bb7ceedb010 size 131040 is outside[0x2b9ec63cb010,0x2bad1458b010] for
base ptr 0x2b9ec63cb010 count 1 and data
[r4450:11544] Datatype 0x145fe90[] size 30720000000 align 4 id 0length 7 used 6true_lb 0 true_ub 61440000000 (true_extent 61440000000) lb 0 ub61440000000 (extent 61440000000)
nbElems -909934592 loops 4 flags 104 (committed )-c-----GD--[---][---]
   contain OPAL_FLOAT4:*
--C--------[---][---] OPAL_LOOP_S 192 times the next 2 elementsextent 80000000--C---P-D--[---][---] OPAL_FLOAT4 count 20000000 disp 0xaba950000(46080000000) blen 0 extent 4 (size 80000000)--C--------[---][---] OPAL_LOOP_E prev 2 elements first elemdisplacement 46080000000 size of data 80000000--C--------[---][---] OPAL_LOOP_S 192 times the next 2 elementsextent 80000000--C---P-D--[---][---] OPAL_FLOAT4 count 20000000 disp 0x0 (0) blen0 extent 4 (size 80000000)--C--------[---][---] OPAL_LOOP_E prev 2 elements first elemdisplacement 0 size of data 80000000-------G---[---][---] OPAL_LOOP_E prev 6 elements first elemdisplacement 46080000000 size of data 655228928
Optimized description
-cC---P-DB-[---][---] OPAL_UINT1 count -1819869184 disp0xaba950000 (46080000000) blen 1 extent 1 (size 15360000000)-cC---P-DB-[---][---] OPAL_UINT1 count -1819869184 disp 0x0 (0)blen 1 extent 1 (size 15360000000)-------G---[---][---] OPAL_LOOP_E prev 2 elements first elemdisplacement 46080000000
[r4450:11544] *** Process received signal ***
[r4450:11544] Signal: Aborted (6)
[r4450:11544] Signal code:  (-6)

Cheers,
Ben
On 2 Nov 2018, at 12:09 pm, Ben Menadue <ben.mena...@nci.org.au<mailto:ben.mena...@nci.org.au>> wrote:
HI Gilles,
On 2 Nov 2018, at 11:03 am, Gilles Gouaillardet <gil...@rist.or.jp<mailto:gil...@rist.or.jp>> wrote:I noted the stack traces refers opal_cuda_memcpy(). Is this issuespecific to CUDA environments ?
No, this is just on normal CPU-only nodes. But memcpy always goesthrough opal_cuda_memcpy when CUDA support is enabled, even ifthere’s no GPUs in use (or indeed, even installed).
The coll/tuned default collective module is known not to work whentasks use matching but different signatures.For example, one task sends one vector of N elements, and the othertask receives N elements.
This is the call that triggers it:
ierror = MPI_Allgatherv(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL, S[0],recvcounts, displs, mpitype_vec_nobs, node_comm);
(and changing the source datatype to MPI_BYTE to avoid the NULLhandle doesn’t help).
A workaround worth trying is to
mpirun --mca coll basic ...
Thanks — using --mca coll basic,libnbc fixes it (basic on its ownfails because it can’t work out what to use for Iallgather).
Last but not least, could you please post a minimal example (and thenumber of MPI tasks used) that can evidence the issue ?
I’m just waiting for the user to get back to me with the okay toshare the code. Otherwise, I’ll see what I can put together myself.It works on 42 cores (at 14 per node = 3 nodes) but fails for 43cores (so 1 rank on the 4th node). The communicator includes 1 rankper node, so it’s going from a three-rank communicator to a four-rankcommunicator — perhaps the tuned algorithm changes at that point?
Cheers,
Ben

_______________________________________________
devel mailing list
devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] 3.1.2: Datatype errors and segfault in MPI_Allgatherv

Reply via email to