Thanks again Damian,

I think the root cause is we call mca_topo_base_neighbor_count() instead of
ompi_comm_size() here.
It seems the implicit assumption is one would call
MPI_Ineighbor_alltoallw() on a cartesian communicator ... which is
obviously wrong: it is legit to call MPI_Ialltoallw(), even if the
communicator is cartesian.

You can try the attached patch (for the v4.1 series).

Note this patch breaks the internal ABI, that is why we will likely end up
with a different one for the v4.1 series.


Cheers,

Gilles

On Thu, May 5, 2022 at 1:10 AM Damian Marek via devel <
devel@lists.open-mpi.org> wrote:

> Hello,
>
> I made an example that triggers the issue. I had to get a little creative
> with how to trigger the crash, since it does not appear if the memory
> allocated for the send and recv types happens to be set to 0 (although
> valgrind still reports an invalid read).
>
> Communicator is an intra-communicator with a cartesian topology. It takes
> the first branch into "mca_topo_base_neighbor_count" and sets the wrong
> rcount and scount, at least to my understanding.
>
> Regards,
> Damian
> ------------------------------
> *From:* devel <devel-boun...@lists.open-mpi.org> on behalf of George
> Bosilca via devel <devel@lists.open-mpi.org>
> *Sent:* Wednesday, May 4, 2022 10:20 AM
> *To:* Open MPI Developers <devel@lists.open-mpi.org>
> *Cc:* George Bosilca <bosi...@icl.utk.edu>
> *Subject:* Re: [OMPI devel] Possible Bug / Invalid Read in Ialltoallw
>
> Damien,
>
> As Gilles indicated an example would be great. Meanwhile, as you already
> have access to the root cause with a debugger, can you check what branch of
> the if regarding the communicator type in the
> ompi_coll_base_retain_datatypes_w function is taken. What is the
> communicator type ? Intra or inter ? with or without topology ?
>
> Thanks,
>   George.
>
>
> On Wed, May 4, 2022 at 9:35 AM Gilles Gouaillardet via devel <
> devel@lists.open-mpi.org> wrote:
>
> Damian,
>
> Thanks for the report!
>
> could you please trim your program and share it so I can have a look?
>
>
> Cheers,
>
> Gilles
>
>
> On Wed, May 4, 2022 at 10:27 PM Damian Marek via devel <
> devel@lists.open-mpi.org> wrote:
>
> Hello,
>
> I have been getting intermittent memory corruptions and segmentation
> faults while using Ialltoallw in OpenMPI v4.0.3. Valgrind also reports an
> invalid read in the "ompi_coll_base_retain_datatypes_w" function defined in
> "coll_base_util.c".
>
> Running with a debug build of ompi an assertion fails as well:
>
> base/coll_base_util.c:274: ompi_coll_base_retain_datatypes_w: Assertion
> `OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (stypes[i]))->obj_magic_id' failed.
>
> I think it is related to the fact that I am using a communicator created
> with 2D MPI_Cart_create followed by getting 2 subcommunicators from
> MPI_Cart_sub, in some cases one of the dimensions is 1. In
> "ompi_coll_base_retain_datatypes_w" the neighbour count is used to find
> "rcount" and "scount" at line 267. In my bug case it returns 2 for both,
> but I believe it should be 1 since that is the comm size and the amount of
> memory I have allocated for sendtypes and recvtypes. Then, an invalid read
> happens at 274 and 280.
>
> Regards,
> Damian
>
>
>
>
>
>
>

Attachment: alltoallw.patch
Description: Binary data

Reply via email to