Re: [OMPI devel] Possible Bug / Invalid Read in Ialltoallw

2022-05-05 Thread Gilles Gouaillardet via devel
Thanks again Damian,

I think the root cause is we call mca_topo_base_neighbor_count() instead of
ompi_comm_size() here.
It seems the implicit assumption is one would call
MPI_Ineighbor_alltoallw() on a cartesian communicator ... which is
obviously wrong: it is legit to call MPI_Ialltoallw(), even if the
communicator is cartesian.

You can try the attached patch (for the v4.1 series).

Note this patch breaks the internal ABI, that is why we will likely end up
with a different one for the v4.1 series.


Cheers,

Gilles

On Thu, May 5, 2022 at 1:10 AM Damian Marek via devel <
devel@lists.open-mpi.org> wrote:

> Hello,
>
> I made an example that triggers the issue. I had to get a little creative
> with how to trigger the crash, since it does not appear if the memory
> allocated for the send and recv types happens to be set to 0 (although
> valgrind still reports an invalid read).
>
> Communicator is an intra-communicator with a cartesian topology. It takes
> the first branch into "mca_topo_base_neighbor_count" and sets the wrong
> rcount and scount, at least to my understanding.
>
> Regards,
> Damian
> --
> *From:* devel  on behalf of George
> Bosilca via devel 
> *Sent:* Wednesday, May 4, 2022 10:20 AM
> *To:* Open MPI Developers 
> *Cc:* George Bosilca 
> *Subject:* Re: [OMPI devel] Possible Bug / Invalid Read in Ialltoallw
>
> Damien,
>
> As Gilles indicated an example would be great. Meanwhile, as you already
> have access to the root cause with a debugger, can you check what branch of
> the if regarding the communicator type in the
> ompi_coll_base_retain_datatypes_w function is taken. What is the
> communicator type ? Intra or inter ? with or without topology ?
>
> Thanks,
>   George.
>
>
> On Wed, May 4, 2022 at 9:35 AM Gilles Gouaillardet via devel <
> devel@lists.open-mpi.org> wrote:
>
> Damian,
>
> Thanks for the report!
>
> could you please trim your program and share it so I can have a look?
>
>
> Cheers,
>
> Gilles
>
>
> On Wed, May 4, 2022 at 10:27 PM Damian Marek via devel <
> devel@lists.open-mpi.org> wrote:
>
> Hello,
>
> I have been getting intermittent memory corruptions and segmentation
> faults while using Ialltoallw in OpenMPI v4.0.3. Valgrind also reports an
> invalid read in the "ompi_coll_base_retain_datatypes_w" function defined in
> "coll_base_util.c".
>
> Running with a debug build of ompi an assertion fails as well:
>
> base/coll_base_util.c:274: ompi_coll_base_retain_datatypes_w: Assertion
> `OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (stypes[i]))->obj_magic_id' failed.
>
> I think it is related to the fact that I am using a communicator created
> with 2D MPI_Cart_create followed by getting 2 subcommunicators from
> MPI_Cart_sub, in some cases one of the dimensions is 1. In
> "ompi_coll_base_retain_datatypes_w" the neighbour count is used to find
> "rcount" and "scount" at line 267. In my bug case it returns 2 for both,
> but I believe it should be 1 since that is the comm size and the amount of
> memory I have allocated for sendtypes and recvtypes. Then, an invalid read
> happens at 274 and 280.
>
> Regards,
> Damian
>
>
>
>
>
>
>


alltoallw.patch
Description: Binary data


Re: [OMPI devel] Possible Bug / Invalid Read in Ialltoallw

2022-05-04 Thread Damian Marek via devel
Hello,

I made an example that triggers the issue. I had to get a little creative with 
how to trigger the crash, since it does not appear if the memory allocated for 
the send and recv types happens to be set to 0 (although valgrind still reports 
an invalid read).

Communicator is an intra-communicator with a cartesian topology. It takes the 
first branch into "mca_topo_base_neighbor_count" and sets the wrong rcount and 
scount, at least to my understanding.

Regards,
Damian

From: devel  on behalf of George Bosilca via 
devel 
Sent: Wednesday, May 4, 2022 10:20 AM
To: Open MPI Developers 
Cc: George Bosilca 
Subject: Re: [OMPI devel] Possible Bug / Invalid Read in Ialltoallw

Damien,

As Gilles indicated an example would be great. Meanwhile, as you already have 
access to the root cause with a debugger, can you check what branch of the if 
regarding the communicator type in the ompi_coll_base_retain_datatypes_w 
function is taken. What is the communicator type ? Intra or inter ? with or 
without topology ?

Thanks,
  George.


On Wed, May 4, 2022 at 9:35 AM Gilles Gouaillardet via devel 
mailto:devel@lists.open-mpi.org>> wrote:
Damian,

Thanks for the report!

could you please trim your program and share it so I can have a look?


Cheers,

Gilles


On Wed, May 4, 2022 at 10:27 PM Damian Marek via devel 
mailto:devel@lists.open-mpi.org>> wrote:
Hello,

I have been getting intermittent memory corruptions and segmentation faults 
while using Ialltoallw in OpenMPI v4.0.3. Valgrind also reports an invalid read 
in the "ompi_coll_base_retain_datatypes_w" function defined in 
"coll_base_util.c".

Running with a debug build of ompi an assertion fails as well:
base/coll_base_util.c:274: ompi_coll_base_retain_datatypes_w: Assertion 
`OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (stypes[i]))->obj_magic_id' failed.
I think it is related to the fact that I am using a communicator created with 
2D MPI_Cart_create followed by getting 2 subcommunicators from MPI_Cart_sub, in 
some cases one of the dimensions is 1. In "ompi_coll_base_retain_datatypes_w" 
the neighbour count is used to find "rcount" and "scount" at line 267. In my 
bug case it returns 2 for both, but I believe it should be 1 since that is the 
comm size and the amount of memory I have allocated for sendtypes and 
recvtypes. Then, an invalid read happens at 274 and 280.

Regards,
Damian






#include "mpi.h"
#include 
#include 
#include 
#include 

//Create a temp 2D cartesian communicator, and retrieve subcomms along each dimension.
void subcomm(MPI_Comm comm, MPI_Comm *subcomms) {
int ndims = 2;

MPI_Comm comm_cart;
int nprocs;
std::vector periods(ndims), remdims(ndims), dims(ndims);
for (int i = 0; i < ndims; i++) {
periods[i] = remdims[i] = dims[i] = 0;
}

MPI_Comm_size(comm, );
int error = MPI_Dims_create(nprocs, ndims, dims.data());
if (error != MPI_SUCCESS) {
MPI_Abort(MPI_COMM_WORLD, error);
}

MPI_Cart_create(comm, ndims, dims.data(), periods.data(), 0, _cart);
for (int i = 0; i < ndims; i++) {
remdims[i] = 1;
MPI_Cart_sub(comm_cart, remdims.data(), [i]);
remdims[i] = 0;
}

MPI_Comm_free(_cart);
}

int main(int argc, char *argv[]) {
MPI_Init(, );

MPI_Comm subcomms[2];
subcomm(MPI_COMM_WORLD, subcomms);

int comm_size, world_rank;
MPI_Comm_size(subcomms[1], _size);
MPI_Comm_rank(MPI_COMM_WORLD, _rank);
if (!world_rank) {
int flag;
MPI_Comm_test_inter(subcomms[1], );
if (flag)
std::cout << "Comm is inter-communicator" << std::endl;
else
std::cout << "Comm is intra-communicator" << std::endl;
MPI_Topo_test(subcomms[1], );
switch (flag) {
case MPI_GRAPH:
case MPI_DIST_GRAPH:
case MPI_UNDEFINED:
assert(false);
case MPI_CART:
std::cout << "Comm is cartesian" << std::endl;
}
}

//Really only need an array of length comm_size, but allocate double so memory can be set manually to something other than 0.
MPI_Datatype *stypes = (MPI_Datatype *) malloc(sizeof(MPI_Datatype) *2* comm_size);
MPI_Datatype *rtypes = (MPI_Datatype *) malloc(sizeof(MPI_Datatype) *2* comm_size);
//If memory happens to be initialized to 0, the null check in coll_base_util.c avoids the invalid read.
MPI_Datatype unin;
stypes[1] = unin;
rtypes[1] = unin;
//
int *scounts = (int *) malloc(sizeof(int) * comm_size);
int *sdispls = (int *) malloc(sizeof(int) * comm_size);
int *rcounts = (int *) malloc(sizeof(int) * comm_size);
int *rdispls = (int *) malloc(sizeof(int) * comm_size);
for (int i = 0; i < comm_size; i++) {
scounts[i] = 1;
rcounts[i] = 1;

Re: [OMPI devel] Possible Bug / Invalid Read in Ialltoallw

2022-05-04 Thread George Bosilca via devel
Damien,

As Gilles indicated an example would be great. Meanwhile, as you already
have access to the root cause with a debugger, can you check what branch of
the if regarding the communicator type in the
ompi_coll_base_retain_datatypes_w function is taken. What is the
communicator type ? Intra or inter ? with or without topology ?

Thanks,
  George.


On Wed, May 4, 2022 at 9:35 AM Gilles Gouaillardet via devel <
devel@lists.open-mpi.org> wrote:

> Damian,
>
> Thanks for the report!
>
> could you please trim your program and share it so I can have a look?
>
>
> Cheers,
>
> Gilles
>
>
> On Wed, May 4, 2022 at 10:27 PM Damian Marek via devel <
> devel@lists.open-mpi.org> wrote:
>
>> Hello,
>>
>> I have been getting intermittent memory corruptions and segmentation
>> faults while using Ialltoallw in OpenMPI v4.0.3. Valgrind also reports an
>> invalid read in the "ompi_coll_base_retain_datatypes_w" function defined in
>> "coll_base_util.c".
>>
>> Running with a debug build of ompi an assertion fails as well:
>>
>> base/coll_base_util.c:274: ompi_coll_base_retain_datatypes_w: Assertion
>> `OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (stypes[i]))->obj_magic_id' failed.
>>
>> I think it is related to the fact that I am using a communicator created
>> with 2D MPI_Cart_create followed by getting 2 subcommunicators from
>> MPI_Cart_sub, in some cases one of the dimensions is 1. In
>> "ompi_coll_base_retain_datatypes_w" the neighbour count is used to find
>> "rcount" and "scount" at line 267. In my bug case it returns 2 for both,
>> but I believe it should be 1 since that is the comm size and the amount of
>> memory I have allocated for sendtypes and recvtypes. Then, an invalid read
>> happens at 274 and 280.
>>
>> Regards,
>> Damian
>>
>>
>>
>>
>>
>>
>>


Re: [OMPI devel] Possible Bug / Invalid Read in Ialltoallw

2022-05-04 Thread Gilles Gouaillardet via devel
Damian,

Thanks for the report!

could you please trim your program and share it so I can have a look?


Cheers,

Gilles


On Wed, May 4, 2022 at 10:27 PM Damian Marek via devel <
devel@lists.open-mpi.org> wrote:

> Hello,
>
> I have been getting intermittent memory corruptions and segmentation
> faults while using Ialltoallw in OpenMPI v4.0.3. Valgrind also reports an
> invalid read in the "ompi_coll_base_retain_datatypes_w" function defined in
> "coll_base_util.c".
>
> Running with a debug build of ompi an assertion fails as well:
>
> base/coll_base_util.c:274: ompi_coll_base_retain_datatypes_w: Assertion
> `OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (stypes[i]))->obj_magic_id' failed.
>
> I think it is related to the fact that I am using a communicator created
> with 2D MPI_Cart_create followed by getting 2 subcommunicators from
> MPI_Cart_sub, in some cases one of the dimensions is 1. In
> "ompi_coll_base_retain_datatypes_w" the neighbour count is used to find
> "rcount" and "scount" at line 267. In my bug case it returns 2 for both,
> but I believe it should be 1 since that is the comm size and the amount of
> memory I have allocated for sendtypes and recvtypes. Then, an invalid read
> happens at 274 and 280.
>
> Regards,
> Damian
>
>
>
>
>
>
>