Clyde,

thanks for reporting the issue.

Can you please give the attached patch a try ?


Cheers,

Gilles

FWIW, the nbc module was not initially specific to Open MPI, and hence used
standard MPI subroutines.
In this case, we can avoid the issue by calling internal Open MPI
subroutines.
This is an intermediate patch, since similar issues might potentially occur
in other places


On Fri, Jul 6, 2018 at 11:12 PM Stanfield, Clyde <
clyde.stanfi...@radiantsolutions.com> wrote:

> We are using MPI_Ialltoallv for an image processing algorithm. When doing
> this we pass in an MPI_Type_contiguous with an MPI_Datatype of
> MPI_C_FLOAT_COMPLEX which ends up being the size of multiple rows of the
> image (based on the number of nodes used for distribution). In addition
> sendcounts, sdispls, resvcounts, and rdispls all fit within a signed int.
> Usually this works without any issues, but when we lower our number of
> nodes we sometimes see failures.
>
>
>
> What we found is that even though we can fit everything into signed ints,
> line 528 of nbc_internal.h ends up calling a malloc with an int that
> appears to be the size of the (num_distributed_rows * num_columns  *
> sizeof(std::complex<float>)) which in very large cases wraps back to
> negative.  As a result we end up seeing “Error in malloc()” (line 530 of
> nbc_internal.h) throughout our output.
>
>
>
> We can get around this issue by ensuring the sum of our contiguous type
> never exceeds 2GB. However, this was unexpected to us as our understanding
> was that all long as we can fit all the parts into signed ints we should be
> able to transfer more than 2GB at a time. Is it intended that
> MPI_Ialltoallv requires your underlying data to be less than 2GB or is this
> in error in how malloc is being called (should be called with a size_t
> instead of an int)?
>
>
>
> Thanks,
>
> Clyde Stanfield
>
>
>
> [image:
> https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-sig-logo.png]
> <http://www.radiantsolutions.com/>
>
> *Clyde Stanfield*
> Software Engineer
> 734-480-5100 office <//734-480-5100>
> clyde.stanfi...@mdaus.com
>
> [image:
> https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-twitter-wide.png]
> <https://twitter.com/radiant_maxar> [image:
> https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-linkedin-wide.png]
> <https://www.linkedin.com/company/radiant-solutions/>
>
>
>
>
>
>
> The information contained in this communication is confidential, is
> intended only for the use of the recipient(s) named above, and may be
> legally privileged. If the reader of this message is not the intended
> recipient, you are hereby notified that any dissemination, distribution, or
> copying of this communication is strictly prohibited. If you have received
> this communication in error, please re-send this communication to the
> sender and delete the original message or any copy of it from your computer
> system.
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

Attachment: nbc_copy.diff
Description: Binary data

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to