Re: [OMPI users] MPI_Ialltoallv

2018-07-09 Thread Stanfield, Clyde
Gilles,

I rebuilt MPI with the attached patch and can verify that it appears to have 
fixed the issue I originally reported.


[https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-sig-logo.png]<http://www.radiantsolutions.com/>

Clyde Stanfield
Software Engineer
734-480-5100 office
clyde.stanfi...@mdaus.com<mailto:clyde.stanfi...@mdaus.com>


[https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-twitter-wide.png]<https://twitter.com/radiant_maxar>
 
[https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-linkedin-wide.png]
 <https://www.linkedin.com/company/radiant-solutions/>



From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles 
Gouaillardet
Sent: Friday, July 06, 2018 11:16 AM
To: Open MPI Users 
Subject: Re: [OMPI users] MPI_Ialltoallv

Clyde,

thanks for reporting the issue.

Can you please give the attached patch a try ?


Cheers,

Gilles

FWIW, the nbc module was not initially specific to Open MPI, and hence used 
standard MPI subroutines.
In this case, we can avoid the issue by calling internal Open MPI subroutines.
This is an intermediate patch, since similar issues might potentially occur in 
other places


On Fri, Jul 6, 2018 at 11:12 PM Stanfield, Clyde 
mailto:clyde.stanfi...@radiantsolutions.com>>
 wrote:
We are using MPI_Ialltoallv for an image processing algorithm. When doing this 
we pass in an MPI_Type_contiguous with an MPI_Datatype of MPI_C_FLOAT_COMPLEX 
which ends up being the size of multiple rows of the image (based on the number 
of nodes used for distribution). In addition sendcounts, sdispls, resvcounts, 
and rdispls all fit within a signed int. Usually this works without any issues, 
but when we lower our number of nodes we sometimes see failures.

What we found is that even though we can fit everything into signed ints, line 
528 of nbc_internal.h ends up calling a malloc with an int that appears to be 
the size of the (num_distributed_rows * num_columns  * 
sizeof(std::complex)) which in very large cases wraps back to negative.  
As a result we end up seeing “Error in malloc()” (line 530 of nbc_internal.h) 
throughout our output.

We can get around this issue by ensuring the sum of our contiguous type never 
exceeds 2GB. However, this was unexpected to us as our understanding was that 
all long as we can fit all the parts into signed ints we should be able to 
transfer more than 2GB at a time. Is it intended that MPI_Ialltoallv requires 
your underlying data to be less than 2GB or is this in error in how malloc is 
being called (should be called with a size_t instead of an int)?

Thanks,
Clyde Stanfield

[https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-sig-logo.png]<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.radiantsolutions.com_&d=DwMFaQ&c=ZBLtD1W6-QVvYMZBL7lDhA&r=xxGoD0cQfkhtBY67YKsDPafuijxAj4F-i3g-cICLMMw&m=96eI89HsZcLDD4EUUYC4m2w1I8doZXNaLdF8fxiJBa0&s=-BclDfm-ZaTikwt3dwncwScP9zGOpGh1zja2cqh4-rU&e=>

Clyde Stanfield
Software Engineer
734-480-5100 office
clyde.stanfi...@mdaus.com<mailto:clyde.stanfi...@mdaus.com>


[https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-twitter-wide.png]<https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_radiant-5Fmaxar&d=DwMFaQ&c=ZBLtD1W6-QVvYMZBL7lDhA&r=xxGoD0cQfkhtBY67YKsDPafuijxAj4F-i3g-cICLMMw&m=96eI89HsZcLDD4EUUYC4m2w1I8doZXNaLdF8fxiJBa0&s=63Ohd0Hw9WTJ-7dFIwi3orAuWbAQkx7F8a-4wmAfHfY&e=>
 
[https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-linkedin-wide.png]
 
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_radiant-2Dsolutions_&d=DwMFaQ&c=ZBLtD1W6-QVvYMZBL7lDhA&r=xxGoD0cQfkhtBY67YKsDPafuijxAj4F-i3g-cICLMMw&m=96eI89HsZcLDD4EUUYC4m2w1I8doZXNaLdF8fxiJBa0&s=i6FpAAnjsozHIu2Hp0Nl6Sm6ovjf20AytkgcVIo-rSo&e=>





The information contained in this communication is confidential, is intended 
only for the use of the recipient(s) named above, and may be legally 
privileged. If the reader of this message is not the intended recipient, you 
are hereby notified that any dissemination, distribution, or copying of this 
communication is strictly prohibited. If you have received this communication 
in error, please re-send this communication to the sender and delete the 
original message or any copy of it from your computer system.
___
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users<https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwMFaQ&c=ZBLtD1W6-QVvYMZBL7lDhA&r=xxGoD0cQfkhtBY67YKsDPafuijxAj4F-i3g-cICLMMw&m=96eI89HsZcLDD4EUUYC4m2w1I8doZXNaLdF8fxiJBa0&s=hkrwwUBofNjvLmEysf2

Re: [OMPI users] MPI_Ialltoallv

2018-07-06 Thread Stanfield, Clyde
Thanks for the quick feedback. I opened an issue here:
https://github.com/open-mpi/ompi/issues/5383


Clyde Stanfield 
Software Engineer 
734-480-5100 office 
clyde.stanfi...@mdaus.com



 


-Original Message-
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Nathan Hjelm 
via users
Sent: Friday, July 06, 2018 10:57 AM
To: Open MPI Users 
Cc: Nathan Hjelm 
Subject: Re: [OMPI users] MPI_Ialltoallv

No, thats a bug. Please open an issue on github and we will fix it shortly.

Thanks for reporting this issue.

-Nathan

> On Jul 6, 2018, at 8:08 AM, Stanfield, Clyde 
>  wrote:
> 
> We are using MPI_Ialltoallv for an image processing algorithm. When doing 
> this we pass in an MPI_Type_contiguous with an MPI_Datatype of 
> MPI_C_FLOAT_COMPLEX which ends up being the size of multiple rows of the 
> image (based on the number of nodes used for distribution). In addition 
> sendcounts, sdispls, resvcounts, and rdispls all fit within a signed int. 
> Usually this works without any issues, but when we lower our number of nodes 
> we sometimes see failures.
> 
> What we found is that even though we can fit everything into signed ints, 
> line 528 of nbc_internal.h ends up calling a malloc with an int that appears 
> to be the size of the (num_distributed_rows * num_columns  * 
> sizeof(std::complex)) which in very large cases wraps back to 
> negative.  As a result we end up seeing “Error in malloc()” (line 530 of 
> nbc_internal.h) throughout our output.
> 
> We can get around this issue by ensuring the sum of our contiguous type never 
> exceeds 2GB. However, this was unexpected to us as our understanding was that 
> all long as we can fit all the parts into signed ints we should be able to 
> transfer more than 2GB at a time. Is it intended that MPI_Ialltoallv requires 
> your underlying data to be less than 2GB or is this in error in how malloc is 
> being called (should be called with a size_t instead of an int)?
> 
> Thanks,
> Clyde Stanfield
> 
> 
> Clyde Stanfield
> Software Engineer
> 734-480-5100 office
> clyde.stanfi...@mdaus.com
>  
> 
> 
> 
> 
> The information contained in this communication is confidential, is intended 
> only for the use of the recipient(s) named above, and may be legally 
> privileged. If the reader of this message is not the intended recipient, you 
> are hereby notified that any dissemination, distribution, or copying of this 
> communication is strictly prohibited. If you have received this communication 
> in error, please re-send this communication to the sender and delete the 
> original message or any copy of it from your computer system. 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI_Ialltoallv

2018-07-06 Thread Gilles Gouaillardet
Clyde,

thanks for reporting the issue.

Can you please give the attached patch a try ?


Cheers,

Gilles

FWIW, the nbc module was not initially specific to Open MPI, and hence used
standard MPI subroutines.
In this case, we can avoid the issue by calling internal Open MPI
subroutines.
This is an intermediate patch, since similar issues might potentially occur
in other places


On Fri, Jul 6, 2018 at 11:12 PM Stanfield, Clyde <
clyde.stanfi...@radiantsolutions.com> wrote:

> We are using MPI_Ialltoallv for an image processing algorithm. When doing
> this we pass in an MPI_Type_contiguous with an MPI_Datatype of
> MPI_C_FLOAT_COMPLEX which ends up being the size of multiple rows of the
> image (based on the number of nodes used for distribution). In addition
> sendcounts, sdispls, resvcounts, and rdispls all fit within a signed int.
> Usually this works without any issues, but when we lower our number of
> nodes we sometimes see failures.
>
>
>
> What we found is that even though we can fit everything into signed ints,
> line 528 of nbc_internal.h ends up calling a malloc with an int that
> appears to be the size of the (num_distributed_rows * num_columns  *
> sizeof(std::complex)) which in very large cases wraps back to
> negative.  As a result we end up seeing “Error in malloc()” (line 530 of
> nbc_internal.h) throughout our output.
>
>
>
> We can get around this issue by ensuring the sum of our contiguous type
> never exceeds 2GB. However, this was unexpected to us as our understanding
> was that all long as we can fit all the parts into signed ints we should be
> able to transfer more than 2GB at a time. Is it intended that
> MPI_Ialltoallv requires your underlying data to be less than 2GB or is this
> in error in how malloc is being called (should be called with a size_t
> instead of an int)?
>
>
>
> Thanks,
>
> Clyde Stanfield
>
>
>
> [image:
> https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-sig-logo.png]
> 
>
> *Clyde Stanfield*
> Software Engineer
> 734-480-5100 office 
> clyde.stanfi...@mdaus.com
>
> [image:
> https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-twitter-wide.png]
>  [image:
> https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-linkedin-wide.png]
> 
>
>
>
>
>
>
> The information contained in this communication is confidential, is
> intended only for the use of the recipient(s) named above, and may be
> legally privileged. If the reader of this message is not the intended
> recipient, you are hereby notified that any dissemination, distribution, or
> copying of this communication is strictly prohibited. If you have received
> this communication in error, please re-send this communication to the
> sender and delete the original message or any copy of it from your computer
> system.
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


nbc_copy.diff
Description: Binary data
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI_Ialltoallv

2018-07-06 Thread Nathan Hjelm via users
No, thats a bug. Please open an issue on github and we will fix it shortly.

Thanks for reporting this issue.

-Nathan

> On Jul 6, 2018, at 8:08 AM, Stanfield, Clyde 
>  wrote:
> 
> We are using MPI_Ialltoallv for an image processing algorithm. When doing 
> this we pass in an MPI_Type_contiguous with an MPI_Datatype of 
> MPI_C_FLOAT_COMPLEX which ends up being the size of multiple rows of the 
> image (based on the number of nodes used for distribution). In addition 
> sendcounts, sdispls, resvcounts, and rdispls all fit within a signed int. 
> Usually this works without any issues, but when we lower our number of nodes 
> we sometimes see failures.
> 
> What we found is that even though we can fit everything into signed ints, 
> line 528 of nbc_internal.h ends up calling a malloc with an int that appears 
> to be the size of the (num_distributed_rows * num_columns  * 
> sizeof(std::complex)) which in very large cases wraps back to 
> negative.  As a result we end up seeing “Error in malloc()” (line 530 of 
> nbc_internal.h) throughout our output.
> 
> We can get around this issue by ensuring the sum of our contiguous type never 
> exceeds 2GB. However, this was unexpected to us as our understanding was that 
> all long as we can fit all the parts into signed ints we should be able to 
> transfer more than 2GB at a time. Is it intended that MPI_Ialltoallv requires 
> your underlying data to be less than 2GB or is this in error in how malloc is 
> being called (should be called with a size_t instead of an int)?
> 
> Thanks,
> Clyde Stanfield
> 
> 
> Clyde Stanfield
> Software Engineer
> 734-480-5100 office
> clyde.stanfi...@mdaus.com
>  
> 
> 
> 
> 
> The information contained in this communication is confidential, is intended 
> only for the use of the recipient(s) named above, and may be legally 
> privileged. If the reader of this message is not the intended recipient, you 
> are hereby notified that any dissemination, distribution, or copying of this 
> communication is strictly prohibited. If you have received this communication 
> in error, please re-send this communication to the sender and delete the 
> original message or any copy of it from your computer system. 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users



signature.asc
Description: Message signed with OpenPGP
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] MPI_Ialltoallv

2018-07-06 Thread Stanfield, Clyde
We are using MPI_Ialltoallv for an image processing algorithm. When doing this 
we pass in an MPI_Type_contiguous with an MPI_Datatype of MPI_C_FLOAT_COMPLEX 
which ends up being the size of multiple rows of the image (based on the number 
of nodes used for distribution). In addition sendcounts, sdispls, resvcounts, 
and rdispls all fit within a signed int. Usually this works without any issues, 
but when we lower our number of nodes we sometimes see failures.

What we found is that even though we can fit everything into signed ints, line 
528 of nbc_internal.h ends up calling a malloc with an int that appears to be 
the size of the (num_distributed_rows * num_columns  * 
sizeof(std::complex)) which in very large cases wraps back to negative.  
As a result we end up seeing "Error in malloc()" (line 530 of nbc_internal.h) 
throughout our output.

We can get around this issue by ensuring the sum of our contiguous type never 
exceeds 2GB. However, this was unexpected to us as our understanding was that 
all long as we can fit all the parts into signed ints we should be able to 
transfer more than 2GB at a time. Is it intended that MPI_Ialltoallv requires 
your underlying data to be less than 2GB or is this in error in how malloc is 
being called (should be called with a size_t instead of an int)?

Thanks,
Clyde Stanfield

[https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-sig-logo.png]

Clyde Stanfield
Software Engineer
734-480-5100 office
clyde.stanfi...@mdaus.com


[https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-twitter-wide.png]
 
[https://digitalglobe-marketing.s3.amazonaws.com/files/signature/radiantsolutions-linkedin-wide.png]
 





The information contained in this communication is confidential, is intended 
only for the use of the recipient(s) named above, and may be legally 
privileged. If the reader of this message is not the intended recipient, you 
are hereby notified that any dissemination, distribution, or copying of this 
communication is strictly prohibited. If you have received this communication 
in error, please re-send this communication to the sender and delete the 
original message or any copy of it from your computer system.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users