Re: [OMPI devel] Hanging tests

2016-09-06 Thread George Bosilca
I can make MPI_Issend_rtoa deadlock with vader and sm.

  George.


On Tue, Sep 6, 2016 at 12:06 PM, r...@open-mpi.org  wrote:

> FWIW: those tests hang for me with TCP (I don’t have openib on my
> cluster). I’ll check it with your change as well
>
>
> On Sep 6, 2016, at 1:29 AM, Gilles Gouaillardet  wrote:
>
> Ralph,
>
>
> this looks like an other hang :-(
>
>
> i ran MPI_Issend_rtoa_c on 32 tasks (2 nodes, 2 sockets per node, 8 cores
> per socket) with infiniband,
>
> and i always observe the same hang at the same place.
>
>
> surprisingly, i do not get any hang if i blacklist the openib btl
>
>
> the patch below can be used to avoid the hang with infiniband or for
> debugging purpose
>
> the hang occurs in communicator 6, and if i skip tests on communicator 2,
> no hang happens.
>
> the hang occur on an intercomm :
>
> task 0 (from MPI_COMM_WORLD) has rank 0 in group A of the intercomm
>
> task 1 (from MPI_COMM_WORLD) has rank 0 in group B of the intercomm
>
> task 0 MPI_Issend to task 1, and task 1 MPI_Irecv from task 0, and then
> both hang in MPI_Wait()
>
> surprisingly, tasks 0 and 1 run on the same node, so it is very puzzling
> the hang only occurs with the openib btl,
>
> since vader should be used here.
>
>
> diff --git a/intel_tests/src/MPI_Issend_rtoa_c.c
> b/intel_tests/src/MPI_Issend_rtoa_c.c
> index 8b26f84..b9a704b 100644
> --- a/intel_tests/src/MPI_Issend_rtoa_c.c
> +++ b/intel_tests/src/MPI_Issend_rtoa_c.c
> @@ -173,8 +177,9 @@ int main(int argc, char *argv[])
>
>  for (comm_count = 0; comm_count < MPITEST_num_comm_sizes();
>   comm_count++) {
>  comm_index = MPITEST_get_comm_index(comm_count);
>  comm_type = MPITEST_get_comm_type(comm_count);
> +if (2 == comm_count) continue;
>
>  /*
> @@ -312,6 +330,9 @@ int main(int argc, char *argv[])
>   * left sub-communicator
>   */
>
> +if (6 == comm_count && 12 == length_count &&
> MPITEST_current_rank < 2) {
> +/* insert a breakpoint here */
> +}
>   * Reset a bunch of variables that will be set when we get our
>
>
>
> as a side note, which is very unlikely related to this issue, i noticed
> the following programs works fine,
>
> though it is reasonnable to expect a hang.
>
> the root cause is MPI_Send uses the eager protocol, and though
> communicators used by MPI_Send and MPI_Recv
>
> are different, they have the same (recycled) CID.
>
> fwiw, the tests also completes with mpich.
>
>
> if not already done, should we provide an option not to recycle CIDs ?
>
> or flush unexpected/unmatched messages when a communicator is free'd ?
>
>
> Cheers,
>
>
> Gilles
>
>
> #include 
> #include 
>
> /* send a message (eager mode) in a communicator, and then
>  * receive it in an other communicator, but with the same CID
>  */
> int main(int argc, char *argv[]) {
> int rank, size;
> int b;
> MPI_Comm comm;
>
> MPI_Init(, );
> MPI_Comm_rank(MPI_COMM_WORLD, );
> MPI_Comm_size(MPI_COMM_WORLD, );
> if (2 > size) MPI_Abort(MPI_COMM_WORLD, 1);
>
> MPI_Comm_dup(MPI_COMM_WORLD, );
> if (0 == rank) {
> b = 0x;
> MPI_Send(, 1, MPI_INT, 1, 0, comm);
> }
> MPI_Comm_free();
>
> MPI_Comm_dup(MPI_COMM_WORLD, );
> if (1 == rank) {
> b = 0x;
> MPI_Recv(, 1, MPI_INT, 0, 0, comm, MPI_STATUS_IGNORE);
> if (0x != b) MPI_Abort(MPI_COMM_WORLD, 2);
> }
> MPI_Comm_free();
>
> MPI_Finalize();
>
> return 0;
> }
>
>
> On 9/6/2016 12:03 AM, Gilles Gouaillardet wrote:
>
> ok,  will double check tomorrow this was the very same hang i fixed
> earlier
>
> Cheers,
>
> Gilles
>
> On Monday, September 5, 2016, r...@open-mpi.org  wrote:
>
>> I was just looking at the overnight MTT report, and these were present
>> going back a long ways in both branches. They are in the Intel test suite.
>>
>> If you have already addressed them, then thanks!
>>
>> > On Sep 5, 2016, at 7:48 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com> wrote:
>> >
>> > Ralph,
>> >
>> > I fixed a hang earlier today in master, and the PR for v2.x is at
>> https://github.com/open-mpi/ompi-release/pull/1368
>> >
>> > Can you please make sure you are running the latest master ?
>> >
>> > Which testsuite do these tests come from ?
>> > I will have a look tomorrow if the hang is still there
>> >
>> > Cheers,
>> >
>> > Gilles
>> >
>> > r...@open-mpi.org wrote:
>> >> Hey folks
>> >>
>> >> All of the tests that involve either ISsend_ator, SSend_ator,
>> ISsend_rtoa, or SSend_rtoa are hanging on master and v2.x. Does anyone know
>> what these tests do, and why we never seem to pass them?
>> >>
>> >> Do we care?
>> >> Ralph
>> >>
>> >> ___
>> >> devel mailing list
>> >> devel@lists.open-mpi.org
>> >> 

Re: [OMPI devel] Hanging tests

2016-09-06 Thread r...@open-mpi.org
FWIW: those tests hang for me with TCP (I don’t have openib on my cluster). 
I’ll check it with your change as well


> On Sep 6, 2016, at 1:29 AM, Gilles Gouaillardet  wrote:
> 
> Ralph,
> 
> 
> this looks like an other hang :-(
> 
> 
> i ran MPI_Issend_rtoa_c on 32 tasks (2 nodes, 2 sockets per node, 8 cores per 
> socket) with infiniband,
> 
> and i always observe the same hang at the same place.
> 
> 
> surprisingly, i do not get any hang if i blacklist the openib btl
> 
> 
> the patch below can be used to avoid the hang with infiniband or for 
> debugging purpose
> 
> the hang occurs in communicator 6, and if i skip tests on communicator 2, no 
> hang happens.
> 
> the hang occur on an intercomm :
> task 0 (from MPI_COMM_WORLD) has rank 0 in group A of the intercomm
> 
> task 1 (from MPI_COMM_WORLD) has rank 0 in group B of the intercomm
> 
> task 0 MPI_Issend to task 1, and task 1 MPI_Irecv from task 0, and then both 
> hang in MPI_Wait()
> 
> surprisingly, tasks 0 and 1 run on the same node, so it is very puzzling the 
> hang only occurs with the openib btl,
> 
> since vader should be used here.
> 
> 
> diff --git a/intel_tests/src/MPI_Issend_rtoa_c.c 
> b/intel_tests/src/MPI_Issend_rtoa_c.c
> index 8b26f84..b9a704b 100644
> --- a/intel_tests/src/MPI_Issend_rtoa_c.c
> +++ b/intel_tests/src/MPI_Issend_rtoa_c.c
> @@ -173,8 +177,9 @@ int main(int argc, char *argv[])
>  
>  for (comm_count = 0; comm_count < MPITEST_num_comm_sizes();
>   comm_count++) {
>  comm_index = MPITEST_get_comm_index(comm_count);
>  comm_type = MPITEST_get_comm_type(comm_count);
> +if (2 == comm_count) continue;
>  
>  /*
> @@ -312,6 +330,9 @@ int main(int argc, char *argv[])
>   * left sub-communicator
>   */
>  
> +if (6 == comm_count && 12 == length_count && 
> MPITEST_current_rank < 2) {
> +/* insert a breakpoint here */
> +}
>   * Reset a bunch of variables that will be set when we get our
> 
> 
> as a side note, which is very unlikely related to this issue, i noticed the 
> following programs works fine,
> 
> though it is reasonnable to expect a hang.
> the root cause is MPI_Send uses the eager protocol, and though communicators 
> used by MPI_Send and MPI_Recv
> 
> are different, they have the same (recycled) CID.
> 
> fwiw, the tests also completes with mpich.
> 
> 
> if not already done, should we provide an option not to recycle CIDs ?
> 
> or flush unexpected/unmatched messages when a communicator is free'd ?
> 
> 
> Cheers,
> 
> 
> Gilles
> 
> #include 
> #include 
> 
> /* send a message (eager mode) in a communicator, and then
>  * receive it in an other communicator, but with the same CID
>  */
> int main(int argc, char *argv[]) {
> int rank, size;
> int b;
> MPI_Comm comm;
> 
> MPI_Init(, );
> MPI_Comm_rank(MPI_COMM_WORLD, );
> MPI_Comm_size(MPI_COMM_WORLD, );
> if (2 > size) MPI_Abort(MPI_COMM_WORLD, 1);
> 
> MPI_Comm_dup(MPI_COMM_WORLD, );
> if (0 == rank) {
> b = 0x;
> MPI_Send(, 1, MPI_INT, 1, 0, comm);
> }
> MPI_Comm_free();
> 
> MPI_Comm_dup(MPI_COMM_WORLD, );
> if (1 == rank) {
> b = 0x;
> MPI_Recv(, 1, MPI_INT, 0, 0, comm, MPI_STATUS_IGNORE);
> if (0x != b) MPI_Abort(MPI_COMM_WORLD, 2);
> }
> MPI_Comm_free();
> 
> MPI_Finalize();
> 
> return 0;
> }
> 
> 
> On 9/6/2016 12:03 AM, Gilles Gouaillardet wrote:
>> ok,  will double check tomorrow this was the very same hang i fixed earlier
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On Monday, September 5, 2016, r...@open-mpi.org  
>> > wrote:
>> I was just looking at the overnight MTT report, and these were present going 
>> back a long ways in both branches. They are in the Intel test suite.
>> 
>> If you have already addressed them, then thanks!
>> 
>> > On Sep 5, 2016, at 7:48 AM, Gilles Gouaillardet 
>> > > wrote:
>> >
>> > Ralph,
>> >
>> > I fixed a hang earlier today in master, and the PR for v2.x is at 
>> > https://github.com/open-mpi/ompi-release/pull/1368 
>> > 
>> >
>> > Can you please make sure you are running the latest master ?
>> >
>> > Which testsuite do these tests come from ?
>> > I will have a look tomorrow if the hang is still there
>> >
>> > Cheers,
>> >
>> > Gilles
>> >
>> > r...@open-mpi.org  wrote:
>> >> Hey folks
>> >>
>> >> All of the tests that involve either ISsend_ator, SSend_ator, 
>> >> ISsend_rtoa, or SSend_rtoa are hanging on master and v2.x. Does anyone 
>> >> know what these tests do, and why we never seem to pass them?
>> >>
>> >> Do we care?
>> >> Ralph
>> >>
>> >> ___
>> >> devel mailing list
>> >>