Re: [OMPI devel] btl/vader: osu_bibw hangs when the number of execution loops is increased

2017-12-11 Thread DERBEY, NADIA
Hi Nathan,

Actually PR # 3846 was enough to fix my hang: this fix went into 2.0.4 and we 
are still based on openmpi 2.0.2...

FYI, I also tested intra-node osu/pt2pt and osu/collective with PR #4569 on top 
of PR # 3846 : I didn't notice any regression.

Regards

On 12/07/2017 08:03 AM, Nadia DERBEY wrote:

Thanks Nathan, will keep you informed.

Regards

On 12/05/2017 11:32 PM, Nathan Hjelm wrote:
Should be fixed by PR #4569 (https://github.com/open-mpi/ompi/pull/4569). 
Please treat and let me know.

-Nathan

On Dec 1, 2017, at 7:37 AM, DERBEY, NADIA 
> wrote:

Hi,

Our validation team detected a hang when running osu_bibw
micro-benchmarks from the OMB 5.3 suite on openmpi 2.0.2 (note that the
same hang appears with openmpi-3.0).
This hang occurs when calling osu_bibw on a single node (vader btl) with
the options "-x 100 -i 1000".
The -x option changes the warmup loop size.
The -i option changes the measured loop size.

For each exchanged message size, osu_bibw loops doing the following
sequence on both ranks:
   . posts 64 non-blocking sends
   . posts 64 non-blocking receives
   . waits for all the send requests to complete
   . waits for all the receive requests to complete

The loop size is the sum of
   . options.skip (warm up phase that can be changed with the -x option)
   . options.loop (actually measured loop that can be changed with the
-i option).

The default values are the following:

+==+==+==+
| message size | skip | loop |
|==+==+==|
|<= 8K |   10 |  100 |
|>  8K |2 |   20 |
+==+==+==+

As said above, the test hangs when moving to more aggressive loop
values: 100 for skip and 1000 for loop.

mca_btl_vader_frag_alloc() calls opal_free_list_get() to get a fragment
from the appropriate free list.
If there are no free fragments anymore, opal_free_list_get() calls
opal_free_list_grow() which in turn calls mca_btl_vader_frag_init()
(initialization routine for the vader btl fragements).
This routine checks if there is enough space left in the mapped memory
segment for the wanted fragment size (current offset + fragment size
shoudl be <= segment size), and it makes opal_free_list_grow fail if the
shared memory segment is exhausted.

As soon as we begin exhausting memory, the 2 ranks get unsynchronized
and the test rapidly hangs. To avoid this hang, I found 2 possible
solutions:

1) change the vader btl segment size: I have set it to 4GB - in order to
be able to do this, I had to change the type parameter in the parameter
registrations to MCA_BASE_VAR_TYPE_SIZE_T.

2) change the call to opal_free_list_get() by a call to
opal_free_list_wait() in mca_btl_vader_frag_alloc(). This also makes the
micro-benchmark run to the end.

So my question is: what would be the best approach (#1 or #2)? and the
question behind this is: what is the reason that makes favoring
opal_free_list_get() instead of opal_free_list_wait().

Thanks

--
Nadia Derbey - B1-387
HPC R - MPI
Tel: +33 4 76 29 77 62
nadia.der...@atos.net
1 Rue de Provence BP 208
38130 Echirolles Cedex, France
www.atos.com
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel



___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


--
Nadia Derbey - B1-387
HPC R - MPI
Tel: +33 4 76 29 77 62
nadia.der...@atos.net
1 Rue de Provence BP 208
38130 Echirolles Cedex, France
www.atos.com


--
Nadia Derbey - B1-387
HPC R - MPI
Tel: +33 4 76 29 77 62
nadia.der...@atos.net
1 Rue de Provence BP 208
38130 Echirolles Cedex, France
www.atos.com
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] btl/vader: osu_bibw hangs when the number of execution loops is increased

2017-12-11 Thread DERBEY, NADIA
Thanks Nathan, will keep you informed.

Regards

On 12/05/2017 11:32 PM, Nathan Hjelm wrote:
Should be fixed by PR #4569 (https://github.com/open-mpi/ompi/pull/4569). 
Please treat and let me know.

-Nathan

On Dec 1, 2017, at 7:37 AM, DERBEY, NADIA 
> wrote:

Hi,

Our validation team detected a hang when running osu_bibw
micro-benchmarks from the OMB 5.3 suite on openmpi 2.0.2 (note that the
same hang appears with openmpi-3.0).
This hang occurs when calling osu_bibw on a single node (vader btl) with
the options "-x 100 -i 1000".
The -x option changes the warmup loop size.
The -i option changes the measured loop size.

For each exchanged message size, osu_bibw loops doing the following
sequence on both ranks:
   . posts 64 non-blocking sends
   . posts 64 non-blocking receives
   . waits for all the send requests to complete
   . waits for all the receive requests to complete

The loop size is the sum of
   . options.skip (warm up phase that can be changed with the -x option)
   . options.loop (actually measured loop that can be changed with the
-i option).

The default values are the following:

+==+==+==+
| message size | skip | loop |
|==+==+==|
|<= 8K |   10 |  100 |
|>  8K |2 |   20 |
+==+==+==+

As said above, the test hangs when moving to more aggressive loop
values: 100 for skip and 1000 for loop.

mca_btl_vader_frag_alloc() calls opal_free_list_get() to get a fragment
from the appropriate free list.
If there are no free fragments anymore, opal_free_list_get() calls
opal_free_list_grow() which in turn calls mca_btl_vader_frag_init()
(initialization routine for the vader btl fragements).
This routine checks if there is enough space left in the mapped memory
segment for the wanted fragment size (current offset + fragment size
shoudl be <= segment size), and it makes opal_free_list_grow fail if the
shared memory segment is exhausted.

As soon as we begin exhausting memory, the 2 ranks get unsynchronized
and the test rapidly hangs. To avoid this hang, I found 2 possible
solutions:

1) change the vader btl segment size: I have set it to 4GB - in order to
be able to do this, I had to change the type parameter in the parameter
registrations to MCA_BASE_VAR_TYPE_SIZE_T.

2) change the call to opal_free_list_get() by a call to
opal_free_list_wait() in mca_btl_vader_frag_alloc(). This also makes the
micro-benchmark run to the end.

So my question is: what would be the best approach (#1 or #2)? and the
question behind this is: what is the reason that makes favoring
opal_free_list_get() instead of opal_free_list_wait().

Thanks

--
Nadia Derbey - B1-387
HPC R - MPI
Tel: +33 4 76 29 77 62
nadia.der...@atos.net
1 Rue de Provence BP 208
38130 Echirolles Cedex, France
www.atos.com
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel



___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


--
Nadia Derbey - B1-387
HPC R - MPI
Tel: +33 4 76 29 77 62
nadia.der...@atos.net
1 Rue de Provence BP 208
38130 Echirolles Cedex, France
www.atos.com
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] btl/vader: osu_bibw hangs when the number of execution loops is increased

2017-12-05 Thread Nathan Hjelm
Should be fixed by PR #4569 (https://github.com/open-mpi/ompi/pull/4569). 
Please treat and let me know.

-Nathan

> On Dec 1, 2017, at 7:37 AM, DERBEY, NADIA  wrote:
> 
> Hi,
> 
> Our validation team detected a hang when running osu_bibw 
> micro-benchmarks from the OMB 5.3 suite on openmpi 2.0.2 (note that the 
> same hang appears with openmpi-3.0).
> This hang occurs when calling osu_bibw on a single node (vader btl) with 
> the options "-x 100 -i 1000".
> The -x option changes the warmup loop size.
> The -i option changes the measured loop size.
> 
> For each exchanged message size, osu_bibw loops doing the following 
> sequence on both ranks:
>. posts 64 non-blocking sends
>. posts 64 non-blocking receives
>. waits for all the send requests to complete
>. waits for all the receive requests to complete
> 
> The loop size is the sum of
>. options.skip (warm up phase that can be changed with the -x option)
>. options.loop (actually measured loop that can be changed with the 
> -i option).
> 
> The default values are the following:
> 
> +==+==+==+
> | message size | skip | loop |
> |==+==+==|
> |<= 8K |   10 |  100 |
> |>  8K |2 |   20 |
> +==+==+==+
> 
> As said above, the test hangs when moving to more aggressive loop 
> values: 100 for skip and 1000 for loop.
> 
> mca_btl_vader_frag_alloc() calls opal_free_list_get() to get a fragment 
> from the appropriate free list.
> If there are no free fragments anymore, opal_free_list_get() calls 
> opal_free_list_grow() which in turn calls mca_btl_vader_frag_init() 
> (initialization routine for the vader btl fragements).
> This routine checks if there is enough space left in the mapped memory 
> segment for the wanted fragment size (current offset + fragment size 
> shoudl be <= segment size), and it makes opal_free_list_grow fail if the 
> shared memory segment is exhausted.
> 
> As soon as we begin exhausting memory, the 2 ranks get unsynchronized 
> and the test rapidly hangs. To avoid this hang, I found 2 possible 
> solutions:
> 
> 1) change the vader btl segment size: I have set it to 4GB - in order to 
> be able to do this, I had to change the type parameter in the parameter 
> registrations to MCA_BASE_VAR_TYPE_SIZE_T.
> 
> 2) change the call to opal_free_list_get() by a call to 
> opal_free_list_wait() in mca_btl_vader_frag_alloc(). This also makes the 
> micro-benchmark run to the end.
> 
> So my question is: what would be the best approach (#1 or #2)? and the 
> question behind this is: what is the reason that makes favoring 
> opal_free_list_get() instead of opal_free_list_wait().
> 
> Thanks
> 
> -- 
> Nadia Derbey - B1-387
> HPC R - MPI
> Tel: +33 4 76 29 77 62
> nadia.der...@atos.net
> 1 Rue de Provence BP 208
> 38130 Echirolles Cedex, France
> www.atos.com
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] btl/vader: osu_bibw hangs when the number of execution loops is increased

2017-12-01 Thread DERBEY, NADIA
Hi,

Our validation team detected a hang when running osu_bibw 
micro-benchmarks from the OMB 5.3 suite on openmpi 2.0.2 (note that the 
same hang appears with openmpi-3.0).
This hang occurs when calling osu_bibw on a single node (vader btl) with 
the options "-x 100 -i 1000".
The -x option changes the warmup loop size.
The -i option changes the measured loop size.

For each exchanged message size, osu_bibw loops doing the following 
sequence on both ranks:
    . posts 64 non-blocking sends
    . posts 64 non-blocking receives
    . waits for all the send requests to complete
    . waits for all the receive requests to complete

The loop size is the sum of
    . options.skip (warm up phase that can be changed with the -x option)
    . options.loop (actually measured loop that can be changed with the 
-i option).

The default values are the following:

+==+==+==+
| message size | skip | loop |
|==+==+==|
|    <= 8K |   10 |  100 |
|    >  8K |    2 |   20 |
+==+==+==+

As said above, the test hangs when moving to more aggressive loop 
values: 100 for skip and 1000 for loop.

mca_btl_vader_frag_alloc() calls opal_free_list_get() to get a fragment 
from the appropriate free list.
If there are no free fragments anymore, opal_free_list_get() calls 
opal_free_list_grow() which in turn calls mca_btl_vader_frag_init() 
(initialization routine for the vader btl fragements).
This routine checks if there is enough space left in the mapped memory 
segment for the wanted fragment size (current offset + fragment size 
shoudl be <= segment size), and it makes opal_free_list_grow fail if the 
shared memory segment is exhausted.

As soon as we begin exhausting memory, the 2 ranks get unsynchronized 
and the test rapidly hangs. To avoid this hang, I found 2 possible 
solutions:

1) change the vader btl segment size: I have set it to 4GB - in order to 
be able to do this, I had to change the type parameter in the parameter 
registrations to MCA_BASE_VAR_TYPE_SIZE_T.

2) change the call to opal_free_list_get() by a call to 
opal_free_list_wait() in mca_btl_vader_frag_alloc(). This also makes the 
micro-benchmark run to the end.

So my question is: what would be the best approach (#1 or #2)? and the 
question behind this is: what is the reason that makes favoring 
opal_free_list_get() instead of opal_free_list_wait().

Thanks

-- 
Nadia Derbey - B1-387
HPC R - MPI
Tel: +33 4 76 29 77 62
nadia.der...@atos.net
1 Rue de Provence BP 208
38130 Echirolles Cedex, France
www.atos.com
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel