Re: [OMPI devel] osu_mbw_mr error

2014-11-06 Thread Ralph Castain
> On Nov 6, 2014, at 1:39 PM, Nathan Hjelm wrote: > > On Thu, Nov 06, 2014 at 04:29:44PM -0500, Joshua Ladd wrote: >> On Thursday, November 6, 2014, Nathan Hjelm wrote: >> >> On Thu, Nov 06, 2014 at 04:06:23PM -0500, Joshua Ladd wrote: >>> Nathan, >>> Has this bug always been present

Re: [OMPI devel] osu_mbw_mr error

2014-11-06 Thread Nathan Hjelm
On Thu, Nov 06, 2014 at 04:29:44PM -0500, Joshua Ladd wrote: >On Thursday, November 6, 2014, Nathan Hjelm wrote: > > On Thu, Nov 06, 2014 at 04:06:23PM -0500, Joshua Ladd wrote: > >Nathan, > >Has this bug always been present in OpenIB or is this a recent > addition

Re: [OMPI devel] osu_mbw_mr error

2014-11-06 Thread Joshua Ladd
On Thursday, November 6, 2014, Nathan Hjelm wrote: > On Thu, Nov 06, 2014 at 04:06:23PM -0500, Joshua Ladd wrote: > >Nathan, > >Has this bug always been present in OpenIB or is this a recent > addition? > >If this is regression, I would also be inclined to say that this is a > > The b

Re: [OMPI devel] osu_mbw_mr error

2014-11-06 Thread Nathan Hjelm
On Thu, Nov 06, 2014 at 04:06:23PM -0500, Joshua Ladd wrote: >Nathan, >Has this bug always been present in OpenIB or is this a recent addition? >If this is regression, I would also be inclined to say that this is a The bug is as old as the message coalescing feature in the openib btl.

[OMPI devel] osu_mbw_mr error

2014-11-06 Thread Joshua Ladd
Nathan, Has this bug always been present in OpenIB or is this a recent addition? If this is regression, I would also be inclined to say that this is a blocker for 1.8.4. This is a SIGNIFICANT bug. Both Howard and I were quite surprised that all the while this code has been in use at LANL in produc

Re: [OMPI devel] osu_mbw_mr error

2014-11-04 Thread Joshua Ladd
Thanks, Nathan. After a bit more investigation yesterday, this was our conclusion too; that it is a longstanding bug in OpenIB BTL we just happened to start triggering the broken flow with some recent changes made to the default max_lmc parameter. Let us know if you need anything from our end. Jos

Re: [OMPI devel] osu_mbw_mr error

2014-11-03 Thread Nathan Hjelm
I see the problem. The openib btl does not properly handle the following call sequence (this is an openib btl bug IMHO): btl_sendi (..., &descriptor); btl_free (..., descriptor); The bug is in the message coalescing code and it looks like extra logic needs to be added to the openib btl's btl_fre

Re: [OMPI devel] osu_mbw_mr error

2014-11-03 Thread Ralph Castain
Can you please let me know when you fix this? I intend to release 1.8.4 by the end of the week. Since Mellanox is the only member with IB, you folks have been maintaining this BTL. > On Nov 3, 2014, at 6:26 AM, Alina Sklarevich > wrote: > > Hi, > > On 1.8.4rc1 we observe the following asser

[OMPI devel] osu_mbw_mr error

2014-11-03 Thread Alina Sklarevich
Hi, On 1.8.4rc1 we observe the following assert in the osu_mbw_mr test when using the openib BTL. When compiled in production mode (i.e. no --enable-debug) the test simply hangs. When using either the tcp BTL or the cm PML, the benchmark completes without error. The command line to reproduce th