Re: [OMPI devel] Time to remove the openib btl? (Was: Users/Eager RDMA causing slow osu_bibw with 3.0.0)

2018-04-11 Thread Pavel Shamis
Personally, I haven't tried iWARP, I just don't have this HW. As far as I
can remember, one of the "special" requirements for  iWARP's NICs is
RDMACM. UCT does have support for RDMACM, but somebody have to try and see
if there are any other gaps.
P.

On Fri, Apr 6, 2018 at 8:17 AM, Jeff Squyres (jsquyres) 
wrote:

> Does UCT support iWARP?
>
>
> > On Apr 6, 2018, at 9:45 AM, Jeff Squyres (jsquyres) 
> wrote:
> >
> > That would be sweet.  Are you aiming for v4.0.0, perchance?
> >
> > I.e., should we have "remove openib BTL / add uct BTL" to the target
> feature list for v4.0.0?
> >
> > (I did laugh at the Kylo name, but I really, really don't want to keep
> the propagate the idea of meaningless names for BTLs -- especially since
> BTL names, more so than most other component names, are visible to the
> user.  ...unless someone wants to finally finish the ideas and implement
> the whole network-transport-name system that we've talked about for a few
> years... :-) )
> >
> >
> >> On Apr 5, 2018, at 1:49 PM, Thananon Patinyasakdikul <
> tpati...@vols.utk.edu> wrote:
> >>
> >> Just more information to help with the decision:
> >>
> >> I am working on Nathan’s uct btl to make it work with ob1 and
> infiniband. So this could be a replacement for openib and honestly we
> should totally call this new uct btl Kylo.
> >>
> >> Arm
> >>
> >>> On Apr 5, 2018, at 1:37 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> >>>
> >>> Below is an email exchange from the users mailing list.
> >>>
> >>> I'm moving this over to devel to talk among the developer community.
> >>>
> >>> Multiple times recently on the users list, we've told people with
> problems with the openib BTL that they should be using UCX (per Mellanox's
> publicly-stated support positions).
> >>>
> >>> Is it time to deprecate / print warning messages / remove the openib
> BTL?
> >>>
> >>>
> >>>
>  Begin forwarded message:
> 
>  From: Nathan Hjelm 
>  Subject: Re: [OMPI users] Eager RDMA causing slow osu_bibw with 3.0.0
>  Date: April 5, 2018 at 12:48:08 PM EDT
>  To: Open MPI Users 
>  Cc: Open MPI Users 
>  Reply-To: Open MPI Users 
> 
> 
>  Honestly, this is a configuration issue with the openib btl. There is
> no reason to keep either eager RDMA nor is there a reason to pipeline RDMA.
> I haven't found an app where either of these "features" helps you with
> infiniband. You have the right idea with the parameter changes but Howard
> is correct, for Mellanox the future is UCX not verbs. I would try it and
> see if it works for you but if it doesn't I would set those two parameters
> in your /etc/openmpi-mca-params.conf and run like that.
> 
>  -Nathan
> 
>  On Apr 05, 2018, at 01:18 AM, Ben Menadue 
> wrote:
> 
> > Hi,
> >
> > Another interesting point. I noticed that the last two message sizes
> tested (2MB and 4MB) are lower than expected for both osu_bw and osu_bibw.
> Increasing the minimum size to use the RDMA pipeline to above these sizes
> brings those two data-points up to scratch for both benchmarks:
> >
> > 3.0.0, osu_bw, no rdma for large messages
> >
> >> mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -map-by
> ppr:1:node -np 2 -H r6,r7 ./osu_bw -m 2097152:4194304
> > # OSU MPI Bi-Directional Bandwidth Test v5.4.0
> > # Size  Bandwidth (MB/s)
> > 2097152  6133.22
> > 4194304  6054.06
> >
> > 3.0.0, osu_bibw, eager rdma disabled, no rdma for large messages
> >
> >> mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -mca
> btl_openib_use_eager_rdma 0 -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bibw -m
> 2097152:4194304
> > # OSU MPI Bi-Directional Bandwidth Test v5.4.0
> > # Size  Bandwidth (MB/s)
> > 2097152 11397.85
> > 4194304 11389.64
> >
> > This makes me think something odd is going on in the RDMA pipeline.
> >
> > Cheers,
> > Ben
> >
> >
> >
> >> On 5 Apr 2018, at 5:03 pm, Ben Menadue 
> wrote:
> >> Hi,
> >>
> >> We’ve just been running some OSU benchmarks with OpenMPI 3.0.0 and
> noticed that osu_bibw gives nowhere near the bandwidth I’d expect (this is
> on FDR IB). However, osu_bw is fine.
> >>
> >> If I disable eager RDMA, then osu_bibw gives the expected numbers.
> Similarly, if I increase the number of eager RDMA buffers, it gives the
> expected results.
> >>
> >> OpenMPI 1.10.7 gives consistent, reasonable numbers with default
> settings, but they’re not as good as 3.0.0 (when tuned) for large buffers.
> The same option changes produce no different in the performance for 1.10.7.
> >>
> >> I was wondering if anyone else has noticed anything similar, and if
> this is unexpected, 

Re: [OMPI devel] Time to remove the openib btl? (Was: Users/Eager RDMA causing slow osu_bibw with 3.0.0)

2018-04-06 Thread Jeff Squyres (jsquyres)
Does UCT support iWARP?


> On Apr 6, 2018, at 9:45 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> That would be sweet.  Are you aiming for v4.0.0, perchance?
> 
> I.e., should we have "remove openib BTL / add uct BTL" to the target feature 
> list for v4.0.0?
> 
> (I did laugh at the Kylo name, but I really, really don't want to keep the 
> propagate the idea of meaningless names for BTLs -- especially since BTL 
> names, more so than most other component names, are visible to the user.  
> ...unless someone wants to finally finish the ideas and implement the whole 
> network-transport-name system that we've talked about for a few years... :-) )
> 
> 
>> On Apr 5, 2018, at 1:49 PM, Thananon Patinyasakdikul  
>> wrote:
>> 
>> Just more information to help with the decision:
>> 
>> I am working on Nathan’s uct btl to make it work with ob1 and infiniband. So 
>> this could be a replacement for openib and honestly we should totally call 
>> this new uct btl Kylo. 
>> 
>> Arm
>> 
>>> On Apr 5, 2018, at 1:37 PM, Jeff Squyres (jsquyres)  
>>> wrote:
>>> 
>>> Below is an email exchange from the users mailing list.
>>> 
>>> I'm moving this over to devel to talk among the developer community.
>>> 
>>> Multiple times recently on the users list, we've told people with problems 
>>> with the openib BTL that they should be using UCX (per Mellanox's 
>>> publicly-stated support positions).
>>> 
>>> Is it time to deprecate / print warning messages / remove the openib BTL?
>>> 
>>> 
>>> 
 Begin forwarded message:
 
 From: Nathan Hjelm 
 Subject: Re: [OMPI users] Eager RDMA causing slow osu_bibw with 3.0.0
 Date: April 5, 2018 at 12:48:08 PM EDT
 To: Open MPI Users 
 Cc: Open MPI Users 
 Reply-To: Open MPI Users 
 
 
 Honestly, this is a configuration issue with the openib btl. There is no 
 reason to keep either eager RDMA nor is there a reason to pipeline RDMA. I 
 haven't found an app where either of these "features" helps you with 
 infiniband. You have the right idea with the parameter changes but Howard 
 is correct, for Mellanox the future is UCX not verbs. I would try it and 
 see if it works for you but if it doesn't I would set those two parameters 
 in your /etc/openmpi-mca-params.conf and run like that.
 
 -Nathan
 
 On Apr 05, 2018, at 01:18 AM, Ben Menadue  wrote:
 
> Hi,
> 
> Another interesting point. I noticed that the last two message sizes 
> tested (2MB and 4MB) are lower than expected for both osu_bw and 
> osu_bibw. Increasing the minimum size to use the RDMA pipeline to above 
> these sizes brings those two data-points up to scratch for both 
> benchmarks:
> 
> 3.0.0, osu_bw, no rdma for large messages
> 
>> mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -map-by ppr:1:node 
>> -np 2 -H r6,r7 ./osu_bw -m 2097152:4194304
> # OSU MPI Bi-Directional Bandwidth Test v5.4.0
> # Size  Bandwidth (MB/s)
> 2097152  6133.22
> 4194304  6054.06
> 
> 3.0.0, osu_bibw, eager rdma disabled, no rdma for large messages
> 
>> mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -mca 
>> btl_openib_use_eager_rdma 0 -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bibw 
>> -m 2097152:4194304
> # OSU MPI Bi-Directional Bandwidth Test v5.4.0
> # Size  Bandwidth (MB/s)
> 2097152 11397.85
> 4194304 11389.64
> 
> This makes me think something odd is going on in the RDMA pipeline.
> 
> Cheers,
> Ben
> 
> 
> 
>> On 5 Apr 2018, at 5:03 pm, Ben Menadue  wrote:
>> Hi,
>> 
>> We’ve just been running some OSU benchmarks with OpenMPI 3.0.0 and 
>> noticed that osu_bibw gives nowhere near the bandwidth I’d expect (this 
>> is on FDR IB). However, osu_bw is fine.
>> 
>> If I disable eager RDMA, then osu_bibw gives the expected numbers. 
>> Similarly, if I increase the number of eager RDMA buffers, it gives the 
>> expected results.
>> 
>> OpenMPI 1.10.7 gives consistent, reasonable numbers with default 
>> settings, but they’re not as good as 3.0.0 (when tuned) for large 
>> buffers. The same option changes produce no different in the performance 
>> for 1.10.7.
>> 
>> I was wondering if anyone else has noticed anything similar, and if this 
>> is unexpected, if anyone has a suggestion on how to investigate further?
>> 
>> Thanks,
>> Ben
>> 
>> 
>> Here’s are the numbers:
>> 
>> 3.0.0, osu_bw, default settings
>> 
>>> mpirun -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bw
>> # OSU MPI Bandwidth Test v5.4.0
>> # Size  Bandwidth (MB/s)
>> 1

Re: [OMPI devel] Time to remove the openib btl? (Was: Users/Eager RDMA causing slow osu_bibw with 3.0.0)

2018-04-06 Thread Jeff Squyres (jsquyres)
That would be sweet.  Are you aiming for v4.0.0, perchance?

I.e., should we have "remove openib BTL / add uct BTL" to the target feature 
list for v4.0.0?

(I did laugh at the Kylo name, but I really, really don't want to keep the 
propagate the idea of meaningless names for BTLs -- especially since BTL names, 
more so than most other component names, are visible to the user.  ...unless 
someone wants to finally finish the ideas and implement the whole 
network-transport-name system that we've talked about for a few years... :-) )


> On Apr 5, 2018, at 1:49 PM, Thananon Patinyasakdikul  
> wrote:
> 
> Just more information to help with the decision:
> 
> I am working on Nathan’s uct btl to make it work with ob1 and infiniband. So 
> this could be a replacement for openib and honestly we should totally call 
> this new uct btl Kylo. 
> 
> Arm
> 
>> On Apr 5, 2018, at 1:37 PM, Jeff Squyres (jsquyres)  
>> wrote:
>> 
>> Below is an email exchange from the users mailing list.
>> 
>> I'm moving this over to devel to talk among the developer community.
>> 
>> Multiple times recently on the users list, we've told people with problems 
>> with the openib BTL that they should be using UCX (per Mellanox's 
>> publicly-stated support positions).
>> 
>> Is it time to deprecate / print warning messages / remove the openib BTL?
>> 
>> 
>> 
>>> Begin forwarded message:
>>> 
>>> From: Nathan Hjelm 
>>> Subject: Re: [OMPI users] Eager RDMA causing slow osu_bibw with 3.0.0
>>> Date: April 5, 2018 at 12:48:08 PM EDT
>>> To: Open MPI Users 
>>> Cc: Open MPI Users 
>>> Reply-To: Open MPI Users 
>>> 
>>> 
>>> Honestly, this is a configuration issue with the openib btl. There is no 
>>> reason to keep either eager RDMA nor is there a reason to pipeline RDMA. I 
>>> haven't found an app where either of these "features" helps you with 
>>> infiniband. You have the right idea with the parameter changes but Howard 
>>> is correct, for Mellanox the future is UCX not verbs. I would try it and 
>>> see if it works for you but if it doesn't I would set those two parameters 
>>> in your /etc/openmpi-mca-params.conf and run like that.
>>> 
>>> -Nathan
>>> 
>>> On Apr 05, 2018, at 01:18 AM, Ben Menadue  wrote:
>>> 
 Hi,
 
 Another interesting point. I noticed that the last two message sizes 
 tested (2MB and 4MB) are lower than expected for both osu_bw and osu_bibw. 
 Increasing the minimum size to use the RDMA pipeline to above these sizes 
 brings those two data-points up to scratch for both benchmarks:
 
 3.0.0, osu_bw, no rdma for large messages
 
 > mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -map-by ppr:1:node 
 > -np 2 -H r6,r7 ./osu_bw -m 2097152:4194304
 # OSU MPI Bi-Directional Bandwidth Test v5.4.0
 # Size  Bandwidth (MB/s)
 2097152  6133.22
 4194304  6054.06
 
 3.0.0, osu_bibw, eager rdma disabled, no rdma for large messages
 
 > mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -mca 
 > btl_openib_use_eager_rdma 0 -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bibw 
 > -m 2097152:4194304
 # OSU MPI Bi-Directional Bandwidth Test v5.4.0
 # Size  Bandwidth (MB/s)
 2097152 11397.85
 4194304 11389.64
 
 This makes me think something odd is going on in the RDMA pipeline.
 
 Cheers,
 Ben
 
 
 
> On 5 Apr 2018, at 5:03 pm, Ben Menadue  wrote:
> Hi,
> 
> We’ve just been running some OSU benchmarks with OpenMPI 3.0.0 and 
> noticed that osu_bibw gives nowhere near the bandwidth I’d expect (this 
> is on FDR IB). However, osu_bw is fine.
> 
> If I disable eager RDMA, then osu_bibw gives the expected numbers. 
> Similarly, if I increase the number of eager RDMA buffers, it gives the 
> expected results.
> 
> OpenMPI 1.10.7 gives consistent, reasonable numbers with default 
> settings, but they’re not as good as 3.0.0 (when tuned) for large 
> buffers. The same option changes produce no different in the performance 
> for 1.10.7.
> 
> I was wondering if anyone else has noticed anything similar, and if this 
> is unexpected, if anyone has a suggestion on how to investigate further?
> 
> Thanks,
> Ben
> 
> 
> Here’s are the numbers:
> 
> 3.0.0, osu_bw, default settings
> 
> > mpirun -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bw
> # OSU MPI Bandwidth Test v5.4.0
> # Size  Bandwidth (MB/s)
> 1   1.13
> 2   2.29
> 4   4.63
> 8   9.21
> 16 18.18
> 32 36.46
> 64 69.95
> 128

Re: [OMPI devel] Time to remove the openib btl? (Was: Users/Eager RDMA causing slow osu_bibw with 3.0.0)

2018-04-05 Thread Nathan Hjelm

As soon as the uct btl is in place I have no use for the openib btl myself. It 
is a pain to maintain and I do have the time to mess with it. From a OSC 
perspective the uct btl offers much better performance with none of the 
headache.

-Nathan

On Apr 05, 2018, at 11:39 AM, "Jeff Squyres (jsquyres)"  
wrote:

Below is an email exchange from the users mailing list.

I'm moving this over to devel to talk among the developer community.

Multiple times recently on the users list, we've told people with problems with 
the openib BTL that they should be using UCX (per Mellanox's publicly-stated 
support positions).

Is it time to deprecate / print warning messages / remove the openib BTL?



Begin forwarded message:
From: Nathan Hjelm 
Subject: Re: [OMPI users] Eager RDMA causing slow osu_bibw with 3.0.0
Date: April 5, 2018 at 12:48:08 PM EDT
To: Open MPI Users 
Cc: Open MPI Users 
Reply-To: Open MPI Users 


Honestly, this is a configuration issue with the openib btl. There is no reason to keep 
either eager RDMA nor is there a reason to pipeline RDMA. I haven't found an app where 
either of these "features" helps you with infiniband. You have the right idea 
with the parameter changes but Howard is correct, for Mellanox the future is UCX not 
verbs. I would try it and see if it works for you but if it doesn't I would set those two 
parameters in your /etc/openmpi-mca-params.conf and run like that.

-Nathan

On Apr 05, 2018, at 01:18 AM, Ben Menadue  wrote:

Hi,

Another interesting point. I noticed that the last two message sizes tested 
(2MB and 4MB) are lower than expected for both osu_bw and osu_bibw. Increasing 
the minimum size to use the RDMA pipeline to above these sizes brings those two 
data-points up to scratch for both benchmarks:

3.0.0, osu_bw, no rdma for large messages


mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -map-by ppr:1:node -np 2 
-H r6,r7 ./osu_bw -m 2097152:4194304

# OSU MPI Bi-Directional Bandwidth Test v5.4.0
# Size      Bandwidth (MB/s)
2097152              6133.22
4194304              6054.06

3.0.0, osu_bibw, eager rdma disabled, no rdma for large messages


mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -mca 
btl_openib_use_eager_rdma 0 -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bibw -m 
2097152:4194304

# OSU MPI Bi-Directional Bandwidth Test v5.4.0
# Size      Bandwidth (MB/s)
2097152             11397.85
4194304             11389.64

This makes me think something odd is going on in the RDMA pipeline.

Cheers,
Ben



On 5 Apr 2018, at 5:03 pm, Ben Menadue  wrote:
Hi,

We’ve just been running some OSU benchmarks with OpenMPI 3.0.0 and noticed that 
osu_bibw gives nowhere near the bandwidth I’d expect (this is on FDR IB). 
However, osu_bw is fine.

If I disable eager RDMA, then osu_bibw gives the expected numbers. Similarly, 
if I increase the number of eager RDMA buffers, it gives the expected results.

OpenMPI 1.10.7 gives consistent, reasonable numbers with default settings, but 
they’re not as good as 3.0.0 (when tuned) for large buffers. The same option 
changes produce no different in the performance for 1.10.7.

I was wondering if anyone else has noticed anything similar, and if this is 
unexpected, if anyone has a suggestion on how to investigate further?

Thanks,
Ben


Here’s are the numbers:

3.0.0, osu_bw, default settings


mpirun -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bw

# OSU MPI Bandwidth Test v5.4.0
# Size      Bandwidth (MB/s)
1                       1.13
2                       2.29
4                       4.63
8                       9.21
16                     18.18
32                     36.46
64                     69.95
128                   128.55
256                   250.74
512                   451.54
1024                  829.44
2048                 1475.87
4096                 2119.99
8192                 3452.37
16384                2866.51
32768                4048.17
65536                5030.54
131072               5573.81
262144               5861.61
524288               6015.15
1048576              6099.46
2097152               989.82
4194304               989.81

3.0.0, osu_bibw, default settings


mpirun -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bibw

# OSU MPI Bi-Directional Bandwidth Test v5.4.0
# Size      Bandwidth (MB/s)
1                       0.00
2                       0.01
4                       0.01
8                       0.02
16                      0.04
32                      0.09
64                      0.16
128                   135.30
256                   265.35
512                   499.92
1024                  949.22
2048                 1440.27
4096                 1960.09
8192                 3166.97
16384                 127.62
32768                 165.12
65536                 312.80
131072               1120.03
262144               4724.01

Re: [OMPI devel] Time to remove the openib btl? (Was: Users/Eager RDMA causing slow osu_bibw with 3.0.0)

2018-04-05 Thread Thananon Patinyasakdikul
Just more information to help with the decision:

I am working on Nathan’s uct btl to make it work with ob1 and infiniband. So 
this could be a replacement for openib and honestly we should totally call this 
new uct btl Kylo. 

Arm

> On Apr 5, 2018, at 1:37 PM, Jeff Squyres (jsquyres)  
> wrote:
> 
> Below is an email exchange from the users mailing list.
> 
> I'm moving this over to devel to talk among the developer community.
> 
> Multiple times recently on the users list, we've told people with problems 
> with the openib BTL that they should be using UCX (per Mellanox's 
> publicly-stated support positions).
> 
> Is it time to deprecate / print warning messages / remove the openib BTL?
> 
> 
> 
>> Begin forwarded message:
>> 
>> From: Nathan Hjelm >
>> Subject: Re: [OMPI users] Eager RDMA causing slow osu_bibw with 3.0.0
>> Date: April 5, 2018 at 12:48:08 PM EDT
>> To: Open MPI Users > >
>> Cc: Open MPI Users > >
>> Reply-To: Open MPI Users > >
>> 
>> 
>> Honestly, this is a configuration issue with the openib btl. There is no 
>> reason to keep either eager RDMA nor is there a reason to pipeline RDMA. I 
>> haven't found an app where either of these "features" helps you with 
>> infiniband. You have the right idea with the parameter changes but Howard is 
>> correct, for Mellanox the future is UCX not verbs. I would try it and see if 
>> it works for you but if it doesn't I would set those two parameters in your 
>> /etc/openmpi-mca-params.conf and run like that.
>> 
>> -Nathan
>> 
>> On Apr 05, 2018, at 01:18 AM, Ben Menadue > > wrote:
>> 
>>> Hi,
>>> 
>>> Another interesting point. I noticed that the last two message sizes tested 
>>> (2MB and 4MB) are lower than expected for both osu_bw and osu_bibw. 
>>> Increasing the minimum size to use the RDMA pipeline to above these sizes 
>>> brings those two data-points up to scratch for both benchmarks:
>>> 
>>> 3.0.0, osu_bw, no rdma for large messages
>>> 
>>> > mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -map-by ppr:1:node 
>>> > -np 2 -H r6,r7 ./osu_bw -m 2097152:4194304
>>> # OSU MPI Bi-Directional Bandwidth Test v5.4.0
>>> # Size  Bandwidth (MB/s)
>>> 2097152  6133.22
>>> 4194304  6054.06
>>> 
>>> 3.0.0, osu_bibw, eager rdma disabled, no rdma for large messages
>>> 
>>> > mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -mca 
>>> > btl_openib_use_eager_rdma 0 -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bibw 
>>> > -m 2097152:4194304
>>> # OSU MPI Bi-Directional Bandwidth Test v5.4.0
>>> # Size  Bandwidth (MB/s)
>>> 2097152 11397.85
>>> 4194304 11389.64
>>> 
>>> This makes me think something odd is going on in the RDMA pipeline.
>>> 
>>> Cheers,
>>> Ben
>>> 
>>> 
>>> 
 On 5 Apr 2018, at 5:03 pm, Ben Menadue > wrote:
 Hi,
 
 We’ve just been running some OSU benchmarks with OpenMPI 3.0.0 and noticed 
 that osu_bibw gives nowhere near the bandwidth I’d expect (this is on FDR 
 IB). However, osu_bw is fine.
 
 If I disable eager RDMA, then osu_bibw gives the expected numbers. 
 Similarly, if I increase the number of eager RDMA buffers, it gives the 
 expected results.
 
 OpenMPI 1.10.7 gives consistent, reasonable numbers with default settings, 
 but they’re not as good as 3.0.0 (when tuned) for large buffers. The same 
 option changes produce no different in the performance for 1.10.7.
 
 I was wondering if anyone else has noticed anything similar, and if this 
 is unexpected, if anyone has a suggestion on how to investigate further?
 
 Thanks,
 Ben
 
 
 Here’s are the numbers:
 
 3.0.0, osu_bw, default settings
 
 > mpirun -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bw
 # OSU MPI Bandwidth Test v5.4.0
 # Size  Bandwidth (MB/s)
 1   1.13
 2   2.29
 4   4.63
 8   9.21
 16 18.18
 32 36.46
 64 69.95
 128   128.55
 256   250.74
 512   451.54
 1024  829.44
 2048 1475.87
 4096 2119.99
 8192 3452.37
 163842866.51
 327684048.17
 655365030.54
 131072   5573.81
 262144   5861.61
 524288   6015.15
 1048576  6099.46
 2097152   989.82
 4194304   989.81
 
 3.0.0, 

[OMPI devel] Time to remove the openib btl? (Was: Users/Eager RDMA causing slow osu_bibw with 3.0.0)

2018-04-05 Thread Jeff Squyres (jsquyres)
Below is an email exchange from the users mailing list.

I'm moving this over to devel to talk among the developer community.

Multiple times recently on the users list, we've told people with problems with 
the openib BTL that they should be using UCX (per Mellanox's publicly-stated 
support positions).

Is it time to deprecate / print warning messages / remove the openib BTL?



Begin forwarded message:

From: Nathan Hjelm >
Subject: Re: [OMPI users] Eager RDMA causing slow osu_bibw with 3.0.0
Date: April 5, 2018 at 12:48:08 PM EDT
To: Open MPI Users >
Cc: Open MPI Users >
Reply-To: Open MPI Users 
>


Honestly, this is a configuration issue with the openib btl. There is no reason 
to keep either eager RDMA nor is there a reason to pipeline RDMA. I haven't 
found an app where either of these "features" helps you with infiniband. You 
have the right idea with the parameter changes but Howard is correct, for 
Mellanox the future is UCX not verbs. I would try it and see if it works for 
you but if it doesn't I would set those two parameters in your 
/etc/openmpi-mca-params.conf and run like that.

-Nathan

On Apr 05, 2018, at 01:18 AM, Ben Menadue 
> wrote:

Hi,

Another interesting point. I noticed that the last two message sizes tested 
(2MB and 4MB) are lower than expected for both osu_bw and osu_bibw. Increasing 
the minimum size to use the RDMA pipeline to above these sizes brings those two 
data-points up to scratch for both benchmarks:

3.0.0, osu_bw, no rdma for large messages

> mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -map-by ppr:1:node -np 
> 2 -H r6,r7 ./osu_bw -m 2097152:4194304
# OSU MPI Bi-Directional Bandwidth Test v5.4.0
# Size  Bandwidth (MB/s)
2097152  6133.22
4194304  6054.06

3.0.0, osu_bibw, eager rdma disabled, no rdma for large messages

> mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -mca 
> btl_openib_use_eager_rdma 0 -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bibw -m 
> 2097152:4194304
# OSU MPI Bi-Directional Bandwidth Test v5.4.0
# Size  Bandwidth (MB/s)
2097152 11397.85
4194304 11389.64

This makes me think something odd is going on in the RDMA pipeline.

Cheers,
Ben



On 5 Apr 2018, at 5:03 pm, Ben Menadue 
> wrote:
Hi,

We’ve just been running some OSU benchmarks with OpenMPI 3.0.0 and noticed that 
osu_bibw gives nowhere near the bandwidth I’d expect (this is on FDR IB). 
However, osu_bw is fine.

If I disable eager RDMA, then osu_bibw gives the expected numbers. Similarly, 
if I increase the number of eager RDMA buffers, it gives the expected results.

OpenMPI 1.10.7 gives consistent, reasonable numbers with default settings, but 
they’re not as good as 3.0.0 (when tuned) for large buffers. The same option 
changes produce no different in the performance for 1.10.7.

I was wondering if anyone else has noticed anything similar, and if this is 
unexpected, if anyone has a suggestion on how to investigate further?

Thanks,
Ben


Here’s are the numbers:

3.0.0, osu_bw, default settings

> mpirun -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bw
# OSU MPI Bandwidth Test v5.4.0
# Size  Bandwidth (MB/s)
1   1.13
2   2.29
4   4.63
8   9.21
16 18.18
32 36.46
64 69.95
128   128.55
256   250.74
512   451.54
1024  829.44
2048 1475.87
4096 2119.99
8192 3452.37
163842866.51
327684048.17
655365030.54
131072   5573.81
262144   5861.61
524288   6015.15
1048576  6099.46
2097152   989.82
4194304   989.81

3.0.0, osu_bibw, default settings

> mpirun -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bibw
# OSU MPI Bi-Directional Bandwidth Test v5.4.0
# Size  Bandwidth (MB/s)
1   0.00
2   0.01
4   0.01
8   0.02
16  0.04
32  0.09
64  0.16
128   135.30
256   265.35
512   499.92
1024  949.22
2048 1440.27
4096 1960.09
8192 3166.97
16384 127.62
32768 165.12
65536 312.80
131072   1120.03
262144   4724.01
524288   4545.93
1048576  5186.51
2097152   989.84
4194304   989.88

3.0.0, osu_bibw, eager