Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?

2016-08-22 Thread Audet, Martin
Hi Devendar,

Thank again you for your answer.

I searched a little bit and found that UD stands for "Unreliable Datagram"
while RC is for "Reliable Connected" transport mechanism. I found another
called DC for "Dynamically Connected" which is not supported on our HCA.

Do you know what is basically the difference between them ?

I didn't find any information about this.

Which one is used by btl=openib (iverb), is it RC ?

Also are they all standard or some of them are supported only by Mellanox ?

I will try to convince the admin of the system I'm using to increase the
maximal shared segment size (SHMMAX). I guess what we have (e.g. 32 MB) is the
default. But I didn't find any document suggesting that we should increase
SHMMAX for helping MXM. This is a bit odd, if it's important, it should be
mentioned in Mellanox documentation at least.

I will check at the messaging rate benchmark osu_mbw_mr for sure to see if its
result are improved by MXM.

After looking at the MPI performance results published on your URL (e.g.
latencies around 1 us in native mode), I'm more and more convinced that our
results are suboptimal.

And after seeing the impact of SR-IOV published on your URL, I suspect more
and more that our mediocre latency is caused by this mechanism.

But our cluster is different: SR-IOV is not used in the context of Virtual
Machines running under a host VMM. SR-IOV is used with Linux LXC containers.


Martin Audet


> Hi Martin
>
> MXM default transport is UD (MXM_TLS=*ud*,shm,self), which is scalable when
> running with large applications.  RC(MXM_TLS=*rc,*shm,self)  is recommended
> for microbenchmarks and very small scale applications,
>
> yes, max seg size setting is too small.
>
> Did you check any message rate benchmarks(like osu_mbw_mr) with MXM?
>
> virtualization env will have some overhead.  see some perf comparision here
> with mvapich
> http://mvapich.cse.ohio-state.edu/performance/v-pt_to_pt/ .



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?

2016-08-19 Thread Deva
Hi Martin

MXM default transport is UD (MXM_TLS=*ud*,shm,self), which is scalable when
running with large applications.  RC(MXM_TLS=*rc,*shm,self)  is recommended
for microbenchmarks and very small scale applications,

yes, max seg size setting is too small.

Did you check any message rate benchmarks(like osu_mbw_mr) with MXM?

virtualization env will have some overhead.  see some perf comparision here
with mvapich
http://mvapich.cse.ohio-state.edu/performance/v-pt_to_pt/ .





On Fri, Aug 19, 2016 at 3:11 PM, Audet, Martin 
wrote:

> Hi Devendar,
>
> Thank you for your answer.
>
> Setting MXM_TLS=rc,shm,self does improve the speed of MXM (both latency
> and bandwidth):
>
> without MXM_TLS
>
> comm   lat_min  bw_max  bw_max
>pingpong pingpongsendrecv
>(us) (MB/s)  (MB/s)
> ---
> openib 1.79 5827.9311552.4
> mxm2.23 5191.77 8201.76
> yalla  2.18 5200.55 8109.48
>
>
> with MXM_TLS=rc,shm,self
>
> comm   lat_min  bw_max  bw_max
>pingpong pingpongsendrecv
>(us) (MB/s)  (MB/s)
> ---
> openib 1.79 6021.8311529
> mxm1.78 5936.9211168.5
> yalla  1.78 5944.8611375
>
>
> Note 1: MXM_RDMA_PORTS=mlx4_0:1 and the MCA parameter
> btl_openib_include_if=mlx4_0 for both cases.
>
> Note 2: The bandwidth reported are not very accurate. Bandwidth results
> can vary easilly by 7% from one run to another.
>
> We see that the performance of MXM is now very similar to the performance
> of openib for these IMB tests.
>
> However an error is now reported a few times when MXM_TLS is set:
>
> sys.c:468  MXM  ERROR A new segment was to be created and size < SHMMIN or
> size > SHMMAX, or the new segment was to be created. A segment with given
> key existed, but size is greater than the size of that segment. Please
> check limits by 'ipcs -l'.
>
> "ipcs -l" reports among other things that:
>
>   max seg size (kbytes) = 32768
>
> By the way, is it too small ?
>
>
> Now if we run /opt/mellanox/mxm/mxm_perftest we get:
>
>   without  with
>   MXM_TLS  MXM_TLS
>   
>   avg send_lat(us)1.6261.321
>
>   avg send_bw   -s 400(MB/s)  5219.51  5514.04
>   avg bidir send_bw -s 400 -b (MB/s)  5283.13  5514.45
>
> Note: the -b for bidirectional bandwith doesn't seen to affect the result.
>
> Again it is an improvement both in term of latency and bandwidth.
>
> However a warning is reported when MXM_TLS is set on the server side when
> the send_lat test is run:
>
> icb_ep.c:287   MXM  WARN  The min value for CIB_RX_QUEUE_LEN is 2048.
>
> Note: setting the undocumented env variable MXM_CIB_RX_QUEUE_LEN=2048
> remove the warning but doesn't affect the send latency.
>
>
> * * *
>
> So now the results are better: MXM performs as well as the regular openib
> in term of latency and bandwidth (I didn't checked the overlap capacity
> though). But I'm not really impressed. I was expecting MXM (especially when
> used by yalla) to be a little better than openib. Also the latency of both
> openib, mxm and yalla at 1.8 us seems to be too high. With a configuration
> like ours, we should get something closer to 1 us.
>
> Does anyone has an idea ?
>
> Don't forget that this cluster uses LXC containers with SR-IOV enabled for
> the Infiniband adapter.
>
> Martin Audet
>
>
> > Hi Martin,
> >
> > Can you check if it is any better with  "-x MXM_TLS=rc,shm,self" ?
> >
> > -Devendar
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>



-- 


-Devendar
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?

2016-08-19 Thread Audet, Martin
Hi Devendar,

Thank you for your answer.

Setting MXM_TLS=rc,shm,self does improve the speed of MXM (both latency and 
bandwidth):

 without MXM_TLS

comm   lat_min  bw_max  bw_max
   pingpong pingpongsendrecv
   (us) (MB/s)  (MB/s)
---
openib 1.79 5827.9311552.4
mxm2.23 5191.77 8201.76
yalla  2.18 5200.55 8109.48


 with MXM_TLS=rc,shm,self

comm   lat_min  bw_max  bw_max
   pingpong pingpongsendrecv
   (us) (MB/s)  (MB/s)
---
openib 1.79 6021.8311529
mxm1.78 5936.9211168.5
yalla  1.78 5944.8611375


Note 1: MXM_RDMA_PORTS=mlx4_0:1 and the MCA parameter 
btl_openib_include_if=mlx4_0 for both cases.

Note 2: The bandwidth reported are not very accurate. Bandwidth results can 
vary easilly by 7% from one run to another.

We see that the performance of MXM is now very similar to the performance of 
openib for these IMB tests.

However an error is now reported a few times when MXM_TLS is set:

sys.c:468  MXM  ERROR A new segment was to be created and size < SHMMIN or size 
> SHMMAX, or the new segment was to be created. A segment with given key 
existed, but size is greater than the size of that segment. Please check limits 
by 'ipcs -l'.

"ipcs -l" reports among other things that:

  max seg size (kbytes) = 32768

By the way, is it too small ?


Now if we run /opt/mellanox/mxm/mxm_perftest we get:

  without  with
  MXM_TLS  MXM_TLS
  
  avg send_lat(us)1.6261.321

  avg send_bw   -s 400(MB/s)  5219.51  5514.04
  avg bidir send_bw -s 400 -b (MB/s)  5283.13  5514.45

Note: the -b for bidirectional bandwith doesn't seen to affect the result.

Again it is an improvement both in term of latency and bandwidth.

However a warning is reported when MXM_TLS is set on the server side when the 
send_lat test is run:

 icb_ep.c:287   MXM  WARN  The min value for CIB_RX_QUEUE_LEN is 2048.

Note: setting the undocumented env variable MXM_CIB_RX_QUEUE_LEN=2048 remove 
the warning but doesn't affect the send latency.


 * * *

So now the results are better: MXM performs as well as the regular openib in 
term of latency and bandwidth (I didn't checked the overlap capacity though). 
But I'm not really impressed. I was expecting MXM (especially when used by 
yalla) to be a little better than openib. Also the latency of both openib, mxm 
and yalla at 1.8 us seems to be too high. With a configuration like ours, we 
should get something closer to 1 us.

Does anyone has an idea ?

Don't forget that this cluster uses LXC containers with SR-IOV enabled for the 
Infiniband adapter.

Martin Audet


> Hi Martin,
>
> Can you check if it is any better with  "-x MXM_TLS=rc,shm,self" ?
>
> -Devendar


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?

2016-08-19 Thread Deva
Hi Martin,

Can you check if it is any better with  "-x MXM_TLS=rc,shm,self" ?

-Devendar


On Tue, Aug 16, 2016 at 11:28 AM, Audet, Martin  wrote:

> Hi Josh,
>
> Thanks for your reply. I did try setting MXM_RDMA_PORTS=mlx4_0:1 for all
> my MPI processes
> and it did improve performance but the performance I obtain isn't
> completely satisfying.
>
> When I use IMB 4.1 pingpong and sendrecv benchmarks between two nodes I
> get using
> Open MPI 1.10.3:
>
> without MXM_RDMA_PORTS
>
>comm   lat_min  bw_max  bw_max
>   pingpong pingpongsendrecv
>   (us) (MB/s)  (MB/s)
>---
>openib 1.79 5947.0711534
>mxm2.51 5166.96 8079.18
>yalla  2.47 5167.29 8278.15
>
>
> with MXM_RDMA_PORTS=mlx4_0:1
>
>comm   lat_min  bw_max  bw_max
>   pingpong pingpongsendrecv
>   (us) (MB/s)  (MB/s)
>---
>openib 1.79 5827.9311552.4
>mxm2.23 5191.77 8201.76
>yalla  2.18 5200.55 8109.48
>
>
> openib means: pml=ob1 btl=openib,vader,self
> btl_openib_include_if=mlx4_0
> mxmmeans: pml=cm,ob1 mtl=mxm  btl=vader,self
> yalla  means: pml=yalla,ob1   btl=vader,self
>
> lspci reports for our FDR Infiniband HCA:
>   Infiniband Controler: Mellanox Technologies MT27500 Family [ConnectX-3]
>
> and 16 lines like:
>   Infiniband Controler: Mellanox Technologies MT27500/MT27520 Family
> [ConnectX-3/ConnectX-3 Pro Virtual Function]
>
> the nodes use two octacore Xeon E5-2650v2 Ivybridge-EP 2.67 GHz sockets
>
> ofed_info reports that mxm version is 3.4.3cce223-0.32200
>
> As you can see the results are not very good. I would expect mxm and yalla
> to perform
> better than openib both in term of latency and bandwidth (note: sendrecv
> bandwidth is
> full duplex). I would expect the yalla bandwidth to be around 1.1 us like
> shown here
> https://www.open-mpi.org/papers/sc-2014/Open-MPI-SC14-BOF.pdf (page 33).
>
> I also ran mxm_perftest (located in /opt/mellanox/bin) and it reports the
> following
> latency between two nodes:
>
> without MXM_RDMA_PORTS1.92 us
> withMXM_RDMA_PORTS=mlx4_0:1   1.65 us
>
> Again I think we can expect a better latency with our configuration. 1.65
> us is not a
> very good result.
>
> Note however that the 0.27 us (1.92 - 1.65 = 0.27) reduction reduction in
> raw mxm
> latency correspond to the above Open MPI latencies observed with mxm (2.51
> - 2.23 = 0.28)
> and yalla (2.47 - 2.18 = 0.29).
>
> Another detail: everything is run inside LXC containers. Also SR-IOV is
> probably used.
>
> Does anyone has any idea what's wrong with our cluster ?
>
> Martin Audet
>
>
> > Hi, Martin
> >
> > The environment variable:
> >
> > MXM_RDMA_PORTS=device:port
> >
> > is what you're looking for. You can specify a device/port pair on your
> OMPI
> > command line like:
> >
> > mpirun -np 2 ... -x MXM_RDMA_PORTS=mlx4_0:1 ...
> >
> >
> > Best,
> >
> > Josh
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>



-- 


-Devendar
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?

2016-08-18 Thread Dave Love
"Audet, Martin"  writes:

> Hi Josh,
>
> Thanks for your reply. I did try setting MXM_RDMA_PORTS=mlx4_0:1 for all my 
> MPI processes
> and it did improve performance but the performance I obtain isn't completely 
> satisfying.

I raised the issue of MXM hurting p2p latency here a while ago, but
don't have a solution.  Mellanox were here last week and promised to
address that, but I haven't heard back.  I get the impression this stuff
isn't widely used, and since it's proprietary, unlike PSM, we can't
really investigate.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?

2016-08-16 Thread Audet, Martin
Hi Josh,

Thanks for your reply. I did try setting MXM_RDMA_PORTS=mlx4_0:1 for all my MPI 
processes
and it did improve performance but the performance I obtain isn't completely 
satisfying.

When I use IMB 4.1 pingpong and sendrecv benchmarks between two nodes I get 
using
Open MPI 1.10.3:

 without MXM_RDMA_PORTS

   comm   lat_min  bw_max  bw_max
  pingpong pingpongsendrecv
  (us) (MB/s)  (MB/s)
   ---
   openib 1.79 5947.0711534
   mxm2.51 5166.96 8079.18
   yalla  2.47 5167.29 8278.15


 with MXM_RDMA_PORTS=mlx4_0:1

   comm   lat_min  bw_max  bw_max
  pingpong pingpongsendrecv
  (us) (MB/s)  (MB/s)
   ---
   openib 1.79 5827.9311552.4
   mxm2.23 5191.77 8201.76
   yalla  2.18 5200.55 8109.48


openib means: pml=ob1 btl=openib,vader,self  
btl_openib_include_if=mlx4_0
mxmmeans: pml=cm,ob1 mtl=mxm  btl=vader,self
yalla  means: pml=yalla,ob1   btl=vader,self

lspci reports for our FDR Infiniband HCA:
  Infiniband Controler: Mellanox Technologies MT27500 Family [ConnectX-3]

and 16 lines like:
  Infiniband Controler: Mellanox Technologies MT27500/MT27520 Family 
[ConnectX-3/ConnectX-3 Pro Virtual Function]

the nodes use two octacore Xeon E5-2650v2 Ivybridge-EP 2.67 GHz sockets

ofed_info reports that mxm version is 3.4.3cce223-0.32200

As you can see the results are not very good. I would expect mxm and yalla to 
perform
better than openib both in term of latency and bandwidth (note: sendrecv 
bandwidth is
full duplex). I would expect the yalla bandwidth to be around 1.1 us like shown 
here
https://www.open-mpi.org/papers/sc-2014/Open-MPI-SC14-BOF.pdf (page 33).

I also ran mxm_perftest (located in /opt/mellanox/bin) and it reports the 
following
latency between two nodes:

 without MXM_RDMA_PORTS1.92 us
 withMXM_RDMA_PORTS=mlx4_0:1   1.65 us

Again I think we can expect a better latency with our configuration. 1.65 us is 
not a
very good result.

Note however that the 0.27 us (1.92 - 1.65 = 0.27) reduction reduction in raw 
mxm
latency correspond to the above Open MPI latencies observed with mxm (2.51 - 
2.23 = 0.28)
and yalla (2.47 - 2.18 = 0.29).

Another detail: everything is run inside LXC containers. Also SR-IOV is 
probably used.

Does anyone has any idea what's wrong with our cluster ?

Martin Audet


> Hi, Martin
>
> The environment variable:
>
> MXM_RDMA_PORTS=device:port
>
> is what you're looking for. You can specify a device/port pair on your OMPI
> command line like:
>
> mpirun -np 2 ... -x MXM_RDMA_PORTS=mlx4_0:1 ...
>
>
> Best,
>
> Josh

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?

2016-08-15 Thread Joshua Ladd
Hi, Martin

The environment variable:

MXM_RDMA_PORTS=device:port

is what you're looking for. You can specify a device/port pair on your OMPI
command line like:

mpirun -np 2 ... -x MXM_RDMA_PORTS=mlx4_0:1 ...


Best,

Josh

On Fri, Aug 12, 2016 at 5:03 PM, Audet, Martin 
wrote:

> Hi OMPI_Users && OMPI_Developers,
>
> Is there an equivalent to the MCA parameter btl_openib_include_if when
> using MXM over Infiniband (e.g. either (pml=cm  mtl=mxm) or (pml=yalla)) ?
>
> I ask this question because I’m working on a cluster where LXC containers
> are used on compute nodes (with SR-IOV I think) and multiple mlx4
> interfaces are reported by lstopo (e.g. mlx4_0, mlx4_1, …, mlx4_16) even if
> a single physical Mellanox Connect-X3 HCA is present per node.
>
> I found that when I use the plain openib btl (e.g. (pml=ob1  btl=openib)),
> it is much faster if I specify the MCA parameter
> btl_openib_include_if=mlx4_0 to force Open MPI to use a single interface.
> By doing that the latency is lower while the bandwidth higher. I guess it
> is because otherwise Open MPI mess by trying to use all “virtual”
> interfaces at once.
>
> However we all know that MXM is better than plain openib since it allows
> the HCAs to perform message matching, transfer message in the background
> and provide communication progress.
>
> So in this case is there a way to use only mlx4_0 ?
>
> I mean when using mxm mtl (pml=cm  mtl=mxm) or preferably using it more
> directly by yalla pml (pml=yalla).
>
> Note I’m using Open MPI 1.10.3 I compiled myself for now but I could use
> instead Open MPI 2.0 if necessary .
>
> Thanks,
>
> Martin Audet
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?

2016-08-12 Thread Audet, Martin
Hi OMPI_Users && OMPI_Developers,

Is there an equivalent to the MCA parameter btl_openib_include_if when using 
MXM over Infiniband (e.g. either (pml=cm  mtl=mxm) or (pml=yalla)) ?

I ask this question because I'm working on a cluster where LXC containers are 
used on compute nodes (with SR-IOV I think) and multiple mlx4 interfaces are 
reported by lstopo (e.g. mlx4_0, mlx4_1, ..., mlx4_16) even if a single 
physical Mellanox Connect-X3 HCA is present per node.

I found that when I use the plain openib btl (e.g. (pml=ob1  btl=openib)), it 
is much faster if I specify the MCA parameter btl_openib_include_if=mlx4_0 to 
force Open MPI to use a single interface. By doing that the latency is lower 
while the bandwidth higher. I guess it is because otherwise Open MPI mess by 
trying to use all "virtual" interfaces at once.

However we all know that MXM is better than plain openib since it allows the 
HCAs to perform message matching, transfer message in the background and 
provide communication progress.

So in this case is there a way to use only mlx4_0 ?

I mean when using mxm mtl (pml=cm  mtl=mxm) or preferably using it more 
directly by yalla pml (pml=yalla).

Note I'm using Open MPI 1.10.3 I compiled myself for now but I could use 
instead Open MPI 2.0 if necessary .

Thanks,

Martin Audet

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users