Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?
Hi Devendar, Thank again you for your answer. I searched a little bit and found that UD stands for "Unreliable Datagram" while RC is for "Reliable Connected" transport mechanism. I found another called DC for "Dynamically Connected" which is not supported on our HCA. Do you know what is basically the difference between them ? I didn't find any information about this. Which one is used by btl=openib (iverb), is it RC ? Also are they all standard or some of them are supported only by Mellanox ? I will try to convince the admin of the system I'm using to increase the maximal shared segment size (SHMMAX). I guess what we have (e.g. 32 MB) is the default. But I didn't find any document suggesting that we should increase SHMMAX for helping MXM. This is a bit odd, if it's important, it should be mentioned in Mellanox documentation at least. I will check at the messaging rate benchmark osu_mbw_mr for sure to see if its result are improved by MXM. After looking at the MPI performance results published on your URL (e.g. latencies around 1 us in native mode), I'm more and more convinced that our results are suboptimal. And after seeing the impact of SR-IOV published on your URL, I suspect more and more that our mediocre latency is caused by this mechanism. But our cluster is different: SR-IOV is not used in the context of Virtual Machines running under a host VMM. SR-IOV is used with Linux LXC containers. Martin Audet > Hi Martin > > MXM default transport is UD (MXM_TLS=*ud*,shm,self), which is scalable when > running with large applications. RC(MXM_TLS=*rc,*shm,self) is recommended > for microbenchmarks and very small scale applications, > > yes, max seg size setting is too small. > > Did you check any message rate benchmarks(like osu_mbw_mr) with MXM? > > virtualization env will have some overhead. see some perf comparision here > with mvapich > http://mvapich.cse.ohio-state.edu/performance/v-pt_to_pt/ . ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?
Hi Martin MXM default transport is UD (MXM_TLS=*ud*,shm,self), which is scalable when running with large applications. RC(MXM_TLS=*rc,*shm,self) is recommended for microbenchmarks and very small scale applications, yes, max seg size setting is too small. Did you check any message rate benchmarks(like osu_mbw_mr) with MXM? virtualization env will have some overhead. see some perf comparision here with mvapich http://mvapich.cse.ohio-state.edu/performance/v-pt_to_pt/ . On Fri, Aug 19, 2016 at 3:11 PM, Audet, Martinwrote: > Hi Devendar, > > Thank you for your answer. > > Setting MXM_TLS=rc,shm,self does improve the speed of MXM (both latency > and bandwidth): > > without MXM_TLS > > comm lat_min bw_max bw_max >pingpong pingpongsendrecv >(us) (MB/s) (MB/s) > --- > openib 1.79 5827.9311552.4 > mxm2.23 5191.77 8201.76 > yalla 2.18 5200.55 8109.48 > > > with MXM_TLS=rc,shm,self > > comm lat_min bw_max bw_max >pingpong pingpongsendrecv >(us) (MB/s) (MB/s) > --- > openib 1.79 6021.8311529 > mxm1.78 5936.9211168.5 > yalla 1.78 5944.8611375 > > > Note 1: MXM_RDMA_PORTS=mlx4_0:1 and the MCA parameter > btl_openib_include_if=mlx4_0 for both cases. > > Note 2: The bandwidth reported are not very accurate. Bandwidth results > can vary easilly by 7% from one run to another. > > We see that the performance of MXM is now very similar to the performance > of openib for these IMB tests. > > However an error is now reported a few times when MXM_TLS is set: > > sys.c:468 MXM ERROR A new segment was to be created and size < SHMMIN or > size > SHMMAX, or the new segment was to be created. A segment with given > key existed, but size is greater than the size of that segment. Please > check limits by 'ipcs -l'. > > "ipcs -l" reports among other things that: > > max seg size (kbytes) = 32768 > > By the way, is it too small ? > > > Now if we run /opt/mellanox/mxm/mxm_perftest we get: > > without with > MXM_TLS MXM_TLS > > avg send_lat(us)1.6261.321 > > avg send_bw -s 400(MB/s) 5219.51 5514.04 > avg bidir send_bw -s 400 -b (MB/s) 5283.13 5514.45 > > Note: the -b for bidirectional bandwith doesn't seen to affect the result. > > Again it is an improvement both in term of latency and bandwidth. > > However a warning is reported when MXM_TLS is set on the server side when > the send_lat test is run: > > icb_ep.c:287 MXM WARN The min value for CIB_RX_QUEUE_LEN is 2048. > > Note: setting the undocumented env variable MXM_CIB_RX_QUEUE_LEN=2048 > remove the warning but doesn't affect the send latency. > > > * * * > > So now the results are better: MXM performs as well as the regular openib > in term of latency and bandwidth (I didn't checked the overlap capacity > though). But I'm not really impressed. I was expecting MXM (especially when > used by yalla) to be a little better than openib. Also the latency of both > openib, mxm and yalla at 1.8 us seems to be too high. With a configuration > like ours, we should get something closer to 1 us. > > Does anyone has an idea ? > > Don't forget that this cluster uses LXC containers with SR-IOV enabled for > the Infiniband adapter. > > Martin Audet > > > > Hi Martin, > > > > Can you check if it is any better with "-x MXM_TLS=rc,shm,self" ? > > > > -Devendar > > > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > -- -Devendar ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?
Hi Devendar, Thank you for your answer. Setting MXM_TLS=rc,shm,self does improve the speed of MXM (both latency and bandwidth): without MXM_TLS comm lat_min bw_max bw_max pingpong pingpongsendrecv (us) (MB/s) (MB/s) --- openib 1.79 5827.9311552.4 mxm2.23 5191.77 8201.76 yalla 2.18 5200.55 8109.48 with MXM_TLS=rc,shm,self comm lat_min bw_max bw_max pingpong pingpongsendrecv (us) (MB/s) (MB/s) --- openib 1.79 6021.8311529 mxm1.78 5936.9211168.5 yalla 1.78 5944.8611375 Note 1: MXM_RDMA_PORTS=mlx4_0:1 and the MCA parameter btl_openib_include_if=mlx4_0 for both cases. Note 2: The bandwidth reported are not very accurate. Bandwidth results can vary easilly by 7% from one run to another. We see that the performance of MXM is now very similar to the performance of openib for these IMB tests. However an error is now reported a few times when MXM_TLS is set: sys.c:468 MXM ERROR A new segment was to be created and size < SHMMIN or size > SHMMAX, or the new segment was to be created. A segment with given key existed, but size is greater than the size of that segment. Please check limits by 'ipcs -l'. "ipcs -l" reports among other things that: max seg size (kbytes) = 32768 By the way, is it too small ? Now if we run /opt/mellanox/mxm/mxm_perftest we get: without with MXM_TLS MXM_TLS avg send_lat(us)1.6261.321 avg send_bw -s 400(MB/s) 5219.51 5514.04 avg bidir send_bw -s 400 -b (MB/s) 5283.13 5514.45 Note: the -b for bidirectional bandwith doesn't seen to affect the result. Again it is an improvement both in term of latency and bandwidth. However a warning is reported when MXM_TLS is set on the server side when the send_lat test is run: icb_ep.c:287 MXM WARN The min value for CIB_RX_QUEUE_LEN is 2048. Note: setting the undocumented env variable MXM_CIB_RX_QUEUE_LEN=2048 remove the warning but doesn't affect the send latency. * * * So now the results are better: MXM performs as well as the regular openib in term of latency and bandwidth (I didn't checked the overlap capacity though). But I'm not really impressed. I was expecting MXM (especially when used by yalla) to be a little better than openib. Also the latency of both openib, mxm and yalla at 1.8 us seems to be too high. With a configuration like ours, we should get something closer to 1 us. Does anyone has an idea ? Don't forget that this cluster uses LXC containers with SR-IOV enabled for the Infiniband adapter. Martin Audet > Hi Martin, > > Can you check if it is any better with "-x MXM_TLS=rc,shm,self" ? > > -Devendar ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?
Hi Martin, Can you check if it is any better with "-x MXM_TLS=rc,shm,self" ? -Devendar On Tue, Aug 16, 2016 at 11:28 AM, Audet, Martinwrote: > Hi Josh, > > Thanks for your reply. I did try setting MXM_RDMA_PORTS=mlx4_0:1 for all > my MPI processes > and it did improve performance but the performance I obtain isn't > completely satisfying. > > When I use IMB 4.1 pingpong and sendrecv benchmarks between two nodes I > get using > Open MPI 1.10.3: > > without MXM_RDMA_PORTS > >comm lat_min bw_max bw_max > pingpong pingpongsendrecv > (us) (MB/s) (MB/s) >--- >openib 1.79 5947.0711534 >mxm2.51 5166.96 8079.18 >yalla 2.47 5167.29 8278.15 > > > with MXM_RDMA_PORTS=mlx4_0:1 > >comm lat_min bw_max bw_max > pingpong pingpongsendrecv > (us) (MB/s) (MB/s) >--- >openib 1.79 5827.9311552.4 >mxm2.23 5191.77 8201.76 >yalla 2.18 5200.55 8109.48 > > > openib means: pml=ob1 btl=openib,vader,self > btl_openib_include_if=mlx4_0 > mxmmeans: pml=cm,ob1 mtl=mxm btl=vader,self > yalla means: pml=yalla,ob1 btl=vader,self > > lspci reports for our FDR Infiniband HCA: > Infiniband Controler: Mellanox Technologies MT27500 Family [ConnectX-3] > > and 16 lines like: > Infiniband Controler: Mellanox Technologies MT27500/MT27520 Family > [ConnectX-3/ConnectX-3 Pro Virtual Function] > > the nodes use two octacore Xeon E5-2650v2 Ivybridge-EP 2.67 GHz sockets > > ofed_info reports that mxm version is 3.4.3cce223-0.32200 > > As you can see the results are not very good. I would expect mxm and yalla > to perform > better than openib both in term of latency and bandwidth (note: sendrecv > bandwidth is > full duplex). I would expect the yalla bandwidth to be around 1.1 us like > shown here > https://www.open-mpi.org/papers/sc-2014/Open-MPI-SC14-BOF.pdf (page 33). > > I also ran mxm_perftest (located in /opt/mellanox/bin) and it reports the > following > latency between two nodes: > > without MXM_RDMA_PORTS1.92 us > withMXM_RDMA_PORTS=mlx4_0:1 1.65 us > > Again I think we can expect a better latency with our configuration. 1.65 > us is not a > very good result. > > Note however that the 0.27 us (1.92 - 1.65 = 0.27) reduction reduction in > raw mxm > latency correspond to the above Open MPI latencies observed with mxm (2.51 > - 2.23 = 0.28) > and yalla (2.47 - 2.18 = 0.29). > > Another detail: everything is run inside LXC containers. Also SR-IOV is > probably used. > > Does anyone has any idea what's wrong with our cluster ? > > Martin Audet > > > > Hi, Martin > > > > The environment variable: > > > > MXM_RDMA_PORTS=device:port > > > > is what you're looking for. You can specify a device/port pair on your > OMPI > > command line like: > > > > mpirun -np 2 ... -x MXM_RDMA_PORTS=mlx4_0:1 ... > > > > > > Best, > > > > Josh > > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > -- -Devendar ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?
"Audet, Martin"writes: > Hi Josh, > > Thanks for your reply. I did try setting MXM_RDMA_PORTS=mlx4_0:1 for all my > MPI processes > and it did improve performance but the performance I obtain isn't completely > satisfying. I raised the issue of MXM hurting p2p latency here a while ago, but don't have a solution. Mellanox were here last week and promised to address that, but I haven't heard back. I get the impression this stuff isn't widely used, and since it's proprietary, unlike PSM, we can't really investigate. ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?
Hi Josh, Thanks for your reply. I did try setting MXM_RDMA_PORTS=mlx4_0:1 for all my MPI processes and it did improve performance but the performance I obtain isn't completely satisfying. When I use IMB 4.1 pingpong and sendrecv benchmarks between two nodes I get using Open MPI 1.10.3: without MXM_RDMA_PORTS comm lat_min bw_max bw_max pingpong pingpongsendrecv (us) (MB/s) (MB/s) --- openib 1.79 5947.0711534 mxm2.51 5166.96 8079.18 yalla 2.47 5167.29 8278.15 with MXM_RDMA_PORTS=mlx4_0:1 comm lat_min bw_max bw_max pingpong pingpongsendrecv (us) (MB/s) (MB/s) --- openib 1.79 5827.9311552.4 mxm2.23 5191.77 8201.76 yalla 2.18 5200.55 8109.48 openib means: pml=ob1 btl=openib,vader,self btl_openib_include_if=mlx4_0 mxmmeans: pml=cm,ob1 mtl=mxm btl=vader,self yalla means: pml=yalla,ob1 btl=vader,self lspci reports for our FDR Infiniband HCA: Infiniband Controler: Mellanox Technologies MT27500 Family [ConnectX-3] and 16 lines like: Infiniband Controler: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function] the nodes use two octacore Xeon E5-2650v2 Ivybridge-EP 2.67 GHz sockets ofed_info reports that mxm version is 3.4.3cce223-0.32200 As you can see the results are not very good. I would expect mxm and yalla to perform better than openib both in term of latency and bandwidth (note: sendrecv bandwidth is full duplex). I would expect the yalla bandwidth to be around 1.1 us like shown here https://www.open-mpi.org/papers/sc-2014/Open-MPI-SC14-BOF.pdf (page 33). I also ran mxm_perftest (located in /opt/mellanox/bin) and it reports the following latency between two nodes: without MXM_RDMA_PORTS1.92 us withMXM_RDMA_PORTS=mlx4_0:1 1.65 us Again I think we can expect a better latency with our configuration. 1.65 us is not a very good result. Note however that the 0.27 us (1.92 - 1.65 = 0.27) reduction reduction in raw mxm latency correspond to the above Open MPI latencies observed with mxm (2.51 - 2.23 = 0.28) and yalla (2.47 - 2.18 = 0.29). Another detail: everything is run inside LXC containers. Also SR-IOV is probably used. Does anyone has any idea what's wrong with our cluster ? Martin Audet > Hi, Martin > > The environment variable: > > MXM_RDMA_PORTS=device:port > > is what you're looking for. You can specify a device/port pair on your OMPI > command line like: > > mpirun -np 2 ... -x MXM_RDMA_PORTS=mlx4_0:1 ... > > > Best, > > Josh ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?
Hi, Martin The environment variable: MXM_RDMA_PORTS=device:port is what you're looking for. You can specify a device/port pair on your OMPI command line like: mpirun -np 2 ... -x MXM_RDMA_PORTS=mlx4_0:1 ... Best, Josh On Fri, Aug 12, 2016 at 5:03 PM, Audet, Martinwrote: > Hi OMPI_Users && OMPI_Developers, > > Is there an equivalent to the MCA parameter btl_openib_include_if when > using MXM over Infiniband (e.g. either (pml=cm mtl=mxm) or (pml=yalla)) ? > > I ask this question because I’m working on a cluster where LXC containers > are used on compute nodes (with SR-IOV I think) and multiple mlx4 > interfaces are reported by lstopo (e.g. mlx4_0, mlx4_1, …, mlx4_16) even if > a single physical Mellanox Connect-X3 HCA is present per node. > > I found that when I use the plain openib btl (e.g. (pml=ob1 btl=openib)), > it is much faster if I specify the MCA parameter > btl_openib_include_if=mlx4_0 to force Open MPI to use a single interface. > By doing that the latency is lower while the bandwidth higher. I guess it > is because otherwise Open MPI mess by trying to use all “virtual” > interfaces at once. > > However we all know that MXM is better than plain openib since it allows > the HCAs to perform message matching, transfer message in the background > and provide communication progress. > > So in this case is there a way to use only mlx4_0 ? > > I mean when using mxm mtl (pml=cm mtl=mxm) or preferably using it more > directly by yalla pml (pml=yalla). > > Note I’m using Open MPI 1.10.3 I compiled myself for now but I could use > instead Open MPI 2.0 if necessary . > > Thanks, > > Martin Audet > > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?
Hi OMPI_Users && OMPI_Developers, Is there an equivalent to the MCA parameter btl_openib_include_if when using MXM over Infiniband (e.g. either (pml=cm mtl=mxm) or (pml=yalla)) ? I ask this question because I'm working on a cluster where LXC containers are used on compute nodes (with SR-IOV I think) and multiple mlx4 interfaces are reported by lstopo (e.g. mlx4_0, mlx4_1, ..., mlx4_16) even if a single physical Mellanox Connect-X3 HCA is present per node. I found that when I use the plain openib btl (e.g. (pml=ob1 btl=openib)), it is much faster if I specify the MCA parameter btl_openib_include_if=mlx4_0 to force Open MPI to use a single interface. By doing that the latency is lower while the bandwidth higher. I guess it is because otherwise Open MPI mess by trying to use all "virtual" interfaces at once. However we all know that MXM is better than plain openib since it allows the HCAs to perform message matching, transfer message in the background and provide communication progress. So in this case is there a way to use only mlx4_0 ? I mean when using mxm mtl (pml=cm mtl=mxm) or preferably using it more directly by yalla pml (pml=yalla). Note I'm using Open MPI 1.10.3 I compiled myself for now but I could use instead Open MPI 2.0 if necessary . Thanks, Martin Audet ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users