Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
Hi George, May be i did not explain properly the point i wanted to be clarified. What i was tring to say is that i had the impression that MPI_T has been developped for tuning the internal MPI parameter for a defined underlying network fabric. Furthermore it is mainly used by MPI developer to do some debugging / improvement right? Or is it meant for users too? If it is the case, how as user can i make use of MPI_T interface to find out if the fabric itself ( system ) has a problem ( and not the MPI implementation on top of it)? Cheers, Denis From: George Bosilca Sent: Saturday, February 12, 2022 7:38:02 AM To: Bertini, Denis Dr. Cc: Open MPI Users; Joseph Schuchart Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network I am not sure I understand the comment about MPI_T. Each network card has internal counters that can be gathered by any process on the node. Similarly, some information is available from the switches, but I always assumed that information is aggregated across all ongoing jobs. But, merging the switch-level information with the MPI level the necessary trend can be highlighted. George. On Fri, Feb 11, 2022 at 12:43 PM Bertini, Denis Dr. mailto:d.bert...@gsi.de>> wrote: May be i am wrong, but the MPI_T seems to aim to internal openMPI parameters right? So with which kind of magic a tool like OSU INAM can get info from network fabric and even switches related to a particular MPI job ... There should be more info gathered in the background From: George Bosilca mailto:bosi...@icl.utk.edu>> Sent: Friday, February 11, 2022 4:25:42 PM To: Open MPI Users Cc: Joseph Schuchart; Bertini, Denis Dr. Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network Collecting data during execution is possible in OMPI either with an external tool, such as mpiP, or the internal infrastructure, SPC. Take a look at ./examples/spc_example.c or ./test/spc/spc_test.c to see how to use this. George. On Fri, Feb 11, 2022 at 9:43 AM Bertini, Denis Dr. via users mailto:users@lists.open-mpi.org>> wrote: I have seen in OSU INAM paper: " While we chose MVAPICH2 for implementing our designs, any MPI runtime (e.g.: OpenMPI [12]) can be modified to perform similar data collection and transmission. " But i do not know what it is meant with "modified" openMPI ? Cheers, Denis From: Joseph Schuchart mailto:schuch...@icl.utk.edu>> Sent: Friday, February 11, 2022 3:02:36 PM To: Bertini, Denis Dr.; Open MPI Users Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network I am not aware of anything similar in Open MPI. Maybe OSU-INAM can work with other MPI implementations? Would be worth investigating... Joseph On 2/11/22 06:54, Bertini, Denis Dr. wrote: > > Hi Joseph > > Looking at the MVAPICH i noticed that, in this MPI implementation > > a Infiniband Network Analysis and Profiling Tool is provided: > > > OSU-INAM > > > Is there something equivalent using openMPI ? > > Best > > Denis > > > > *From:* users > mailto:users-boun...@lists.open-mpi.org>> > on behalf of Joseph > Schuchart via users > mailto:users@lists.open-mpi.org>> > *Sent:* Tuesday, February 8, 2022 4:02:53 PM > *To:* users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> > *Cc:* Joseph Schuchart > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking > Infiniband network > Hi Denis, > > Sorry if I missed it in your previous messages but could you also try > running a different MPI implementation (MVAPICH) to see whether Open MPI > is at fault or the system is somehow to blame for it? > > Thanks > Joseph > > On 2/8/22 03:06, Bertini, Denis Dr. via users wrote: > > > > Hi > > > > Thanks for all these informations ! > > > > > > But i have to confess that in this multi-tuning-parameter space, > > > > i got somehow lost. > > > > Furthermore it is somtimes mixing between user-space and kernel-space. > > > > I have only possibility to act on the user space. > > > > > > 1) So i have on the system max locked memory: > > > > - ulimit -l unlimited (default ) > > > > and i do not see any warnings/errors related to that when > launching MPI. > > > > > > 2) I tried differents algorithms for MPI_all_reduce op. all showing > > drop in > > > > bw for size=16384 > > > > > > 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed > > > > the same behaviour. > > >
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
I am not sure I understand the comment about MPI_T. Each network card has internal counters that can be gathered by any process on the node. Similarly, some information is available from the switches, but I always assumed that information is aggregated across all ongoing jobs. But, merging the switch-level information with the MPI level the necessary trend can be highlighted. George. On Fri, Feb 11, 2022 at 12:43 PM Bertini, Denis Dr. wrote: > May be i am wrong, but the MPI_T seems to aim to internal openMPI > parameters right? > > > So with which kind of magic a tool like OSU INAM can get info from network > fabric and even > > switches related to a particular MPI job ... > > > There should be more info gathered in the background > > > -- > *From:* George Bosilca > *Sent:* Friday, February 11, 2022 4:25:42 PM > *To:* Open MPI Users > *Cc:* Joseph Schuchart; Bertini, Denis Dr. > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband > network > > Collecting data during execution is possible in OMPI either with an > external tool, such as mpiP, or the internal infrastructure, SPC. Take a > look at ./examples/spc_example.c or ./test/spc/spc_test.c to see how to use > this. > > George. > > > On Fri, Feb 11, 2022 at 9:43 AM Bertini, Denis Dr. via users < > users@lists.open-mpi.org> wrote: > >> I have seen in OSU INAM paper: >> >> >> " >> While we chose MVAPICH2 for implementing our designs, any MPI >> runtime (e.g.: OpenMPI [12]) can be modified to perform similar data >> collection and >> transmission. >> " >> >> But i do not know what it is meant with "modified" openMPI ? >> >> >> Cheers, >> >> Denis >> >> >> -- >> *From:* Joseph Schuchart >> *Sent:* Friday, February 11, 2022 3:02:36 PM >> *To:* Bertini, Denis Dr.; Open MPI Users >> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband >> network >> >> I am not aware of anything similar in Open MPI. Maybe OSU-INAM can work >> with other MPI implementations? Would be worth investigating... >> >> Joseph >> >> On 2/11/22 06:54, Bertini, Denis Dr. wrote: >> > >> > Hi Joseph >> > >> > Looking at the MVAPICH i noticed that, in this MPI implementation >> > >> > a Infiniband Network Analysis and Profiling Tool is provided: >> > >> > >> > OSU-INAM >> > >> > >> > Is there something equivalent using openMPI ? >> > >> > Best >> > >> > Denis >> > >> > >> > >> > *From:* users on behalf of Joseph >> > Schuchart via users >> > *Sent:* Tuesday, February 8, 2022 4:02:53 PM >> > *To:* users@lists.open-mpi.org >> > *Cc:* Joseph Schuchart >> > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking >> > Infiniband network >> > Hi Denis, >> > >> > Sorry if I missed it in your previous messages but could you also try >> > running a different MPI implementation (MVAPICH) to see whether Open MPI >> > is at fault or the system is somehow to blame for it? >> > >> > Thanks >> > Joseph >> > >> > On 2/8/22 03:06, Bertini, Denis Dr. via users wrote: >> > > >> > > Hi >> > > >> > > Thanks for all these informations ! >> > > >> > > >> > > But i have to confess that in this multi-tuning-parameter space, >> > > >> > > i got somehow lost. >> > > >> > > Furthermore it is somtimes mixing between user-space and kernel-space. >> > > >> > > I have only possibility to act on the user space. >> > > >> > > >> > > 1) So i have on the system max locked memory: >> > > >> > > - ulimit -l unlimited (default ) >> > > >> > > and i do not see any warnings/errors related to that when >> > launching MPI. >> > > >> > > >> > > 2) I tried differents algorithms for MPI_all_reduce op. all showing >> > > drop in >> > > >> > > bw for size=16384 >> > > >> > > >> > > 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed >> > > >> > > the same behaviour. >> > > >> > > >
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
May be i am wrong, but the MPI_T seems to aim to internal openMPI parameters right? So with which kind of magic a tool like OSU INAM can get info from network fabric and even switches related to a particular MPI job ... There should be more info gathered in the background From: George Bosilca Sent: Friday, February 11, 2022 4:25:42 PM To: Open MPI Users Cc: Joseph Schuchart; Bertini, Denis Dr. Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network Collecting data during execution is possible in OMPI either with an external tool, such as mpiP, or the internal infrastructure, SPC. Take a look at ./examples/spc_example.c or ./test/spc/spc_test.c to see how to use this. George. On Fri, Feb 11, 2022 at 9:43 AM Bertini, Denis Dr. via users mailto:users@lists.open-mpi.org>> wrote: I have seen in OSU INAM paper: " While we chose MVAPICH2 for implementing our designs, any MPI runtime (e.g.: OpenMPI [12]) can be modified to perform similar data collection and transmission. " But i do not know what it is meant with "modified" openMPI ? Cheers, Denis From: Joseph Schuchart mailto:schuch...@icl.utk.edu>> Sent: Friday, February 11, 2022 3:02:36 PM To: Bertini, Denis Dr.; Open MPI Users Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network I am not aware of anything similar in Open MPI. Maybe OSU-INAM can work with other MPI implementations? Would be worth investigating... Joseph On 2/11/22 06:54, Bertini, Denis Dr. wrote: > > Hi Joseph > > Looking at the MVAPICH i noticed that, in this MPI implementation > > a Infiniband Network Analysis and Profiling Tool is provided: > > > OSU-INAM > > > Is there something equivalent using openMPI ? > > Best > > Denis > > > > *From:* users > mailto:users-boun...@lists.open-mpi.org>> > on behalf of Joseph > Schuchart via users > mailto:users@lists.open-mpi.org>> > *Sent:* Tuesday, February 8, 2022 4:02:53 PM > *To:* users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> > *Cc:* Joseph Schuchart > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking > Infiniband network > Hi Denis, > > Sorry if I missed it in your previous messages but could you also try > running a different MPI implementation (MVAPICH) to see whether Open MPI > is at fault or the system is somehow to blame for it? > > Thanks > Joseph > > On 2/8/22 03:06, Bertini, Denis Dr. via users wrote: > > > > Hi > > > > Thanks for all these informations ! > > > > > > But i have to confess that in this multi-tuning-parameter space, > > > > i got somehow lost. > > > > Furthermore it is somtimes mixing between user-space and kernel-space. > > > > I have only possibility to act on the user space. > > > > > > 1) So i have on the system max locked memory: > > > > - ulimit -l unlimited (default ) > > > > and i do not see any warnings/errors related to that when > launching MPI. > > > > > > 2) I tried differents algorithms for MPI_all_reduce op. all showing > > drop in > > > > bw for size=16384 > > > > > > 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed > > > > the same behaviour. > > > > > > 3) i realized that increasing the so-called warm up parameter in the > > > > OSU benchmark (argument -x 200 as default) the discrepancy. > > > > At the contrary putting lower threshold ( -x 10 ) can increase this BW > > > > discrepancy up to factor 300 at message size 16384 compare to > > > > message size 8192 for example. > > > > So does it means that there are some caching effects > > > > in the internode communication? > > > > > > From my experience, to tune parameters is a time-consuming and > cumbersome > > > > task. > > > > > > Could it also be the problem is not really on the openMPI > > implemenation but on the > > > > system? > > > > > > Best > > > > Denis > > > > > > *From:* users > > mailto:users-boun...@lists.open-mpi.org>> > > on behalf of Gus > > Correa via users mailto:users@lists.open-mpi.org>> > > *Sent:* Monday, February 7, 2022 9:14:19 PM > > *To:* Open MPI Users > > *Cc:* Gus Correa > > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking > > Infiniban
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
Collecting data during execution is possible in OMPI either with an external tool, such as mpiP, or the internal infrastructure, SPC. Take a look at ./examples/spc_example.c or ./test/spc/spc_test.c to see how to use this. George. On Fri, Feb 11, 2022 at 9:43 AM Bertini, Denis Dr. via users < users@lists.open-mpi.org> wrote: > I have seen in OSU INAM paper: > > > " > While we chose MVAPICH2 for implementing our designs, any MPI > runtime (e.g.: OpenMPI [12]) can be modified to perform similar data > collection and > transmission. > " > > But i do not know what it is meant with "modified" openMPI ? > > > Cheers, > > Denis > > > -- > *From:* Joseph Schuchart > *Sent:* Friday, February 11, 2022 3:02:36 PM > *To:* Bertini, Denis Dr.; Open MPI Users > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband > network > > I am not aware of anything similar in Open MPI. Maybe OSU-INAM can work > with other MPI implementations? Would be worth investigating... > > Joseph > > On 2/11/22 06:54, Bertini, Denis Dr. wrote: > > > > Hi Joseph > > > > Looking at the MVAPICH i noticed that, in this MPI implementation > > > > a Infiniband Network Analysis and Profiling Tool is provided: > > > > > > OSU-INAM > > > > > > Is there something equivalent using openMPI ? > > > > Best > > > > Denis > > > > > > ---------------- > > *From:* users on behalf of Joseph > > Schuchart via users > > *Sent:* Tuesday, February 8, 2022 4:02:53 PM > > *To:* users@lists.open-mpi.org > > *Cc:* Joseph Schuchart > > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking > > Infiniband network > > Hi Denis, > > > > Sorry if I missed it in your previous messages but could you also try > > running a different MPI implementation (MVAPICH) to see whether Open MPI > > is at fault or the system is somehow to blame for it? > > > > Thanks > > Joseph > > > > On 2/8/22 03:06, Bertini, Denis Dr. via users wrote: > > > > > > Hi > > > > > > Thanks for all these informations ! > > > > > > > > > But i have to confess that in this multi-tuning-parameter space, > > > > > > i got somehow lost. > > > > > > Furthermore it is somtimes mixing between user-space and kernel-space. > > > > > > I have only possibility to act on the user space. > > > > > > > > > 1) So i have on the system max locked memory: > > > > > > - ulimit -l unlimited (default ) > > > > > > and i do not see any warnings/errors related to that when > > launching MPI. > > > > > > > > > 2) I tried differents algorithms for MPI_all_reduce op. all showing > > > drop in > > > > > > bw for size=16384 > > > > > > > > > 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed > > > > > > the same behaviour. > > > > > > > > > 3) i realized that increasing the so-called warm up parameter in the > > > > > > OSU benchmark (argument -x 200 as default) the discrepancy. > > > > > > At the contrary putting lower threshold ( -x 10 ) can increase this BW > > > > > > discrepancy up to factor 300 at message size 16384 compare to > > > > > > message size 8192 for example. > > > > > > So does it means that there are some caching effects > > > > > > in the internode communication? > > > > > > > > > From my experience, to tune parameters is a time-consuming and > > cumbersome > > > > > > task. > > > > > > > > > Could it also be the problem is not really on the openMPI > > > implemenation but on the > > > > > > system? > > > > > > > > > Best > > > > > > Denis > > > > > > > > > > *From:* users on behalf of Gus > > > Correa via users > > > *Sent:* Monday, February 7, 2022 9:14:19 PM > > > *To:* Open MPI Users > > > *Cc:* Gus Correa > > > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking > > > Infiniband network > > > This may have changed since, but these used to be relevant points. > > > Overall, the Ope
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
I have seen in OSU INAM paper: " While we chose MVAPICH2 for implementing our designs, any MPI runtime (e.g.: OpenMPI [12]) can be modified to perform similar data collection and transmission. " But i do not know what it is meant with "modified" openMPI ? Cheers, Denis From: Joseph Schuchart Sent: Friday, February 11, 2022 3:02:36 PM To: Bertini, Denis Dr.; Open MPI Users Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network I am not aware of anything similar in Open MPI. Maybe OSU-INAM can work with other MPI implementations? Would be worth investigating... Joseph On 2/11/22 06:54, Bertini, Denis Dr. wrote: > > Hi Joseph > > Looking at the MVAPICH i noticed that, in this MPI implementation > > a Infiniband Network Analysis and Profiling Tool is provided: > > > OSU-INAM > > > Is there something equivalent using openMPI ? > > Best > > Denis > > > > *From:* users on behalf of Joseph > Schuchart via users > *Sent:* Tuesday, February 8, 2022 4:02:53 PM > *To:* users@lists.open-mpi.org > *Cc:* Joseph Schuchart > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking > Infiniband network > Hi Denis, > > Sorry if I missed it in your previous messages but could you also try > running a different MPI implementation (MVAPICH) to see whether Open MPI > is at fault or the system is somehow to blame for it? > > Thanks > Joseph > > On 2/8/22 03:06, Bertini, Denis Dr. via users wrote: > > > > Hi > > > > Thanks for all these informations ! > > > > > > But i have to confess that in this multi-tuning-parameter space, > > > > i got somehow lost. > > > > Furthermore it is somtimes mixing between user-space and kernel-space. > > > > I have only possibility to act on the user space. > > > > > > 1) So i have on the system max locked memory: > > > > - ulimit -l unlimited (default ) > > > > and i do not see any warnings/errors related to that when > launching MPI. > > > > > > 2) I tried differents algorithms for MPI_all_reduce op. all showing > > drop in > > > > bw for size=16384 > > > > > > 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed > > > > the same behaviour. > > > > > > 3) i realized that increasing the so-called warm up parameter in the > > > > OSU benchmark (argument -x 200 as default) the discrepancy. > > > > At the contrary putting lower threshold ( -x 10 ) can increase this BW > > > > discrepancy up to factor 300 at message size 16384 compare to > > > > message size 8192 for example. > > > > So does it means that there are some caching effects > > > > in the internode communication? > > > > > > From my experience, to tune parameters is a time-consuming and > cumbersome > > > > task. > > > > > > Could it also be the problem is not really on the openMPI > > implemenation but on the > > > > system? > > > > > > Best > > > > Denis > > > > > > *From:* users on behalf of Gus > > Correa via users > > *Sent:* Monday, February 7, 2022 9:14:19 PM > > *To:* Open MPI Users > > *Cc:* Gus Correa > > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking > > Infiniband network > > This may have changed since, but these used to be relevant points. > > Overall, the Open MPI FAQ have lots of good suggestions: > > https://www.open-mpi.org/faq/ > > some specific for performance tuning: > > https://www.open-mpi.org/faq/?category=tuning > > https://www.open-mpi.org/faq/?category=openfabrics > > > > 1) Make sure you are not using the Ethernet TCP/IP, which is widely > > available in compute nodes: > > mpirun --mca btl self,sm,openib ... > > > > https://www.open-mpi.org/faq/?category=tuning#selecting-components > > > > However, this may have changed lately: > > https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable > > 2) Maximum locked memory used by IB and their system limit. Start > > here: > > > https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage > > 3) The eager vs. rendezvous message size threshold. I wonder if it may > > sit right where you see the latency spike. > > https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user > &
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
I am not aware of anything similar in Open MPI. Maybe OSU-INAM can work with other MPI implementations? Would be worth investigating... Joseph On 2/11/22 06:54, Bertini, Denis Dr. wrote: Hi Joseph Looking at the MVAPICH i noticed that, in this MPI implementation a Infiniband Network Analysis and Profiling Tool is provided: OSU-INAM Is there something equivalent using openMPI ? Best Denis *From:* users on behalf of Joseph Schuchart via users *Sent:* Tuesday, February 8, 2022 4:02:53 PM *To:* users@lists.open-mpi.org *Cc:* Joseph Schuchart *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband network Hi Denis, Sorry if I missed it in your previous messages but could you also try running a different MPI implementation (MVAPICH) to see whether Open MPI is at fault or the system is somehow to blame for it? Thanks Joseph On 2/8/22 03:06, Bertini, Denis Dr. via users wrote: > > Hi > > Thanks for all these informations ! > > > But i have to confess that in this multi-tuning-parameter space, > > i got somehow lost. > > Furthermore it is somtimes mixing between user-space and kernel-space. > > I have only possibility to act on the user space. > > > 1) So i have on the system max locked memory: > > - ulimit -l unlimited (default ) > > and i do not see any warnings/errors related to that when launching MPI. > > > 2) I tried differents algorithms for MPI_all_reduce op. all showing > drop in > > bw for size=16384 > > > 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed > > the same behaviour. > > > 3) i realized that increasing the so-called warm up parameter in the > > OSU benchmark (argument -x 200 as default) the discrepancy. > > At the contrary putting lower threshold ( -x 10 ) can increase this BW > > discrepancy up to factor 300 at message size 16384 compare to > > message size 8192 for example. > > So does it means that there are some caching effects > > in the internode communication? > > > From my experience, to tune parameters is a time-consuming and cumbersome > > task. > > > Could it also be the problem is not really on the openMPI > implemenation but on the > > system? > > > Best > > Denis > > ------------ > *From:* users on behalf of Gus > Correa via users > *Sent:* Monday, February 7, 2022 9:14:19 PM > *To:* Open MPI Users > *Cc:* Gus Correa > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking > Infiniband network > This may have changed since, but these used to be relevant points. > Overall, the Open MPI FAQ have lots of good suggestions: > https://www.open-mpi.org/faq/ > some specific for performance tuning: > https://www.open-mpi.org/faq/?category=tuning > https://www.open-mpi.org/faq/?category=openfabrics > > 1) Make sure you are not using the Ethernet TCP/IP, which is widely > available in compute nodes: > mpirun --mca btl self,sm,openib ... > > https://www.open-mpi.org/faq/?category=tuning#selecting-components > > However, this may have changed lately: > https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable > 2) Maximum locked memory used by IB and their system limit. Start > here: > https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage > 3) The eager vs. rendezvous message size threshold. I wonder if it may > sit right where you see the latency spike. > https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user > 4) Processor and memory locality/affinity and binding (please check > the current options and syntax) > https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4 > > On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users > wrote: > > Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php > > mpirun --verbose --display-map > > Have you tried newer OpenMPI versions? > > Do you get similar behavior for the osu_reduce and osu_gather > benchmarks? > > Typically internal buffer sizes as well as your hardware will affect > performance. Can you give specifications similar to what is > available at: > http://mvapich.cse.ohio-state.edu/performance/collectives/ > where the operating system, switch, node type and memory are > indicated. > > If you need good performance, may want to also specify the algorithm > used. You can find some of the parameters you can tune using: > > ompi_info --all > > A particular helpful parameter is: > > MCA coll tuned: para
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
Hi Joseph Looking at the MVAPICH i noticed that, in this MPI implementation a Infiniband Network Analysis and Profiling Tool is provided: OSU-INAM Is there something equivalent using openMPI ? Best Denis From: users on behalf of Joseph Schuchart via users Sent: Tuesday, February 8, 2022 4:02:53 PM To: users@lists.open-mpi.org Cc: Joseph Schuchart Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network Hi Denis, Sorry if I missed it in your previous messages but could you also try running a different MPI implementation (MVAPICH) to see whether Open MPI is at fault or the system is somehow to blame for it? Thanks Joseph On 2/8/22 03:06, Bertini, Denis Dr. via users wrote: > > Hi > > Thanks for all these informations ! > > > But i have to confess that in this multi-tuning-parameter space, > > i got somehow lost. > > Furthermore it is somtimes mixing between user-space and kernel-space. > > I have only possibility to act on the user space. > > > 1) So i have on the system max locked memory: > > - ulimit -l unlimited (default ) > > and i do not see any warnings/errors related to that when launching MPI. > > > 2) I tried differents algorithms for MPI_all_reduce op. all showing > drop in > > bw for size=16384 > > > 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed > > the same behaviour. > > > 3) i realized that increasing the so-called warm up parameter in the > > OSU benchmark (argument -x 200 as default) the discrepancy. > > At the contrary putting lower threshold ( -x 10 ) can increase this BW > > discrepancy up to factor 300 at message size 16384 compare to > > message size 8192 for example. > > So does it means that there are some caching effects > > in the internode communication? > > > From my experience, to tune parameters is a time-consuming and cumbersome > > task. > > > Could it also be the problem is not really on the openMPI > implemenation but on the > > system? > > > Best > > Denis > > ------------ > *From:* users on behalf of Gus > Correa via users > *Sent:* Monday, February 7, 2022 9:14:19 PM > *To:* Open MPI Users > *Cc:* Gus Correa > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking > Infiniband network > This may have changed since, but these used to be relevant points. > Overall, the Open MPI FAQ have lots of good suggestions: > https://www.open-mpi.org/faq/ > some specific for performance tuning: > https://www.open-mpi.org/faq/?category=tuning > https://www.open-mpi.org/faq/?category=openfabrics > > 1) Make sure you are not using the Ethernet TCP/IP, which is widely > available in compute nodes: > mpirun --mca btl self,sm,openib ... > > https://www.open-mpi.org/faq/?category=tuning#selecting-components > > However, this may have changed lately: > https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable > 2) Maximum locked memory used by IB and their system limit. Start > here: > https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage > 3) The eager vs. rendezvous message size threshold. I wonder if it may > sit right where you see the latency spike. > https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user > 4) Processor and memory locality/affinity and binding (please check > the current options and syntax) > https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4 > > On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users > wrote: > > Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php > > mpirun --verbose --display-map > > Have you tried newer OpenMPI versions? > > Do you get similar behavior for the osu_reduce and osu_gather > benchmarks? > > Typically internal buffer sizes as well as your hardware will affect > performance. Can you give specifications similar to what is > available at: > http://mvapich.cse.ohio-state.edu/performance/collectives/ > where the operating system, switch, node type and memory are > indicated. > > If you need good performance, may want to also specify the algorithm > used. You can find some of the parameters you can tune using: > > ompi_info --all > > A particular helpful parameter is: > > MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current > value: "ignore", data source: default, level: 5 tuner/detail, > type: int) >Which allreduce algorithm is used. Can be > locked down to any of: 0 ignore,
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
Hi I do not have so much experience with MVAPICH. Since we work with singularity container, i can create a container and install this version to compare. Cheers, Denis From: users on behalf of Joseph Schuchart via users Sent: Tuesday, February 8, 2022 4:02:53 PM To: users@lists.open-mpi.org Cc: Joseph Schuchart Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network Hi Denis, Sorry if I missed it in your previous messages but could you also try running a different MPI implementation (MVAPICH) to see whether Open MPI is at fault or the system is somehow to blame for it? Thanks Joseph On 2/8/22 03:06, Bertini, Denis Dr. via users wrote: > > Hi > > Thanks for all these informations ! > > > But i have to confess that in this multi-tuning-parameter space, > > i got somehow lost. > > Furthermore it is somtimes mixing between user-space and kernel-space. > > I have only possibility to act on the user space. > > > 1) So i have on the system max locked memory: > > - ulimit -l unlimited (default ) > > and i do not see any warnings/errors related to that when launching MPI. > > > 2) I tried differents algorithms for MPI_all_reduce op. all showing > drop in > > bw for size=16384 > > > 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed > > the same behaviour. > > > 3) i realized that increasing the so-called warm up parameter in the > > OSU benchmark (argument -x 200 as default) the discrepancy. > > At the contrary putting lower threshold ( -x 10 ) can increase this BW > > discrepancy up to factor 300 at message size 16384 compare to > > message size 8192 for example. > > So does it means that there are some caching effects > > in the internode communication? > > > From my experience, to tune parameters is a time-consuming and cumbersome > > task. > > > Could it also be the problem is not really on the openMPI > implemenation but on the > > system? > > > Best > > Denis > > ------------ > *From:* users on behalf of Gus > Correa via users > *Sent:* Monday, February 7, 2022 9:14:19 PM > *To:* Open MPI Users > *Cc:* Gus Correa > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking > Infiniband network > This may have changed since, but these used to be relevant points. > Overall, the Open MPI FAQ have lots of good suggestions: > https://www.open-mpi.org/faq/ > some specific for performance tuning: > https://www.open-mpi.org/faq/?category=tuning > https://www.open-mpi.org/faq/?category=openfabrics > > 1) Make sure you are not using the Ethernet TCP/IP, which is widely > available in compute nodes: > mpirun --mca btl self,sm,openib ... > > https://www.open-mpi.org/faq/?category=tuning#selecting-components > > However, this may have changed lately: > https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable > 2) Maximum locked memory used by IB and their system limit. Start > here: > https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage > 3) The eager vs. rendezvous message size threshold. I wonder if it may > sit right where you see the latency spike. > https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user > 4) Processor and memory locality/affinity and binding (please check > the current options and syntax) > https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4 > > On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users > wrote: > > Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php > > mpirun --verbose --display-map > > Have you tried newer OpenMPI versions? > > Do you get similar behavior for the osu_reduce and osu_gather > benchmarks? > > Typically internal buffer sizes as well as your hardware will affect > performance. Can you give specifications similar to what is > available at: > http://mvapich.cse.ohio-state.edu/performance/collectives/ > where the operating system, switch, node type and memory are > indicated. > > If you need good performance, may want to also specify the algorithm > used. You can find some of the parameters you can tune using: > > ompi_info --all > > A particular helpful parameter is: > > MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current > value: "ignore", data source: default, level: 5 tuner/detail, > type: int) >Which allreduce algorithm is used. Can be > locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
Hi Denis, Sorry if I missed it in your previous messages but could you also try running a different MPI implementation (MVAPICH) to see whether Open MPI is at fault or the system is somehow to blame for it? Thanks Joseph On 2/8/22 03:06, Bertini, Denis Dr. via users wrote: Hi Thanks for all these informations ! But i have to confess that in this multi-tuning-parameter space, i got somehow lost. Furthermore it is somtimes mixing between user-space and kernel-space. I have only possibility to act on the user space. 1) So i have on the system max locked memory: - ulimit -l unlimited (default ) and i do not see any warnings/errors related to that when launching MPI. 2) I tried differents algorithms for MPI_all_reduce op. all showing drop in bw for size=16384 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed the same behaviour. 3) i realized that increasing the so-called warm up parameter in the OSU benchmark (argument -x 200 as default) the discrepancy. At the contrary putting lower threshold ( -x 10 ) can increase this BW discrepancy up to factor 300 at message size 16384 compare to message size 8192 for example. So does it means that there are some caching effects in the internode communication? From my experience, to tune parameters is a time-consuming and cumbersome task. Could it also be the problem is not really on the openMPI implemenation but on the system? Best Denis *From:* users on behalf of Gus Correa via users *Sent:* Monday, February 7, 2022 9:14:19 PM *To:* Open MPI Users *Cc:* Gus Correa *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband network This may have changed since, but these used to be relevant points. Overall, the Open MPI FAQ have lots of good suggestions: https://www.open-mpi.org/faq/ some specific for performance tuning: https://www.open-mpi.org/faq/?category=tuning https://www.open-mpi.org/faq/?category=openfabrics 1) Make sure you are not using the Ethernet TCP/IP, which is widely available in compute nodes: mpirun --mca btl self,sm,openib ... https://www.open-mpi.org/faq/?category=tuning#selecting-components However, this may have changed lately: https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable 2) Maximum locked memory used by IB and their system limit. Start here: https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage 3) The eager vs. rendezvous message size threshold. I wonder if it may sit right where you see the latency spike. https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user 4) Processor and memory locality/affinity and binding (please check the current options and syntax) https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4 On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users wrote: Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php mpirun --verbose --display-map Have you tried newer OpenMPI versions? Do you get similar behavior for the osu_reduce and osu_gather benchmarks? Typically internal buffer sizes as well as your hardware will affect performance. Can you give specifications similar to what is available at: http://mvapich.cse.ohio-state.edu/performance/collectives/ where the operating system, switch, node type and memory are indicated. If you need good performance, may want to also specify the algorithm used. You can find some of the parameters you can tune using: ompi_info --all A particular helpful parameter is: MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current value: "ignore", data source: default, level: 5 tuner/detail, type: int) Which allreduce algorithm is used. Can be locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping (tuned reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring Valid values: 0:"ignore", 1:"basic_linear", 2:"nonoverlapping", 3:"recursive_doubling", 4:"ring", 5:"segmented_ring", 6:"rabenseifner" MCA coll tuned: parameter "coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", data source: default, level: 5 tuner/detail, type: int) For OpenMPI 4.0, there is a tuning program [2] that might also be helpful. [1] https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi [2] https://github.com/open-mpi/ompi-collectives-tuning On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote: > Hi > > When i repeat i always got the huge discrepancy at the > > message size of 16384. > > May be there
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
Sorry it is float and size output by the benchmark is in bytes: https://github.com/ROCm-Developer-Tools/OSU_Microbenchmarks/blob/master/mpi/collective/osu_allreduce.c Data cache is 32 Kb per core: https://www.cpu-world.com/CPUs/Zen/AMD-EPYC%207571.html One send array of 16Kb and one receive array of 16Kb should fill this if it is used in that manner. Similar behavior is obtained on Intel Xeon Platinum 8142M (also 32Kb L1 data cache per core) with OpenMPI 4.0.2 http://hidl.cse.ohio-state.edu/static/media/talks/slide/AWS_SC19_Talk_V2.pdf On 2/8/22 2:17 PM, Bertini, Denis Dr. wrote: Hi Thanks a lot for all the infos ! Very interesting thanks ! We use basically AMP EPYC processor >> vendor_id: AuthenticAMD cpu family: 23 model: 1 model name: AMD EPYC 7551 32-Core Processor stepping: 2 microcode: 0x8001250 cpu MHz: 2000.000 cache size: 512 KB physical id: 1 siblings: 64 core id: 31 cpu cores: 32 apicid: 127 initial apicid: 127 fpu: yes fpu_exception: yes cpuid level: 13 wp: yes >> The number of cores could depend on the node though ( 32/64 ) So according to your calculation, the message of 16384 bytes should fit in? BTW it is 16384 bytes or 16384 double precision = 16384*8bytes? Best Denis *From:* users on behalf of Benson Muite via users *Sent:* Tuesday, February 8, 2022 11:47:18 AM *To:* users@lists.open-mpi.org *Cc:* Benson Muite *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband network On 2/8/22 11:06 AM, Bertini, Denis Dr. via users wrote: Hi Thanks for all these informations ! But i have to confess that in this multi-tuning-parameter space, i got somehow lost. Furthermore it is somtimes mixing between user-space and kernel-space. I have only possibility to act on the user space. Ok. If you are doing the test to check your system, you should only tune for typical applications, rather than for one function call with a specific message size. As you can change the OpenMPI default settings for the algorithm used to do all reduce, you may wish to run tests and choose a setting that will work well for most of your users. You may also wish to upgrade to OpenMPI 4.1 as default, so perhaps do tests on that version. 1) So i have on the system max locked memory: - ulimit -l unlimited (default ) and i do not see any warnings/errors related to that when launching MPI. 2) I tried differents algorithms for MPI_all_reduce op. all showing drop in bw for size=16384 The drops are of different magnitude depending on the algorithm used, default gives worst case latency of 54970.96 us and round robin gives worst case latency of 4992.04 us for a size of 16384. May be helpful to indicate what hardware you are using, both for the chip (cache sizes will be important) and the interconnect. Perhaps try the test on 2 or 4 nodes as well. 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed the same behaviour. This suggests it is the chip, rather than the interconnect. 3) i realized that increasing the so-called warm up parameter in the OSU benchmark (argument -x 200 as default) the discrepancy. At the contrary putting lower threshold ( -x 10 ) can increase this BW discrepancy up to factor 300 at message size 16384 compare to message size 8192 for example. So does it means that there are some caching effects in the internode communication? Probably. If you are using AMD 7551P nodes, these have 96K L1 cache per core. A message of 16384 double precision uses 132K so will not fit in L1 cache, and a message of 8192 uses 66K and will fit in L1 cache. Perhaps try the same test on Intel Xeon e52680 nodes or 6248r nodes. Some relevant studies are: Zhong, Cao, Bosilica and Dongarra, "Using long vector extensions for MPI reductions", https://doi.org/10.1016/j.parco.2021.102871 <https://doi.org/10.1016/j.parco.2021.102871> Hashmi, Chakraborty, Bayatpour, Subramoni and Panda "Designing Shared Address Space MPI libraries in the Many-core Era", https://jahanzeb-hashmi.github.io/files/talks/ipdps18.pdf <https://jahanzeb-hashmi.github.io/files/talks/ipdps18.pdf> Saini, Mehrotra, Taylor, Shende and Biswas, "Performance Analysis of Scientific and Engineering Applications Using MPInside and TAU", https://ntrs.nasa.gov/api/citations/20100038444/downloads/20100038444.pdf <https://ntrs.nasa.gov/api/citations/20100038444/downloads/20100038444.pdf> The second study by Hashmi et al. focuses on inter node communication, but has a nice performance model that demonstrates understanding of the communication pattern. For typical use of MPI on a particular cluster, such a detailed understanding is likely not necessary. These studies do also collect hardware performance information. From my experience, to tune parameters is a time-consuming and cumbersome task. Could it al
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
Hi Thanks a lot for all the infos ! Very interesting thanks ! We use basically AMP EPYC processor >> vendor_id : AuthenticAMD cpu family : 23 model : 1 model name : AMD EPYC 7551 32-Core Processor stepping : 2 microcode : 0x8001250 cpu MHz : 2000.000 cache size : 512 KB physical id : 1 siblings : 64 core id : 31 cpu cores : 32 apicid : 127 initial apicid : 127 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes >> The number of cores could depend on the node though ( 32/64 ) So according to your calculation, the message of 16384 bytes should fit in? BTW it is 16384 bytes or 16384 double precision = 16384*8bytes? Best Denis From: users on behalf of Benson Muite via users Sent: Tuesday, February 8, 2022 11:47:18 AM To: users@lists.open-mpi.org Cc: Benson Muite Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network On 2/8/22 11:06 AM, Bertini, Denis Dr. via users wrote: > Hi > > Thanks for all these informations ! > > > But i have to confess that in this multi-tuning-parameter space, > > i got somehow lost. > > Furthermore it is somtimes mixing between user-space and kernel-space. > > I have only possibility to act on the user space. Ok. If you are doing the test to check your system, you should only tune for typical applications, rather than for one function call with a specific message size. As you can change the OpenMPI default settings for the algorithm used to do all reduce, you may wish to run tests and choose a setting that will work well for most of your users. You may also wish to upgrade to OpenMPI 4.1 as default, so perhaps do tests on that version. > > > 1) So i have on the system max locked memory: > > - ulimit -l unlimited (default ) > >and i do not see any warnings/errors related to that when launching MPI. > > > 2) I tried differents algorithms for MPI_all_reduce op. all showing drop in > > bw for size=16384 The drops are of different magnitude depending on the algorithm used, default gives worst case latency of 54970.96 us and round robin gives worst case latency of 4992.04 us for a size of 16384. May be helpful to indicate what hardware you are using, both for the chip (cache sizes will be important) and the interconnect. Perhaps try the test on 2 or 4 nodes as well. > > > 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed > > the same behaviour. This suggests it is the chip, rather than the interconnect. > > > 3) i realized that increasing the so-called warm up parameter in the > > OSU benchmark (argument -x 200 as default) the discrepancy. > > At the contrary putting lower threshold ( -x 10 ) can increase this BW > > discrepancy up to factor 300 at message size 16384 compare to > > message size 8192 for example. > > So does it means that there are some caching effects > > in the internode communication? > > Probably. If you are using AMD 7551P nodes, these have 96K L1 cache per core. A message of 16384 double precision uses 132K so will not fit in L1 cache, and a message of 8192 uses 66K and will fit in L1 cache. Perhaps try the same test on Intel Xeon e52680 nodes or 6248r nodes. Some relevant studies are: Zhong, Cao, Bosilica and Dongarra, "Using long vector extensions for MPI reductions", https://doi.org/10.1016/j.parco.2021.102871 Hashmi, Chakraborty, Bayatpour, Subramoni and Panda "Designing Shared Address Space MPI libraries in the Many-core Era", https://jahanzeb-hashmi.github.io/files/talks/ipdps18.pdf Saini, Mehrotra, Taylor, Shende and Biswas, "Performance Analysis of Scientific and Engineering Applications Using MPInside and TAU", https://ntrs.nasa.gov/api/citations/20100038444/downloads/20100038444.pdf The second study by Hashmi et al. focuses on inter node communication, but has a nice performance model that demonstrates understanding of the communication pattern. For typical use of MPI on a particular cluster, such a detailed understanding is likely not necessary. These studies do also collect hardware performance information. > From my experience, to tune parameters is a time-consuming and cumbersome > > task. > > > Could it also be the problem is not really on the openMPI implemenation > but on the > > system? The default OpenMPI parameters may need to be adjusted for a good user experience on your system, but demanding users will probably do this for their specific applications. By changing the algorithm used for all reduce, you got a factor of 10 improvement in the benchmark for a size of 16384. Perhaps determine which MPI calls are used most often on your cluster, and provide a guide as to how OpenMPI can be tuned for these. Alternatively, if you have a set of heavily used applications, profile them to determine most used MPI calls and
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
On 2/8/22 11:06 AM, Bertini, Denis Dr. via users wrote: Hi Thanks for all these informations ! But i have to confess that in this multi-tuning-parameter space, i got somehow lost. Furthermore it is somtimes mixing between user-space and kernel-space. I have only possibility to act on the user space. Ok. If you are doing the test to check your system, you should only tune for typical applications, rather than for one function call with a specific message size. As you can change the OpenMPI default settings for the algorithm used to do all reduce, you may wish to run tests and choose a setting that will work well for most of your users. You may also wish to upgrade to OpenMPI 4.1 as default, so perhaps do tests on that version. 1) So i have on the system max locked memory: - ulimit -l unlimited (default ) and i do not see any warnings/errors related to that when launching MPI. 2) I tried differents algorithms for MPI_all_reduce op. all showing drop in bw for size=16384 The drops are of different magnitude depending on the algorithm used, default gives worst case latency of 54970.96 us and round robin gives worst case latency of 4992.04 us for a size of 16384. May be helpful to indicate what hardware you are using, both for the chip (cache sizes will be important) and the interconnect. Perhaps try the test on 2 or 4 nodes as well. 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed the same behaviour. This suggests it is the chip, rather than the interconnect. 3) i realized that increasing the so-called warm up parameter in the OSU benchmark (argument -x 200 as default) the discrepancy. At the contrary putting lower threshold ( -x 10 ) can increase this BW discrepancy up to factor 300 at message size 16384 compare to message size 8192 for example. So does it means that there are some caching effects in the internode communication? Probably. If you are using AMD 7551P nodes, these have 96K L1 cache per core. A message of 16384 double precision uses 132K so will not fit in L1 cache, and a message of 8192 uses 66K and will fit in L1 cache. Perhaps try the same test on Intel Xeon e52680 nodes or 6248r nodes. Some relevant studies are: Zhong, Cao, Bosilica and Dongarra, "Using long vector extensions for MPI reductions", https://doi.org/10.1016/j.parco.2021.102871 Hashmi, Chakraborty, Bayatpour, Subramoni and Panda "Designing Shared Address Space MPI libraries in the Many-core Era", https://jahanzeb-hashmi.github.io/files/talks/ipdps18.pdf Saini, Mehrotra, Taylor, Shende and Biswas, "Performance Analysis of Scientific and Engineering Applications Using MPInside and TAU", https://ntrs.nasa.gov/api/citations/20100038444/downloads/20100038444.pdf The second study by Hashmi et al. focuses on inter node communication, but has a nice performance model that demonstrates understanding of the communication pattern. For typical use of MPI on a particular cluster, such a detailed understanding is likely not necessary. These studies do also collect hardware performance information. From my experience, to tune parameters is a time-consuming and cumbersome task. Could it also be the problem is not really on the openMPI implemenation but on the system? The default OpenMPI parameters may need to be adjusted for a good user experience on your system, but demanding users will probably do this for their specific applications. By changing the algorithm used for all reduce, you got a factor of 10 improvement in the benchmark for a size of 16384. Perhaps determine which MPI calls are used most often on your cluster, and provide a guide as to how OpenMPI can be tuned for these. Alternatively, if you have a set of heavily used applications, profile them to determine most used MPI calls and then set defaults that would improve application performance. Do also check whether there are any performance measurements available from your infiniband switch provider that will allow checking of correct functionality at the single switch level. Best Denis *From:* users on behalf of Gus Correa via users *Sent:* Monday, February 7, 2022 9:14:19 PM *To:* Open MPI Users *Cc:* Gus Correa *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband network This may have changed since, but these used to be relevant points. Overall, the Open MPI FAQ have lots of good suggestions: https://www.open-mpi.org/faq/ <https://www.open-mpi.org/faq/> some specific for performance tuning: https://www.open-mpi.org/faq/?category=tuning <https://www.open-mpi.org/faq/?category=tuning> https://www.open-mpi.org/faq/?category=openfabrics <https://www.open-mpi.org/faq/?category=openfabrics> 1) Make sure you are not using the Ethernet TCP/IP, which is widely available in compute nodes: mpirun
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
Hi Thanks for all these informations ! But i have to confess that in this multi-tuning-parameter space, i got somehow lost. Furthermore it is somtimes mixing between user-space and kernel-space. I have only possibility to act on the user space. 1) So i have on the system max locked memory: - ulimit -l unlimited (default ) and i do not see any warnings/errors related to that when launching MPI. 2) I tried differents algorithms for MPI_all_reduce op. all showing drop in bw for size=16384 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed the same behaviour. 3) i realized that increasing the so-called warm up parameter in the OSU benchmark (argument -x 200 as default) the discrepancy. At the contrary putting lower threshold ( -x 10 ) can increase this BW discrepancy up to factor 300 at message size 16384 compare to message size 8192 for example. So does it means that there are some caching effects in the internode communication? From my experience, to tune parameters is a time-consuming and cumbersome task. Could it also be the problem is not really on the openMPI implemenation but on the system? Best Denis From: users on behalf of Gus Correa via users Sent: Monday, February 7, 2022 9:14:19 PM To: Open MPI Users Cc: Gus Correa Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network This may have changed since, but these used to be relevant points. Overall, the Open MPI FAQ have lots of good suggestions: https://www.open-mpi.org/faq/ some specific for performance tuning: https://www.open-mpi.org/faq/?category=tuning https://www.open-mpi.org/faq/?category=openfabrics 1) Make sure you are not using the Ethernet TCP/IP, which is widely available in compute nodes: mpirun --mca btl self,sm,openib ... https://www.open-mpi.org/faq/?category=tuning#selecting-components However, this may have changed lately: https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable 2) Maximum locked memory used by IB and their system limit. Start here: https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage 3) The eager vs. rendezvous message size threshold. I wonder if it may sit right where you see the latency spike. https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user 4) Processor and memory locality/affinity and binding (please check the current options and syntax) https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4 On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users mailto:users@lists.open-mpi.org>> wrote: Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php mpirun --verbose --display-map Have you tried newer OpenMPI versions? Do you get similar behavior for the osu_reduce and osu_gather benchmarks? Typically internal buffer sizes as well as your hardware will affect performance. Can you give specifications similar to what is available at: http://mvapich.cse.ohio-state.edu/performance/collectives/ where the operating system, switch, node type and memory are indicated. If you need good performance, may want to also specify the algorithm used. You can find some of the parameters you can tune using: ompi_info --all A particular helpful parameter is: MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current value: "ignore", data source: default, level: 5 tuner/detail, type: int) Which allreduce algorithm is used. Can be locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping (tuned reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring Valid values: 0:"ignore", 1:"basic_linear", 2:"nonoverlapping", 3:"recursive_doubling", 4:"ring", 5:"segmented_ring", 6:"rabenseifner" MCA coll tuned: parameter "coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", data source: default, level: 5 tuner/detail, type: int) For OpenMPI 4.0, there is a tuning program [2] that might also be helpful. [1] https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi [2] https://github.com/open-mpi/ompi-collectives-tuning On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote: > Hi > > When i repeat i always got the huge discrepancy at the > > message size of 16384. > > May be there is a way to run mpi in verbose mode in order > > to further investigate this behaviour? > > Best > > Denis > > > *From:* users > mailto:users-boun...@lists.open-mpi.org>> > on behalf of Benson > Muite via users mailto:users@lists.open-mpi.org>> > *Sent:* Monday, February 7, 2022 2:27:34 PM > *To:* users@lists.open-mpi.org<mailto:users@lists.open-mpi.or
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
This may have changed since, but these used to be relevant points. Overall, the Open MPI FAQ have lots of good suggestions: https://www.open-mpi.org/faq/ some specific for performance tuning: https://www.open-mpi.org/faq/?category=tuning https://www.open-mpi.org/faq/?category=openfabrics 1) Make sure you are not using the Ethernet TCP/IP, which is widely available in compute nodes: mpirun --mca btl self,sm,openib ... https://www.open-mpi.org/faq/?category=tuning#selecting-components However, this may have changed lately: https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable 2) Maximum locked memory used by IB and their system limit. Start here: https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage 3) The eager vs. rendezvous message size threshold. I wonder if it may sit right where you see the latency spike. https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user 4) Processor and memory locality/affinity and binding (please check the current options and syntax) https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4 On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users < users@lists.open-mpi.org> wrote: > Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php > > mpirun --verbose --display-map > > Have you tried newer OpenMPI versions? > > Do you get similar behavior for the osu_reduce and osu_gather benchmarks? > > Typically internal buffer sizes as well as your hardware will affect > performance. Can you give specifications similar to what is available at: > http://mvapich.cse.ohio-state.edu/performance/collectives/ > where the operating system, switch, node type and memory are indicated. > > If you need good performance, may want to also specify the algorithm > used. You can find some of the parameters you can tune using: > > ompi_info --all > > A particular helpful parameter is: > > MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current > value: "ignore", data source: default, level: 5 tuner/detail, type: int) >Which allreduce algorithm is used. Can be > locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping (tuned > reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring >Valid values: 0:"ignore", 1:"basic_linear", > 2:"nonoverlapping", 3:"recursive_doubling", 4:"ring", > 5:"segmented_ring", 6:"rabenseifner" >MCA coll tuned: parameter > "coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", data > source: default, level: 5 tuner/detail, type: int) > > For OpenMPI 4.0, there is a tuning program [2] that might also be helpful. > > [1] > > https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi > [2] https://github.com/open-mpi/ompi-collectives-tuning > > On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote: > > Hi > > > > When i repeat i always got the huge discrepancy at the > > > > message size of 16384. > > > > May be there is a way to run mpi in verbose mode in order > > > > to further investigate this behaviour? > > > > Best > > > > Denis > > > > ---- > > *From:* users on behalf of Benson > > Muite via users > > *Sent:* Monday, February 7, 2022 2:27:34 PM > > *To:* users@lists.open-mpi.org > > *Cc:* Benson Muite > > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband > > network > > Hi, > > Do you get similar results when you repeat the test? Another job could > > have interfered with your run. > > Benson > > On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote: > >> Hi > >> > >> I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to > >> check/benchmark > >> > >> the infiniband network for our cluster. > >> > >> For that i use the collective all_reduce benchmark and run over 200 > >> nodes, using 1 process per node. > >> > >> And this is the results i obtained > >> > >> > >> > >> > >> > >> # OSU MPI Allreduce Latency Test v5.7.1 > >> # Size Avg Latency(us) Min Latency(us) Max Latency(us) > Iterations > >> 4 114.65 83.22147.98 > 1000 > >> 8 133.85106.47164.93 > 1000 > >> 16116.41 87.57
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
Hi I changed the algorithm used to ring algorithm 4 ( for example ) and the scan changed to # OSU MPI Allreduce Latency Test v5.7.1 # Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations 4 59.39 51.04 65.36 1 8 109.13 90.14126.32 1 16253.26 60.89290.31 1 32 75.04 54.53 83.28 1 64 96.40 59.73111.45 1 12867.86 59.73 76.44 1 25676.32 67.33 85.18 1 512 129.93 85.76170.31 1 1024 168.51129.15194.68 1 2048 136.17110.09156.94 1 4096 173.59130.76199.21 1 8192 236.05170.77269.98 1 163844212.65 3627.71 4992.04 1 327681243.05 1205.11 1276.11 1 655361464.50 1364.76 1531.48 1 131072 1558.71 1454.52 1632.91 1 262144 1681.58 1609.15 1745.44 1 524288 2305.73 2178.17 2402.69 1 1048576 3389.83 3220.44 3517.61 1 Would this means that the first results was linked to the underlying algorithm used by defaults in openMPI ( 0=ignore)? Do you know what is this algorithm (0=ignore)? I still see the wall for message=16384 though ... Best Denis From: Benson Muite Sent: Monday, February 7, 2022 4:59:45 PM To: Bertini, Denis Dr.; Open MPI Users Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php mpirun --verbose --display-map Have you tried newer OpenMPI versions? Do you get similar behavior for the osu_reduce and osu_gather benchmarks? Typically internal buffer sizes as well as your hardware will affect performance. Can you give specifications similar to what is available at: http://mvapich.cse.ohio-state.edu/performance/collectives/ where the operating system, switch, node type and memory are indicated. If you need good performance, may want to also specify the algorithm used. You can find some of the parameters you can tune using: ompi_info --all A particular helpful parameter is: MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current value: "ignore", data source: default, level: 5 tuner/detail, type: int) Which allreduce algorithm is used. Can be locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping (tuned reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring Valid values: 0:"ignore", 1:"basic_linear", 2:"nonoverlapping", 3:"recursive_doubling", 4:"ring", 5:"segmented_ring", 6:"rabenseifner" MCA coll tuned: parameter "coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", data source: default, level: 5 tuner/detail, type: int) For OpenMPI 4.0, there is a tuning program [2] that might also be helpful. [1] https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi [2] https://github.com/open-mpi/ompi-collectives-tuning On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote: > Hi > > When i repeat i always got the huge discrepancy at the > > message size of 16384. > > May be there is a way to run mpi in verbose mode in order > > to further investigate this behaviour? > > Best > > Denis > > > *From:* users on behalf of Benson > Muite via users > *Sent:* Monday, February 7, 2022 2:27:34 PM > *To:* users@lists.open-mpi.org > *Cc:* Benson Muite > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband > network > Hi, > Do you get similar results when you repeat the test? Another job could > have interfered with your run. > Benson > On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote: >> Hi >> >> I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to >> check/benchmark >> >> the infiniband network for our cluster. >> >> For that i use the collective all_reduce benchmark and run over 200 >> nodes, using 1 process per node. >&
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
Hi, I ran the all gather becnhmarks and got this values which show also a step wise preformance drop as function of message size. Would this be linked to the underlying algorithm used for collective operation? OSU MPI Allgather Latency Test v5.7.1 # Size Avg Latency(us) 1 70.36 2 47.01 4 72.42 8 49.62 16 57.93 32 50.11 64 57.29 12874.05 256 454.41 512 544.04 1024 580.96 2048 711.40 4096 905.14 8192 2002.32 163842652.59 327684034.35 655366816.29 131072 14280.11 262144 28451.46 524288 54719.41 1048576106607.19 I use srun and not mpirun, how to activate the flage for verbosity in that case? Best Denis From: Benson Muite Sent: Monday, February 7, 2022 4:59:45 PM To: Bertini, Denis Dr.; Open MPI Users Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php mpirun --verbose --display-map Have you tried newer OpenMPI versions? Do you get similar behavior for the osu_reduce and osu_gather benchmarks? Typically internal buffer sizes as well as your hardware will affect performance. Can you give specifications similar to what is available at: http://mvapich.cse.ohio-state.edu/performance/collectives/ where the operating system, switch, node type and memory are indicated. If you need good performance, may want to also specify the algorithm used. You can find some of the parameters you can tune using: ompi_info --all A particular helpful parameter is: MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current value: "ignore", data source: default, level: 5 tuner/detail, type: int) Which allreduce algorithm is used. Can be locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping (tuned reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring Valid values: 0:"ignore", 1:"basic_linear", 2:"nonoverlapping", 3:"recursive_doubling", 4:"ring", 5:"segmented_ring", 6:"rabenseifner" MCA coll tuned: parameter "coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", data source: default, level: 5 tuner/detail, type: int) For OpenMPI 4.0, there is a tuning program [2] that might also be helpful. [1] https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi [2] https://github.com/open-mpi/ompi-collectives-tuning On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote: > Hi > > When i repeat i always got the huge discrepancy at the > > message size of 16384. > > May be there is a way to run mpi in verbose mode in order > > to further investigate this behaviour? > > Best > > Denis > > > *From:* users on behalf of Benson > Muite via users > *Sent:* Monday, February 7, 2022 2:27:34 PM > *To:* users@lists.open-mpi.org > *Cc:* Benson Muite > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband > network > Hi, > Do you get similar results when you repeat the test? Another job could > have interfered with your run. > Benson > On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote: >> Hi >> >> I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to >> check/benchmark >> >> the infiniband network for our cluster. >> >> For that i use the collective all_reduce benchmark and run over 200 >> nodes, using 1 process per node. >> >> And this is the results i obtained >> >> >> >> >> >> # OSU MPI Allreduce Latency Test v5.7.1 >> # Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations >> 4 114.65 83.22147.981000 >> 8 133.85106.47164.931000 >> 16116.41 87.57150.581000 >> 32112.17 93.25130.231000 >> 64106.85 81.93134.741000 >> 128 117.53 87.50152.271000 >> 256 143.08115.63173.971000 >> 512 130.34
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php mpirun --verbose --display-map Have you tried newer OpenMPI versions? Do you get similar behavior for the osu_reduce and osu_gather benchmarks? Typically internal buffer sizes as well as your hardware will affect performance. Can you give specifications similar to what is available at: http://mvapich.cse.ohio-state.edu/performance/collectives/ where the operating system, switch, node type and memory are indicated. If you need good performance, may want to also specify the algorithm used. You can find some of the parameters you can tune using: ompi_info --all A particular helpful parameter is: MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current value: "ignore", data source: default, level: 5 tuner/detail, type: int) Which allreduce algorithm is used. Can be locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping (tuned reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring Valid values: 0:"ignore", 1:"basic_linear", 2:"nonoverlapping", 3:"recursive_doubling", 4:"ring", 5:"segmented_ring", 6:"rabenseifner" MCA coll tuned: parameter "coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", data source: default, level: 5 tuner/detail, type: int) For OpenMPI 4.0, there is a tuning program [2] that might also be helpful. [1] https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi [2] https://github.com/open-mpi/ompi-collectives-tuning On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote: Hi When i repeat i always got the huge discrepancy at the message size of 16384. May be there is a way to run mpi in verbose mode in order to further investigate this behaviour? Best Denis *From:* users on behalf of Benson Muite via users *Sent:* Monday, February 7, 2022 2:27:34 PM *To:* users@lists.open-mpi.org *Cc:* Benson Muite *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband network Hi, Do you get similar results when you repeat the test? Another job could have interfered with your run. Benson On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote: Hi I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to check/benchmark the infiniband network for our cluster. For that i use the collective all_reduce benchmark and run over 200 nodes, using 1 process per node. And this is the results i obtained # OSU MPI Allreduce Latency Test v5.7.1 # Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations 4 114.65 83.22 147.98 1000 8 133.85 106.47 164.93 1000 16 116.41 87.57 150.58 1000 32 112.17 93.25 130.23 1000 64 106.85 81.93 134.74 1000 128 117.53 87.50 152.27 1000 256 143.08 115.63 173.97 1000 512 130.34 100.20 167.56 1000 1024 155.67 111.29 188.20 1000 2048 151.82 116.03 198.19 1000 4096 159.11 122.09 199.24 1000 8192 176.74 143.54 221.98 1000 16384 48862.85 39270.21 54970.96 1000 32768 2737.37 2614.60 2802.68 1000 65536 2723.15 2585.62 2813.65 1000 Could someone explain me what is happening for message = 16384 ? One can notice a huge latency (~ 300 time larger) compare to message size = 8192. I do not really understand what could create such an increase in the latency. The reason i use the OSU microbenchmarks is that we sporadically experience a drop in the bandwith for typical collective operations such as MPI_Reduce in our cluster which is difficult to understand. I would be grateful if somebody can share its expertise or such problem with me. Best, Denis - Denis Bertini Abteilung: CIT Ort: SB3 2.265a Tel: +49 6159 71 2240 Fax: +49 6159 71 2986 E-Mail: d.bert...@gsi.de GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528 Managing Directors / Geschäftsführung: Professor Dr. P
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
Hi When i repeat i always got the huge discrepancy at the message size of 16384. May be there is a way to run mpi in verbose mode in order to further investigate this behaviour? Best Denis From: users on behalf of Benson Muite via users Sent: Monday, February 7, 2022 2:27:34 PM To: users@lists.open-mpi.org Cc: Benson Muite Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network Hi, Do you get similar results when you repeat the test? Another job could have interfered with your run. Benson On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote: > Hi > > I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to > check/benchmark > > the infiniband network for our cluster. > > For that i use the collective all_reduce benchmark and run over 200 > nodes, using 1 process per node. > > And this is the results i obtained > > > > > > # OSU MPI Allreduce Latency Test v5.7.1 > # Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations > 4 114.65 83.22147.981000 > 8 133.85106.47164.931000 > 16116.41 87.57150.581000 > 32112.17 93.25130.231000 > 64106.85 81.93134.741000 > 128 117.53 87.50152.271000 > 256 143.08115.63173.971000 > 512 130.34100.20167.561000 > 1024 155.67111.29188.201000 > 2048 151.82116.03198.191000 > 4096 159.11122.09199.241000 > 8192 176.74143.54221.981000 > 16384 48862.85 39270.21 54970.961000 > 327682737.37 2614.60 2802.681000 > 655362723.15 2585.62 2813.651000 > > > > Could someone explain me what is happening for message = 16384 ? > One can notice a huge latency (~ 300 time larger) compare to message > size = 8192. > I do not really understand what could create such an increase in the > latency. > The reason i use the OSU microbenchmarks is that we > sporadically experience a drop > in the bandwith for typical collective operations such as MPI_Reduce in > our cluster > which is difficult to understand. > I would be grateful if somebody can share its expertise or such problem > with me. > > Best, > Denis > > > > - > Denis Bertini > Abteilung: CIT > Ort: SB3 2.265a > > Tel: +49 6159 71 2240 > Fax: +49 6159 71 2986 > E-Mail: d.bert...@gsi.de > > GSI Helmholtzzentrum für Schwerionenforschung GmbH > Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de > > Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528 > Managing Directors / Geschäftsführung: > Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock > Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats: > Ministerialdirigent Dr. Volkmar Dietz >
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
Hi, Do you get similar results when you repeat the test? Another job could have interfered with your run. Benson On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote: Hi I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to check/benchmark the infiniband network for our cluster. For that i use the collective all_reduce benchmark and run over 200 nodes, using 1 process per node. And this is the results i obtained # OSU MPI Allreduce Latency Test v5.7.1 # Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations 4 114.65 83.22 147.98 1000 8 133.85 106.47 164.93 1000 16 116.41 87.57 150.58 1000 32 112.17 93.25 130.23 1000 64 106.85 81.93 134.74 1000 128 117.53 87.50 152.27 1000 256 143.08 115.63 173.97 1000 512 130.34 100.20 167.56 1000 1024 155.67 111.29 188.20 1000 2048 151.82 116.03 198.19 1000 4096 159.11 122.09 199.24 1000 8192 176.74 143.54 221.98 1000 16384 48862.85 39270.21 54970.96 1000 32768 2737.37 2614.60 2802.68 1000 65536 2723.15 2585.62 2813.65 1000 Could someone explain me what is happening for message = 16384 ? One can notice a huge latency (~ 300 time larger) compare to message size = 8192. I do not really understand what could create such an increase in the latency. The reason i use the OSU microbenchmarks is that we sporadically experience a drop in the bandwith for typical collective operations such as MPI_Reduce in our cluster which is difficult to understand. I would be grateful if somebody can share its expertise or such problem with me. Best, Denis - Denis Bertini Abteilung: CIT Ort: SB3 2.265a Tel: +49 6159 71 2240 Fax: +49 6159 71 2986 E-Mail: d.bert...@gsi.de GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528 Managing Directors / Geschäftsführung: Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats: Ministerialdirigent Dr. Volkmar Dietz
[OMPI users] Using OSU benchmarks for checking Infiniband network
Hi I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to check/benchmark the infiniband network for our cluster. For that i use the collective all_reduce benchmark and run over 200 nodes, using 1 process per node. And this is the results i obtained # OSU MPI Allreduce Latency Test v5.7.1 # Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations 4 114.65 83.22147.981000 8 133.85106.47164.931000 16116.41 87.57150.581000 32112.17 93.25130.231000 64106.85 81.93134.741000 128 117.53 87.50152.271000 256 143.08115.63173.971000 512 130.34100.20167.561000 1024 155.67111.29188.201000 2048 151.82116.03198.191000 4096 159.11122.09199.241000 8192 176.74143.54221.981000 16384 48862.85 39270.21 54970.961000 327682737.37 2614.60 2802.681000 655362723.15 2585.62 2813.651000 Could someone explain me what is happening for message = 16384 ? One can notice a huge latency (~ 300 time larger) compare to message size = 8192. I do not really understand what could create such an increase in the latency. The reason i use the OSU microbenchmarks is that we sporadically experience a drop in the bandwith for typical collective operations such as MPI_Reduce in our cluster which is difficult to understand. I would be grateful if somebody can share its expertise or such problem with me. Best, Denis - Denis Bertini Abteilung: CIT Ort: SB3 2.265a Tel: +49 6159 71 2240 Fax: +49 6159 71 2986 E-Mail: d.bert...@gsi.de GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528 Managing Directors / Geschäftsführung: Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats: Ministerialdirigent Dr. Volkmar Dietz