Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB
Hi Craig, Terry, Neeraj, list Craig: A fellow here runs WRF. I grep the code and there is plenty of collectives there: MPI_[All]Gather[v], MPI_[All]Reduce, etc. Domain decomposition code like WRF, MITgcm, and other atmosphere and ocean codes has point-to-point communication to exchange subdomain boundaries, but also collective operations to calculate sums, etc, in various types of PDE (matrix) solvers that require global information. Terry: On the MITgcm, the apparent culprit is MPI_Allreduce, which seems to be bad on **small** messages (rather than big ones). This is the same behavior pattern that was reported here on May, regarding MPI_Alltoall, by Roman Martonak, a list subscriber using a computational chemistry package on an IB cluster: http://www.open-mpi.org/community/lists/users/2009/07/10045.php http://www.open-mpi.org/community/lists/users/2009/05/9419.php At that point Pavel Shamis, Peter Kjellstrom, and others gave very good suggestions, but they were only focused on MPI_Alltoall. No other collectives were considered. All: Any insights on how to tune MPI_Allreduce? Maybe a hint on the other collectives also? Any benchmark tool available that one can use to find the sweet spot of each collective? Many thanks, Gus Correa - Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA - nee...@crlindia.com wrote: Hi Terry, I had tested mostly MPI_Bcast, MPI_Reduce, MPI_Gather kind of MPI with openmpi-1.3 and hierarchical option enabled.In all these, i found results slower than regular tuned collectives. We have HP Blade with intel clovertown processor(two quad core) connected with DDR infiniband clos network. Results were tested on 12-16 nodes with 8 mpi process each node. Regards Neeraj Chourasia (MTS) Computational Research Laboratories Ltd. (A wholly Owned Subsidiary of TATA SONS Ltd) B-101, ICC Trade Towers, Senapati Bapat Road Pune 411016 (Mah) INDIA (O) +91-20-6620 9863 (Fax) +91-20-6620 9862 M: +91.9225520634 *Terry Dontje <terry.don...@sun.com>* Sent by: users-boun...@open-mpi.org 08/07/2009 05:15 PM Please respond to Open MPI Users <us...@open-mpi.org> To us...@open-mpi.org cc Subject Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 onIB Hi Neeraj, Were there specific collectives that were slower? Also what kind of cluster were you running on? How many nodes and cores per node? thanks, --td > Message: 3 > Date: Fri, 7 Aug 2009 16:51:05 +0530 > From: nee...@crlindia.com > Subject: Re: [OMPI users] Performance question about OpenMPI and > MVAPICH2 on IB > To: Open MPI Users <us...@open-mpi.org> > Cc: us...@open-mpi.org, users-boun...@open-mpi.org > Message-ID: > <of62a95e62.d6758124-on6525760b.003e2874-6525760b.003e1...@crlindia.com> > > Content-Type: text/plain; charset="us-ascii" > > Hi Terry, > > I feel hierarchical collectives are slower compare to tuned one. I > had done some benchmark in the past specific to collectives, and this is > what i feel based on my observation. > > Regards > > Neeraj Chourasia (MTS) > Computational Research Laboratories Ltd. > (A wholly Owned Subsidiary of TATA SONS Ltd) > B-101, ICC Trade Towers, Senapati Bapat Road > Pune 411016 (Mah) INDIA > (O) +91-20-6620 9863 (Fax) +91-20-6620 9862 > M: +91.9225520634 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions.Thank you =-=-= ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB
Terry Dontje wrote: Craig, Did your affinity script bind the processes per socket or linearly to cores. If the former you'll want to look at using rankfiles and place the ranks based on sockets. TWe have found this especially useful if you are not running fully subscribed on your machines. Also, if you think the main issue is collectives performance you may want to try using the hierarchical and SM collectives. However, be forewarned we are right now trying to pound out some errors with these modules. To enable them you add the following parameters "--mca coll_hierarch_priority 100 --mca coll_sm_priority 100". We would be very interested in any results you get (failures, improvements, non-improvements). Adding these two options causes the code to segfault at startup. Craig thanks, --td Message: 4 Date: Thu, 06 Aug 2009 17:03:08 -0600 From: Craig Tierney <craig.tier...@noaa.gov> Subject: Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 onIB To: Open MPI Users <us...@open-mpi.org> Message-ID: <4a7b612c.8070...@noaa.gov> Content-Type: text/plain; charset=ISO-8859-1 A followup Part of problem was affinity. I had written a script to do processor and memory affinity (which works fine with MVAPICH2). It is an idea that I got from TACC. However, the script didn't seem to work correctly with OpenMPI (or I still have bugs). Setting --mca mpi_paffinity_alone 1 made things better. However, the performance is still not as good: Cores Mvapich2Openmpi --- 8 17.317.3 16 31.731.5 32 62.962.8 64 110.8 108.0 128 219.2 201.4 256 384.5 342.7 512 687.2 537.6 The performance number is GFlops (so larger is better). The first few numbers show that the executable is the right speed. I verified that IB is being used by using OMB and checking latency and bandwidth. Those numbers are what I expect (3GB/s, 1.5mu/s for QDR). However, the Openmpi version is not scaling as well. Any ideas on why that might be the case? Thanks, Craig ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB
nee...@crlindia.com wrote: Hi Craig, How was the nodefile selected for execution? Whether it was provided by scheduler say LSF/SGE/PBS or you manually gave it? With WRF, we observed giving sequential nodes (Blades which are in the same order as in enclosure) gave us some performance benefit. Regards I figured this might be the case. Right now the batch system is giving the nodes to the applciation. They are not sorted, and I have considered doing that. I have also launched numerous cases of one problems size, and I don't get that much variation in run time, not to explain the differences in MPI stack. Craig Neeraj Chourasia (MTS) Computational Research Laboratories Ltd. (A wholly Owned Subsidiary of TATA SONS Ltd) B-101, ICC Trade Towers, Senapati Bapat Road Pune 411016 (Mah) INDIA (O) +91-20-6620 9863 (Fax) +91-20-6620 9862 M: +91.9225520634 *Craig Tierney <craig.tier...@noaa.gov>* Sent by: users-boun...@open-mpi.org 08/07/2009 04:43 AM Please respond to Open MPI Users <us...@open-mpi.org> To Open MPI Users <us...@open-mpi.org> cc Subject Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB Gus Correa wrote: > Hi Craig, list > > I suppose WRF uses MPI collective calls (MPI_Reduce, > MPI_Bcast, MPI_Alltoall etc), > just like the climate models we run here do. > A recursive grep on the source code will tell. > I will check this out. I am not the WRF expert, but I was under the impression that most weather models are nearest neighbor communications, not collectives. > If that is the case, you may need to tune the collectives dynamically. > We are experimenting with tuned collectives here also. > > Specifically, we had a scaling problem with the MITgcm > (also running on an IB cluster) > that is probably due to collectives. > Similar problems were reported on this list before, > with computational chemistry software. > See these threads: > http://www.open-mpi.org/community/lists/users/2009/07/10045.php > http://www.open-mpi.org/community/lists/users/2009/05/9419.php > > If WRF outputs timing information, particularly the time spent on MPI > routines, you may also want to compare how the OpenMPI and > MVAPICH versions fare w.r.t. MPI collectives. > > I hope this helps. > I will look into this. Thanks for the ideas. Craig > Gus Correa > - > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > - > > > > Craig Tierney wrote: >> I am running openmpi-1.3.3 on my cluster which is using >> OFED-1.4.1 for Infiniband support. I am comparing performance >> between this version of OpenMPI and Mvapich2, and seeing a >> very large difference in performance. >> >> The code I am testing is WRF v3.0.1. I am running the >> 12km benchmark. >> >> The two builds are the exact same codes and configuration >> files. All I did different was use modules to switch versions >> of MPI, and recompiled the code. >> >> Performance: >> >> Cores Mvapich2Openmpi >> --- >>8 17.313.9 >> 16 31.725.9 >> 32 62.951.6 >> 64 110.892.8 >> 128 219.2 189.4 >> 256 384.5 317.8 >> 512 687.2 516.7 >> >> The performance number is GFlops (so larger is better). >> >> I am calling openmpi as: >> >> /opt/openmpi/1.3.3-intel/bin/mpirun --mca plm_rsh_disable_qrsh 1 >> --mca btl openib,sm,self \ >> -machinefile /tmp/6026489.1.qntest.q/machines -x LD_LIBRARY_PATH -np >> $NSLOTS /home/ctierney/bin/noaa_affinity ./wrf.exe >> >> So, >> >> Is this expected? Are some common sense optimizations to use? >> Is there a way to verify that I am really using the IB? When >> I try: >> >> -mca bta ^tcp,openib,sm,self >> >> I get the errors: >> -- >> >> No available btl components were found! >> >> This means that there are no components of this type installed on your >> system or all the components reported that they could not be used. >> >> This is a fatal error; your MPI process is likely to abort. Check the >> output of the "ompi_info" command and ensure that components of this >> type are available on your system. You may also wish to check the >
Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB
Terry Dontje wrote: Craig, Did your affinity script bind the processes per socket or linearly to cores. If the former you'll want to look at using rankfiles and place the ranks based on sockets. TWe have found this especially useful if you are not running fully subscribed on your machines. The script binds them to sockets and also binds memory per node. It is smart enough that if the machine_file does not use all the cores (because the user reordered them) then the script will lay out the tasks evenly between the two sockets. Also, if you think the main issue is collectives performance you may want to try using the hierarchical and SM collectives. However, be forewarned we are right now trying to pound out some errors with these modules. To enable them you add the following parameters "--mca coll_hierarch_priority 100 --mca coll_sm_priority 100". We would be very interested in any results you get (failures, improvements, non-improvements). I don't know what it is slow. OpenMPI is so flexible in how the stack can be tuned. But I also have 100s of users runing dozens of major codes, and what I need is a set of options that 'just work' in most cases. I will try the above options and get back to you. Craig thanks, --td Message: 4 Date: Thu, 06 Aug 2009 17:03:08 -0600 From: Craig Tierney <craig.tier...@noaa.gov> Subject: Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 onIB To: Open MPI Users <us...@open-mpi.org> Message-ID: <4a7b612c.8070...@noaa.gov> Content-Type: text/plain; charset=ISO-8859-1 A followup Part of problem was affinity. I had written a script to do processor and memory affinity (which works fine with MVAPICH2). It is an idea that I got from TACC. However, the script didn't seem to work correctly with OpenMPI (or I still have bugs). Setting --mca mpi_paffinity_alone 1 made things better. However, the performance is still not as good: Cores Mvapich2Openmpi --- 8 17.317.3 16 31.731.5 32 62.962.8 64 110.8 108.0 128 219.2 201.4 256 384.5 342.7 512 687.2 537.6 The performance number is GFlops (so larger is better). The first few numbers show that the executable is the right speed. I verified that IB is being used by using OMB and checking latency and bandwidth. Those numbers are what I expect (3GB/s, 1.5mu/s for QDR). However, the Openmpi version is not scaling as well. Any ideas on why that might be the case? Thanks, Craig ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB
Hi Terry, I feel hierarchical collectives are slower compare to tuned one. I had done some benchmark in the past specific to collectives, and this is what i feel based on my observation. Regards Neeraj Chourasia (MTS) Computational Research Laboratories Ltd. (A wholly Owned Subsidiary of TATA SONS Ltd) B-101, ICC Trade Towers, Senapati Bapat Road Pune 411016 (Mah) INDIA (O) +91-20-6620 9863 (Fax) +91-20-6620 9862 M: +91.9225520634 Terry Dontje <terry.don...@sun.com> Sent by: users-boun...@open-mpi.org 08/07/2009 04:35 PM Please respond to Open MPI Users <us...@open-mpi.org> To us...@open-mpi.org cc Subject Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB Craig, Did your affinity script bind the processes per socket or linearly to cores. If the former you'll want to look at using rankfiles and place the ranks based on sockets. TWe have found this especially useful if you are not running fully subscribed on your machines. Also, if you think the main issue is collectives performance you may want to try using the hierarchical and SM collectives. However, be forewarned we are right now trying to pound out some errors with these modules. To enable them you add the following parameters "--mca coll_hierarch_priority 100 --mca coll_sm_priority 100". We would be very interested in any results you get (failures, improvements, non-improvements). thanks, --td > Message: 4 > Date: Thu, 06 Aug 2009 17:03:08 -0600 > From: Craig Tierney <craig.tier...@noaa.gov> > Subject: Re: [OMPI users] Performance question about OpenMPI and >MVAPICH2 on IB > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <4a7b612c.8070...@noaa.gov> > Content-Type: text/plain; charset=ISO-8859-1 > > A followup > > Part of problem was affinity. I had written a script to do processor > and memory affinity (which works fine with MVAPICH2). It is an > idea that I got from TACC. However, the script didn't seem to > work correctly with OpenMPI (or I still have bugs). > > Setting --mca mpi_paffinity_alone 1 made things better. However, > the performance is still not as good: > > Cores Mvapich2Openmpi > --- >8 17.317.3 > 16 31.731.5 > 32 62.962.8 > 64 110.8 108.0 > 128 219.2 201.4 > 256 384.5 342.7 > 512 687.2 537.6 > > The performance number is GFlops (so larger is better). > > The first few numbers show that the executable is the right > speed. I verified that IB is being used by using OMB and > checking latency and bandwidth. Those numbers are what I > expect (3GB/s, 1.5mu/s for QDR). > > However, the Openmpi version is not scaling as well. Any > ideas on why that might be the case? > > Thanks, > Craig ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions.Thank you =-=-=
Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB
Hi Craig, WRF has pattern of talking to nearest neighbours like p+1, p-1, p+3 and p-3, where p is the particular process. But in addition to that, it also uses collective calls like MPI_Bcast, MPI_AlltoAllv, MPI_Allgather, MPI_Gather, MPI_Gatherv, MPI_Scatterv. Apparently openmpi-1.3 series are not better in terms of collectives as compare to its 1.2 series. But there are lot of parameters which has been added to tune collectives like giving dynamic file option which would override openmpi default selection of algorithm for particular collective operation. Since collectives depend heavily on your network architecture and message size, i would like you to first fine tune your collectives on your network fabric before running any scientific application. Regards Neeraj Chourasia (MTS) Computational Research Laboratories Ltd. (A wholly Owned Subsidiary of TATA SONS Ltd) B-101, ICC Trade Towers, Senapati Bapat Road Pune 411016 (Mah) INDIA (O) +91-20-6620 9863 (Fax) +91-20-6620 9862 M: +91.9225520634 Craig Tierney <craig.tier...@noaa.gov> Sent by: users-boun...@open-mpi.org 08/07/2009 04:43 AM Please respond to Open MPI Users <us...@open-mpi.org> To Open MPI Users <us...@open-mpi.org> cc Subject Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB Gus Correa wrote: > Hi Craig, list > > I suppose WRF uses MPI collective calls (MPI_Reduce, > MPI_Bcast, MPI_Alltoall etc), > just like the climate models we run here do. > A recursive grep on the source code will tell. > I will check this out. I am not the WRF expert, but I was under the impression that most weather models are nearest neighbor communications, not collectives. > If that is the case, you may need to tune the collectives dynamically. > We are experimenting with tuned collectives here also. > > Specifically, we had a scaling problem with the MITgcm > (also running on an IB cluster) > that is probably due to collectives. > Similar problems were reported on this list before, > with computational chemistry software. > See these threads: > http://www.open-mpi.org/community/lists/users/2009/07/10045.php > http://www.open-mpi.org/community/lists/users/2009/05/9419.php > > If WRF outputs timing information, particularly the time spent on MPI > routines, you may also want to compare how the OpenMPI and > MVAPICH versions fare w.r.t. MPI collectives. > > I hope this helps. > I will look into this. Thanks for the ideas. Craig > Gus Correa > - > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > - > > > > Craig Tierney wrote: >> I am running openmpi-1.3.3 on my cluster which is using >> OFED-1.4.1 for Infiniband support. I am comparing performance >> between this version of OpenMPI and Mvapich2, and seeing a >> very large difference in performance. >> >> The code I am testing is WRF v3.0.1. I am running the >> 12km benchmark. >> >> The two builds are the exact same codes and configuration >> files. All I did different was use modules to switch versions >> of MPI, and recompiled the code. >> >> Performance: >> >> Cores Mvapich2Openmpi >> --- >>8 17.313.9 >> 16 31.725.9 >> 32 62.951.6 >> 64 110.892.8 >> 128 219.2 189.4 >> 256 384.5 317.8 >> 512 687.2 516.7 >> >> The performance number is GFlops (so larger is better). >> >> I am calling openmpi as: >> >> /opt/openmpi/1.3.3-intel/bin/mpirun --mca plm_rsh_disable_qrsh 1 >> --mca btl openib,sm,self \ >> -machinefile /tmp/6026489.1.qntest.q/machines -x LD_LIBRARY_PATH -np >> $NSLOTS /home/ctierney/bin/noaa_affinity ./wrf.exe >> >> So, >> >> Is this expected? Are some common sense optimizations to use? >> Is there a way to verify that I am really using the IB? When >> I try: >> >> -mca bta ^tcp,openib,sm,self >> >> I get the errors: >> -- >> >> No available btl components were found! >> >> This means that there are no components of this type installed on your >> system or all the components reported that they could not be used. >> >> This is a fatal error; your MPI process is likely to abort. Check the >> output of the "ompi_info" command and ensure that components of this >> type are available on your s
Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB
Craig, Let me look at your script, if you'd like... I may be able to help there. I've also been seeing some "interesting results for WRF on OpenMPI, and we may want to see if we're taking complimentary approaches... gerry Craig Tierney wrote: A followup Part of problem was affinity. I had written a script to do processor and memory affinity (which works fine with MVAPICH2). It is an idea that I got from TACC. However, the script didn't seem to work correctly with OpenMPI (or I still have bugs). Setting --mca mpi_paffinity_alone 1 made things better. However, the performance is still not as good: Cores Mvapich2Openmpi --- 8 17.317.3 16 31.731.5 32 62.962.8 64 110.8 108.0 128 219.2 201.4 256 384.5 342.7 512 687.2 537.6 The performance number is GFlops (so larger is better). The first few numbers show that the executable is the right speed. I verified that IB is being used by using OMB and checking latency and bandwidth. Those numbers are what I expect (3GB/s, 1.5mu/s for QDR). However, the Openmpi version is not scaling as well. Any ideas on why that might be the case? Thanks, Craig Craig Tierney wrote: I am running openmpi-1.3.3 on my cluster which is using OFED-1.4.1 for Infiniband support. I am comparing performance between this version of OpenMPI and Mvapich2, and seeing a very large difference in performance. The code I am testing is WRF v3.0.1. I am running the 12km benchmark. The two builds are the exact same codes and configuration files. All I did different was use modules to switch versions of MPI, and recompiled the code. Performance: Cores Mvapich2Openmpi --- 8 17.313.9 16 31.725.9 32 62.951.6 64 110.892.8 128 219.2 189.4 256 384.5 317.8 512 687.2 516.7 The performance number is GFlops (so larger is better). I am calling openmpi as: /opt/openmpi/1.3.3-intel/bin/mpirun --mca plm_rsh_disable_qrsh 1 --mca btl openib,sm,self \ -machinefile /tmp/6026489.1.qntest.q/machines -x LD_LIBRARY_PATH -np $NSLOTS /home/ctierney/bin/noaa_affinity ./wrf.exe So, Is this expected? Are some common sense optimizations to use? Is there a way to verify that I am really using the IB? When I try: -mca bta ^tcp,openib,sm,self I get the errors: -- No available btl components were found! This means that there are no components of this type installed on your system or all the components reported that they could not be used. This is a fatal error; your MPI process is likely to abort. Check the output of the "ompi_info" command and ensure that components of this type are available on your system. You may also wish to check the value of the "component_path" MCA parameter and ensure that it has at least one directory that contains valid MCA components. -- But ompi_info is telling me that I have openib support: MCA btl: openib (MCA v2.0, API v2.0, Component v1.3.3) Note, I did rebuild OFED and put it in a different directory and did not rebuild OpenMPI. However, since ompi_info isn't complaining and the libraries are available, I am thinking that is isn't a problem. I could be wrong. Thanks, Craig -- Gerry Creager -- gerry.crea...@tamu.edu Texas Mesonet -- AATLT, Texas A University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843
Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB
Gus Correa wrote: > Hi Craig, list > > I suppose WRF uses MPI collective calls (MPI_Reduce, > MPI_Bcast, MPI_Alltoall etc), > just like the climate models we run here do. > A recursive grep on the source code will tell. > I will check this out. I am not the WRF expert, but I was under the impression that most weather models are nearest neighbor communications, not collectives. > If that is the case, you may need to tune the collectives dynamically. > We are experimenting with tuned collectives here also. > > Specifically, we had a scaling problem with the MITgcm > (also running on an IB cluster) > that is probably due to collectives. > Similar problems were reported on this list before, > with computational chemistry software. > See these threads: > http://www.open-mpi.org/community/lists/users/2009/07/10045.php > http://www.open-mpi.org/community/lists/users/2009/05/9419.php > > If WRF outputs timing information, particularly the time spent on MPI > routines, you may also want to compare how the OpenMPI and > MVAPICH versions fare w.r.t. MPI collectives. > > I hope this helps. > I will look into this. Thanks for the ideas. Craig > Gus Correa > - > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > - > > > > Craig Tierney wrote: >> I am running openmpi-1.3.3 on my cluster which is using >> OFED-1.4.1 for Infiniband support. I am comparing performance >> between this version of OpenMPI and Mvapich2, and seeing a >> very large difference in performance. >> >> The code I am testing is WRF v3.0.1. I am running the >> 12km benchmark. >> >> The two builds are the exact same codes and configuration >> files. All I did different was use modules to switch versions >> of MPI, and recompiled the code. >> >> Performance: >> >> Cores Mvapich2Openmpi >> --- >>8 17.313.9 >> 16 31.725.9 >> 32 62.951.6 >> 64 110.892.8 >> 128 219.2 189.4 >> 256 384.5 317.8 >> 512 687.2 516.7 >> >> The performance number is GFlops (so larger is better). >> >> I am calling openmpi as: >> >> /opt/openmpi/1.3.3-intel/bin/mpirun --mca plm_rsh_disable_qrsh 1 >> --mca btl openib,sm,self \ >> -machinefile /tmp/6026489.1.qntest.q/machines -x LD_LIBRARY_PATH -np >> $NSLOTS /home/ctierney/bin/noaa_affinity ./wrf.exe >> >> So, >> >> Is this expected? Are some common sense optimizations to use? >> Is there a way to verify that I am really using the IB? When >> I try: >> >> -mca bta ^tcp,openib,sm,self >> >> I get the errors: >> -- >> >> No available btl components were found! >> >> This means that there are no components of this type installed on your >> system or all the components reported that they could not be used. >> >> This is a fatal error; your MPI process is likely to abort. Check the >> output of the "ompi_info" command and ensure that components of this >> type are available on your system. You may also wish to check the >> value of the "component_path" MCA parameter and ensure that it has at >> least one directory that contains valid MCA components. >> -- >> >> >> But ompi_info is telling me that I have openib support: >> >>MCA btl: openib (MCA v2.0, API v2.0, Component v1.3.3) >> >> Note, I did rebuild OFED and put it in a different directory >> and did not rebuild OpenMPI. However, since ompi_info isn't >> complaining and the libraries are available, I am thinking that >> is isn't a problem. I could be wrong. >> >> Thanks, >> Craig > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Craig Tierney (craig.tier...@noaa.gov)
Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB
A followup Part of problem was affinity. I had written a script to do processor and memory affinity (which works fine with MVAPICH2). It is an idea that I got from TACC. However, the script didn't seem to work correctly with OpenMPI (or I still have bugs). Setting --mca mpi_paffinity_alone 1 made things better. However, the performance is still not as good: Cores Mvapich2Openmpi --- 8 17.317.3 16 31.731.5 32 62.962.8 64 110.8 108.0 128 219.2 201.4 256 384.5 342.7 512 687.2 537.6 The performance number is GFlops (so larger is better). The first few numbers show that the executable is the right speed. I verified that IB is being used by using OMB and checking latency and bandwidth. Those numbers are what I expect (3GB/s, 1.5mu/s for QDR). However, the Openmpi version is not scaling as well. Any ideas on why that might be the case? Thanks, Craig Craig Tierney wrote: > I am running openmpi-1.3.3 on my cluster which is using > OFED-1.4.1 for Infiniband support. I am comparing performance > between this version of OpenMPI and Mvapich2, and seeing a > very large difference in performance. > > The code I am testing is WRF v3.0.1. I am running the > 12km benchmark. > > The two builds are the exact same codes and configuration > files. All I did different was use modules to switch versions > of MPI, and recompiled the code. > > Performance: > > Cores Mvapich2Openmpi > --- >8 17.313.9 > 16 31.725.9 > 32 62.951.6 > 64 110.892.8 > 128 219.2 189.4 > 256 384.5 317.8 > 512 687.2 516.7 > > The performance number is GFlops (so larger is better). > > I am calling openmpi as: > > /opt/openmpi/1.3.3-intel/bin/mpirun --mca plm_rsh_disable_qrsh 1 --mca btl > openib,sm,self \ > -machinefile /tmp/6026489.1.qntest.q/machines -x LD_LIBRARY_PATH -np $NSLOTS > /home/ctierney/bin/noaa_affinity ./wrf.exe > > So, > > Is this expected? Are some common sense optimizations to use? > Is there a way to verify that I am really using the IB? When > I try: > > -mca bta ^tcp,openib,sm,self > > I get the errors: > -- > No available btl components were found! > > This means that there are no components of this type installed on your > system or all the components reported that they could not be used. > > This is a fatal error; your MPI process is likely to abort. Check the > output of the "ompi_info" command and ensure that components of this > type are available on your system. You may also wish to check the > value of the "component_path" MCA parameter and ensure that it has at > least one directory that contains valid MCA components. > -- > > But ompi_info is telling me that I have openib support: > >MCA btl: openib (MCA v2.0, API v2.0, Component v1.3.3) > > Note, I did rebuild OFED and put it in a different directory > and did not rebuild OpenMPI. However, since ompi_info isn't > complaining and the libraries are available, I am thinking that > is isn't a problem. I could be wrong. > > Thanks, > Craig -- Craig Tierney (craig.tier...@noaa.gov)
Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB
Hi Craig, list I suppose WRF uses MPI collective calls (MPI_Reduce, MPI_Bcast, MPI_Alltoall etc), just like the climate models we run here do. A recursive grep on the source code will tell. If that is the case, you may need to tune the collectives dynamically. We are experimenting with tuned collectives here also. Specifically, we had a scaling problem with the MITgcm (also running on an IB cluster) that is probably due to collectives. Similar problems were reported on this list before, with computational chemistry software. See these threads: http://www.open-mpi.org/community/lists/users/2009/07/10045.php http://www.open-mpi.org/community/lists/users/2009/05/9419.php If WRF outputs timing information, particularly the time spent on MPI routines, you may also want to compare how the OpenMPI and MVAPICH versions fare w.r.t. MPI collectives. I hope this helps. Gus Correa - Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA - Craig Tierney wrote: I am running openmpi-1.3.3 on my cluster which is using OFED-1.4.1 for Infiniband support. I am comparing performance between this version of OpenMPI and Mvapich2, and seeing a very large difference in performance. The code I am testing is WRF v3.0.1. I am running the 12km benchmark. The two builds are the exact same codes and configuration files. All I did different was use modules to switch versions of MPI, and recompiled the code. Performance: Cores Mvapich2Openmpi --- 8 17.313.9 16 31.725.9 32 62.951.6 64 110.892.8 128 219.2 189.4 256 384.5 317.8 512 687.2 516.7 The performance number is GFlops (so larger is better). I am calling openmpi as: /opt/openmpi/1.3.3-intel/bin/mpirun --mca plm_rsh_disable_qrsh 1 --mca btl openib,sm,self \ -machinefile /tmp/6026489.1.qntest.q/machines -x LD_LIBRARY_PATH -np $NSLOTS /home/ctierney/bin/noaa_affinity ./wrf.exe So, Is this expected? Are some common sense optimizations to use? Is there a way to verify that I am really using the IB? When I try: -mca bta ^tcp,openib,sm,self I get the errors: -- No available btl components were found! This means that there are no components of this type installed on your system or all the components reported that they could not be used. This is a fatal error; your MPI process is likely to abort. Check the output of the "ompi_info" command and ensure that components of this type are available on your system. You may also wish to check the value of the "component_path" MCA parameter and ensure that it has at least one directory that contains valid MCA components. -- But ompi_info is telling me that I have openib support: MCA btl: openib (MCA v2.0, API v2.0, Component v1.3.3) Note, I did rebuild OFED and put it in a different directory and did not rebuild OpenMPI. However, since ompi_info isn't complaining and the libraries are available, I am thinking that is isn't a problem. I could be wrong. Thanks, Craig