Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB

2009-08-07 Thread Gus Correa

Hi Craig, Terry, Neeraj, list

Craig:  A fellow here runs WRF.
I grep the code and there is plenty of collectives there:
MPI_[All]Gather[v], MPI_[All]Reduce, etc.
Domain decomposition code like WRF, MITgcm, and other atmosphere
and ocean codes has point-to-point communication to exchange
subdomain boundaries, but also collective operations to calculate
sums, etc, in various types of PDE (matrix) solvers that require
global information.

Terry: On the MITgcm, the apparent culprit is MPI_Allreduce,
which seems to be bad on **small** messages (rather than big ones).
This is the same behavior pattern that was reported here on May,
regarding MPI_Alltoall, by Roman Martonak, a list subscriber using a 
computational chemistry package on an IB cluster:


http://www.open-mpi.org/community/lists/users/2009/07/10045.php
http://www.open-mpi.org/community/lists/users/2009/05/9419.php

At that point Pavel Shamis, Peter Kjellstrom, and others gave
very good suggestions, but they were only focused on MPI_Alltoall.
No other collectives were considered.

All:  Any insights on how to tune MPI_Allreduce?
Maybe a hint on the other collectives also?
Any benchmark tool available that one can use to find the
sweet spot of each collective?

Many thanks,
Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-




nee...@crlindia.com wrote:

Hi Terry,
   
I had tested mostly MPI_Bcast, MPI_Reduce, MPI_Gather kind of 
MPI with openmpi-1.3 and hierarchical option enabled.In all these, i 
found results slower than regular tuned collectives.


We have HP Blade with intel clovertown processor(two quad core) 
connected with DDR infiniband clos network.

Results were tested on 12-16 nodes with 8 mpi process each node.


Regards

Neeraj Chourasia (MTS)
Computational Research Laboratories Ltd.
(A wholly Owned Subsidiary of TATA SONS Ltd)
B-101, ICC Trade Towers, Senapati Bapat Road
Pune 411016 (Mah) INDIA
(O) +91-20-6620 9863  (Fax) +91-20-6620 9862
M: +91.9225520634



*Terry Dontje <terry.don...@sun.com>*
Sent by: users-boun...@open-mpi.org

08/07/2009 05:15 PM
Please respond to
Open MPI Users <us...@open-mpi.org>



To
us...@open-mpi.org
cc
    
Subject
	Re: [OMPI users] Performance question about OpenMPI and MVAPICH2   
 onIB









Hi Neeraj,

Were there specific collectives that were slower?  Also what kind of 
cluster were you running on?  How many nodes and cores per node?


thanks,

--td
 > Message: 3
 > Date: Fri, 7 Aug 2009 16:51:05 +0530
 > From: nee...@crlindia.com
 > Subject: Re: [OMPI users] Performance question about OpenMPI and
 >  MVAPICH2     on IB
 > To: Open MPI Users <us...@open-mpi.org>
 > Cc: us...@open-mpi.org, users-boun...@open-mpi.org
 > Message-ID:
 > 
 <of62a95e62.d6758124-on6525760b.003e2874-6525760b.003e1...@crlindia.com>
 >  
 > Content-Type: text/plain; charset="us-ascii"

 >
 > Hi Terry,
 >
 > I feel hierarchical collectives are slower compare to tuned 
one. I

 > had done some benchmark in the past specific to collectives, and this is
 > what i feel based on my observation.
 >
 > Regards
 >
 > Neeraj Chourasia (MTS)
 > Computational Research Laboratories Ltd.
 > (A wholly Owned Subsidiary of TATA SONS Ltd)
 > B-101, ICC Trade Towers, Senapati Bapat Road
 > Pune 411016 (Mah) INDIA
 > (O) +91-20-6620 9863  (Fax) +91-20-6620 9862
 > M: +91.9225520634

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

=-=-= Notice: The information contained in this 
e-mail message and/or attachments to it may contain confidential or 
privileged information. If you are not the intended recipient, any 
dissemination, use, review, distribution, printing or copying of the 
information contained in this e-mail message and/or attachments to it 
are strictly prohibited. If you have received this communication in 
error, please notify us by reply e-mail or telephone and immediately and 
permanently delete the message and any attachments. Internet 
communications cannot be guaranteed to be timely, secure, error or 
virus-free. The sender does not accept liability for any errors or 
omissions.Thank you =-=-=





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB

2009-08-07 Thread Craig Tierney

Terry Dontje wrote:

Craig,

Did your affinity script bind the processes per socket or linearly to 
cores.  If the former you'll want to look at using rankfiles and place 
the ranks based on sockets.  TWe have found this especially useful if 
you are not running fully subscribed on your machines.


Also, if you think the main issue is collectives performance you may 
want to try using the hierarchical and SM collectives.  However, be 
forewarned we are right now trying to pound out some errors with these 
modules.  To enable them you add the following parameters "--mca 
coll_hierarch_priority 100 --mca coll_sm_priority 100".  We would be 
very interested in any results you get (failures, improvements, 
non-improvements).




Adding these two options causes the code to segfault at startup.

Craig





thanks,

--td


Message: 4
Date: Thu, 06 Aug 2009 17:03:08 -0600
From: Craig Tierney <craig.tier...@noaa.gov>
Subject: Re: [OMPI users] Performance question about OpenMPI and
    MVAPICH2 onIB
To: Open MPI Users <us...@open-mpi.org>
Message-ID: <4a7b612c.8070...@noaa.gov>
Content-Type: text/plain; charset=ISO-8859-1

A followup

Part of problem was affinity.  I had written a script to do processor
and memory affinity (which works fine with MVAPICH2).  It is an
idea that I got from TACC.  However, the script didn't seem to
work correctly with OpenMPI (or I still have bugs).

Setting --mca mpi_paffinity_alone 1 made things better.  However,
the performance is still not as good:

Cores   Mvapich2Openmpi
---
   8  17.317.3
  16  31.731.5
  32  62.962.8
  64 110.8   108.0
 128 219.2   201.4
 256 384.5   342.7
 512 687.2   537.6

The performance number is GFlops (so larger is better).

The first few numbers show that the executable is the right
speed.  I verified that IB is being used by using OMB and
checking latency and bandwidth.  Those numbers are what I
expect (3GB/s, 1.5mu/s for QDR).

However, the Openmpi version is not scaling as well.  Any
ideas on why that might be the case?

Thanks,
Craig


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB

2009-08-07 Thread Craig Tierney

nee...@crlindia.com wrote:

Hi Craig,

How was the nodefile selected for execution? Whether it was 
provided by scheduler say LSF/SGE/PBS or you manually gave it?
With WRF, we observed giving sequential nodes (Blades which are in the 
same order as in enclosure) gave us some performance benefit.


Regards



I figured this might be the case.  Right now the batch system
is giving the nodes to the applciation.  They are not sorted,
and I have considered doing that.  I have also launched numerous
cases of one problems size, and I don't get that much variation
in run time, not to explain the differences in MPI stack.

Craig





Neeraj Chourasia (MTS)
Computational Research Laboratories Ltd.
(A wholly Owned Subsidiary of TATA SONS Ltd)
B-101, ICC Trade Towers, Senapati Bapat Road
Pune 411016 (Mah) INDIA
(O) +91-20-6620 9863  (Fax) +91-20-6620 9862
M: +91.9225520634



*Craig Tierney <craig.tier...@noaa.gov>*
Sent by: users-boun...@open-mpi.org

08/07/2009 04:43 AM
Please respond to
Open MPI Users <us...@open-mpi.org>



To
Open MPI Users <us...@open-mpi.org>
cc
        
Subject
	Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on 
   IB









Gus Correa wrote:
 > Hi Craig, list
 >
 > I suppose WRF uses MPI collective calls (MPI_Reduce,
 > MPI_Bcast, MPI_Alltoall etc),
 > just like the climate models we run here do.
 > A recursive grep on the source code will tell.
 >

I will check this out.  I am not the WRF expert, but
I was under the impression that most weather models are
nearest neighbor communications, not collectives.


 > If that is the case, you may need to tune the collectives dynamically.
 > We are experimenting with tuned collectives here also.
 >
 > Specifically, we had a scaling problem with the MITgcm
 > (also running on an IB cluster)
 > that is probably due to collectives.
 > Similar problems were reported on this list before,
 > with computational chemistry software.
 > See these threads:
 > http://www.open-mpi.org/community/lists/users/2009/07/10045.php
 > http://www.open-mpi.org/community/lists/users/2009/05/9419.php
 >
 > If WRF outputs timing information, particularly the time spent on MPI
 > routines, you may also want to compare how the OpenMPI and
 > MVAPICH versions fare w.r.t. MPI collectives.
 >
 > I hope this helps.
 >

I will look into this.  Thanks for the ideas.

Craig



 > Gus Correa
 > -
 > Gustavo Correa
 > Lamont-Doherty Earth Observatory - Columbia University
 > Palisades, NY, 10964-8000 - USA
 > -
 >
 >
 >
 > Craig Tierney wrote:
 >> I am running openmpi-1.3.3 on my cluster which is using
 >> OFED-1.4.1 for Infiniband support.  I am comparing performance
 >> between this version of OpenMPI and Mvapich2, and seeing a
 >> very large difference in performance.
 >>
 >> The code I am testing is WRF v3.0.1.  I am running the
 >> 12km benchmark.
 >>
 >> The two builds are the exact same codes and configuration
 >> files.  All I did different was use modules to switch versions
 >> of MPI, and recompiled the code.
 >>
 >> Performance:
 >>
 >> Cores   Mvapich2Openmpi
 >> ---
 >>8  17.313.9
 >>   16  31.725.9
 >>   32  62.951.6
 >>   64 110.892.8
 >>  128 219.2   189.4
 >>  256 384.5   317.8
 >>  512 687.2   516.7
 >>
 >> The performance number is GFlops (so larger is better).
 >>
 >> I am calling openmpi as:
 >>
 >> /opt/openmpi/1.3.3-intel/bin/mpirun  --mca plm_rsh_disable_qrsh 1
 >> --mca btl openib,sm,self \
 >> -machinefile /tmp/6026489.1.qntest.q/machines -x LD_LIBRARY_PATH -np
 >> $NSLOTS /home/ctierney/bin/noaa_affinity ./wrf.exe
 >>
 >> So,
 >>
 >> Is this expected?  Are some common sense optimizations to use?
 >> Is there a way to verify that I am really using the IB?  When
 >> I try:
 >>
 >> -mca bta ^tcp,openib,sm,self
 >>
 >> I get the errors:
 >> 
--

 >>
 >> No available btl components were found!
 >>
 >> This means that there are no components of this type installed on your
 >> system or all the components reported that they could not be used.
 >>
 >> This is a fatal error; your MPI process is likely to abort.  Check the
 >> output of the "ompi_info" command and ensure that components of this
 >> type are available on your system.  You may also wish to check the
 >

Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB

2009-08-07 Thread Craig Tierney

Terry Dontje wrote:

Craig,

Did your affinity script bind the processes per socket or linearly to 
cores.  If the former you'll want to look at using rankfiles and place 
the ranks based on sockets.  TWe have found this especially useful if 
you are not running fully subscribed on your machines.




The script binds them to sockets and also binds memory per node.
It is smart enough that if the machine_file does not use all
the cores (because the user reordered them) then the script will
lay out the tasks evenly between the two sockets.

Also, if you think the main issue is collectives performance you may 
want to try using the hierarchical and SM collectives.  However, be 
forewarned we are right now trying to pound out some errors with these 
modules.  To enable them you add the following parameters "--mca 
coll_hierarch_priority 100 --mca coll_sm_priority 100".  We would be 
very interested in any results you get (failures, improvements, 
non-improvements).




I don't know what it is slow.  OpenMPI is so flexible in how the
stack can be tuned.  But I also have 100s of users runing dozens
of major codes, and what I need is a set of options that 'just work'
in most cases.

I will try the above options and get back to you.

Craig





thanks,

--td


Message: 4
Date: Thu, 06 Aug 2009 17:03:08 -0600
From: Craig Tierney <craig.tier...@noaa.gov>
Subject: Re: [OMPI users] Performance question about OpenMPI and
    MVAPICH2 onIB
To: Open MPI Users <us...@open-mpi.org>
Message-ID: <4a7b612c.8070...@noaa.gov>
Content-Type: text/plain; charset=ISO-8859-1

A followup

Part of problem was affinity.  I had written a script to do processor
and memory affinity (which works fine with MVAPICH2).  It is an
idea that I got from TACC.  However, the script didn't seem to
work correctly with OpenMPI (or I still have bugs).

Setting --mca mpi_paffinity_alone 1 made things better.  However,
the performance is still not as good:

Cores   Mvapich2Openmpi
---
   8  17.317.3
  16  31.731.5
  32  62.962.8
  64 110.8   108.0
 128 219.2   201.4
 256 384.5   342.7
 512 687.2   537.6

The performance number is GFlops (so larger is better).

The first few numbers show that the executable is the right
speed.  I verified that IB is being used by using OMB and
checking latency and bandwidth.  Those numbers are what I
expect (3GB/s, 1.5mu/s for QDR).

However, the Openmpi version is not scaling as well.  Any
ideas on why that might be the case?

Thanks,
Craig


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB

2009-08-07 Thread neeraj
Hi Terry,

I feel hierarchical collectives are slower compare to tuned one. I 
had done some benchmark in the past specific to collectives, and this is 
what i feel based on my observation.

Regards

Neeraj Chourasia (MTS)
Computational Research Laboratories Ltd.
(A wholly Owned Subsidiary of TATA SONS Ltd)
B-101, ICC Trade Towers, Senapati Bapat Road
Pune 411016 (Mah) INDIA
(O) +91-20-6620 9863  (Fax) +91-20-6620 9862
M: +91.9225520634




Terry Dontje <terry.don...@sun.com> 
Sent by: users-boun...@open-mpi.org
08/07/2009 04:35 PM
Please respond to
Open MPI Users <us...@open-mpi.org>


To
us...@open-mpi.org
cc

Subject
Re: [OMPI users] Performance question about OpenMPI and MVAPICH2    on 
IB






Craig,

Did your affinity script bind the processes per socket or linearly to 
cores.  If the former you'll want to look at using rankfiles and place the 
ranks based on sockets.  TWe have found this especially useful if you are 
not running fully subscribed on your machines.

Also, if you think the main issue is collectives performance you may want 
to try using the hierarchical and SM collectives.  However, be forewarned 
we are right now trying to pound out some errors with these modules.  To 
enable them you add the following parameters "--mca coll_hierarch_priority 
100 --mca coll_sm_priority 100".  We would be very interested in any 
results you get (failures, improvements, non-improvements).

thanks,

--td

> Message: 4
> Date: Thu, 06 Aug 2009 17:03:08 -0600
> From: Craig Tierney <craig.tier...@noaa.gov>
> Subject: Re: [OMPI users] Performance question about OpenMPI and
>MVAPICH2 on IB
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <4a7b612c.8070...@noaa.gov>
> Content-Type: text/plain; charset=ISO-8859-1
>
> A followup
>
> Part of problem was affinity.  I had written a script to do processor
> and memory affinity (which works fine with MVAPICH2).  It is an
> idea that I got from TACC.  However, the script didn't seem to
> work correctly with OpenMPI (or I still have bugs).
>
> Setting --mca mpi_paffinity_alone 1 made things better.  However,
> the performance is still not as good:
>
> Cores   Mvapich2Openmpi
> ---
>8  17.317.3
>   16  31.731.5
>   32  62.962.8
>   64 110.8   108.0
>  128 219.2   201.4
>  256 384.5   342.7
>  512 687.2   537.6
>
> The performance number is GFlops (so larger is better).
>
> The first few numbers show that the executable is the right
> speed.  I verified that IB is being used by using OMB and
> checking latency and bandwidth.  Those numbers are what I
> expect (3GB/s, 1.5mu/s for QDR).
>
> However, the Openmpi version is not scaling as well.  Any
> ideas on why that might be the case?
>
> Thanks,
> Craig

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


=-=-=



Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. 

Internet communications cannot be guaranteed to be timely,
secure, error or virus-free. The sender does not accept liability
for any errors or omissions.Thank you

=-=-=


Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB

2009-08-07 Thread neeraj
Hi Craig,

WRF has pattern of talking to nearest neighbours like p+1, p-1, 
p+3 and p-3, where p is the particular process. But in addition to that, 
it also uses collective calls like MPI_Bcast, MPI_AlltoAllv, 
MPI_Allgather, MPI_Gather, MPI_Gatherv, MPI_Scatterv.

Apparently openmpi-1.3 series are not better in terms of 
collectives as compare to its 1.2 series. But there are lot of parameters 
which has been added to tune collectives like giving dynamic file option 
which would override openmpi default selection of algorithm for particular 
collective operation.

Since collectives depend heavily on your network architecture and 
message size, i would like you to first fine tune your collectives on your 
network fabric before running any scientific application.

Regards

Neeraj Chourasia (MTS)
Computational Research Laboratories Ltd.
(A wholly Owned Subsidiary of TATA SONS Ltd)
B-101, ICC Trade Towers, Senapati Bapat Road
Pune 411016 (Mah) INDIA
(O) +91-20-6620 9863  (Fax) +91-20-6620 9862
M: +91.9225520634




Craig Tierney <craig.tier...@noaa.gov> 
Sent by: users-boun...@open-mpi.org
08/07/2009 04:43 AM
Please respond to
Open MPI Users <us...@open-mpi.org>


To
Open MPI Users <us...@open-mpi.org>
cc

Subject
Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB






Gus Correa wrote:
> Hi Craig, list
> 
> I suppose WRF uses MPI collective calls (MPI_Reduce,
> MPI_Bcast, MPI_Alltoall etc),
> just like the climate models we run here do.
> A recursive grep on the source code will tell.
> 

I will check this out.  I am not the WRF expert, but
I was under the impression that most weather models are
nearest neighbor communications, not collectives.


> If that is the case, you may need to tune the collectives dynamically.
> We are experimenting with tuned collectives here also.
> 
> Specifically, we had a scaling problem with the MITgcm
> (also running on an IB cluster)
> that is probably due to collectives.
> Similar problems were reported on this list before,
> with computational chemistry software.
> See these threads:
> http://www.open-mpi.org/community/lists/users/2009/07/10045.php
> http://www.open-mpi.org/community/lists/users/2009/05/9419.php
> 
> If WRF outputs timing information, particularly the time spent on MPI
> routines, you may also want to compare how the OpenMPI and
> MVAPICH versions fare w.r.t. MPI collectives.
> 
> I hope this helps.
> 

I will look into this.  Thanks for the ideas.

Craig



> Gus Correa
> -
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> -
> 
> 
> 
> Craig Tierney wrote:
>> I am running openmpi-1.3.3 on my cluster which is using
>> OFED-1.4.1 for Infiniband support.  I am comparing performance
>> between this version of OpenMPI and Mvapich2, and seeing a
>> very large difference in performance.
>>
>> The code I am testing is WRF v3.0.1.  I am running the
>> 12km benchmark.
>>
>> The two builds are the exact same codes and configuration
>> files.  All I did different was use modules to switch versions
>> of MPI, and recompiled the code.
>>
>> Performance:
>>
>> Cores   Mvapich2Openmpi
>> ---
>>8  17.313.9
>>   16  31.725.9
>>   32  62.951.6
>>   64 110.892.8
>>  128 219.2   189.4
>>  256 384.5   317.8
>>  512 687.2   516.7
>>
>> The performance number is GFlops (so larger is better).
>>
>> I am calling openmpi as:
>>
>> /opt/openmpi/1.3.3-intel/bin/mpirun  --mca plm_rsh_disable_qrsh 1
>> --mca btl openib,sm,self \
>> -machinefile /tmp/6026489.1.qntest.q/machines -x LD_LIBRARY_PATH -np
>> $NSLOTS /home/ctierney/bin/noaa_affinity ./wrf.exe
>>
>> So,
>>
>> Is this expected?  Are some common sense optimizations to use?
>> Is there a way to verify that I am really using the IB?  When
>> I try:
>>
>> -mca bta ^tcp,openib,sm,self
>>
>> I get the errors:
>> 
--
>>
>> No available btl components were found!
>>
>> This means that there are no components of this type installed on your
>> system or all the components reported that they could not be used.
>>
>> This is a fatal error; your MPI process is likely to abort.  Check the
>> output of the "ompi_info" command and ensure that components of this
>> type are available on your s

Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB

2009-08-06 Thread Gerry Creager

Craig,

Let me look at your script, if you'd like... I may be able to help 
there.  I've also been seeing some "interesting results for WRF on 
OpenMPI, and we may want to see if we're taking complimentary approaches...


gerry

Craig Tierney wrote:

A followup

Part of problem was affinity.  I had written a script to do processor
and memory affinity (which works fine with MVAPICH2).  It is an
idea that I got from TACC.  However, the script didn't seem to
work correctly with OpenMPI (or I still have bugs).

Setting --mca mpi_paffinity_alone 1 made things better.  However,
the performance is still not as good:

Cores   Mvapich2Openmpi
---
   8  17.317.3
  16  31.731.5
  32  62.962.8
  64 110.8   108.0
 128 219.2   201.4
 256 384.5   342.7
 512 687.2   537.6

The performance number is GFlops (so larger is better).

The first few numbers show that the executable is the right
speed.  I verified that IB is being used by using OMB and
checking latency and bandwidth.  Those numbers are what I
expect (3GB/s, 1.5mu/s for QDR).

However, the Openmpi version is not scaling as well.  Any
ideas on why that might be the case?

Thanks,
Craig


Craig Tierney wrote:

I am running openmpi-1.3.3 on my cluster which is using
OFED-1.4.1 for Infiniband support.  I am comparing performance
between this version of OpenMPI and Mvapich2, and seeing a
very large difference in performance.

The code I am testing is WRF v3.0.1.  I am running the
12km benchmark.

The two builds are the exact same codes and configuration
files.  All I did different was use modules to switch versions
of MPI, and recompiled the code.

Performance:

Cores   Mvapich2Openmpi
---
   8  17.313.9
  16  31.725.9
  32  62.951.6
  64 110.892.8
 128 219.2   189.4
 256 384.5   317.8
 512 687.2   516.7

The performance number is GFlops (so larger is better).

I am calling openmpi as:

/opt/openmpi/1.3.3-intel/bin/mpirun  --mca plm_rsh_disable_qrsh 1 --mca btl 
openib,sm,self \
-machinefile /tmp/6026489.1.qntest.q/machines -x LD_LIBRARY_PATH -np $NSLOTS 
/home/ctierney/bin/noaa_affinity ./wrf.exe

So,

Is this expected?  Are some common sense optimizations to use?
Is there a way to verify that I am really using the IB?  When
I try:

-mca bta ^tcp,openib,sm,self

I get the errors:
--
No available btl components were found!

This means that there are no components of this type installed on your
system or all the components reported that they could not be used.

This is a fatal error; your MPI process is likely to abort.  Check the
output of the "ompi_info" command and ensure that components of this
type are available on your system.  You may also wish to check the
value of the "component_path" MCA parameter and ensure that it has at
least one directory that contains valid MCA components.
--

But ompi_info is telling me that I have openib support:

   MCA btl: openib (MCA v2.0, API v2.0, Component v1.3.3)

Note, I did rebuild OFED and put it in a different directory
and did not rebuild OpenMPI.  However, since ompi_info isn't
complaining and the libraries are available, I am thinking that
is isn't a problem.  I could be wrong.

Thanks,
Craig





--
Gerry Creager -- gerry.crea...@tamu.edu
Texas Mesonet -- AATLT, Texas A University
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843


Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB

2009-08-06 Thread Craig Tierney
Gus Correa wrote:
> Hi Craig, list
> 
> I suppose WRF uses MPI collective calls (MPI_Reduce,
> MPI_Bcast, MPI_Alltoall etc),
> just like the climate models we run here do.
> A recursive grep on the source code will tell.
> 

I will check this out.  I am not the WRF expert, but
I was under the impression that most weather models are
nearest neighbor communications, not collectives.


> If that is the case, you may need to tune the collectives dynamically.
> We are experimenting with tuned collectives here also.
> 
> Specifically, we had a scaling problem with the MITgcm
> (also running on an IB cluster)
> that is probably due to collectives.
> Similar problems were reported on this list before,
> with computational chemistry software.
> See these threads:
> http://www.open-mpi.org/community/lists/users/2009/07/10045.php
> http://www.open-mpi.org/community/lists/users/2009/05/9419.php
> 
> If WRF outputs timing information, particularly the time spent on MPI
> routines, you may also want to compare how the OpenMPI and
> MVAPICH versions fare w.r.t. MPI collectives.
> 
> I hope this helps.
> 

I will look into this.  Thanks for the ideas.

Craig



> Gus Correa
> -
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> -
> 
> 
> 
> Craig Tierney wrote:
>> I am running openmpi-1.3.3 on my cluster which is using
>> OFED-1.4.1 for Infiniband support.  I am comparing performance
>> between this version of OpenMPI and Mvapich2, and seeing a
>> very large difference in performance.
>>
>> The code I am testing is WRF v3.0.1.  I am running the
>> 12km benchmark.
>>
>> The two builds are the exact same codes and configuration
>> files.  All I did different was use modules to switch versions
>> of MPI, and recompiled the code.
>>
>> Performance:
>>
>> Cores   Mvapich2Openmpi
>> ---
>>8  17.313.9
>>   16  31.725.9
>>   32  62.951.6
>>   64 110.892.8
>>  128 219.2   189.4
>>  256 384.5   317.8
>>  512 687.2   516.7
>>
>> The performance number is GFlops (so larger is better).
>>
>> I am calling openmpi as:
>>
>> /opt/openmpi/1.3.3-intel/bin/mpirun  --mca plm_rsh_disable_qrsh 1
>> --mca btl openib,sm,self \
>> -machinefile /tmp/6026489.1.qntest.q/machines -x LD_LIBRARY_PATH -np
>> $NSLOTS /home/ctierney/bin/noaa_affinity ./wrf.exe
>>
>> So,
>>
>> Is this expected?  Are some common sense optimizations to use?
>> Is there a way to verify that I am really using the IB?  When
>> I try:
>>
>> -mca bta ^tcp,openib,sm,self
>>
>> I get the errors:
>> --
>>
>> No available btl components were found!
>>
>> This means that there are no components of this type installed on your
>> system or all the components reported that they could not be used.
>>
>> This is a fatal error; your MPI process is likely to abort.  Check the
>> output of the "ompi_info" command and ensure that components of this
>> type are available on your system.  You may also wish to check the
>> value of the "component_path" MCA parameter and ensure that it has at
>> least one directory that contains valid MCA components.
>> --
>>
>>
>> But ompi_info is telling me that I have openib support:
>>
>>MCA btl: openib (MCA v2.0, API v2.0, Component v1.3.3)
>>
>> Note, I did rebuild OFED and put it in a different directory
>> and did not rebuild OpenMPI.  However, since ompi_info isn't
>> complaining and the libraries are available, I am thinking that
>> is isn't a problem.  I could be wrong.
>>
>> Thanks,
>> Craig
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


-- 
Craig Tierney (craig.tier...@noaa.gov)


Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB

2009-08-06 Thread Craig Tierney
A followup

Part of problem was affinity.  I had written a script to do processor
and memory affinity (which works fine with MVAPICH2).  It is an
idea that I got from TACC.  However, the script didn't seem to
work correctly with OpenMPI (or I still have bugs).

Setting --mca mpi_paffinity_alone 1 made things better.  However,
the performance is still not as good:

Cores   Mvapich2Openmpi
---
   8  17.317.3
  16  31.731.5
  32  62.962.8
  64 110.8   108.0
 128 219.2   201.4
 256 384.5   342.7
 512 687.2   537.6

The performance number is GFlops (so larger is better).

The first few numbers show that the executable is the right
speed.  I verified that IB is being used by using OMB and
checking latency and bandwidth.  Those numbers are what I
expect (3GB/s, 1.5mu/s for QDR).

However, the Openmpi version is not scaling as well.  Any
ideas on why that might be the case?

Thanks,
Craig


Craig Tierney wrote:
> I am running openmpi-1.3.3 on my cluster which is using
> OFED-1.4.1 for Infiniband support.  I am comparing performance
> between this version of OpenMPI and Mvapich2, and seeing a
> very large difference in performance.
> 
> The code I am testing is WRF v3.0.1.  I am running the
> 12km benchmark.
> 
> The two builds are the exact same codes and configuration
> files.  All I did different was use modules to switch versions
> of MPI, and recompiled the code.
> 
> Performance:
> 
> Cores   Mvapich2Openmpi
> ---
>8  17.313.9
>   16  31.725.9
>   32  62.951.6
>   64 110.892.8
>  128 219.2   189.4
>  256 384.5   317.8
>  512 687.2   516.7
> 
> The performance number is GFlops (so larger is better).
> 
> I am calling openmpi as:
> 
> /opt/openmpi/1.3.3-intel/bin/mpirun  --mca plm_rsh_disable_qrsh 1 --mca btl 
> openib,sm,self \
> -machinefile /tmp/6026489.1.qntest.q/machines -x LD_LIBRARY_PATH -np $NSLOTS 
> /home/ctierney/bin/noaa_affinity ./wrf.exe
> 
> So,
> 
> Is this expected?  Are some common sense optimizations to use?
> Is there a way to verify that I am really using the IB?  When
> I try:
> 
> -mca bta ^tcp,openib,sm,self
> 
> I get the errors:
> --
> No available btl components were found!
> 
> This means that there are no components of this type installed on your
> system or all the components reported that they could not be used.
> 
> This is a fatal error; your MPI process is likely to abort.  Check the
> output of the "ompi_info" command and ensure that components of this
> type are available on your system.  You may also wish to check the
> value of the "component_path" MCA parameter and ensure that it has at
> least one directory that contains valid MCA components.
> --
> 
> But ompi_info is telling me that I have openib support:
> 
>MCA btl: openib (MCA v2.0, API v2.0, Component v1.3.3)
> 
> Note, I did rebuild OFED and put it in a different directory
> and did not rebuild OpenMPI.  However, since ompi_info isn't
> complaining and the libraries are available, I am thinking that
> is isn't a problem.  I could be wrong.
> 
> Thanks,
> Craig


-- 
Craig Tierney (craig.tier...@noaa.gov)


Re: [OMPI users] Performance question about OpenMPI and MVAPICH2 on IB

2009-08-06 Thread Gus Correa

Hi Craig, list

I suppose WRF uses MPI collective calls (MPI_Reduce,
MPI_Bcast, MPI_Alltoall etc),
just like the climate models we run here do.
A recursive grep on the source code will tell.

If that is the case, you may need to tune the collectives dynamically.
We are experimenting with tuned collectives here also.

Specifically, we had a scaling problem with the MITgcm
(also running on an IB cluster)
that is probably due to collectives.
Similar problems were reported on this list before,
with computational chemistry software.
See these threads:
http://www.open-mpi.org/community/lists/users/2009/07/10045.php
http://www.open-mpi.org/community/lists/users/2009/05/9419.php

If WRF outputs timing information, particularly the time spent on MPI
routines, you may also want to compare how the OpenMPI and
MVAPICH versions fare w.r.t. MPI collectives.

I hope this helps.

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-



Craig Tierney wrote:

I am running openmpi-1.3.3 on my cluster which is using
OFED-1.4.1 for Infiniband support.  I am comparing performance
between this version of OpenMPI and Mvapich2, and seeing a
very large difference in performance.

The code I am testing is WRF v3.0.1.  I am running the
12km benchmark.

The two builds are the exact same codes and configuration
files.  All I did different was use modules to switch versions
of MPI, and recompiled the code.

Performance:

Cores   Mvapich2Openmpi
---
   8  17.313.9
  16  31.725.9
  32  62.951.6
  64 110.892.8
 128 219.2   189.4
 256 384.5   317.8
 512 687.2   516.7

The performance number is GFlops (so larger is better).

I am calling openmpi as:

/opt/openmpi/1.3.3-intel/bin/mpirun  --mca plm_rsh_disable_qrsh 1 --mca btl 
openib,sm,self \
-machinefile /tmp/6026489.1.qntest.q/machines -x LD_LIBRARY_PATH -np $NSLOTS 
/home/ctierney/bin/noaa_affinity ./wrf.exe

So,

Is this expected?  Are some common sense optimizations to use?
Is there a way to verify that I am really using the IB?  When
I try:

-mca bta ^tcp,openib,sm,self

I get the errors:
--
No available btl components were found!

This means that there are no components of this type installed on your
system or all the components reported that they could not be used.

This is a fatal error; your MPI process is likely to abort.  Check the
output of the "ompi_info" command and ensure that components of this
type are available on your system.  You may also wish to check the
value of the "component_path" MCA parameter and ensure that it has at
least one directory that contains valid MCA components.
--

But ompi_info is telling me that I have openib support:

   MCA btl: openib (MCA v2.0, API v2.0, Component v1.3.3)

Note, I did rebuild OFED and put it in a different directory
and did not rebuild OpenMPI.  However, since ompi_info isn't
complaining and the libraries are available, I am thinking that
is isn't a problem.  I could be wrong.

Thanks,
Craig