Re: [X10-users] performance issue with matrix vector multiply program

Christoph Pospiech Fri, 10 Sep 2010 03:28:37 -0700

On Friday, September 10, 2010 04:26:23 am David P Grove wrote:
> Christoph Pospiech <christoph.pospi...@de.ibm.com> wrote 
> on 09/09/2010 07:04:10 AM:
> > $ mpiexec -np 1 ./matmul 500 10 1
> > ****************************************************
> > * X10 test program for matrix vector multiplication
> > ****************************************************
> > 
> > matrix size = 500
> > loop count  = 10
> > places      = 1
> > axis for parallelization = 1
> > 
> > The time is 1.162122911000552 s
> > 
> > This narrows the gap to 2.10 orders of magnitude.
> 
> Hi,
> 
>         My guess is that the overhead is coming from the "for ((i,j) in (A|
> p)) idiom in the for loop.  We haven't had a chance to finish all the
> optimizer work that is needed to make that efficient yet in the general
> case.  If the compiler knows statically that the Region R in the for ((i,j)
> in R) { ... } construct has the rect property, then we generate a proper
> nested counted for loop and get good performance.  If we can't prove the
> region is rect (or it isn't rect), then we have to fallback into calling
> the region's iterator to get the Point objects one at a time, and this is
> much, much slower.
> 
>         Looking at your profile, it appears that a large chunk of the time
> is going into the Iterator operations and the allocation of objects (which
> I'm speculating are Points that are only being created because that's what
> the Iterator returns).  That's why I think it's the (A|p) idiom that is
> hurting you.  We don't have quite enough static information to apply the
> for loop optimization.
> 
>         In X10 2.0.6, pretty much your only chance of getting good
> performance from moderately clean code here is to do something slightly
> sleazy and "help" the compiler by exploiting the fact that you actually do
> should have RectRegions (even though the compiler can't prove that
> statically).
> 
>         To show this, I wrote the little program appended below.  I see
> about 25x performance difference between the slow, pure loop, and the
> faster but slightly sleazy loop.
> 
>         I will have to take a look at the Dist and DistArray implementation
> to see if there is some way we can restructure the code for 2.1.0 such that
> this becomes possible without requiring a sleazy cast by the user.
> 
> hope this helps,
> 
> --dave


Dave,

I can confirm that your guess is correct - I added your suggestions to my code 
and rerun.
$ mpiexec -np 1 ./matmul 500 10 1
****************************************************
* X10 test program for matrix vector multiplication
****************************************************

matrix size = 500
loop count  = 10
places      = 1
axis for parallelization = 1

The time is 0.076593629000854 s

The performance gap to FORTRAN has narrowed to 0.92 orders of magnitude. This 
decrease in gap width translates to a speed up by a factor of 
10**(2.10 - 0.92) = 15.14, which is a major jump ahead.

The flat profile has now changed to the following, and I am not sure how many 
further hints can be deduced from a subroutine profile like that.

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 77.78      0.07     0.07                             
matmul__closure__15::apply()
 22.22      0.09     0.02        1    20.00    20.00  
x10_array_DistArray__closure__0<double>::apply()

[... many further entries with 0% time ... ]
[... no wonder, 22.22 + 77.78 = 100.00 ...]

matmul__closure__15 is the same code snippet as before (with the amendments 
suggested by you and Josh Milthorpe).

The other entry appears to me like coming from X10 DistArray implementation. 
The call tree from gprof reads like this. Does this tell you anything ?

index % time    self  children    called     name
-----------------------------------------------
                0.02    0.00       1/1          
x10_lang_PlaceLocalHandle__closure__2<x10aux::ref<x10::array::DistArray__LocalState<double>
 
> >::apply() [3]
[2]     22.2    0.02    0.00       1         
x10_array_DistArray__closure__0<double>::apply() [2]
                0.00    0.00  500001/514061      
x10::lang::Iterator<x10aux::ref<x10::array::Point> 
>::itable<x10::lang::Reference>* x10aux::findITable<x10::lan
g::Iterator<x10aux::ref<x10::array::Point> > >(x10aux::itable_entry*) [20]
                0.00    0.00  250000/250000      
matmul__closure__1::_getITables() [21]
                0.00    0.00  250000/250000      
matmul__closure__1::apply(x10aux::ref<x10::array::Point>) [22]
                0.00    0.00       2/344         
x10aux::alloc_internal(unsigned int, bool) [31]
                0.00    0.00       1/7           x10aux::RuntimeType const* 
x10aux::getRTT<x10::lang::Fun_0_1<x10aux::ref<x10::array::Point>, double> >() 
[50]
-----------------------------------------------
                                                 <spontaneous>
[3]     22.2    0.00    0.02                 
x10_lang_PlaceLocalHandle__closure__2<x10aux::ref<x10::array::DistArray__LocalState<double>
 
> >::apply() [3]
                0.02    0.00       1/1           
x10_array_DistArray__closure__0<double>::apply() [2]
                0.00    0.00      12/12          
x10_array_DistArray__closure__2<double>::_getITables() [42]
                0.00    0.00      12/344         
x10aux::alloc_internal(unsigned int, bool) [31]
                0.00    0.00      11/11          
x10_array_DistArray__closure__2<double>::apply() [47]
                0.00    0.00       1/1           
x10_array_DistArray__closure__0<double>::_getITables() [78]
                0.00    0.00       1/2           
x10::array::DistArray__LocalState<double>::getRTT() [72]
-----------------------------------------------

-- 

Mit freundlichen Grüßen / Kind regards

Dr. Christoph Pospiech
High Performance & Parallel Computing
Phone:  +49-351 86269826
Mobile: +49-171-765 5871
E-Mail: christoph.pospi...@de.ibm.com
-------------------------------------
IBM Deutschland GmbH
Vorsitzender des Aufsichtsrats: Erich Clementi 
Geschäftsführung: Martin Jetter (Vorsitzender), 
Reinhard Reschke, Christoph Grandpierre, 
Klaus Lintelmann, Michael Diemer, Martina Koederitz 
Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 
14562 WEEE-Reg.-Nr. DE 99369940


------------------------------------------------------------------------------
Automate Storage Tiering Simply
Optimize IT performance and efficiency through flexible, powerful, 
automated storage tiering capabilities. View this brief to learn how
you can reduce costs and improve performance. 
http://p.sf.net/sfu/dell-sfdev2dev
_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users

Re: [X10-users] performance issue with matrix vector multiply program

Reply via email to