Hi Christoph,

this type:

x10::array::DistArray<x10aux::ref<x10::array::Array<double>  >  >

looks like the type of v_dst and v_src and _not_ of A.  This suggests 
that for some reason the evaluations of v_dst(pt) and v_src(pt) are not 
reused between iterations of the for loop.  Can you pull them out of the 
loop i.e.

val p:Place = v_src.dist()(pt);
val dst = v_dst(pt);
val src = v_src(pt);
for ( (i,j) in (A|p) ) {
     dst(i) += A(i,j)*src(j);
}

and report whether this makes a difference?

Cheers,

Josh

On 09/09/10 20:13, Christoph Pospiech wrote:
> Hi,
>
> I wrote a small X10 test program which runs an ordinary matrix vector multiply
> in a timing loop, and I am currently stuck assessing the performance.
>
> Running with eclipse X10DT 2.0.6 and C++ backend on Linux x86 with one place,
> I am getting the following.
> ****************************************************
> * X10 test program for matrix vector multiplication
> ****************************************************
>
> matrix size = 500
> loop count  = 10
> places      = 1
> axis for parallelization = 1
>
> The time is 7.22883933400044 s
>
> This has to be compared to an equivalent Fortran program.
> $ make clean;make
> mpif77 -O3  -c -o mmp.o mmp.f
> Linking mmp ...
> done
> $ mpiexec -np 1 ./mmp
>   The wall clock time [s] was   9.17305762413889170E-003
>
> The difference is 2.9 orders of magnitude. Clearly, the X10 program has a
> performance issue.
>
> OK, there are the hints on the following URL.
> http://x10.codehaus.org/Performance+Tuning+an+X10+Application
>
> I compiled my own MPI runtime and currently ended up with the following.
> $ x10c++ -x10rt mpipg -O -NO_CHECKS -o matmul ../src/matmul.x10
> $ mpiexec -np 1 ./matmul 500 10 1
> ****************************************************
> * X10 test program for matrix vector multiplication
> ****************************************************
>
> matrix size = 500
> loop count  = 10
> places      = 1
> axis for parallelization = 1
>
> The time is 1.652489519001392 s
>
> Still a gap to Fortran performance by 2.25 orders of magnitude.
>
> The "pg" in "-x10rt mpipg" has the following significance.
> $ cat /opt/sw/X10_compiler/etc/x10rt_mpipg.properties
> CXX=mpicxx
> CXXFLAGS=-pg -g -O3
> LDFLAGS=-pg -g -O3
> LDLIBS=-lx10rt_mpi
>
> So I can now look at a gmon.out file and see the following.
> Flat profile:
>
> Each sample counts as 0.01 seconds.
>    %   cumulative   self              self     total
>   time   seconds   seconds    calls  ms/call  ms/call  name
>   58.06      0.18     0.18  4999853     0.00     0.00
> x10::array::DistArray<x10aux::ref<x10::array::Array<double>  >
>    
>> ::apply(x10aux::ref<x10::array::Point>)
>>      
>   29.03      0.27     0.09
> matmul__closure__15::apply()
>    6.45      0.29     0.02        1    20.00    20.96
> x10_array_DistArray__closure__0<double>::apply()
>    3.23      0.30     0.01  5513706     0.00     0.00
> x10::lang::Iterator<x10aux::ref<x10::array::Point>
>    
>> ::itable<x10::lang::Reference>*
>>      
> x10aux::findITable<x10::lang::Iterator<x10aux::ref<x10::array::Point>  >
>    
>> (x10aux::itable_entry*)
>>      
> matmul__closure__15::apply() is the calling parent to the hot spot. The call
> graph profile looks like this.
>
> granularity: each sample hit covers 4 byte(s) for 3.23% of 0.31 seconds
>
> index % time    self  children    called     name
>                                                   <spontaneous>
> [1]     90.6    0.09    0.19                 matmul__closure__15::apply() [1]
>                  0.18    0.00 4999823/4999853
> x10::array::DistArray<x10aux::ref<x10::array::Array<double>  >
>    
>> ::apply(x10aux::ref<x10::array::Point>) [2]
>>      
>                  0.01    0.00 4999820/5513706
> x10::lang::Iterator<x10aux::ref<x10::array::Point>
>    
>> ::itable<x10::lang::Reference>*
>>      
> x10aux::findITable<x10::lang::Iterator<x10aux::ref<x10::array::Point>  >
>    
>> (x10aux::itable_entry*) [5]
>>      
>                  0.00    0.00      10/12
> x10::array::DistArray<double>::__bar(x10::lang::Place) [10]
> -----------------------------------------------
>                  0.00    0.00      10/4999853     matmul__closure__17::apply()
> [24]
>                  0.00    0.00      20/4999853     matmul__closure__13::apply()
> [49]
>                  0.18    0.00 4999823/4999853     matmul__closure__15::apply()
> [1]
> [2]     58.1    0.18    0.00 4999853
> x10::array::DistArray<x10aux::ref<x10::array::Array<double>  >
>    
>> ::apply(x10aux::ref<x10::array::Point>) [2]
>>      
> -----------------------------------------------
>
> matmul__closure__15::apply() can be identified as the following code snippet,
> actually the heart of the matrix vector multiply.
>
>               /**
>                * Next do the local part of the
>                * matrix multiply.
>                */
>               finish ateach (pt in v_src ) {
>                       val p:Place = v_src.dist()(pt);
>                       for ( (i,j) in (A|p) ) {
>                               v_dst(pt)(i) += A(i,j)*v_src(pt)(j);
>                       }
>                       if (debug) {
>                               val v_src_str = "v_src("+p.id()+")";
>                               prettyPrintArray1D(v_src_str, v_src(pt));
>                               val v_dst_str = "v_dst("+p.id()+")";
>                               prettyPrintArray1D(v_dst_str, v_dst(pt));
>                       }
>               }
> where
>          static type Array1D = Array[Double]{rank==1};
>
>          global val v_dst: DistArray[Array1D]{rank==1};
>          global val v_src: DistArray[Array1D]{rank==1};
>          global val A: DistArray[Double]{rank==2};
>
> - the region for all objects of type Array1D is [0..vsize-1],
> - the region for v_src and v_dst is [0..number_of_places-1],
> - the distribution for v_src and v_dst maps exactly one point to each place.
> - the region for A is [0..vsize-1,0..vsize-1].
> - in all of the above, number_of_places == 1.
> - in all of the above, debug:Boolean == false.
>
> Am I correct that 58% of the time is spent in
> x10::array::DistArray...::apply(...Point), which I interpret as the evaluation
> of A(i,j) (perhaps also v_src(pt) and v_dst(pt)) ? And each of that is a
> function call, adding up to 4999823 calls ?
>
> That seems a lot of CPU cycles just to get the matrix value A(i,j). Perhaps
> this can be inlined ? How ?
> And where are all the rest of the cycles that add up to the performance gap of
> currently 2.25 orders of magnitude ?
>    


------------------------------------------------------------------------------
This SF.net Dev2Dev email is sponsored by:

Show off your parallel programming skills.
Enter the Intel(R) Threading Challenge 2010.
http://p.sf.net/sfu/intel-thread-sfd
_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users

Reply via email to