[X10-users] performance issue with matrix vector multiply program

Christoph Pospiech Thu, 09 Sep 2010 03:14:26 -0700

Hi,

I wrote a small X10 test program which runs an ordinary matrix vector multiply 
in a timing loop, and I am currently stuck assessing the performance.


Running with eclipse X10DT 2.0.6 and C++ backend on Linux x86 with one place, 
I am getting the following.
****************************************************
* X10 test program for matrix vector multiplication
****************************************************

matrix size = 500
loop count  = 10
places      = 1
axis for parallelization = 1

The time is 7.22883933400044 s

This has to be compared to an equivalent Fortran program.
$ make clean;make
mpif77 -O3  -c -o mmp.o mmp.f
Linking mmp ...
done
$ mpiexec -np 1 ./mmp
 The wall clock time [s] was   9.17305762413889170E-003

The difference is 2.9 orders of magnitude. Clearly, the X10 program has a 
performance issue.

OK, there are the hints on the following URL. 
http://x10.codehaus.org/Performance+Tuning+an+X10+Application

I compiled my own MPI runtime and currently ended up with the following.
$ x10c++ -x10rt mpipg -O -NO_CHECKS -o matmul ../src/matmul.x10
$ mpiexec -np 1 ./matmul 500 10 1
****************************************************
* X10 test program for matrix vector multiplication
****************************************************

matrix size = 500
loop count  = 10
places      = 1
axis for parallelization = 1

The time is 1.652489519001392 s

Still a gap to Fortran performance by 2.25 orders of magnitude.

The "pg" in "-x10rt mpipg" has the following significance.
$ cat /opt/sw/X10_compiler/etc/x10rt_mpipg.properties
CXX=mpicxx
CXXFLAGS=-pg -g -O3 
LDFLAGS=-pg -g -O3
LDLIBS=-lx10rt_mpi 

So I can now look at a gmon.out file and see the following.
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 58.06      0.18     0.18  4999853     0.00     0.00  
x10::array::DistArray<x10aux::ref<x10::array::Array<double> > 
>::apply(x10aux::ref<x10::array::Point>)
 29.03      0.27     0.09                             
matmul__closure__15::apply()
  6.45      0.29     0.02        1    20.00    20.96  
x10_array_DistArray__closure__0<double>::apply()
  3.23      0.30     0.01  5513706     0.00     0.00  
x10::lang::Iterator<x10aux::ref<x10::array::Point> 
>::itable<x10::lang::Reference>* 
x10aux::findITable<x10::lang::Iterator<x10aux::ref<x10::array::Point> > 
>(x10aux::itable_entry*)

matmul__closure__15::apply() is the calling parent to the hot spot. The call 
graph profile looks like this.

granularity: each sample hit covers 4 byte(s) for 3.23% of 0.31 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]     90.6    0.09    0.19                 matmul__closure__15::apply() [1]
                0.18    0.00 4999823/4999853     
x10::array::DistArray<x10aux::ref<x10::array::Array<double> > 
>::apply(x10aux::ref<x10::array::Point>) [2]
                0.01    0.00 4999820/5513706     
x10::lang::Iterator<x10aux::ref<x10::array::Point> 
>::itable<x10::lang::Reference>* 
x10aux::findITable<x10::lang::Iterator<x10aux::ref<x10::array::Point> > 
>(x10aux::itable_entry*) [5]
                0.00    0.00      10/12          
x10::array::DistArray<double>::__bar(x10::lang::Place) [10]
-----------------------------------------------
                0.00    0.00      10/4999853     matmul__closure__17::apply() 
[24]
                0.00    0.00      20/4999853     matmul__closure__13::apply() 
[49]
                0.18    0.00 4999823/4999853     matmul__closure__15::apply() 
[1]
[2]     58.1    0.18    0.00 4999853         
x10::array::DistArray<x10aux::ref<x10::array::Array<double> > 
>::apply(x10aux::ref<x10::array::Point>) [2]
-----------------------------------------------

matmul__closure__15::apply() can be identified as the following code snippet, 
actually the heart of the matrix vector multiply.

                /**
                 * Next do the local part of the 
                 * matrix multiply.
                 */
                finish ateach (pt in v_src ) {
                        val p:Place = v_src.dist()(pt);
                        for ( (i,j) in (A|p) ) {
                                v_dst(pt)(i) += A(i,j)*v_src(pt)(j);
                        }
                        if (debug) {
                                val v_src_str = "v_src("+p.id()+")";
                                prettyPrintArray1D(v_src_str, v_src(pt));
                                val v_dst_str = "v_dst("+p.id()+")";
                                prettyPrintArray1D(v_dst_str, v_dst(pt));
                        }
                }
where
        static type Array1D = Array[Double]{rank==1};
        
        global val v_dst: DistArray[Array1D]{rank==1};        
        global val v_src: DistArray[Array1D]{rank==1};
        global val A: DistArray[Double]{rank==2};

- the region for all objects of type Array1D is [0..vsize-1],
- the region for v_src and v_dst is [0..number_of_places-1],
- the distribution for v_src and v_dst maps exactly one point to each place.
- the region for A is [0..vsize-1,0..vsize-1].
- in all of the above, number_of_places == 1.
- in all of the above, debug:Boolean == false.

Am I correct that 58% of the time is spent in 
x10::array::DistArray...::apply(...Point), which I interpret as the evaluation 
of A(i,j) (perhaps also v_src(pt) and v_dst(pt)) ? And each of that is a 
function call, adding up to 4999823 calls ?

That seems a lot of CPU cycles just to get the matrix value A(i,j). Perhaps 
this can be inlined ? How ?
And where are all the rest of the cycles that add up to the performance gap of 
currently 2.25 orders of magnitude ?
-- 

Mit freundlichen Grüßen / Kind regards

Dr. Christoph Pospiech
High Performance & Parallel Computing
Phone:  +49-351 86269826
Mobile: +49-171-765 5871
E-Mail: christoph.pospi...@de.ibm.com
-------------------------------------
IBM Deutschland GmbH
Vorsitzender des Aufsichtsrats: Erich Clementi 
Geschäftsführung: Martin Jetter (Vorsitzender), 
Reinhard Reschke, Christoph Grandpierre, 
Klaus Lintelmann, Michael Diemer, Martina Koederitz 
Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 
14562 WEEE-Reg.-Nr. DE 99369940
------------------------------------------------------------------------------
This SF.net Dev2Dev email is sponsored by:

Show off your parallel programming skills.
Enter the Intel(R) Threading Challenge 2010.
http://p.sf.net/sfu/intel-thread-sfd
_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users

[X10-users] performance issue with matrix vector multiply program

Reply via email to