Hi Christoph, this type:
x10::array::DistArray<x10aux::ref<x10::array::Array<double> > > looks like the type of v_dst and v_src and _not_ of A. This suggests that for some reason the evaluations of v_dst(pt) and v_src(pt) are not reused between iterations of the for loop. Can you pull them out of the loop i.e. val p:Place = v_src.dist()(pt); val dst = v_dst(pt); val src = v_src(pt); for ( (i,j) in (A|p) ) { dst(i) += A(i,j)*src(j); } and report whether this makes a difference? Cheers, Josh On 09/09/10 20:13, Christoph Pospiech wrote: > Hi, > > I wrote a small X10 test program which runs an ordinary matrix vector multiply > in a timing loop, and I am currently stuck assessing the performance. > > Running with eclipse X10DT 2.0.6 and C++ backend on Linux x86 with one place, > I am getting the following. > **************************************************** > * X10 test program for matrix vector multiplication > **************************************************** > > matrix size = 500 > loop count = 10 > places = 1 > axis for parallelization = 1 > > The time is 7.22883933400044 s > > This has to be compared to an equivalent Fortran program. > $ make clean;make > mpif77 -O3 -c -o mmp.o mmp.f > Linking mmp ... > done > $ mpiexec -np 1 ./mmp > The wall clock time [s] was 9.17305762413889170E-003 > > The difference is 2.9 orders of magnitude. Clearly, the X10 program has a > performance issue. > > OK, there are the hints on the following URL. > http://x10.codehaus.org/Performance+Tuning+an+X10+Application > > I compiled my own MPI runtime and currently ended up with the following. > $ x10c++ -x10rt mpipg -O -NO_CHECKS -o matmul ../src/matmul.x10 > $ mpiexec -np 1 ./matmul 500 10 1 > **************************************************** > * X10 test program for matrix vector multiplication > **************************************************** > > matrix size = 500 > loop count = 10 > places = 1 > axis for parallelization = 1 > > The time is 1.652489519001392 s > > Still a gap to Fortran performance by 2.25 orders of magnitude. > > The "pg" in "-x10rt mpipg" has the following significance. > $ cat /opt/sw/X10_compiler/etc/x10rt_mpipg.properties > CXX=mpicxx > CXXFLAGS=-pg -g -O3 > LDFLAGS=-pg -g -O3 > LDLIBS=-lx10rt_mpi > > So I can now look at a gmon.out file and see the following. > Flat profile: > > Each sample counts as 0.01 seconds. > % cumulative self self total > time seconds seconds calls ms/call ms/call name > 58.06 0.18 0.18 4999853 0.00 0.00 > x10::array::DistArray<x10aux::ref<x10::array::Array<double> > > >> ::apply(x10aux::ref<x10::array::Point>) >> > 29.03 0.27 0.09 > matmul__closure__15::apply() > 6.45 0.29 0.02 1 20.00 20.96 > x10_array_DistArray__closure__0<double>::apply() > 3.23 0.30 0.01 5513706 0.00 0.00 > x10::lang::Iterator<x10aux::ref<x10::array::Point> > >> ::itable<x10::lang::Reference>* >> > x10aux::findITable<x10::lang::Iterator<x10aux::ref<x10::array::Point> > > >> (x10aux::itable_entry*) >> > matmul__closure__15::apply() is the calling parent to the hot spot. The call > graph profile looks like this. > > granularity: each sample hit covers 4 byte(s) for 3.23% of 0.31 seconds > > index % time self children called name > <spontaneous> > [1] 90.6 0.09 0.19 matmul__closure__15::apply() [1] > 0.18 0.00 4999823/4999853 > x10::array::DistArray<x10aux::ref<x10::array::Array<double> > > >> ::apply(x10aux::ref<x10::array::Point>) [2] >> > 0.01 0.00 4999820/5513706 > x10::lang::Iterator<x10aux::ref<x10::array::Point> > >> ::itable<x10::lang::Reference>* >> > x10aux::findITable<x10::lang::Iterator<x10aux::ref<x10::array::Point> > > >> (x10aux::itable_entry*) [5] >> > 0.00 0.00 10/12 > x10::array::DistArray<double>::__bar(x10::lang::Place) [10] > ----------------------------------------------- > 0.00 0.00 10/4999853 matmul__closure__17::apply() > [24] > 0.00 0.00 20/4999853 matmul__closure__13::apply() > [49] > 0.18 0.00 4999823/4999853 matmul__closure__15::apply() > [1] > [2] 58.1 0.18 0.00 4999853 > x10::array::DistArray<x10aux::ref<x10::array::Array<double> > > >> ::apply(x10aux::ref<x10::array::Point>) [2] >> > ----------------------------------------------- > > matmul__closure__15::apply() can be identified as the following code snippet, > actually the heart of the matrix vector multiply. > > /** > * Next do the local part of the > * matrix multiply. > */ > finish ateach (pt in v_src ) { > val p:Place = v_src.dist()(pt); > for ( (i,j) in (A|p) ) { > v_dst(pt)(i) += A(i,j)*v_src(pt)(j); > } > if (debug) { > val v_src_str = "v_src("+p.id()+")"; > prettyPrintArray1D(v_src_str, v_src(pt)); > val v_dst_str = "v_dst("+p.id()+")"; > prettyPrintArray1D(v_dst_str, v_dst(pt)); > } > } > where > static type Array1D = Array[Double]{rank==1}; > > global val v_dst: DistArray[Array1D]{rank==1}; > global val v_src: DistArray[Array1D]{rank==1}; > global val A: DistArray[Double]{rank==2}; > > - the region for all objects of type Array1D is [0..vsize-1], > - the region for v_src and v_dst is [0..number_of_places-1], > - the distribution for v_src and v_dst maps exactly one point to each place. > - the region for A is [0..vsize-1,0..vsize-1]. > - in all of the above, number_of_places == 1. > - in all of the above, debug:Boolean == false. > > Am I correct that 58% of the time is spent in > x10::array::DistArray...::apply(...Point), which I interpret as the evaluation > of A(i,j) (perhaps also v_src(pt) and v_dst(pt)) ? And each of that is a > function call, adding up to 4999823 calls ? > > That seems a lot of CPU cycles just to get the matrix value A(i,j). Perhaps > this can be inlined ? How ? > And where are all the rest of the cycles that add up to the performance gap of > currently 2.25 orders of magnitude ? > ------------------------------------------------------------------------------ This SF.net Dev2Dev email is sponsored by: Show off your parallel programming skills. Enter the Intel(R) Threading Challenge 2010. http://p.sf.net/sfu/intel-thread-sfd _______________________________________________ X10-users mailing list X10-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/x10-users