Hi, I wrote a small X10 test program which runs an ordinary matrix vector multiply in a timing loop, and I am currently stuck assessing the performance.
Running with eclipse X10DT 2.0.6 and C++ backend on Linux x86 with one place, I am getting the following. **************************************************** * X10 test program for matrix vector multiplication **************************************************** matrix size = 500 loop count = 10 places = 1 axis for parallelization = 1 The time is 7.22883933400044 s This has to be compared to an equivalent Fortran program. $ make clean;make mpif77 -O3 -c -o mmp.o mmp.f Linking mmp ... done $ mpiexec -np 1 ./mmp The wall clock time [s] was 9.17305762413889170E-003 The difference is 2.9 orders of magnitude. Clearly, the X10 program has a performance issue. OK, there are the hints on the following URL. http://x10.codehaus.org/Performance+Tuning+an+X10+Application I compiled my own MPI runtime and currently ended up with the following. $ x10c++ -x10rt mpipg -O -NO_CHECKS -o matmul ../src/matmul.x10 $ mpiexec -np 1 ./matmul 500 10 1 **************************************************** * X10 test program for matrix vector multiplication **************************************************** matrix size = 500 loop count = 10 places = 1 axis for parallelization = 1 The time is 1.652489519001392 s Still a gap to Fortran performance by 2.25 orders of magnitude. The "pg" in "-x10rt mpipg" has the following significance. $ cat /opt/sw/X10_compiler/etc/x10rt_mpipg.properties CXX=mpicxx CXXFLAGS=-pg -g -O3 LDFLAGS=-pg -g -O3 LDLIBS=-lx10rt_mpi So I can now look at a gmon.out file and see the following. Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 58.06 0.18 0.18 4999853 0.00 0.00 x10::array::DistArray<x10aux::ref<x10::array::Array<double> > >::apply(x10aux::ref<x10::array::Point>) 29.03 0.27 0.09 matmul__closure__15::apply() 6.45 0.29 0.02 1 20.00 20.96 x10_array_DistArray__closure__0<double>::apply() 3.23 0.30 0.01 5513706 0.00 0.00 x10::lang::Iterator<x10aux::ref<x10::array::Point> >::itable<x10::lang::Reference>* x10aux::findITable<x10::lang::Iterator<x10aux::ref<x10::array::Point> > >(x10aux::itable_entry*) matmul__closure__15::apply() is the calling parent to the hot spot. The call graph profile looks like this. granularity: each sample hit covers 4 byte(s) for 3.23% of 0.31 seconds index % time self children called name <spontaneous> [1] 90.6 0.09 0.19 matmul__closure__15::apply() [1] 0.18 0.00 4999823/4999853 x10::array::DistArray<x10aux::ref<x10::array::Array<double> > >::apply(x10aux::ref<x10::array::Point>) [2] 0.01 0.00 4999820/5513706 x10::lang::Iterator<x10aux::ref<x10::array::Point> >::itable<x10::lang::Reference>* x10aux::findITable<x10::lang::Iterator<x10aux::ref<x10::array::Point> > >(x10aux::itable_entry*) [5] 0.00 0.00 10/12 x10::array::DistArray<double>::__bar(x10::lang::Place) [10] ----------------------------------------------- 0.00 0.00 10/4999853 matmul__closure__17::apply() [24] 0.00 0.00 20/4999853 matmul__closure__13::apply() [49] 0.18 0.00 4999823/4999853 matmul__closure__15::apply() [1] [2] 58.1 0.18 0.00 4999853 x10::array::DistArray<x10aux::ref<x10::array::Array<double> > >::apply(x10aux::ref<x10::array::Point>) [2] ----------------------------------------------- matmul__closure__15::apply() can be identified as the following code snippet, actually the heart of the matrix vector multiply. /** * Next do the local part of the * matrix multiply. */ finish ateach (pt in v_src ) { val p:Place = v_src.dist()(pt); for ( (i,j) in (A|p) ) { v_dst(pt)(i) += A(i,j)*v_src(pt)(j); } if (debug) { val v_src_str = "v_src("+p.id()+")"; prettyPrintArray1D(v_src_str, v_src(pt)); val v_dst_str = "v_dst("+p.id()+")"; prettyPrintArray1D(v_dst_str, v_dst(pt)); } } where static type Array1D = Array[Double]{rank==1}; global val v_dst: DistArray[Array1D]{rank==1}; global val v_src: DistArray[Array1D]{rank==1}; global val A: DistArray[Double]{rank==2}; - the region for all objects of type Array1D is [0..vsize-1], - the region for v_src and v_dst is [0..number_of_places-1], - the distribution for v_src and v_dst maps exactly one point to each place. - the region for A is [0..vsize-1,0..vsize-1]. - in all of the above, number_of_places == 1. - in all of the above, debug:Boolean == false. Am I correct that 58% of the time is spent in x10::array::DistArray...::apply(...Point), which I interpret as the evaluation of A(i,j) (perhaps also v_src(pt) and v_dst(pt)) ? And each of that is a function call, adding up to 4999823 calls ? That seems a lot of CPU cycles just to get the matrix value A(i,j). Perhaps this can be inlined ? How ? And where are all the rest of the cycles that add up to the performance gap of currently 2.25 orders of magnitude ? -- Mit freundlichen Grüßen / Kind regards Dr. Christoph Pospiech High Performance & Parallel Computing Phone: +49-351 86269826 Mobile: +49-171-765 5871 E-Mail: christoph.pospi...@de.ibm.com ------------------------------------- IBM Deutschland GmbH Vorsitzender des Aufsichtsrats: Erich Clementi Geschäftsführung: Martin Jetter (Vorsitzender), Reinhard Reschke, Christoph Grandpierre, Klaus Lintelmann, Michael Diemer, Martina Koederitz Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 WEEE-Reg.-Nr. DE 99369940 ------------------------------------------------------------------------------ This SF.net Dev2Dev email is sponsored by: Show off your parallel programming skills. Enter the Intel(R) Threading Challenge 2010. http://p.sf.net/sfu/intel-thread-sfd _______________________________________________ X10-users mailing list X10-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/x10-users