On Friday, September 10, 2010 04:26:23 am David P Grove wrote: > Christoph Pospiech <christoph.pospi...@de.ibm.com> wrote > on 09/09/2010 07:04:10 AM: > > $ mpiexec -np 1 ./matmul 500 10 1 > > **************************************************** > > * X10 test program for matrix vector multiplication > > **************************************************** > > > > matrix size = 500 > > loop count = 10 > > places = 1 > > axis for parallelization = 1 > > > > The time is 1.162122911000552 s > > > > This narrows the gap to 2.10 orders of magnitude. > > Hi, > > My guess is that the overhead is coming from the "for ((i,j) in (A| > p)) idiom in the for loop. We haven't had a chance to finish all the > optimizer work that is needed to make that efficient yet in the general > case. If the compiler knows statically that the Region R in the for ((i,j) > in R) { ... } construct has the rect property, then we generate a proper > nested counted for loop and get good performance. If we can't prove the > region is rect (or it isn't rect), then we have to fallback into calling > the region's iterator to get the Point objects one at a time, and this is > much, much slower. > > Looking at your profile, it appears that a large chunk of the time > is going into the Iterator operations and the allocation of objects (which > I'm speculating are Points that are only being created because that's what > the Iterator returns). That's why I think it's the (A|p) idiom that is > hurting you. We don't have quite enough static information to apply the > for loop optimization. > > In X10 2.0.6, pretty much your only chance of getting good > performance from moderately clean code here is to do something slightly > sleazy and "help" the compiler by exploiting the fact that you actually do > should have RectRegions (even though the compiler can't prove that > statically). > > To show this, I wrote the little program appended below. I see > about 25x performance difference between the slow, pure loop, and the > faster but slightly sleazy loop. > > I will have to take a look at the Dist and DistArray implementation > to see if there is some way we can restructure the code for 2.1.0 such that > this becomes possible without requiring a sleazy cast by the user. > > hope this helps, > > --dave
Dave, I can confirm that your guess is correct - I added your suggestions to my code and rerun. $ mpiexec -np 1 ./matmul 500 10 1 **************************************************** * X10 test program for matrix vector multiplication **************************************************** matrix size = 500 loop count = 10 places = 1 axis for parallelization = 1 The time is 0.076593629000854 s The performance gap to FORTRAN has narrowed to 0.92 orders of magnitude. This decrease in gap width translates to a speed up by a factor of 10**(2.10 - 0.92) = 15.14, which is a major jump ahead. The flat profile has now changed to the following, and I am not sure how many further hints can be deduced from a subroutine profile like that. Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 77.78 0.07 0.07 matmul__closure__15::apply() 22.22 0.09 0.02 1 20.00 20.00 x10_array_DistArray__closure__0<double>::apply() [... many further entries with 0% time ... ] [... no wonder, 22.22 + 77.78 = 100.00 ...] matmul__closure__15 is the same code snippet as before (with the amendments suggested by you and Josh Milthorpe). The other entry appears to me like coming from X10 DistArray implementation. The call tree from gprof reads like this. Does this tell you anything ? index % time self children called name ----------------------------------------------- 0.02 0.00 1/1 x10_lang_PlaceLocalHandle__closure__2<x10aux::ref<x10::array::DistArray__LocalState<double> > >::apply() [3] [2] 22.2 0.02 0.00 1 x10_array_DistArray__closure__0<double>::apply() [2] 0.00 0.00 500001/514061 x10::lang::Iterator<x10aux::ref<x10::array::Point> >::itable<x10::lang::Reference>* x10aux::findITable<x10::lan g::Iterator<x10aux::ref<x10::array::Point> > >(x10aux::itable_entry*) [20] 0.00 0.00 250000/250000 matmul__closure__1::_getITables() [21] 0.00 0.00 250000/250000 matmul__closure__1::apply(x10aux::ref<x10::array::Point>) [22] 0.00 0.00 2/344 x10aux::alloc_internal(unsigned int, bool) [31] 0.00 0.00 1/7 x10aux::RuntimeType const* x10aux::getRTT<x10::lang::Fun_0_1<x10aux::ref<x10::array::Point>, double> >() [50] ----------------------------------------------- <spontaneous> [3] 22.2 0.00 0.02 x10_lang_PlaceLocalHandle__closure__2<x10aux::ref<x10::array::DistArray__LocalState<double> > >::apply() [3] 0.02 0.00 1/1 x10_array_DistArray__closure__0<double>::apply() [2] 0.00 0.00 12/12 x10_array_DistArray__closure__2<double>::_getITables() [42] 0.00 0.00 12/344 x10aux::alloc_internal(unsigned int, bool) [31] 0.00 0.00 11/11 x10_array_DistArray__closure__2<double>::apply() [47] 0.00 0.00 1/1 x10_array_DistArray__closure__0<double>::_getITables() [78] 0.00 0.00 1/2 x10::array::DistArray__LocalState<double>::getRTT() [72] ----------------------------------------------- -- Mit freundlichen Grüßen / Kind regards Dr. Christoph Pospiech High Performance & Parallel Computing Phone: +49-351 86269826 Mobile: +49-171-765 5871 E-Mail: christoph.pospi...@de.ibm.com ------------------------------------- IBM Deutschland GmbH Vorsitzender des Aufsichtsrats: Erich Clementi Geschäftsführung: Martin Jetter (Vorsitzender), Reinhard Reschke, Christoph Grandpierre, Klaus Lintelmann, Michael Diemer, Martina Koederitz Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 WEEE-Reg.-Nr. DE 99369940 ------------------------------------------------------------------------------ Automate Storage Tiering Simply Optimize IT performance and efficiency through flexible, powerful, automated storage tiering capabilities. View this brief to learn how you can reduce costs and improve performance. http://p.sf.net/sfu/dell-sfdev2dev _______________________________________________ X10-users mailing list X10-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/x10-users