Christoph Pospiech <christoph.pospi...@de.ibm.com> wrote on 09/09/2010 07:04:10 AM: > $ mpiexec -np 1 ./matmul 500 10 1 > **************************************************** > * X10 test program for matrix vector multiplication > **************************************************** > > matrix size = 500 > loop count = 10 > places = 1 > axis for parallelization = 1 > > The time is 1.162122911000552 s > > This narrows the gap to 2.10 orders of magnitude. >
Hi, My guess is that the overhead is coming from the "for ((i,j) in (A| p)) idiom in the for loop. We haven't had a chance to finish all the optimizer work that is needed to make that efficient yet in the general case. If the compiler knows statically that the Region R in the for ((i,j) in R) { ... } construct has the rect property, then we generate a proper nested counted for loop and get good performance. If we can't prove the region is rect (or it isn't rect), then we have to fallback into calling the region's iterator to get the Point objects one at a time, and this is much, much slower. Looking at your profile, it appears that a large chunk of the time is going into the Iterator operations and the allocation of objects (which I'm speculating are Points that are only being created because that's what the Iterator returns). That's why I think it's the (A|p) idiom that is hurting you. We don't have quite enough static information to apply the for loop optimization. In X10 2.0.6, pretty much your only chance of getting good performance from moderately clean code here is to do something slightly sleazy and "help" the compiler by exploiting the fact that you actually do should have RectRegions (even though the compiler can't prove that statically). To show this, I wrote the little program appended below. I see about 25x performance difference between the slow, pure loop, and the faster but slightly sleazy loop. I will have to take a look at the Dist and DistArray implementation to see if there is some way we can restructure the code for 2.1.0 such that this becomes possible without requiring a sleazy cast by the user. hope this helps, --dave [dgr...@wannalancit tests]$ ../bin/x10c++ -O -NO_CHECKS LoopTest.x10 [dgr...@wannalancit tests]$ mpirun -n 2 a.out Slow loop 0.145475895 Fast loop 0.006407003 public class LoopTest { public static def main(Rail[String]) { val r = [1..1000,1..1000] as Region; val d = Dist.makeBlock(r); val da = DistArray.make[float](d); val start = System.nanoTime(); var tmp:float = 0; for ((i,j) in (da|here)) { tmp += da(i,j); } val stop = System.nanoTime(); val dHere = d | here; val rhere = dHere.region as RectRegion; // This is a sleazy trick val start2 = System.nanoTime(); for ((i,j) in rhere) { tmp += da(i,j); } val stop2 = System.nanoTime(); Console.OUT.println("Slow loop "+(((stop-start) as double)/1e9)); Console.OUT.println("Fast loop "+(((stop2-start2) as double)/1e9)); } } ------------------------------------------------------------------------------ Automate Storage Tiering Simply Optimize IT performance and efficiency through flexible, powerful, automated storage tiering capabilities. View this brief to learn how you can reduce costs and improve performance. http://p.sf.net/sfu/dell-sfdev2dev _______________________________________________ X10-users mailing list X10-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/x10-users