Christoph Pospiech <christoph.pospi...@de.ibm.com> wrote on 09/09/2010
07:04:10 AM:
> $ mpiexec -np 1 ./matmul 500 10 1
> ****************************************************
> * X10 test program for matrix vector multiplication
> ****************************************************
>
> matrix size = 500
> loop count  = 10
> places      = 1
> axis for parallelization = 1
>
> The time is 1.162122911000552 s
>
> This narrows the gap to 2.10 orders of magnitude.
>

Hi,

        My guess is that the overhead is coming from the "for ((i,j) in (A|
p)) idiom in the for loop.  We haven't had a chance to finish all the
optimizer work that is needed to make that efficient yet in the general
case.  If the compiler knows statically that the Region R in the for ((i,j)
in R) { ... } construct has the rect property, then we generate a proper
nested counted for loop and get good performance.  If we can't prove the
region is rect (or it isn't rect), then we have to fallback into calling
the region's iterator to get the Point objects one at a time, and this is
much, much slower.

        Looking at your profile, it appears that a large chunk of the time is
going into the Iterator operations and the allocation of objects (which I'm
speculating are Points that are only being created because that's what the
Iterator returns).  That's why I think it's the (A|p) idiom that is hurting
you.  We don't have quite enough static information to apply the for loop
optimization.

        In X10 2.0.6, pretty much your only chance of getting good
performance from moderately clean code here is to do something slightly
sleazy and "help" the compiler by exploiting the fact that you actually do
should have RectRegions (even though the compiler can't prove that
statically).

        To show this, I wrote the little program appended below.  I see about
25x performance difference between the slow, pure loop, and the faster but
slightly sleazy loop.

        I will have to take a look at the Dist and DistArray implementation
to see if there is some way we can restructure the code for 2.1.0 such that
this becomes possible without requiring a sleazy cast by the user.

hope this helps,

--dave


[dgr...@wannalancit tests]$ ../bin/x10c++ -O -NO_CHECKS LoopTest.x10
[dgr...@wannalancit tests]$ mpirun -n 2  a.out
Slow loop 0.145475895
Fast loop 0.006407003


public class LoopTest {

  public static def main(Rail[String]) {
    val r = [1..1000,1..1000] as Region;
    val d = Dist.makeBlock(r);
    val da = DistArray.make[float](d);
    val start = System.nanoTime();
    var tmp:float = 0;
    for ((i,j) in (da|here)) {
        tmp += da(i,j);
    }
    val stop = System.nanoTime();
    val dHere = d | here;
    val rhere = dHere.region as RectRegion; // This is a sleazy trick
    val start2 = System.nanoTime();
    for ((i,j) in rhere) {
        tmp += da(i,j);
    }
    val stop2 = System.nanoTime();
    Console.OUT.println("Slow loop "+(((stop-start) as double)/1e9));
    Console.OUT.println("Fast loop "+(((stop2-start2) as double)/1e9));
  }
}
------------------------------------------------------------------------------
Automate Storage Tiering Simply
Optimize IT performance and efficiency through flexible, powerful, 
automated storage tiering capabilities. View this brief to learn how
you can reduce costs and improve performance. 
http://p.sf.net/sfu/dell-sfdev2dev
_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users

Reply via email to