[X10-users] Question X10 performance

Tom Maes Fri, 31 May 2013 18:36:07 -0700

Hello,

For a task we had to implement the Conjugate Gradient Method in a parallel 
language. For this task, I have chosen to implement it in X10, which means I’m 
quite new to the language.
Before I ask my question, I will first give a short explanation about some code 
and the environment.


Code:

The main method:

public static def main(Array[String]):void {
  val size : Int = 10000;
  val matDist = Dist.makeBlock((1..size) * (1..size));
  val vecDist = Dist.makeBlock(1..size);
  val randomGenerator : Random = new Random();
  val a : DistArray[Double] = DistArray.make[Double](matDist, (Point(2)) => 
randomGenerator.nextDouble() * randomGenerator.nextInt());
  val x : DistArray[Double] = DistArray.make[Double](vecDist, (Point(1)) => 
randomGenerator.nextDouble());
  val b : DistArray[Double] = DistArray.make[Double](vecDist, (Point(1)) => 
randomGenerator.nextDouble() * randomGenerator.nextInt());
  val r : DistArray[Double] = DistArray.make[Double](vecDist, (Point(1)) => 
0.0);
  val w : DistArray[Double] = DistArray.make[Double](vecDist, (Point(1)) => 
0.0);
  val p : DistArray[Double] = DistArray.make[Double](vecDist, (Point(1)) => 
0.0);
  val result : DistArray[Double] = DistArray.make[Double](vecDist, (Point(1)) 
=> 0.0);
 
  var alpha : Double = 0.0;
  var beta : Double = 0.0;
 
  var pk : Double = 0.0;
  var pk_1 : Double = 0.0;

  // r = b - A . x
  multiplication_matrix_row(a,b,result);
  subtract(b,result,r);
  // p = r
  copy(p,r);
  // p0 = rT . r
  pk = multiplication_row_column(r, r);
 
  for(val i in 1..100){
    // w = A . p
    multiplication_matrix_row(a,p,w);
    // alpha = p(k)²/(pT . w)
    alpha = (pk * pk) / multiplication_row_column(p, w);
    // x = x + alpha . p
    multiplication_number_row(alpha, p, result);
    addition(x, result);
    // r = r - alpha . w
    multiplication_number_row(alpha, w, result);
    subtract(r, result, r);
    // p(k+1) = rT . r
    pk_1 = multiplication_row_column(r, r);
    // beta = p(k)²/p(k+1)²
    beta = (pk * pk) / (pk_1 * pk_1);
    // p = r + beta . p
    multiplication_number_row(beta, p, p);
    addition(r, p);
    pk = pk_1;
  }

}

I’m using the distributed arrays, and for A, B and X, I’m initializing them 
with a random value.
All other functions are part of the complete algorithm.

I’m not going to display all functions, as it would be irrelevant, but their 
structure looks the same as the following method:

static def multiplication_matrix_row(matrix : DistArray[Double], row : 
DistArray[Double], result : DistArray[Double]):void {
    finish for(val place in Place.places()) 
        async at(place) {
            for(val [i] in row.dist|here) {
                result(i) = 0;
            }
            for(val [i] in row.dist|here) {
                var value : Double = 0;
                for(val [j] in row.dist|here) {
                    value += matrix(i,j) * row(i);
                }
                val endValue = value;
                at(result.dist(i)) atomic {
                  result(i) += endValue;
                }
            }
        }
  }


I compiled my file called main.x10 with the c++ back-end as: 

x10c++ –O –NO_CHECKS –o ConjugateGradientMethod main.x10

For the execution of the program, I’m running it on: 

https://cc.ulb.ac.be/hydra/hydra_specs.php

Using the specs: 

X10_NPLACES=X runx10 ConjugateGradientMethod

With X being the number of places (of course).

The question:

One part of the task was to benchmark this algorithm and to have a look at the 
results. When executing this algorithm for one place, I received an average 
time of +/- 30 minutes.
However, when I perform the above code for 20 places, I receive an average time 
of 19 seconds, which means a speedup of 91x.

Therefore I’m wondering what is wrong with my execution or code, as I can’t 
figure it out. 
My first idea was that the ‘finish’ was wrong and needed to be a ‘clocked 
finish’, but as far as I know, the ‘finish’ makes sure that all spawned tasks 
will synchronize before going to the next method (as also explained on page 192 
in spec v2.3) and therefore no clock would be needed.
My next taught would maybe be an overhead for 1 place, but I can’t believe it 
would that large.

Could someone explain me what I’m doing wrong?

With kind regards,

Tom

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2

_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users

[X10-users] Question X10 performance

Reply via email to