Hi Tom - I see that others have commented on your original question, but I
thought I'd add a note on the way you are executing your program.

Since you're using native X10, you can run your binary directly, like this:
"X10_NPLACES=X ./ConjugateGradientMethod", without the need for runx10.
Also, since you're not specifying the locations for these places, they are
all running on the local host, which only has 16 cores.  You'll want to
specify X10_HOSTFILE or X10_HOSTLIST to spread them across the cluster.

http://www.x10-lang.org/documentation/practical-x10-programming/x10rt-implementations.html


   - Ben



From:   Tom Maes <tomma...@hotmail.com>
To:     <x10-users@lists.sourceforge.net>,
Date:   05/31/2013 09:35 PM
Subject:        [X10-users] Question X10 performance



Hello,

For a task we had to implement the Conjugate Gradient Method in a parallel
language. For this task, I have chosen to implement it in X10, which means
I’m quite new to the language.
Before I ask my question, I will first give a short explanation about some
code and the environment.

Code:

The main method:

public static def main(Array[String]):void {
  val size : Int = 10000;
  val matDist = Dist.makeBlock((1..size) * (1..size));
  val vecDist = Dist.makeBlock(1..size);
  val randomGenerator : Random = new Random();
  val a : DistArray[Double] = DistArray.make[Double](matDist, (Point(2)) =>
randomGenerator.nextDouble() * randomGenerator.nextInt());
  val x : DistArray[Double] = DistArray.make[Double](vecDist, (Point(1)) =>
randomGenerator.nextDouble());
  val b : DistArray[Double] = DistArray.make[Double](vecDist, (Point(1)) =>
randomGenerator.nextDouble() * randomGenerator.nextInt());
  val r : DistArray[Double] = DistArray.make[Double](vecDist, (Point(1)) =>
0.0);
  val w : DistArray[Double] = DistArray.make[Double](vecDist, (Point(1)) =>
0.0);
  val p : DistArray[Double] = DistArray.make[Double](vecDist, (Point(1)) =>
0.0);
  val result : DistArray[Double] = DistArray.make[Double](vecDist, (Point
(1)) => 0.0);
  var alpha : Double = 0.0;
  var beta : Double = 0.0;
  var pk : Double = 0.0;
  var pk_1 : Double = 0.0;

  // r = b - A . x
  multiplication_matrix_row(a,b,result);
  subtract(b,result,r);
  // p = r
  copy(p,r);
  // p0 = rT . r
  pk = multiplication_row_column(r, r);
  for(val i in 1..100){
    // w = A . p
    multiplication_matrix_row(a,p,w);
    // alpha = p(k)²/(pT . w)
    alpha = (pk * pk) / multiplication_row_column(p, w);
    // x = x + alpha . p
    multiplication_number_row(alpha, p, result);
    addition(x, result);
    // r = r - alpha . w
    multiplication_number_row(alpha, w, result);
    subtract(r, result, r);
    // p(k+1) = rT . r
    pk_1 = multiplication_row_column(r, r);
    // beta = p(k)²/p(k+1)²
    beta = (pk * pk) / (pk_1 * pk_1);
    // p = r + beta . p
    multiplication_number_row(beta, p, p);
    addition(r, p);
    pk = pk_1;
  }

}

I’m using the distributed arrays, and for A, B and X, I’m initializing them
with a random value.
All other functions are part of the complete algorithm.

I’m not going to display all functions, as it would be irrelevant, but
their structure looks the same as the following method:

static def multiplication_matrix_row(matrix : DistArray[Double], row :
DistArray[Double], result : DistArray[Double]):void {
    finish for(val place in Place.places())
        async at(place) {
            for(val [i] in row.dist|here) {
                result(i) = 0;
            }
            for(val [i] in row.dist|here) {
                var value : Double = 0;
                for(val [j] in row.dist|here) {
                    value += matrix(i,j) * row(i);
                }
                val endValue = value;
                at(result.dist(i)) atomic {
                  result(i) += endValue;
                }
            }
        }
  }


I compiled my file called main.x10 with the c++ back-end as:

x10c++ –O –NO_CHECKS –o ConjugateGradientMethod main.x10

For the execution of the program, I’m running it on:

https://cc.ulb.ac.be/hydra/hydra_specs.php

Using the specs:

X10_NPLACES=X runx10 ConjugateGradientMethod

With X being the number of places (of course).

The question:

One part of the task was to benchmark this algorithm and to have a look at
the results. When executing this algorithm for one place, I received an
average time of +/- 30 minutes.
However, when I perform the above code for 20 places, I receive an average
time of 19 seconds, which means a speedup of 91x.

Therefore I’m wondering what is wrong with my execution or code, as I can’t
figure it out.
My first idea was that the ‘finish’ was wrong and needed to be a ‘clocked
finish’, but as far as I know, the ‘finish’ makes sure that all spawned
tasks will synchronize before going to the next method (as also explained
on page 192 in spec v2.3) and therefore no clock would be needed.
My next taught would maybe be an overhead for 1 place, but I can’t believe
it would that large.

Could someone explain me what I’m doing wrong?

With kind regards,

Tom
------------------------------------------------------------------------------

Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users

<<inline: graycol.gif>>

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users

Reply via email to