Tom --

I would start by checking that you are getting correct results for 1 place and multiple places.

Once your code is as simple as you can make it (i.e directly expresses the logic of the computation) and correct for multiple places should you then turn to look at how to improve performance. (There are many places in your code where performance can be improved through local rewrites, but important to check correctness first.)

Best,
Vijay

On 6/1/13 9:21 AM, Josh Milthorpe wrote:
Hi Tom,

the superlinear speedup you're seeing is surprising, and does suggest something is broken in the code for single-place execution.

Have you profiled the code (e.g. using gperftools, x10c++ -gpt) to confirm that the time is being spent in the methods that you expect, like multiplication_matrix_row?

Assuming that it is, I notice that your code uses 'atomic' to update each element of the result vector. 'atomic' is currently implemented using a place-wide lock and may limit scalability for multiple threads within a place.
http://x10-lang.org/documentation/practical-x10-programming/performance-tuning.html
How many threads are you using per place?  (X10_NTHREADS)

If it's at all possible, you should try to avoid the use of 'atomic' (just as you should try to avoid using locks in other programming models). For example, organise the code so that each element of an array is written to by only one activity. If I read multiplication_matrix_row correctly, it looks like this may already be the case - can you confirm?

Cheers,

Josh
------------------------------------------
Josh Milthorpe

Postdoctoral Fellow, Research School of Computer Science
Australian National University, Building 108
Canberra, ACT 0200
Australia
Phone:  + 61 (0)2 61254478
Mobile: + 61 (0)407 940743
E-mail:josh.miltho...@anu.edu.au
Web:http://cs.anu.edu.au/~Josh.Milthorpe/ On 01/06/13 09:52, Tom Maes wrote:
Hello,
For a task we had to implement the Conjugate Gradient Method in a parallel language. For this task, I have chosen to implement it in X10, which means I'm quite new to the language. Before I ask my question, I will first give a short explanation about some code and the environment.
Code:
The main method:
public static def main(Array[String]):void {
  val size : Int = 10000;
  val matDist = Dist.makeBlock((1..size) * (1..size));
  val vecDist = Dist.makeBlock(1..size);
  val randomGenerator : Random = new Random();
val a : DistArray[Double] = DistArray.make[Double](matDist, (Point(2)) => randomGenerator.nextDouble() * randomGenerator.nextInt()); val x : DistArray[Double] = DistArray.make[Double](vecDist, (Point(1)) => randomGenerator.nextDouble()); val b : DistArray[Double] = DistArray.make[Double](vecDist, (Point(1)) => randomGenerator.nextDouble() * randomGenerator.nextInt()); val r : DistArray[Double] = DistArray.make[Double](vecDist, (Point(1)) => 0.0); val w : DistArray[Double] = DistArray.make[Double](vecDist, (Point(1)) => 0.0); val p : DistArray[Double] = DistArray.make[Double](vecDist, (Point(1)) => 0.0); val result : DistArray[Double] = DistArray.make[Double](vecDist, (Point(1)) => 0.0);
  var alpha : Double = 0.0;
  var beta : Double = 0.0;
  var pk : Double = 0.0;
  var pk_1 : Double = 0.0;
  // r = b - A . x
  multiplication_matrix_row(a,b,result);
  subtract(b,result,r);
  // p = r
  copy(p,r);
  // p0 = rT . r
  pk = multiplication_row_column(r, r);
  for(val i in 1..100){
    // w = A . p
    multiplication_matrix_row(a,p,w);
    // alpha = p(k)²/(pT . w)
    alpha = (pk * pk) / multiplication_row_column(p, w);
    // x = x + alpha . p
    multiplication_number_row(alpha, p, result);
    addition(x, result);
    // r = r - alpha . w
    multiplication_number_row(alpha, w, result);
    subtract(r, result, r);
    // p(k+1) = rT . r
    pk_1 = multiplication_row_column(r, r);
    // beta = p(k)²/p(k+1)²
    beta = (pk * pk) / (pk_1 * pk_1);
    // p = r + beta . p
    multiplication_number_row(beta, p, p);
    addition(r, p);
    pk = pk_1;
  }
}
I'm using the distributed arrays, and for A, B and X, I'm initializing them with a random value.
All other functions are part of the complete algorithm.
I'm not going to display all functions, as it would be irrelevant, but their structure looks the same as the following method: static def multiplication_matrix_row(matrix : DistArray[Double], row : DistArray[Double], result : DistArray[Double]):void {
    finish for(val place in Place.places())
        async at(place) {
            for(val [i] in row.dist|here) {
                result(i) = 0;
            }
            for(val [i] in row.dist|here) {
                var value : Double = 0;
                for(val [j] in row.dist|here) {
                    value += matrix(i,j) * row(i);
                }
                val endValue = value;
                at(result.dist(i)) atomic {
                  result(i) += endValue;
                }
            }
        }
  }
I compiled my file called main.x10 with the c++ back-end as:
x10c++ --O --NO_CHECKS --o ConjugateGradientMethod main.x10
For the execution of the program, I'm running it on:
https://cc.ulb.ac.be/hydra/hydra_specs.php
Using the specs:
X10_NPLACES=X runx10 ConjugateGradientMethod

With X being the number of places (of course).
The question:
One part of the task was to benchmark this algorithm and to have a look at the results. When executing this algorithm for one place, I received an average time of +/- 30 minutes. However, when I perform the above code for 20 places, I receive an average time of 19 seconds, which means a speedup of 91x. Therefore I'm wondering what is wrong with my execution or code, as I can't figure it out. My first idea was that the 'finish' was wrong and needed to be a 'clocked finish', but as far as I know, the 'finish' makes sure that all spawned tasks will synchronize before going to the next method (as also explained on page 192 in spec v2.3) and therefore no clock would be needed. My next taught would maybe be an overhead for 1 place, but I can't believe it would that large.
Could someone explain me what I'm doing wrong?
With kind regards,
Tom


------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2


_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users



------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2


_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users

Reply via email to