Re: [X10-users] Question X10 performance

Vijay Saraswat Sat, 01 Jun 2013 08:01:58 -0700

Tom --

I would start by checking that you are getting correct results for 1place and multiple places.

Once your code is as simple as you can make it (i.e directly expressesthe logic of the computation) and correct for multiple places should youthen turn to look at how to improve performance. (There are many placesin your code where performance can be improved through local rewrites,but important to check correctness first.)


Best,
Vijay

On 6/1/13 9:21 AM, Josh Milthorpe wrote:

Hi Tom,
the superlinear speedup you're seeing is surprising, and does suggestsomething is broken in the code for single-place execution.
Have you profiled the code (e.g. using gperftools, x10c++ -gpt) toconfirm that the time is being spent in the methods that you expect,like multiplication_matrix_row?
Assuming that it is, I notice that your code uses 'atomic' to updateeach element of the result vector. 'atomic' is currently implementedusing a place-wide lock and may limit scalability for multiple threadswithin a place.
http://x10-lang.org/documentation/practical-x10-programming/performance-tuning.html
How many threads are you using per place?  (X10_NTHREADS)
If it's at all possible, you should try to avoid the use of 'atomic'(just as you should try to avoid using locks in other programmingmodels). For example, organise the code so that each element of anarray is written to by only one activity. If I readmultiplication_matrix_row correctly, it looks like this may already bethe case - can you confirm?
Cheers,

Josh
------------------------------------------
Josh Milthorpe

Postdoctoral Fellow, Research School of Computer Science
Australian National University, Building 108
Canberra, ACT 0200
Australia
Phone:  + 61 (0)2 61254478
Mobile: + 61 (0)407 940743
E-mail:josh.miltho...@anu.edu.au
Web:http://cs.anu.edu.au/~Josh.Milthorpe/On 01/06/13 09:52, Tom Maes wrote:
Hello,
For a task we had to implement the Conjugate Gradient Method in aparallel language. For this task, I have chosen to implement it inX10, which means I'm quite new to the language.Before I ask my question, I will first give a short explanation aboutsome code and the environment.
Code:
The main method:
public static def main(Array[String]):void {
  val size : Int = 10000;
  val matDist = Dist.makeBlock((1..size) * (1..size));
  val vecDist = Dist.makeBlock(1..size);
  val randomGenerator : Random = new Random();
val a : DistArray[Double] = DistArray.make[Double](matDist,(Point(2)) => randomGenerator.nextDouble() * randomGenerator.nextInt());val x : DistArray[Double] = DistArray.make[Double](vecDist,(Point(1)) => randomGenerator.nextDouble());val b : DistArray[Double] = DistArray.make[Double](vecDist,(Point(1)) => randomGenerator.nextDouble() * randomGenerator.nextInt());val r : DistArray[Double] = DistArray.make[Double](vecDist,(Point(1)) => 0.0);val w : DistArray[Double] = DistArray.make[Double](vecDist,(Point(1)) => 0.0);val p : DistArray[Double] = DistArray.make[Double](vecDist,(Point(1)) => 0.0);val result : DistArray[Double] = DistArray.make[Double](vecDist,(Point(1)) => 0.0);
  var alpha : Double = 0.0;
  var beta : Double = 0.0;
  var pk : Double = 0.0;
  var pk_1 : Double = 0.0;
  // r = b - A . x
  multiplication_matrix_row(a,b,result);
  subtract(b,result,r);
  // p = r
  copy(p,r);
  // p0 = rT . r
  pk = multiplication_row_column(r, r);
  for(val i in 1..100){
    // w = A . p
    multiplication_matrix_row(a,p,w);
    // alpha = p(k)²/(pT . w)
    alpha = (pk * pk) / multiplication_row_column(p, w);
    // x = x + alpha . p
    multiplication_number_row(alpha, p, result);
    addition(x, result);
    // r = r - alpha . w
    multiplication_number_row(alpha, w, result);
    subtract(r, result, r);
    // p(k+1) = rT . r
    pk_1 = multiplication_row_column(r, r);
    // beta = p(k)²/p(k+1)²
    beta = (pk * pk) / (pk_1 * pk_1);
    // p = r + beta . p
    multiplication_number_row(beta, p, p);
    addition(r, p);
    pk = pk_1;
  }
}
I'm using the distributed arrays, and for A, B and X, I'minitializing them with a random value.
All other functions are part of the complete algorithm.
I'm not going to display all functions, as it would be irrelevant,but their structure looks the same as the following method:static def multiplication_matrix_row(matrix : DistArray[Double], row: DistArray[Double], result : DistArray[Double]):void {
    finish for(val place in Place.places())
        async at(place) {
            for(val [i] in row.dist|here) {
                result(i) = 0;
            }
            for(val [i] in row.dist|here) {
                var value : Double = 0;
                for(val [j] in row.dist|here) {
                    value += matrix(i,j) * row(i);
                }
                val endValue = value;
                at(result.dist(i)) atomic {
                  result(i) += endValue;
                }
            }
        }
  }
I compiled my file called main.x10 with the c++ back-end as:
x10c++ --O --NO_CHECKS --o ConjugateGradientMethod main.x10
For the execution of the program, I'm running it on:
https://cc.ulb.ac.be/hydra/hydra_specs.php
Using the specs:
X10_NPLACES=X runx10 ConjugateGradientMethod

With X being the number of places (of course).
The question:
One part of the task was to benchmark this algorithm and to have alook at the results. When executing this algorithm for one place, Ireceived an average time of +/- 30 minutes.However, when I perform the above code for 20 places, I receive anaverage time of 19 seconds, which means a speedup of 91x.Therefore I'm wondering what is wrong with my execution or code, as Ican't figure it out.My first idea was that the 'finish' was wrong and needed to be a'clocked finish', but as far as I know, the 'finish' makes sure thatall spawned tasks will synchronize before going to the next method(as also explained on page 192 in spec v2.3) and therefore no clockwould be needed.My next taught would maybe be an overhead for 1 place, but I can'tbelieve it would that large.
Could someone explain me what I'm doing wrong?
With kind regards,
Tom


------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2


_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users
------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2


_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2

_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users

Re: [X10-users] Question X10 performance

Reply via email to