Hi Tom - I see that others have commented on your original question, but I thought I'd add a note on the way you are executing your program.
Since you're using native X10, you can run your binary directly, like this: "X10_NPLACES=X ./ConjugateGradientMethod", without the need for runx10. Also, since you're not specifying the locations for these places, they are all running on the local host, which only has 16 cores. You'll want to specify X10_HOSTFILE or X10_HOSTLIST to spread them across the cluster. http://www.x10-lang.org/documentation/practical-x10-programming/x10rt-implementations.html - Ben From: Tom Maes <tomma...@hotmail.com> To: <x10-users@lists.sourceforge.net>, Date: 05/31/2013 09:35 PM Subject: [X10-users] Question X10 performance Hello, For a task we had to implement the Conjugate Gradient Method in a parallel language. For this task, I have chosen to implement it in X10, which means I’m quite new to the language. Before I ask my question, I will first give a short explanation about some code and the environment. Code: The main method: public static def main(Array[String]):void { val size : Int = 10000; val matDist = Dist.makeBlock((1..size) * (1..size)); val vecDist = Dist.makeBlock(1..size); val randomGenerator : Random = new Random(); val a : DistArray[Double] = DistArray.make[Double](matDist, (Point(2)) => randomGenerator.nextDouble() * randomGenerator.nextInt()); val x : DistArray[Double] = DistArray.make[Double](vecDist, (Point(1)) => randomGenerator.nextDouble()); val b : DistArray[Double] = DistArray.make[Double](vecDist, (Point(1)) => randomGenerator.nextDouble() * randomGenerator.nextInt()); val r : DistArray[Double] = DistArray.make[Double](vecDist, (Point(1)) => 0.0); val w : DistArray[Double] = DistArray.make[Double](vecDist, (Point(1)) => 0.0); val p : DistArray[Double] = DistArray.make[Double](vecDist, (Point(1)) => 0.0); val result : DistArray[Double] = DistArray.make[Double](vecDist, (Point (1)) => 0.0); var alpha : Double = 0.0; var beta : Double = 0.0; var pk : Double = 0.0; var pk_1 : Double = 0.0; // r = b - A . x multiplication_matrix_row(a,b,result); subtract(b,result,r); // p = r copy(p,r); // p0 = rT . r pk = multiplication_row_column(r, r); for(val i in 1..100){ // w = A . p multiplication_matrix_row(a,p,w); // alpha = p(k)²/(pT . w) alpha = (pk * pk) / multiplication_row_column(p, w); // x = x + alpha . p multiplication_number_row(alpha, p, result); addition(x, result); // r = r - alpha . w multiplication_number_row(alpha, w, result); subtract(r, result, r); // p(k+1) = rT . r pk_1 = multiplication_row_column(r, r); // beta = p(k)²/p(k+1)² beta = (pk * pk) / (pk_1 * pk_1); // p = r + beta . p multiplication_number_row(beta, p, p); addition(r, p); pk = pk_1; } } I’m using the distributed arrays, and for A, B and X, I’m initializing them with a random value. All other functions are part of the complete algorithm. I’m not going to display all functions, as it would be irrelevant, but their structure looks the same as the following method: static def multiplication_matrix_row(matrix : DistArray[Double], row : DistArray[Double], result : DistArray[Double]):void { finish for(val place in Place.places()) async at(place) { for(val [i] in row.dist|here) { result(i) = 0; } for(val [i] in row.dist|here) { var value : Double = 0; for(val [j] in row.dist|here) { value += matrix(i,j) * row(i); } val endValue = value; at(result.dist(i)) atomic { result(i) += endValue; } } } } I compiled my file called main.x10 with the c++ back-end as: x10c++ –O –NO_CHECKS –o ConjugateGradientMethod main.x10 For the execution of the program, I’m running it on: https://cc.ulb.ac.be/hydra/hydra_specs.php Using the specs: X10_NPLACES=X runx10 ConjugateGradientMethod With X being the number of places (of course). The question: One part of the task was to benchmark this algorithm and to have a look at the results. When executing this algorithm for one place, I received an average time of +/- 30 minutes. However, when I perform the above code for 20 places, I receive an average time of 19 seconds, which means a speedup of 91x. Therefore I’m wondering what is wrong with my execution or code, as I can’t figure it out. My first idea was that the ‘finish’ was wrong and needed to be a ‘clocked finish’, but as far as I know, the ‘finish’ makes sure that all spawned tasks will synchronize before going to the next method (as also explained on page 192 in spec v2.3) and therefore no clock would be needed. My next taught would maybe be an overhead for 1 place, but I can’t believe it would that large. Could someone explain me what I’m doing wrong? With kind regards, Tom ------------------------------------------------------------------------------ Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 _______________________________________________ X10-users mailing list X10-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/x10-users
<<inline: graycol.gif>>
------------------------------------------------------------------------------ Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2
_______________________________________________ X10-users mailing list X10-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/x10-users