Adding to what Josh said (1): If you really need to calculate the N^2
forces then you are probably better off simply replicating the N bodies
in each place. At the beginning of each round, each place should
broadcast the current value of its N/P bodies to all others.
After this, each place can perform its calculations locally and update
the position for the N/P bodies it owns. Now you get to choose whether
you want two level parallelism, i.e. whether you should have multiple
workers per place, each worker responsible for updating N/(P*W) bodies,
or a single worker per place. The total amount of communication will be
less if you have fewer places.
So the only advantage of running over many places is to get more cores
to attack the problem. You should be able to use some of the collective
libraries to do an all to all.
With these changes you should be able to get good speedups. Should be
linear except for the cost of broadcasts. You may also need to increase
N -- you are not going to get speedups by parallelizing across nodes if
your total sequential time is in micro-seconds.. (you didnt provide the
unit in your perf nums).
On 6/17/13 8:18 PM, Josh Milthorpe wrote:
Hi Konstantina,
the core of the problem seems to be this:
finish ateach(b in bodies){
bi: Body = bodies(b);
finish ateach(p in bodies){
This will create N^2 activities in the inner loop, of which (N^2 *
(P-1) / P) will execute remotely (i.e. at a different place to the
home place of 'bi'). The 'ateach' statement helps a bit by batching
all remote activities for a place into a single message, so the actual
activity structure is:
finish for (place in D.places()) at (place) async {
for (b in D|here) async {
bi: Body = bodies(b);
finish for (place in D.places()) at (place) async {
for (p in D|here) async {
From this it is apparent that for each body, (P-1) remote messages
will be sent, for a total of N*(P-1) messages.
I would suggest two possible changes to improve performance:
1. Reduce the number of remote activities. Consider copying all
bodies from one place to another in a single message, rather than one
at a time. This will improve scaling to larger numbers of places.
2. Reduce the total number of activities. Currently each activity is
very small, on the order of 20 floating-point instructions. The
performance tuning guide mentions "In the fork-join based runtime,
each async created results in somewhere around 50 or 100 machine
instructions of overhead (allocation of a task object, pushing the
task object onto a deque, and so forth)."
http://x10-lang.org/documentation/practical-x10-programming/performance-tuning.html
You might choose to only create N activities (one for each body), each
of which calculates all (N-1) interactions with other bodies.
Assuming that N is still much greater than the available hardware
parallelism, this will improve the baseline performance of the code on
a single place.
With regard to the problem of getting zero for the energy: variables
are copied by value within the context of an 'at' statement (which
includes 'ateach'). Therefore assigning to 'e' within the ateach has
no effect. For this kind of global reduction, X10 provides a
'collecting finish', which for a sum would look something like:
val e = finish(SumReducer()) {
ateach(...) {
var partialEnergy:Double = ...;
offer partialEnergy;
}
};
static struct SumReducer implements Reducible[Double] {
public def zero() = 0.0;
public operator this(a:Double, b:Double) = (a + b);
}
If you'd like to see an example N-body code that uses all of the
techniques described above, there is a simple force/energy calculation
for non-bonded electrostatic interactions at:
http://sourceforge.net/p/anuchem/code/ci/default/tree/anu-chem/src/au/edu/anu/chem/mm/ElectrostaticDirectMethod.x10
(N.B. this code implements two alternative communication patterns,
which makes it more complicated than it needs to be. Focus on the
'asyncComms' version.)
Cheers,
Josh
------------------------------------------
Josh Milthorpe
Postdoctoral Fellow, Research School of Computer Science
Australian National University, Building 108
Canberra, ACT 0200
Australia
Phone: + 61 (0)2 61254478
Mobile: + 61 (0)407 940743
E-mail:josh.miltho...@anu.edu.au
Web:http://cs.anu.edu.au/~Josh.Milthorpe/
On 18/06/13 04:49, Konstantina Panagiotopoulou wrote:
Hello all,
I am trying to implement a parallel version of the N body problem.
I have tried to compile with java and c++ backend and the compiler
otpimisations -O -NO_CHECKS proposed on the webpage, but it seems
like after all these, the sequential version is always faster.
I use distributed arrays of objects with block distribution and a
cluster of 10 nodes for testing. I have tried different combinations
for # of hosts, NPLACES and NTHREADS and no real difference occurs in
performance.
On the other hand, results seem correct and from printing here.id
<http://here.id>'s inside the ateach statements I can see that the
workload is balanced over all nodes.
Does the distributed array of objects give so much overhead? I read
somewhere in the documentation that heavily object-oriented
applications do not benefit a lot from performance tuning ...:(
Could someone please take a look at the code in case I am missing
something and give me a hint on performance improvement? as is done here.
Below is the code:
-----------------------------------------------------
//Class UniBlock implements the universe in which the bodies are
initialised and interact with each other
import x10.util.Random;
import Body;
import x10.io.Printer;
import x10.io.Console;
import x10.lang.Math;
import x10.array.DistArray;
import x10.util.*;
public class UniBlock {
public var dimx: double;
public var dimy: double;
public var numBodies: Int;
public var R : Region(1);
public val D:Dist;
public val bodies: DistArray[Body](D);
public var px :double;
public var py :double;
public var pz :double;
public var e: double;
//Constructor
public def this(dimx: double, dimy:double, numBodies: Int){
this.dimx = dimx;
this.dimy = dimy;
this.numBodies = numBodies;
this.R = 1..numBodies;
this.D = Dist.makeBlock(R);
this.bodies = DistArray.make[Body](D, (p:Point(D.rank))=>new Body());
this.px = 0.0;
this.py = 0.0;
this.pz = 0.0;
Console.OUT.println("universe created");
Console.OUT.println("dimensions x: "+dimx+" y: "+dimy+" bodies:
"+numBodies);
}
public def printBodies(){
Console.OUT.println("bodies----------------------");
finish ateach(d in bodies){
bodies(d).printBody();
}
}
//The intensive parallel calculation is performed here.
public def Advance(val dt:double){
finish ateach(p in bodies){
px += bodies(p).velx * bodies(p).mass;
py += bodies(p).vely * bodies(p).mass;
pz += bodies(p).velz * bodies(p).mass;
}
finish ateach(b in bodies){
bi: Body = bodies(b);
finish ateach(p in bodies){
bj: Body = bodies(p);
var dx: double = bi.posx - bj.posx ;
var dy: double = bi.posy - bj.posy;
var dz: double = bi.posz - bj.posz;
var d2: double = dx*dx + dy*dy + dz*dz;
if (d2 != 0.0 ){
var mag: double = dt/ (d2*Math.sqrt(d2));
bi.velx -= dx* bj.mass * mag;
bj.velx += dx* bi.mass * mag;
bi.vely -= dy* bj.mass * mag;
bj.vely += dy* bi.mass * mag;
bi.velz -= dz* bj.mass * mag;
bj.velz += dz* bi.mass * mag;
}
}
}
finish ateach(d in bodies){
bodies(d).posx += dt* bodies(d).velx;
bodies(d).posy += dt* bodies(d).vely;
bodies(d).posz += dt* bodies(d).velz;
}
}
//Calculation of the energy produced
public def Energy(): double{
finish ateach(d in bodies){
val bi: Body = bodies(d);
e += 0.5 * bi.mass * (bi.velx*bi.velx + bi.vely*bi.vely +
bi.velz*bi.velz);
finish ateach(b in bodies){
val bj: Body = bodies(b);
val dx: double = bi.posx - bj.posx;
val dy: double = bi.posy - bj.posx;
val dz: double = bi.posz - bj.posx;
e -= (bi.mass*bj.mass) / Math.sqrt(dx*dx + dy*dy + dz*dz);
}
}
return e;
}
------------------------------------------------------------------------------
//Genblock is a generator class to initialise and call functions of
the UniBlock class
import UniBlock;
import Body;
import x10.util.Timer;
import x10.io.Printer;
import x10.io.Console;
import x10.util.*;
class GenBlock{
var bodies: Int;
static val x: double = 3.5;
static val y: double = 4.2;
static val dt:double = 0.1;
static val numIterations: Int = 3;
public def run(bodies:Int){
val uni = new UniBlock(x, y, bodies);
Console.OUT.println("==== Initial positions ====");
uni.printBodies();
var e: double = 1.1;
val t = new Timer();
val start: long = t.milliTime();
for (var i: Int = 1; i <= numIterations; i++) {
uni.Advance(dt);
e= uni.Energy();
}
val stop: long = t.milliTime();
Console.OUT.println("Total energy : "+e);
Console.OUT.println("==== Final positions ====");
uni.printBodies();
val total:long = stop-start;
Console.OUT.println("");
Console.OUT.println("Time elapsed: "+((total as double)/1e9));
}
public static def main (args:Rail[String])
{
val bod = Int.parse(args(0));
new GenBlock().run(bod);
}
}
--------------------
Finaly the body class is trivial.It comprises of the Body constructor:
def this(mass: double, posx: double, posy: double, posz: double,
velx: double, vely: double, velz : double)
and some get/set and print methods.
-----------------------------
As you can see I call the methods Advance and Energy (here 3
iterations) and I am only timing this part.
Example average time measurements :
#bodies = 1000 seq: 3.1E-7 par: 6.299E-6
#bodies = 10000 seq: 2.0492E-5 par: 2.5103E-4
#bodies = 12000 seq 2.8669E-5 par: 3.48155E-4
Another question is for the energy function:
Currently I use public var e: double; which gives me final result =
0.0 although when printing all intermediate values I can see that
the calculation is done properly.
When I try to declare it inside the function body I get warning that
it need to be a preinitialized val because it is accessed in many
places. But then again, it gives a zero result.
How should I handle this?
Regards,
Konstantina
------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users