Adding to what Josh said (1): If you really need to calculate the N^2 forces then you are probably better off simply replicating the N bodies in each place. At the beginning of each round, each place should broadcast the current value of its N/P bodies to all others.

After this, each place can perform its calculations locally and update the position for the N/P bodies it owns. Now you get to choose whether you want two level parallelism, i.e. whether you should have multiple workers per place, each worker responsible for updating N/(P*W) bodies, or a single worker per place. The total amount of communication will be less if you have fewer places.

So the only advantage of running over many places is to get more cores to attack the problem. You should be able to use some of the collective libraries to do an all to all.

With these changes you should be able to get good speedups. Should be linear except for the cost of broadcasts. You may also need to increase N -- you are not going to get speedups by parallelizing across nodes if your total sequential time is in micro-seconds.. (you didnt provide the unit in your perf nums).

On 6/17/13 8:18 PM, Josh Milthorpe wrote:
Hi Konstantina,

the core of the problem seems to be this:

    finish ateach(b in bodies){
        bi: Body = bodies(b);
        finish ateach(p in bodies){

This will create N^2 activities in the inner loop, of which (N^2 * (P-1) / P) will execute remotely (i.e. at a different place to the home place of 'bi'). The 'ateach' statement helps a bit by batching all remote activities for a place into a single message, so the actual activity structure is:

    finish for (place in D.places()) at (place) async {
        for (b in D|here) async {
            bi: Body = bodies(b);
            finish for (place in D.places()) at (place) async {
                for (p in D|here) async {

From this it is apparent that for each body, (P-1) remote messages will be sent, for a total of N*(P-1) messages.

I would suggest two possible changes to improve performance:

1. Reduce the number of remote activities. Consider copying all bodies from one place to another in a single message, rather than one at a time. This will improve scaling to larger numbers of places.

2. Reduce the total number of activities. Currently each activity is very small, on the order of 20 floating-point instructions. The performance tuning guide mentions "In the fork-join based runtime, each async created results in somewhere around 50 or 100 machine instructions of overhead (allocation of a task object, pushing the task object onto a deque, and so forth)."
http://x10-lang.org/documentation/practical-x10-programming/performance-tuning.html
You might choose to only create N activities (one for each body), each of which calculates all (N-1) interactions with other bodies. Assuming that N is still much greater than the available hardware parallelism, this will improve the baseline performance of the code on a single place.

With regard to the problem of getting zero for the energy: variables are copied by value within the context of an 'at' statement (which includes 'ateach'). Therefore assigning to 'e' within the ateach has no effect. For this kind of global reduction, X10 provides a 'collecting finish', which for a sum would look something like:

    val e = finish(SumReducer()) {
        ateach(...) {
            var partialEnergy:Double = ...;
            offer partialEnergy;
        }

    };

    static struct SumReducer implements Reducible[Double] {
        public def zero() = 0.0;
        public operator this(a:Double, b:Double) = (a + b);
    }

If you'd like to see an example N-body code that uses all of the techniques described above, there is a simple force/energy calculation for non-bonded electrostatic interactions at:

http://sourceforge.net/p/anuchem/code/ci/default/tree/anu-chem/src/au/edu/anu/chem/mm/ElectrostaticDirectMethod.x10

(N.B. this code implements two alternative communication patterns, which makes it more complicated than it needs to be. Focus on the 'asyncComms' version.)

Cheers,

Josh
------------------------------------------
Josh Milthorpe

Postdoctoral Fellow, Research School of Computer Science
Australian National University, Building 108
Canberra, ACT 0200
Australia
Phone:  + 61 (0)2 61254478
Mobile: + 61 (0)407 940743
E-mail:josh.miltho...@anu.edu.au
Web:http://cs.anu.edu.au/~Josh.Milthorpe/


On 18/06/13 04:49, Konstantina Panagiotopoulou wrote:
Hello all,

I am trying to implement a parallel version of the N body problem.

I have tried to compile with java and c++ backend and the compiler otpimisations -O -NO_CHECKS proposed on the webpage, but it seems like after all these, the sequential version is always faster.

I use distributed arrays of objects with block distribution and a cluster of 10 nodes for testing. I have tried different combinations for # of hosts, NPLACES and NTHREADS and no real difference occurs in performance. On the other hand, results seem correct and from printing here.id <http://here.id>'s inside the ateach statements I can see that the workload is balanced over all nodes.

Does the distributed array of objects give so much overhead? I read somewhere in the documentation that heavily object-oriented applications do not benefit a lot from performance tuning ...:(


Could someone please take a look at the code in case I am missing something and give me a hint on performance improvement? as is done here.

Below is the code:

-----------------------------------------------------
//Class UniBlock implements the universe in which the bodies are initialised and interact with each other
import x10.util.Random;
import Body;
import x10.io.Printer;
import x10.io.Console;
import x10.lang.Math;
import x10.array.DistArray;
import x10.util.*;

public class UniBlock {

public var dimx: double;
public var dimy: double;
public var numBodies: Int;


public var R : Region(1);
public val D:Dist;
public val bodies: DistArray[Body](D);
public var px :double;
public var py :double;
public var pz :double;

public var e: double;

//Constructor

public def this(dimx: double, dimy:double, numBodies: Int){

this.dimx = dimx;
this.dimy = dimy;
this.numBodies = numBodies;

this.R = 1..numBodies;
this.D = Dist.makeBlock(R);

this.bodies = DistArray.make[Body](D, (p:Point(D.rank))=>new Body());

this.px = 0.0;
this.py = 0.0;
this.pz = 0.0;

Console.OUT.println("universe created");
Console.OUT.println("dimensions x: "+dimx+" y: "+dimy+" bodies: "+numBodies);
}

public def printBodies(){
Console.OUT.println("bodies----------------------");

finish ateach(d in bodies){
bodies(d).printBody();
}
}

//The intensive parallel calculation is performed here.
public def Advance(val dt:double){
finish ateach(p in bodies){

px += bodies(p).velx * bodies(p).mass;
py += bodies(p).vely * bodies(p).mass;
pz += bodies(p).velz * bodies(p).mass;
}


finish ateach(b in bodies){
bi: Body = bodies(b);
finish ateach(p in bodies){
bj: Body = bodies(p);
var dx: double = bi.posx - bj.posx ;
var dy: double  = bi.posy - bj.posy;
var dz: double  = bi.posz - bj.posz;
var d2: double  = dx*dx + dy*dy + dz*dz;
if (d2 != 0.0 ){
var mag: double  = dt/ (d2*Math.sqrt(d2));

bi.velx -= dx* bj.mass * mag;
bj.velx += dx* bi.mass * mag;
bi.vely -= dy* bj.mass * mag;
bj.vely += dy* bi.mass * mag;

bi.velz -= dz* bj.mass * mag;
bj.velz += dz* bi.mass * mag;
}
}
}
finish ateach(d in bodies){
bodies(d).posx += dt* bodies(d).velx;
bodies(d).posy += dt* bodies(d).vely;
bodies(d).posz += dt* bodies(d).velz;
}
}


//Calculation of the energy produced
public def Energy(): double{
finish ateach(d in bodies){
val bi: Body = bodies(d);
e += 0.5 * bi.mass * (bi.velx*bi.velx + bi.vely*bi.vely + bi.velz*bi.velz);

finish ateach(b in bodies){
val bj: Body = bodies(b);

val dx: double = bi.posx - bj.posx;
val dy: double = bi.posy - bj.posx;
val dz: double = bi.posz - bj.posx;
e -=  (bi.mass*bj.mass) / Math.sqrt(dx*dx + dy*dy + dz*dz);
}
}
return e;
}

------------------------------------------------------------------------------
//Genblock is a generator class to initialise and call functions of the UniBlock class



import UniBlock;
import Body;


import x10.util.Timer;
import x10.io.Printer;
import x10.io.Console;
import x10.util.*;

class GenBlock{

var bodies: Int;
static val x: double = 3.5;
static val y: double = 4.2;

static val dt:double = 0.1;

static val numIterations: Int = 3;

public def run(bodies:Int){
val uni = new UniBlock(x, y, bodies);
Console.OUT.println("==== Initial positions ====");
uni.printBodies();
var e: double = 1.1;
val t = new Timer();
val start: long = t.milliTime();
for (var i: Int = 1; i <= numIterations; i++) {
uni.Advance(dt);
e= uni.Energy();


}
val stop: long = t.milliTime();
Console.OUT.println("Total energy : "+e);
Console.OUT.println("==== Final positions ====");
uni.printBodies();

val total:long = stop-start;
Console.OUT.println("");
Console.OUT.println("Time elapsed: "+((total as double)/1e9));

}

public static def main (args:Rail[String])
{
val bod = Int.parse(args(0));
new GenBlock().run(bod);
}
}

--------------------
Finaly the body class is trivial.It comprises of the Body constructor:

def this(mass: double, posx: double, posy: double, posz: double, velx: double, vely: double, velz : double)

and some get/set and print methods.
-----------------------------
As you can see I call the methods Advance and Energy (here 3 iterations) and I am only timing this part.
Example average time measurements :

#bodies = 1000  seq: 3.1E-7 par: 6.299E-6

#bodies = 10000 seq: 2.0492E-5 par: 2.5103E-4

#bodies = 12000 seq 2.8669E-5 par: 3.48155E-4



Another question is for the energy function:
Currently I use public var e: double; which gives me final result = 0.0 although when printing all intermediate values I can see that the calculation is done properly. When I try to declare it inside the function body I get warning that it need to be a preinitialized val because it is accessed in many places. But then again, it gives a zero result.
How should I handle this?

Regards,
Konstantina



------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users

Reply via email to