Re: [X10-users] Performance tuning for the N Body problem

Vijay Saraswat Mon, 17 Jun 2013 19:47:51 -0700

Adding to what Josh said (1): If you really need to calculate the N^2forces then you are probably better off simply replicating the N bodiesin each place. At the beginning of each round, each place shouldbroadcast the current value of its N/P bodies to all others.

After this, each place can perform its calculations locally and updatethe position for the N/P bodies it owns. Now you get to choose whetheryou want two level parallelism, i.e. whether you should have multipleworkers per place, each worker responsible for updating N/(P*W) bodies,or a single worker per place. The total amount of communication will beless if you have fewer places.

So the only advantage of running over many places is to get more coresto attack the problem. You should be able to use some of the collectivelibraries to do an all to all.

With these changes you should be able to get good speedups. Should belinear except for the cost of broadcasts. You may also need to increaseN -- you are not going to get speedups by parallelizing across nodes ifyour total sequential time is in micro-seconds.. (you didnt provide theunit in your perf nums).


On 6/17/13 8:18 PM, Josh Milthorpe wrote:

Hi Konstantina,

the core of the problem seems to be this:

    finish ateach(b in bodies){
        bi: Body = bodies(b);
        finish ateach(p in bodies){
This will create N^2 activities in the inner loop, of which (N^2 *(P-1) / P) will execute remotely (i.e. at a different place to thehome place of 'bi'). The 'ateach' statement helps a bit by batchingall remote activities for a place into a single message, so the actualactivity structure is:
    finish for (place in D.places()) at (place) async {
        for (b in D|here) async {
            bi: Body = bodies(b);
            finish for (place in D.places()) at (place) async {
                for (p in D|here) async {
From this it is apparent that for each body, (P-1) remote messageswill be sent, for a total of N*(P-1) messages.
I would suggest two possible changes to improve performance:
1. Reduce the number of remote activities. Consider copying allbodies from one place to another in a single message, rather than oneat a time. This will improve scaling to larger numbers of places.
2. Reduce the total number of activities. Currently each activity isvery small, on the order of 20 floating-point instructions. Theperformance tuning guide mentions "In the fork-join based runtime,each async created results in somewhere around 50 or 100 machineinstructions of overhead (allocation of a task object, pushing thetask object onto a deque, and so forth)."
http://x10-lang.org/documentation/practical-x10-programming/performance-tuning.html
You might choose to only create N activities (one for each body), eachof which calculates all (N-1) interactions with other bodies.Assuming that N is still much greater than the available hardwareparallelism, this will improve the baseline performance of the code ona single place.
With regard to the problem of getting zero for the energy: variablesare copied by value within the context of an 'at' statement (whichincludes 'ateach'). Therefore assigning to 'e' within the ateach hasno effect. For this kind of global reduction, X10 provides a'collecting finish', which for a sum would look something like:
    val e = finish(SumReducer()) {
        ateach(...) {
            var partialEnergy:Double = ...;
            offer partialEnergy;
        }

    };

    static struct SumReducer implements Reducible[Double] {
        public def zero() = 0.0;
        public operator this(a:Double, b:Double) = (a + b);
    }
If you'd like to see an example N-body code that uses all of thetechniques described above, there is a simple force/energy calculationfor non-bonded electrostatic interactions at:
http://sourceforge.net/p/anuchem/code/ci/default/tree/anu-chem/src/au/edu/anu/chem/mm/ElectrostaticDirectMethod.x10
(N.B. this code implements two alternative communication patterns,which makes it more complicated than it needs to be. Focus on the'asyncComms' version.)
Cheers,

Josh
------------------------------------------
Josh Milthorpe

Postdoctoral Fellow, Research School of Computer Science
Australian National University, Building 108
Canberra, ACT 0200
Australia
Phone:  + 61 (0)2 61254478
Mobile: + 61 (0)407 940743
E-mail:josh.miltho...@anu.edu.au
Web:http://cs.anu.edu.au/~Josh.Milthorpe/


On 18/06/13 04:49, Konstantina Panagiotopoulou wrote:
Hello all,

I am trying to implement a parallel version of the N body problem.
I have tried to compile with java and c++ backend and the compilerotpimisations -O -NO_CHECKS proposed on the webpage, but it seemslike after all these, the sequential version is always faster.
I use distributed arrays of objects with block distribution and acluster of 10 nodes for testing. I have tried different combinationsfor # of hosts, NPLACES and NTHREADS and no real difference occurs inperformance.On the other hand, results seem correct and from printing here.id<http://here.id>'s inside the ateach statements I can see that theworkload is balanced over all nodes.
Does the distributed array of objects give so much overhead? I readsomewhere in the documentation that heavily object-orientedapplications do not benefit a lot from performance tuning ...:(
Could someone please take a look at the code in case I am missingsomething and give me a hint on performance improvement? as is done here.
Below is the code:

-----------------------------------------------------
//Class UniBlock implements the universe in which the bodies areinitialised and interact with each other
import x10.util.Random;
import Body;
import x10.io.Printer;
import x10.io.Console;
import x10.lang.Math;
import x10.array.DistArray;
import x10.util.*;

public class UniBlock {

public var dimx: double;
public var dimy: double;
public var numBodies: Int;


public var R : Region(1);
public val D:Dist;
public val bodies: DistArray[Body](D);
public var px :double;
public var py :double;
public var pz :double;

public var e: double;

//Constructor

public def this(dimx: double, dimy:double, numBodies: Int){

this.dimx = dimx;
this.dimy = dimy;
this.numBodies = numBodies;

this.R = 1..numBodies;
this.D = Dist.makeBlock(R);

this.bodies = DistArray.make[Body](D, (p:Point(D.rank))=>new Body());

this.px = 0.0;
this.py = 0.0;
this.pz = 0.0;

Console.OUT.println("universe created");
Console.OUT.println("dimensions x: "+dimx+" y: "+dimy+" bodies:"+numBodies);
}

public def printBodies(){
Console.OUT.println("bodies----------------------");

finish ateach(d in bodies){
bodies(d).printBody();
}
}

//The intensive parallel calculation is performed here.
public def Advance(val dt:double){
finish ateach(p in bodies){

px += bodies(p).velx * bodies(p).mass;
py += bodies(p).vely * bodies(p).mass;
pz += bodies(p).velz * bodies(p).mass;
}


finish ateach(b in bodies){
bi: Body = bodies(b);
finish ateach(p in bodies){
bj: Body = bodies(p);
var dx: double = bi.posx - bj.posx ;
var dy: double  = bi.posy - bj.posy;
var dz: double  = bi.posz - bj.posz;
var d2: double  = dx*dx + dy*dy + dz*dz;
if (d2 != 0.0 ){
var mag: double  = dt/ (d2*Math.sqrt(d2));

bi.velx -= dx* bj.mass * mag;
bj.velx += dx* bi.mass * mag;
bi.vely -= dy* bj.mass * mag;
bj.vely += dy* bi.mass * mag;

bi.velz -= dz* bj.mass * mag;
bj.velz += dz* bi.mass * mag;
}
}
}
finish ateach(d in bodies){
bodies(d).posx += dt* bodies(d).velx;
bodies(d).posy += dt* bodies(d).vely;
bodies(d).posz += dt* bodies(d).velz;
}
}


//Calculation of the energy produced
public def Energy(): double{
finish ateach(d in bodies){
val bi: Body = bodies(d);
e += 0.5 * bi.mass * (bi.velx*bi.velx + bi.vely*bi.vely +bi.velz*bi.velz);
finish ateach(b in bodies){
val bj: Body = bodies(b);

val dx: double = bi.posx - bj.posx;
val dy: double = bi.posy - bj.posx;
val dz: double = bi.posz - bj.posx;
e -=  (bi.mass*bj.mass) / Math.sqrt(dx*dx + dy*dy + dz*dz);
}
}
return e;
}

------------------------------------------------------------------------------
//Genblock is a generator class to initialise and call functions ofthe UniBlock class
import UniBlock;
import Body;


import x10.util.Timer;
import x10.io.Printer;
import x10.io.Console;
import x10.util.*;

class GenBlock{

var bodies: Int;
static val x: double = 3.5;
static val y: double = 4.2;

static val dt:double = 0.1;

static val numIterations: Int = 3;

public def run(bodies:Int){
val uni = new UniBlock(x, y, bodies);
Console.OUT.println("==== Initial positions ====");
uni.printBodies();
var e: double = 1.1;
val t = new Timer();
val start: long = t.milliTime();
for (var i: Int = 1; i <= numIterations; i++) {
uni.Advance(dt);
e= uni.Energy();


}
val stop: long = t.milliTime();
Console.OUT.println("Total energy : "+e);
Console.OUT.println("==== Final positions ====");
uni.printBodies();

val total:long = stop-start;
Console.OUT.println("");
Console.OUT.println("Time elapsed: "+((total as double)/1e9));

}

public static def main (args:Rail[String])
{
val bod = Int.parse(args(0));
new GenBlock().run(bod);
}
}

--------------------
Finaly the body class is trivial.It comprises of the Body constructor:
def this(mass: double, posx: double, posy: double, posz: double,velx: double, vely: double, velz : double)
and some get/set and print methods.
-----------------------------
As you can see I call the methods Advance and Energy (here 3iterations) and I am only timing this part.
Example average time measurements :

#bodies = 1000  seq: 3.1E-7 par: 6.299E-6

#bodies = 10000 seq: 2.0492E-5 par: 2.5103E-4

#bodies = 12000 seq 2.8669E-5 par: 3.48155E-4



Another question is for the energy function:
Currently I use public var e: double; which gives me final result =0.0 although when printing all intermediate values I can see thatthe calculation is done properly.When I try to declare it inside the function body I get warning thatit need to be a preinitialized val because it is accessed in manyplaces. But then again, it gives a zero result.
How should I handle this?

Regards,
Konstantina

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev

_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users

Re: [X10-users] Performance tuning for the N Body problem

Reply via email to