Hi Marko, I suspect the source of the error with X10_NTHREADS>1 is that you are integrating the equations of motion (the loop over k) for each particle immediately after calculating the forces (the loop over j), but there is no barrier between the two phases. Therefore individual particle positions may be updated before they are used to calculate the forces on other particles.
Probably the simplest fix is to restructure as two separate phases: finish ateach(p in dist) { // compute forces } finish ateach(p in dist) { // update positions and velocities } A design point: your program currently broadcasts all particle data to each place, and gathers the updated data back to place 0 one particle at a time, for every phase. It would be much more efficient to permanently store particle data in DistArrays, and update at each place. For the force calculations, only position and mass are needed, so there's no reason to transfer velocities and accelerations between places. A minor performance improvement: where array variables are one-dimensional and zero-based (as yours are), they should be declared as Rail[T] rather than Array[T](1). This allows the compiler to generate much more efficient indexing code. Cheers, Josh Benjamin W Herta wrote: > You may find this document useful. > > (See attached file: X10PerformanceModel.pdf) > > - Ben > > > > > > > From: Marko Kobal <marko.ko...@arctur.si> > > > > To: Benjamin W Herta/Fishkill/IBM@IBMUS > > > > Cc: "x10-users@lists.sourceforge.net" > <x10-users@lists.sourceforge.net> > > > > Date: 08/10/2011 13:25 > > > > Subject: RE: [X10-users] runing on multiple places (cluster) with > sockets RT implementation > > > > > > > > Hi, > > Ben, thanks for the hints. > > When X10_HOSTFILE (or X10_HOSTLIST) is populated with one entry per core > (that is in my case 12 entries per one node), the processes are properly > distributed among nodes. > > I've also tested the MPI version, where MPI is configured to run over > InfiniBand and it runs significantly faster (thanks to InfiniBand low > latency). > > And I've also tried to use X10_NTHREADS > 1, and yes, I get the results > faster, however, the result are not correct. I guess I'll have to look into > the code ... I've put the code of the N-body problem > (http://en.wikipedia.org/wiki/N-body_problem) below (the code is from my > colleague...) ... if somebody finds any error in the code regarding the > usage of async, I'll be very glad ;) ! > > > > > import x10.io.Console; > import x10.io.File; > import x10.io.FileReader; > import x10.io.FileWriter; > import x10.util.Random; > import x10.compiler.Native; > > public class nbody { > public static def main(args : Array[String](1)){ > val fileName = args.size > 0 ? args(0) : > "problem1"; > > var nTimeSteps : Int; > var eps : Double; > var tol : Double; > var bodyMass : Array[Double](1); > var bodyPos : Array[Double](1); > var bodyVel : Array[Double](1); > var bodyAcc : Array[Double](1); > > > ////////// READ FILE ////////// > var f : File = new File(fileName); > val reader = f.openRead(); > > > val nBodies = > Int.parse(reader.readLine().trim()); > nTimeSteps = > Int.parse(reader.readLine().trim()); > val dTime = Double.parse(reader.readLine().trim > ()); > eps = Double.parse(reader.readLine().trim()); > val epssq=eps*eps; > tol = Double.parse(reader.readLine().trim()); > > Console.OUT.println(nBodies); > > bodyMass = new Array[Double](nBodies); > bodyPos = new Array[Double](nBodies*3); > bodyVel = new Array[Double](nBodies*3); > bodyAcc = new Array[Double](nBodies*3,0.0); > > for (var i : Int = 0; i<nBodies; i++){ > val line = > reader.readLine().trim(); > val value = line.split(" "); > > bodyMass(i) = > Double.parse(value > (0).toString()); > bodyPos(3*i+0) = > Double.parse(value > (1).toString()); > bodyPos(3*i+1) = > Double.parse(value > (2).toString()); > bodyPos(3*i+2) = > Double.parse(value > (3).toString()); > bodyVel(3*i+0) = > Double.parse(value > (4).toString()); > bodyVel(3*i+1) = > Double.parse(value > (5).toString()); > bodyVel(3*i+2) = > Double.parse(value > (6).toString()); > } > > if(reader != null) reader.close(); > ////////// END READ FILE ////////// > > val dist = Dist.makeBlock(bodyMass.region); > val remBodyPos = new RemoteArray(bodyPos); > val remBodyVel = new RemoteArray(bodyVel); > val remBodyAcc = new RemoteArray(bodyAcc); > > val bodyPoss = bodyPos; > val bodyVell = bodyVel; > val bodyAccc = bodyAcc; > val bodyMasss = bodyMass; > > for(var step: Int = 0; step < nTimeSteps; > step++) > { > finish ateach (p in dist){ > > //Console.OUT.println > ("Befor Place:" + > here.id + " Item:"+p+" Value:" + bodyPoss(p)); > //finish async > { //computeForce(p); > > var dr : Array > [Double](1) = new > Array[Double](3); > > var drsq : > Double; > > var idr : > Double; > > var scale : > Double; > > > for(var k : > Int=0;k<3;k++){ > > bodyAccc(3*p+k) = 0.0; > > } > > > finish for(var > j: Int=0; j<nBodies; > j++){ > > val jj=j; > > async{ > > for(var k : Int=0; k<3; > k++){ > > dr(k) = > bodyPoss(3*jj+k) - bodyPoss(3*p+k); > > } > > drsq = dr(0)*dr(0) + > dr(1)*dr(1)+dr(2)*dr(2)+epssq; > > idr = 1/Math.sqrt(drsq); > > scale = > bodyMasss(jj)*idr*idr*idr; > > for(var k : Int=0; k<3; > k++){ > > bodyAccc(3*p+k) += > scale*dr(k); > > } > > > } > > } > //} > //finish async > { //advanceBody(p) > > for(var k : > Int=0;k<3;k++){ > > bodyPoss(3*p+k) += > bodyVell(3*p+k)*dTime + 0.5*bodyAccc(3*p+k)*dTime*dTime; > > bodyVell(3*p+k) += > bodyAccc(3*p+k)*dTime; > > } > //} > > //Console.OUT.println > ("After Place:" + > here.id + " Item:"+p+" Value:" + bodyPoss(p)); > finish > Array.asyncCopy > (bodyPoss, 3*p, > remBodyPos, 3*p, 3); > finish > Array.asyncCopy > (bodyVell, 3*p, > remBodyVel, 3*p, 3); > finish > Array.asyncCopy > (bodyAccc, 3*p, > remBodyAcc, 3*p, 3); > > //Console.OUT.println > ("aCopy Place:" + > here.id + " Item:"+p+" Value:" + bodyPoss(p)); > } > //for(var > i:Int=0;i<bodyPos.size;i++)Console.OUT.print(bodyPos(i)+", "); > //Console.OUT.println(); > } > > //// WRITE OUTPUT // > f = new File(fileName + ".out"); > val writer = f.printer(); > > try{ > for (var i : Int = 0; > i<nBodies; i++){ > > writer.printf("%+1.6E > \t%+1.6E\t%+1.6E\n", > bodyPos(3*i+0), bodyPos(3*i+1), bodyPos(3*i+2)); > } > > writer.flush(); > } > catch(ioe: x10.io.IOException){ > Console.OUT.println("IO > error"); > } > finally{ > if(writer != null) > writer.close(); > } > Console.OUT.println("-------------"); > } > } > > > > > > Kind regards, Marko Kobal > > > -----Original Message----- > From: Benjamin W Herta [mailto:bhe...@us.ibm.com] > Sent: Tuesday, August 09, 2011 7:23 PM > To: Marko Kobal > Cc: x10-users@lists.sourceforge.net > Subject: Re: [X10-users] runing on multiple places (cluster) with sockets > RT > implementation > > Hi Marko - unfortunately, attachments are also removed on this mailing > list. > You can send an email directly to me if you would like me to look at it. > > You can specify which host each place is located on most easily by using > X10_HOSTFILE instead of X10_HOSTLIST. If you are running with 48 places, > then you can create a text file with 48 lines, with a hostname on each > line. > Place 0 will run on the host specified on the first line, place 1 on the > second line, and so on. This can also be done with a hostlist, but it will > be a very long command line. > > Both the hostfile and hostlist wrap, so when you specify only 4 nodes as > per > your email below, you should be getting 12 places on each (node001 should > have places 0,4,8,12,etc). > > Depending on your program, you may get better performance running with 4 > places instead of 48, and use async to increase the parallelism within each > place. You may also want to explicitly set the X10_NTHREADS environment > variable to 1 if you're using 48 places, or 12 if using 4 places. Others > on > this mailing list may have additional comments on this. > > - Ben > > > > > > From: Marko Kobal <marko.ko...@arctur.si> > > > > To: "x10-users@lists.sourceforge.net" > <x10-users@lists.sourceforge.net> > > > > Date: 08/09/2011 12:35 > > > > Subject: Re: [X10-users] runing on multiple places (cluster) with > sockets RT implementation > > > > > > > > Hi, > > > > Me again ;) > > > > I have an example (N-body) written to execute in parallel. It works just > > fine, scales good on more cores. > > > > However, when I try to run it on more than one machine, more exactly on 4 > > nodes, I can see that X10 does not properly distributes the load to the > > nodes. > > > > I have nodes with 2 Intel processors, 6 cores each, that makes 12 cores per > > node, 48 cores per 4 nodes: > > > > export X10_HOSTLIST=node001,node002,node003,node004 > > export X10_NPLACES=48 > > > > I did compile for the sockets RT Implementation: > > > > # x10c++ -x10rt sockets -o nbody.parallel.sockets nbody.parallel.x10 > > > > When I execute the program, I can see that processes are spawn through the > 4 > > nodes, however the load is not distributed evenly. On some nodes there are > > more than 12 processes running, on some less than 12. This is obviously not > > good as some nodes are overloaded (and as such processing is not optimal) > > and some are under loaded (that's not one would wish for). See the print > > screen from my monitoring software: > > > > (sorry, the picture was embedded wich is obviusly not supported by the > > mailing list, I've put it into attachement now) > > > > > > The usage for X10Laucher says: X10Launcher [-np NUM_OF_PLACES] [-hostlist > > HOST1,HOST2,ETC] [-hostfile FILENAME] COMMAND_TO_LAUNCH [ARG1 ARG2 ...] . > so > > there is no parameter to set "processes per node" . I would expect > something > > similiar as is the "-perhost" parameter in the MPI world. Is there any way > > to achieve this with the X10 sockets RT Implementation? > > > > Thanks for help! > > > > > > Kind regards, Marko Kobal > > > > ---------------------------------------------------------------------------- > > -- > > uberSVN's rich system and user administration capabilities and model > configuration take the hassle out of deploying and managing Subversion and > the tools developers use with it. Learn more about uberSVN and get a free > download at: http://p.sf.net/sfu/wandisco-dev2dev > _______________________________________________ > X10-users mailing list > X10-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/x10-users > > > > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------------ > Get a FREE DOWNLOAD! and learn more about uberSVN rich system, > user administration capabilities and model configuration. Take > the hassle out of deploying and managing Subversion and the > tools developers use with it. > http://p.sf.net/sfu/wandisco-dev2dev > ------------------------------------------------------------------------ > > _______________________________________________ > X10-users mailing list > X10-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/x10-users > ------------------------------------------------------------------------------ FREE DOWNLOAD - uberSVN with Social Coding for Subversion. Subversion made easy with a complete admin console. Easy to use, easy to manage, easy to install, easy to extend. Get a Free download of the new open ALM Subversion platform now. http://p.sf.net/sfu/wandisco-dev2dev _______________________________________________ X10-users mailing list X10-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/x10-users