Re: [X10-users] Thread Ring Benchmark

Chris Bunch Sun, 13 Feb 2011 16:50:43 -0800

Hmmm, that's a pretty cool way to solve this problem - glad to get an
extra pair of eyes on it! I'm glad to have the help, as I thought that
the 'at' block would just do the computation at the node in question and
was a relatively lightweight operation.


I'll run some numbers on these and see what I come up with - in some
scenarios Josh's code is a bit faster but I'm re-running both codes with
X10_NTHREADS=1 as I didn't know how to do it before. I saw in the
documentation that it could be set but wasn't sure if I had to do it at
compile-time or run-time and how exactly to do it, so Josh's email on
that helped a lot.

Thanks for the help everyone! 

On Sat, 2011-02-12 at 16:13 -0500, Dave Cunningham wrote:
> OK the X10 version had a different communication pattern to the MPI
> version.  In particular it was recreating the ring every iteration, which
> would have created contention at place zero.
> 
> It is not necessary to write 'recv' in X10, since X10 does not require
> communication to be written in a 2-sided style like MPI.  I'm not sure what
> the barriers were for since the messages alone should be enough for global
> synchronisation.
> 
> It is possible to write it much more elegantly as shown below.  This is
> because the receiver of one message sends the next message, so there is a
> nesting of messages if we are to use the lexically scoped finish / async /
> at constructs.  Note that it is necessary to use async at (p) rather than
> just at (p) because otherwise there is a danger of blowing the stack.  This
> is because the async decouples the nesting of messages from the nesting of
> stack frames.  This is similar to tail call optimisation.
> 
> Also note that the current X10 runtime busywaits so you should avoid running
> an X10 program such that X10_NPLACES * X10_NTHREADS is greater than the
> total number of cores you have at each host.  Since this is not a multicore
> program, you should set X10_NTHREADS=1.
> 
> import x10.lang.Math;
> import x10.util.Timer;
> 
> public class Ring {
> 
>   static val NUM_MESSAGES = 10000;
> 
>   // A global datastructure with one integer cell per place
>   static A = PlaceLocalHandle.make[Cell[Long]](Dist.makeUnique(), ()=>new
> Cell[Long](-1));
> 
>   public static def send (msg:Long, depth:Int) {
>     A()() = msg;
>     if (depth==0) return;
>     async at (here.next()) send(msg, depth-1);
>   }
> 
>   public static def main(args:Array[String](1)) {
> 
>     val startTime = Timer.milliTime();
>     finish send(42L, NUM_MESSAGES * Place.MAX_PLACES);
>     val endTime = Timer.milliTime();
> 
>     val totalTime = (endTime - startTime) / 1000.0;
> 
>     Console.OUT.printf("It took %f seconds\n", totalTime);
>   }
> }
> 
> On Fri, Feb 11, 2011 at 10:00 PM, Josh Milthorpe
> <josh.miltho...@anu.edu.au>wrote:
> 
> > Hi Chris,
> >
> > can you confirm that you are using 64 X10 places, running on two nodes,
> > e.g.
> >
> > mpiexec -n 64 -host host1,host2 a.out
> >
> > What value are you using for X10_NTHREADS ?  (The default is 2.)
> >
> > Can you get performance results for the X10 version with different
> > numbers of threads, i.e.
> >
> > mpiexec -x X10_NTHREADS=1 -n 64 ...
> > mpiexec -x X10_NTHREADS=2 -n 64 ...
> > mpiexec -x X10_NTHREADS=4 -n 64 ...
> >
> > On our cluster of quad-core Linux x86-64 nodes, running with four
> > threads is approximately 100 times slower than with one!
> >
> > Cheers,
> >
> > Josh
> >
> > On 12/02/11 06:15, Chris Bunch wrote:
> > > Certainly! Here's a link:
> > >
> > > http://pastebin.com/3cHa5M2R
> > >
> > > Thanks for the help! I appreciate it a lot!
> > >
> > > On Fri, 2011-02-11 at 14:05 -0500, Dave Cunningham wrote:
> > >
> > >> Can you show us the MPI code so we can verify it's doing the same thing
> > as
> > >> the X10 code?
> > >>
> > >> On Fri, Feb 11, 2011 at 1:07 PM, Chris Bunch<c...@cs.ucsb.edu>  wrote:
> > >>
> > >>
> > >>> Hi Josh,
> > >>>   Thanks for the quick response! This code is definitely quite an
> > >>> improvement over the last, and improves my running time from 10000
> > >>> seconds to 1500 seconds. Unfortunately, it's still much slower than my
> > >>> UPC and MPI codes, which are coming in at 5 seconds. I'm compiling my
> > >>> X10 code with -O -NO_CHECKS. Any other ideas?
> > >>>
> > >>> Thanks again!
> > >>>
> > >>> On Fri, 2011-02-11 at 12:25 +1100, Josh Milthorpe wrote:
> > >>>
> > >>>> Sorry Chris!  I just realised my "simplification" means it's no longer
> > a
> > >>>> ring :-)
> > >>>>
> > >>>> The correct code for the main method is:
> > >>>>
> > >>>>      for (var index : Int = 0; index<  NUM_MESSAGES; index++) {
> > >>>>        val i = index;
> > >>>>        finish for (p in Place.places()) async at (p) {
> > >>>>          if (p.id == 0) {
> > >>>>              Ring.send(here.next(), i);
> > >>>>              Ring.recv(i);
> > >>>>          } else {
> > >>>>              Ring.recv(i);
> > >>>>              Ring.send(here.next(), i);
> > >>>>          }
> > >>>>        }
> > >>>>      }
> > >>>>
> > >>>>
> > >>>> Josh Milthorpe wrote:
> > >>>>
> > >>>>> Hi Chris,
> > >>>>>
> > >>>>> this is a nice test of X10 primitive and communications.
> > >>>>>
> > >>>>> When I profile your code on multiple places on a single computer, I
> > see
> > >>>>> almost all the runtime is spent in "busy waiting" - presumably,
> > threads
> > >>>>> at the receiving node waiting for the sending node to complete.
> >  There
> > >>>>> is more information on the busy waiting problem in
> > >>>>> http://jira.codehaus.org/browse/XTENLANG-1012
> > >>>>>
> > >>>>> I'm guessing that the "sleep" is not an essential part of your
> > >>>>> benchmark.  If that's right, I would say that this is a perfect test
> > >>>>>
> > >>> for
> > >>>
> > >>>>> conditional atomic blocks (section 14.7.2 of the language
> > >>>>> specification).  You can replace the body of recv(...)  with
> > >>>>>
> > >>>>>               when (A(here.id) == value);
> > >>>>>
> > >>>>> This simply waits for the value to be set, and then returns.
> > >>>>>
> > >>>>> Sadly, this won't work by itself.  With the current version of X10,
> > the
> > >>>>> blocked thread is never woken up again to check the condition.  Thus
> > we
> > >>>>> have a deadlock - see
> > http://jira.codehaus.org/browse/XTENLANG-1660for
> > >>>>> more information.
> > >>>>>
> > >>>>> There is an easy way to avoid this deadlock using an (unconditional)
> > >>>>> atomic block in send(...) as follows:
> > >>>>>
> > >>>>>      at (target) {
> > >>>>>        atomic A(here.id) = value;
> > >>>>>      }
> > >>>>>
> > >>>>> On exit of this atomic block at the receiving place, the runtime
> > checks
> > >>>>> whether there are other threads waiting, and if so wakes them up.  So
> > >>>>> the blocked thread will see that the condition is now true, and
> > >>>>>
> > >>> continue.
> > >>>
> > >>>>> These changes improved the performance of your code by over three
> > >>>>>
> > >>> orders
> > >>>
> > >>>>> of magnitude on my platform.  Please let me know whether they work
> > for
> > >>>>>
> > >>> you.
> > >>>
> > >>>>> As an aside, you can use the Place.next() method to simplify the code
> > >>>>> dramatically.  A full version is below.
> > >>>>>
> > >>>>> Cheers
> > >>>>>
> > >>>>> Josh
> > >>>>>
> > >>>>> ---
> > >>>>>    public static def send(target:Place, value:Int) {
> > >>>>>      at (target) {
> > >>>>>        atomic A(here.id) = value;
> > >>>>>      }
> > >>>>>    }
> > >>>>>
> > >>>>>    public static def recv(value:Int) {
> > >>>>>      when (A(here.id) == value);
> > >>>>>    }
> > >>>>>
> > >>>>>    public static def main(args:Array[String](1)) {
> > >>>>>      val startTime = Timer.milliTime();
> > >>>>>
> > >>>>>      for (var index : Int = 0; index<  NUM_MESSAGES; index++) {
> > >>>>>        val i = index;
> > >>>>>        finish for (p in Place.places()) async at (p) {
> > >>>>>              Ring.send(here.next(), i);
> > >>>>>              Ring.recv(i);
> > >>>>>        }
> > >>>>>      }
> > >>>>>
> > >>>>>      val endTime = Timer.milliTime();
> > >>>>>      val totalTime = (endTime - startTime) / 1000.0;
> > >>>>>
> > >>>>>      Console.OUT.printf("It took %f seconds\n", totalTime);
> > >>>>>    }
> > >>>>> ---
> > >>>>>
> > >>>>>
> > >>>>> Chris Bunch wrote:
> > >>>>>
> > >>>>>
> > >>>>>> Hi all,
> > >>>>>>    I've been working on a small thread ring benchmark in X10 and
> > have
> > >>>>>> codes written in MPI, UPC, and X10 thus far. Unfortunately, my X10
> > >>>>>>
> > >>> code
> > >>>
> > >>>>>> is quite slower than the others (two orders of magnitude slower) and
> > >>>>>>
> > >>> I'm
> > >>>
> > >>>>>> not entirely sure why. Essentially each process just sends a message
> > >>>>>>
> > >>> to
> > >>>
> > >>>>>> the next process, and the final process sends the message to the
> > first
> > >>>>>> process (forming a ring).
> > >>>>>>
> > >>>>>> I'd like to make the code comparable to MPI and UPC and would love a
> > >>>>>> separate pair of eyes to look it over - it's less than 100 lines of
> > >>>>>>
> > >>> code
> > >>>
> > >>>>>> so it's not that long. I've posted the code here for those who are
> > >>>>>> interested:
> > >>>>>>
> > >>>>>> http://pastebin.com/dYPCwh4G
> > >>>>>>
> > >>>>>> I know everyone is busy but any help is much appreciated! I know X10
> > >>>>>>
> > >>> can
> > >>>
> > >>>>>> pass around 100 messages between 64 processors on two nodes with the
> > >>>>>>
> > >>> MPI
> > >>>
> > >>>>>> backed faster than 9700 seconds (the MPI is doing it in 4 seconds),
> > >>>>>>
> > >>> but
> > >>>
> > >>>>>> I'm just not sure what I'm doing wrong.
> > >>>>>>
> > >>>>>> Thanks!
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>
> > ------------------------------------------------------------------------------
> > >>>
> > >>>>>> The ultimate all-in-one performance toolkit: Intel(R) Parallel
> > Studio
> > >>>>>>
> > >>> XE:
> > >>>
> > >>>>>> Pinpoint memory and threading errors before they happen.
> > >>>>>> Find and fix more than 250 security defects in the development
> > cycle.
> > >>>>>> Locate bottlenecks in serial and parallel code that limit
> > performance.
> > >>>>>> http://p.sf.net/sfu/intel-dev2devfeb
> > >>>>>> _______________________________________________
> > >>>>>> X10-users mailing list
> > >>>>>> X10-users@lists.sourceforge.net
> > >>>>>> https://lists.sourceforge.net/lists/listinfo/x10-users
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>
> > ------------------------------------------------------------------------------
> > >>>
> > >>>>> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio
> > >>>>>
> > >>> XE:
> > >>>
> > >>>>> Pinpoint memory and threading errors before they happen.
> > >>>>> Find and fix more than 250 security defects in the development cycle.
> > >>>>> Locate bottlenecks in serial and parallel code that limit
> > performance.
> > >>>>> http://p.sf.net/sfu/intel-dev2devfeb
> > >>>>> _______________________________________________
> > >>>>> X10-users mailing list
> > >>>>> X10-users@lists.sourceforge.net
> > >>>>> https://lists.sourceforge.net/lists/listinfo/x10-users
> > >>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>
> > ------------------------------------------------------------------------------
> > >>>
> > >>>> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio
> > XE:
> > >>>> Pinpoint memory and threading errors before they happen.
> > >>>> Find and fix more than 250 security defects in the development cycle.
> > >>>> Locate bottlenecks in serial and parallel code that limit performance.
> > >>>> http://p.sf.net/sfu/intel-dev2devfeb
> > >>>> _______________________________________________
> > >>>> X10-users mailing list
> > >>>> X10-users@lists.sourceforge.net
> > >>>> https://lists.sourceforge.net/lists/listinfo/x10-users
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>>
> > ------------------------------------------------------------------------------
> > >>> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio
> > XE:
> > >>> Pinpoint memory and threading errors before they happen.
> > >>> Find and fix more than 250 security defects in the development cycle.
> > >>> Locate bottlenecks in serial and parallel code that limit performance.
> > >>> http://p.sf.net/sfu/intel-dev2devfeb
> > >>> _______________________________________________
> > >>> X10-users mailing list
> > >>> X10-users@lists.sourceforge.net
> > >>> https://lists.sourceforge.net/lists/listinfo/x10-users
> > >>>
> > >>>
> > >>
> > ------------------------------------------------------------------------------
> > >> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio
> > XE:
> > >> Pinpoint memory and threading errors before they happen.
> > >> Find and fix more than 250 security defects in the development cycle.
> > >> Locate bottlenecks in serial and parallel code that limit performance.
> > >> http://p.sf.net/sfu/intel-dev2devfeb
> > >> _______________________________________________
> > >> X10-users mailing list
> > >> X10-users@lists.sourceforge.net
> > >> https://lists.sourceforge.net/lists/listinfo/x10-users
> > >>
> > >
> > >
> > >
> > ------------------------------------------------------------------------------
> > > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
> > > Pinpoint memory and threading errors before they happen.
> > > Find and fix more than 250 security defects in the development cycle.
> > > Locate bottlenecks in serial and parallel code that limit performance.
> > > http://p.sf.net/sfu/intel-dev2devfeb
> > > _______________________________________________
> > > X10-users mailing list
> > > X10-users@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/x10-users
> > >
> >
> >
> >
> > ------------------------------------------------------------------------------
> > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
> > Pinpoint memory and threading errors before they happen.
> > Find and fix more than 250 security defects in the development cycle.
> > Locate bottlenecks in serial and parallel code that limit performance.
> > http://p.sf.net/sfu/intel-dev2devfeb
> > _______________________________________________
> > X10-users mailing list
> > X10-users@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/x10-users
> >
> ------------------------------------------------------------------------------
> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
> Pinpoint memory and threading errors before they happen.
> Find and fix more than 250 security defects in the development cycle.
> Locate bottlenecks in serial and parallel code that limit performance.
> http://p.sf.net/sfu/intel-dev2devfeb
> _______________________________________________
> X10-users mailing list
> X10-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/x10-users



------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users

Re: [X10-users] Thread Ring Benchmark

Reply via email to