Hi Marco,

a few questions and suggestions about your benchmark code:

Why measure the time over so few runs?  With the number of runs in the
attached output (500), the total execution time for the timed portion is
around 0.05-0.1 seconds.  This is probably too short a run to get a
reliable measurement.  I would suggest aiming for an execution time of at
least 0.5 seconds (10000 runs or so) - maybe longer if the network is
involved.

What is the purpose of these blocks in the benchmark code?

               at (target) async
               {
                   sleep(SLEEP);
               }

These blocks would seem to double the number of (counted) activities, but
it's not clear whether they were meant to be included in the timing.  More
importantly, they tie up worker threads at the receiving place, meaning
there is a limit to the number of remote async activities (RUNS) that can
be processed.  Is this an important feature of the benchmark?  (I doubt a
real X10 program would have large numbers of calls to sleep() from user
code.)

Also, in the version of the code you sent, asyncAt() is used for both
tests, and asyncUncountedAt() is never called - I presume this is a typo
not present in the version you tested?

Finally, I suspect the timer calls are at too low a level of nesting to be
accurate.  The benchmark code calls System.nanoTime() twice per run.  This
means that a significant amount of time may be attributable to the overhead
of the timer function itself.  I would suggest hoisting the timer calls
outside of the loop over RUNS, so the timer is only called twice per
execution.

If I modify your benchmark code accordingly and use RUNS=10000, the
repeatable result is that Uncounted async is faster over both Sockets and
MPI (see attached version of the code).  Can you confirm on your test
machines?

(See attached file: Test.x10)

$ x10c++ -O -NO_CHECKS Test.x10
$ X10_NPLACES=2 ./a.out 3 10000 1000 2000
[MASTER Place(0)]: starting test...
--------------------------------------------------------------------------------
[MASTER Place(0)]: starting execution 0
[MASTER Place(0)]: starting asyncAt test... done!
[MASTER Place(0)]: sleeping 2000 ms to let all open activities finish.
[MASTER Place(0)]: starting asyncUncountedAt test... done!
[MASTER Place(0)]: sleeping 2000 ms to let all open activities finish.
[MASTER Place(0)]: test statistics:
[MASTER Place(0)]: asyncAt: 10000 runs in 0.515896001 seconds -> 51589.0 ns
per run.
[MASTER Place(0)]: asyncUncountedAt: 10000 runs in 0.177069871 seconds ->
17706.0 ns per run.
[MASTER Place(0)]: execution 0 done.
--------------------------------------------------------------------------------
[MASTER Place(0)]: starting execution 1
[MASTER Place(0)]: starting asyncAt test... done!
[MASTER Place(0)]: sleeping 2000 ms to let all open activities finish.
[MASTER Place(0)]: starting asyncUncountedAt test... done!
[MASTER Place(0)]: sleeping 2000 ms to let all open activities finish.
[MASTER Place(0)]: test statistics:
[MASTER Place(0)]: asyncAt: 10000 runs in 0.461985174 seconds -> 46198.0 ns
per run.
[MASTER Place(0)]: asyncUncountedAt: 10000 runs in 0.199475774 seconds ->
19947.0 ns per run.
[MASTER Place(0)]: execution 1 done.
--------------------------------------------------------------------------------
[MASTER Place(0)]: starting execution 2
[MASTER Place(0)]: starting asyncAt test... done!
[MASTER Place(0)]: sleeping 2000 ms to let all open activities finish.
[MASTER Place(0)]: starting asyncUncountedAt test... done!
[MASTER Place(0)]: sleeping 2000 ms to let all open activities finish.
[MASTER Place(0)]: test statistics:
[MASTER Place(0)]: asyncAt: 10000 runs in 0.485613161 seconds -> 48561.0 ns
per run.
[MASTER Place(0)]: asyncUncountedAt: 10000 runs in 0.269239342 seconds ->
26923.0 ns per run.
[MASTER Place(0)]: execution 2 done.
$
$ x10c++ -x10rt mpi -O -NO_CHECKS Test.x10
$ mpiexec -n 2 ./a.out 3 10000 1000 2000
[MASTER Place(0)]: starting test...
--------------------------------------------------------------------------------
[MASTER Place(0)]: starting execution 0
[MASTER Place(0)]: starting asyncAt test... done!
[MASTER Place(0)]: sleeping 2000 ms to let all open activities finish.
[MASTER Place(0)]: starting asyncUncountedAt test... done!
[MASTER Place(0)]: sleeping 2000 ms to let all open activities finish.
[MASTER Place(0)]: test statistics:
[MASTER Place(0)]: asyncAt: 10000 runs in 0.070323175 seconds -> 7032.0 ns
per run.
[MASTER Place(0)]: asyncUncountedAt: 10000 runs in 0.059020186 seconds ->
5902.0 ns per run.
[MASTER Place(0)]: execution 0 done.
--------------------------------------------------------------------------------
[MASTER Place(0)]: starting execution 1
[MASTER Place(0)]: starting asyncAt test... done!
[MASTER Place(0)]: sleeping 2000 ms to let all open activities finish.
[MASTER Place(0)]: starting asyncUncountedAt test... done!
[MASTER Place(0)]: sleeping 2000 ms to let all open activities finish.
[MASTER Place(0)]: test statistics:
[MASTER Place(0)]: asyncAt: 10000 runs in 0.071089413 seconds -> 7108.0 ns
per run.
[MASTER Place(0)]: asyncUncountedAt: 10000 runs in 0.03705698 seconds ->
3705.0 ns per run.
[MASTER Place(0)]: execution 1 done.
--------------------------------------------------------------------------------
[MASTER Place(0)]: starting execution 2
[MASTER Place(0)]: starting asyncAt test... done!
[MASTER Place(0)]: sleeping 2000 ms to let all open activities finish.
[MASTER Place(0)]: starting asyncUncountedAt test... done!
[MASTER Place(0)]: sleeping 2000 ms to let all open activities finish.
[MASTER Place(0)]: test statistics:
[MASTER Place(0)]: asyncAt: 10000 runs in 0.079891188 seconds -> 7989.0 ns
per run.
[MASTER Place(0)]: asyncUncountedAt: 10000 runs in 0.038455636 seconds ->
3845.0 ns per run.
[MASTER Place(0)]: execution 2 done.

Cheers,
                                                                                
   
 Josh Milthorpe                                                                 
   
 Post Doctoral Researcher                                                       
   
 Cognitive Systems: Learning to Reason                                          
   
 IBM Research                                                                   
   
                                                                                
   
                                                                                
    
                                                                                
    
                                                                                
    
 Phone: 1-914-945-2209                                            1101 
Kitchawan Rd 
 E-mail: jjmil...@us.ibm.com                             Yorktown Heights, NY 
10598 
                                                                      United 
States 
                                                                                
    







From:   Marco Bungart <m.bung...@gmx.net>
To:     Mailing list for users of the X10 programming language
            <x10-users@lists.sourceforge.net>
Date:   07/26/2016 09:59 AM
Subject:        [X10-users] Performance: at (...) async vs. at (...) @Uncounted
            async



Hi all,

we monitored one of our programs and observed a strange behaviour: the
compiler-annotation @Uncounted seems to slow down program execution. I
constructed a sample program (attached) to simulate the behaviour
(please ignore any kind of synchronization problem, I tried to keep the
test as simple as possible).

I compiled the attached program with the current git-version of X10. The
tests were conducted with the default (sockets) and mpi (OpenMPI
1.7.1ULFM) RT implementations. Envirnoment variables were set to
X10_NPLACES=2 and X10_NTHREADS=1. The program output is in the
corresponding attachments.

Bottom line: with the socket implementation, at (...) @Uncounted async {
... } seems to be slower than or equally fast as a "normal" at (...)
async { ... }. With the MPI implementation, execution times are as
expected (@Uncounted is faster than its counterpart). I was able to
observe this behaviour on two different machines (both linux, one Ubuntu
16.04.1 LTS, the other a CentOS Linux release 7.2.1511). Is there
something wrong within the sockets implementation or is this behaviour
expected? If this is expected behaviour, could someone tell me why, with
sockets RT, an at @Uncounted async is slower?

Thanks and cheers,
Marco
[attachment "Test.x10" deleted by Joshua J Milthorpe/Watson/IBM]
[attachment "mpi.txt" deleted by Joshua J Milthorpe/Watson/IBM] [attachment
"sockets.txt" deleted by Joshua J Milthorpe/Watson/IBM]
------------------------------------------------------------------------------

What NetFlow Analyzer can do for you? Monitors network bandwidth and
traffic
patterns at an interface-level. Reveals which users, apps, and protocols
are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning
reports.http://sdm.link/zohodev2dev_______________________________________________

X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users

Attachment: Test.x10
Description: Binary data

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users

Reply via email to