Hi. The reasons can be many fold - Mostly it is because you are creating too many activities when the system has much less parallelism. See the following ICS 2009 paper for some explanation in this regard, and a way out :
Chunking Parallel Loops in the Presence of Synchronization. By Jun Shirako, Jisheng Zhao, V. Krishna Nandivada, Vivek Sarkar. Warm regards, Krishna. |------------> | From: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |"Liu, Xing" <xing....@gatech.edu> | >--------------------------------------------------------------------------------------------------------------------------------------------------| |------------> | To: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |x10-users <x10-users@lists.sourceforge.net> | >--------------------------------------------------------------------------------------------------------------------------------------------------| |------------> | Date: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |04/13/2010 10:24 PM | >--------------------------------------------------------------------------------------------------------------------------------------------------| |------------> | Subject: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |[X10-users] Performance of foreach | >--------------------------------------------------------------------------------------------------------------------------------------------------| Hi, We used the following code to test the performance of foreach. add1() is a sequential code. in add2(), we use foreach, and let X10 to partition workloads. and in add3(), we partition the workloads by ourselves. We use c++ backend, and run the code as # env X10_NTHREADS=2 runx10 ./Test_foreach The performance: time of add1() = 32.3 ms time of add2() = 3277.98 ms time of add3() = 18.33 ms It is surprising that add2() is 100 times slower than add1(). Is someone knows the reason? Thanks. // Test_foreach.x10 def add1() { for ((i) in 0..size-1) { data(i) += 5; } } def add2() { finish foreach ((i) in 0..size-1) { data(i) += 5; } } def add3() { var numThreads: Int = 2; val mySize = size/numThreads; finish foreach ((p) in 0..numThreads-1) { for ((i) in p*mySize..(p+1)*mySize-1) { data(i) += 5; } } } ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ X10-users mailing list X10-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/x10-users ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ X10-users mailing list X10-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/x10-users