[Bug libgomp/79784] Synchronization overhead is thrashing on Aarch64

2017-03-02 Thread cbz at baozis dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79784

--- Comment #10 from Chen Baozi  ---
I have attached the testcase I used to benchmark synchronization of OpenMP on
AArch64, which is extracted from EPCC OpenMP micro-benchmark suite.

The operating system I use is ubuntu 16.04 with 4.4.0 kernel. The hardware I
use is an experimental 16-core aarch64 platform. There are 4 clusters of cpu
cores interconnected with L3 cache, in each of which contains 4 cores. And the
thrashing seems to be more severely when the threads are distributed in one
cluster, e.g., 4 threads distributed 4 different clusters looks much better
than the case when 4 threads distributed in one cluster.

[Bug libgomp/79784] Synchronization overhead is thrashing on Aarch64

2017-03-01 Thread cbz at baozis dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79784

--- Comment #7 from Chen Baozi  ---
Created attachment 40867
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40867=edit
synchronization test case

[Bug libgomp/79784] New: Synchronization overhead is thrashing on Aarch64

2017-03-01 Thread cbz at baozis dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79784

Bug ID: 79784
   Summary: Synchronization overhead is thrashing on Aarch64
   Product: gcc
   Version: 7.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libgomp
  Assignee: unassigned at gcc dot gnu.org
  Reporter: cbz at baozis dot org
CC: jakub at gcc dot gnu.org
  Target Milestone: ---

I have recently been running several programs (caffe, mxnet, openblas, =
blis...) on aarch64.  And I found performance regression when libgomp (gcc
implementation of OpenMP) is used and OMP_NUM_THREADS is set to be >2. Almost
half of the execution time is consumed either in gomp_barrier_wait_end() or
gomp_team_barrier_wait_end(). Then I run EPCC OpenMP micro-benchmark suite to
get the overhead of synchronization mechanism of GOMP on Aarch64. And it looks
pretty bad. The PARALLEL overhead varies from ~1ms to ~2000ms.

I used linux perf to analyze hot spots of the program. And I find most of
execution time is taken in the loop of barrier waiting for other threads to
synchronize. In gomp_barrier_wait_end(), it is the following section:

 do
   do_wait ((int *) >generation, state);
 while (__atomic_load_n (>generation, MEMMODEL_ACQUIRE) == state);

I'm not quite sure whether it is a known issue on Aarch64. If so, is there any
way to fix it?