https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79784
Bug ID: 79784 Summary: Synchronization overhead is thrashing on Aarch64 Product: gcc Version: 7.0.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libgomp Assignee: unassigned at gcc dot gnu.org Reporter: cbz at baozis dot org CC: jakub at gcc dot gnu.org Target Milestone: --- I have recently been running several programs (caffe, mxnet, openblas, = blis...) on aarch64. And I found performance regression when libgomp (gcc implementation of OpenMP) is used and OMP_NUM_THREADS is set to be >2. Almost half of the execution time is consumed either in gomp_barrier_wait_end() or gomp_team_barrier_wait_end(). Then I run EPCC OpenMP micro-benchmark suite to get the overhead of synchronization mechanism of GOMP on Aarch64. And it looks pretty bad. The PARALLEL overhead varies from ~1ms to ~2000ms. I used linux perf to analyze hot spots of the program. And I find most of execution time is taken in the loop of barrier waiting for other threads to synchronize. In gomp_barrier_wait_end(), it is the following section: do do_wait ((int *) &bar->generation, state); while (__atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE) == state); I'm not quite sure whether it is a known issue on Aarch64. If so, is there any way to fix it?