[Bug middle-end/81818] aarch64 uses 2-3x memory and 2x time of arm at -Os, -O2, -O3

2017-08-17 Thread andrewm.roberts at sky dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81818

--- Comment #11 from Andrew Roberts  ---
Created attachment 41992
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41992=edit
gcc-7.2.0 -fmem-report output for arm, aarch64, and x86-64

Output for gcc 7.2.0 with -fmem-report (as gcc-7.2.0-fmem-report.tar.bz2).

g++ -Ox -fmem-report -c testmap.cpp
where -Ox is one of: -O0, -O1, -O2, -O3, or -O1 -fgcse

This is across: x64 (x86-64) , arm, aarch64-rpi3 (aarch64)
Both Raspberry Pi 3 systems are identical, one has 32 bit OS, other has 64 bit
OS (Arch Linux ARM)

The files are named: gcc-7.2.0-[arch]-[opt].txt.

[Bug middle-end/81818] aarch64 uses 2-3x memory and 2x time of arm at -Os, -O2, -O3

2017-08-17 Thread andrewm.roberts at sky dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81818

--- Comment #10 from Andrew Roberts  ---
I've attached the output for gcc 7.2.0 with -fmem-report (as
gcc-7.2.0-fmem-report.tar.bz2).

g++ -Ox -fmem-report -c testmap.cpp
where -Ox is one of: -O0, -O1, -O2, -O3, or -O1 -fgcse

This is across: x64 (x86-64) , arm, aarch64-rpi3 (aarch64)
Both Raspbery Pi 3 systems are identical, one has 32 bit OS, other has 64 bit
OS (Arch Linux ARM)

The files are named: gcc-7.2.0-[arch]-[opt].txt.

The original issue was large memory usage increase for aarch64 vs arm, on -O2
and above. So looking at -O1 vs -O2 for the above.

There seem to be leaks in the Bitmaps:

  Total Memory  Percentage
  MemoryLeaked  Leaked
arm -O1: 54067992  10582346 19.57%
arm -O2: 43536148  15595746 35.82%
aarch64 -O1: 39788848   9005047 22.63%
aarch64 -O2: 74521688  42694630 57.29% <= big increase on aarch64 at -O2

47% of the leaks at -O2 on aarch64 are in:
df-problems.c:1912 (df_mir_alloc)543920:  0.7% 202813600 
10167911: 23.8%   0   0  heap
df-problems.c:1913 (df_mir_alloc)544080:  0.7% 202798720 
10167165: 23.8%   0   0  heap

32% of the leaks at -O2 on x86-64 are also in the same place, so I guess this
is a 64bit code path.

I don't see anything else which stands out as being different between arm and
aarch64 as they move from -O1 to -O2.
There are plenty of other leaks though, although how significant these are I
have no idea.

The arm gcc is configured with:
/usr/local/gcc/bin/g++ -v
Using built-in specs.
COLLECT_GCC=/usr/local/gcc/bin/g++
COLLECT_LTO_WRAPPER=/usr/local/gcc-7.2.0/libexec/gcc/armv7l-unknown-linux-gnueabihf/7.2.0/lto-wrapper
Target: armv7l-unknown-linux-gnueabihf
Configured with: ../gcc-7.2.0/configure --prefix=/usr/local/gcc-7.2.0
--program-suffix= --disable-werror --enable-shared --enable-threads=posix
--enable-checking=release --with-system-zlib --enable-__cxa_atexit
--disable-libunwind-exceptions --enable-gnu-unique-object
--enable-linker-build-id --with-linker-hash-style=gnu --enable-plugin
--enable-gnu-indirect-function --enable-lto --with-isl
--enable-languages=c,c++,fortran --disable-libgcj --enable-clocale=gnu
--disable-libstdcxx-pch --enable-install-libiberty --disable-multilib
--disable-libssp --host=armv7l-unknown-linux-gnueabihf
--build=armv7l-unknown-linux-gnueabihf --with-arch=armv7-a --with-float=hard
--with-fpu=vfpv3-d16 --disable-bootstrap --enable-gather-detailed-mem-stats
Thread model: posix
gcc version 7.2.0 (GCC)

The aarch64 gcc is configured with:
/usr/local/gcc/bin/g++ -v
Using built-in specs.
COLLECT_GCC=/usr/local/gcc/bin/g++
COLLECT_LTO_WRAPPER=/usr/local/gcc-7.2.0/libexec/gcc/aarch64-unknown-linux-gnu/7.2.0/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: ../gcc-7.2.0/configure --prefix=/usr/local/gcc-7.2.0
--program-suffix= --disable-werror --enable-shared --enable-threads=posix
--enable-checking=release --with-system-zlib --enable-__cxa_atexit
--disable-libunwind-exceptions --enable-gnu-unique-object
--enable-linker-build-id --with-linker-hash-style=gnu --enable-plugin
--enable-gnu-indirect-function --enable-lto --with-isl
--enable-languages=c,c++,fortran --disable-libgcj --enable-clocale=gnu
--disable-libstdcxx-pch --enable-install-libiberty --disable-multilib
--enable-shared --enable-clocale=gnu --with-arch-directory=aarch64
--enable-multiarch --disable-libssp --host=aarch64-unknown-linux-gnu
--build=aarch64-unknown-linux-gnu --with-arch=armv8-a --disable-bootstrap
--enable-gather-detailed-mem-stats
Thread model: posix
gcc version 7.2.0 (GCC)

The x86-64 gcc is configured with:
/usr/local/gcc/bin/g++ -v
Using built-in specs.
COLLECT_GCC=/usr/local/gcc/bin/g++
COLLECT_LTO_WRAPPER=/usr/local/gcc-7.2.0/libexec/gcc/x86_64-unknown-linux-gnu/7.2.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../gcc-7.2.0/configure --prefix=/usr/local/gcc-7.2.0
--program-suffix= --disable-werror --enable-shared --enable-threads=posix
--enable-checking=release --with-system-zlib --enable-__cxa_atexit
--disable-libunwind-exceptions --enable-gnu-unique-object
--enable-linker-build-id --with-linker-hash-style=gnu --enable-plugin
--enable-initfini-array --enable-gnu-indirect-function --with-isl
--enable-languages=c,c++,fortran,lto --disable-libgcj --enable-lto
--enable-multilib --with-tune=generic --with-arch_32=i686
--host=x86_64-unknown-linux-gnu --build=x86_64-unknown-linux-gnu
--with-ld=/usr/local/bin/ld --with-gnu-ld --with-as=/usr/local/bin/as
--with-gnu-as --disable-bootstrap --enable-gather-detailed-mem-stats
Thread model: posix
gcc version 7.2.0 (GCC)

[Bug middle-end/81818] aarch64 uses 2-3x memory and 2x time of arm at -Os, -O2, -O3

2017-08-16 Thread rearnsha at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81818

--- Comment #9 from Richard Earnshaw  ---
(In reply to Andrew Roberts from comment #8)
> I've tried building gcc-8-20170806 and gcc-8-20170813 with
> --enable-gather-detailed-mem-stats
> 
> This fails on x86-64, arm and aarch64 with the same error.
> 
> Shall I file a separate bug report for gcc-8?
> 

Yes please.  One bug report per issue.

[Bug middle-end/81818] aarch64 uses 2-3x memory and 2x time of arm at -Os, -O2, -O3

2017-08-16 Thread andrewm.roberts at sky dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81818

--- Comment #8 from Andrew Roberts  ---
I've tried building gcc-8-20170806 and gcc-8-20170813 with
--enable-gather-detailed-mem-stats

This fails on x86-64, arm and aarch64 with the same error.

The recently released 7.2.0 build ok on x86-64 at least, still testing the
rest.

Shall I file a separate bug report for gcc-8?

The error is:
/home/aroberts/gcc/gcc-build/./gcc/xgcc -B/home/aroberts/gcc/gcc-build/./gcc/
-xc -nostdinc /dev/null -S -o /dev/null
-fself-test=../../gcc-8.0.0/gcc/testsuite/selftests
xgcc: internal compiler error: Segmentation fault (program cc1)
Please submit a full bug report,
with preprocessed source if appropriate.
See  for instructions.
make[2]: *** [Makefile:1952: s-selftest-c] Error 4
rm fsf-funding.pod gcov.pod gpl.pod cpp.pod gfdl.pod gcc.pod gcov-dump.pod
gfortran.pod gcov-tool.pod
make[2]: Leaving directory '/home/aroberts/gcc/gcc-build/gcc'
make[1]: *** [Makefile:4305: all-gcc] Error 2
make[1]: Leaving directory '/home/aroberts/gcc/gcc-build'
make: *** [Makefile:918: all] Error 2

[Bug middle-end/81818] aarch64 uses 2-3x memory and 2x time of arm at -Os, -O2, -O3

2017-08-16 Thread andrewm.roberts at sky dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81818

--- Comment #7 from Andrew Roberts  ---
I'll try the memory testing on both arm and aarch64.

I've also tried -fopt-info-all-optall, I was hoping this would provide some
info on what was happening, but it only seems to give any output under -O3.

[Bug middle-end/81818] aarch64 uses 2-3x memory and 2x time of arm at -Os, -O2, -O3

2017-08-16 Thread andrewm.roberts at sky dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81818

--- Comment #6 from Andrew Roberts  ---
Looks like this info got purged by the bugzilla failure, here it is again:

Ok, I've done some more digging. 

Looking at the optimization options enabled by -O2 vs -O1, I built the test
program at -O1 and enabled each optimization in turn, on both ARM and AARCH64.

It looks like -fgcse is using the most memory of all the optimizations.
On ARM "-O1 -fgcse" is using MORE memory than "-O2". 

This suggests to me that on ARM the gcse optimization is not being run for -O2
due to some cost benefit analysis or something. Where as it is on AARCH64. Is
there anyway to get some info out of gcc to prove this?

On AARCH64 -fgcse results in a huge compile time increase due to the additional
memory usage causing massive swapping. ARM compile time increased by 14%, but
AARCH compile time increased by 400%. When there is enough RAM to avoid
swapping  -fgcse looks ok (2Gb on odroid-c2).

Tested using: gcc version 8.0.0 20170806 (experimental) (GCC) on
Raspberry PI 3 1Gb RAM (both armv7l and aarch64).

For ARM:

Optimization Level: -O1 -falign-functions
Time=1:20.76 Mem=320040 PageFaults=0
Optimization Level: -O1 -falign-jumps
Time=1:21.10 Mem=319940 PageFaults=0
Optimization Level: -O1 -falign-labels
Time=1:21.00 Mem=320028 PageFaults=0
Optimization Level: -O1 -falign-loops
Time=1:20.62 Mem=320028 PageFaults=0
Optimization Level: -O1 -fcaller-saves
Time=1:20.45 Mem=319884 PageFaults=0
Optimization Level: -O1 -fcode-hoisting
Time=1:22.01 Mem=320832 PageFaults=0
Optimization Level: -O1 -fcrossjumping
Time=1:21.28 Mem=320164 PageFaults=0
Optimization Level: -O1 -fcse-follow-jumps
Time=1:20.47 Mem=32 PageFaults=0
Optimization Level: -O1 -fdevirtualize
Time=1:42.07 Mem=320032 PageFaults=0
Optimization Level: -O1 -fdevirtualize-speculatively
Time=1:20.44 Mem=320008 PageFaults=0
Optimization Level: -O1 -fexpensive-optimizations
Time=1:22.92 Mem=321752 PageFaults=0
Optimization Level: -O1 -fgcse
Time=1:34.12 Mem=556640 PageFaults=0 <
Optimization Level: -O1 -fhoist-adjacent-loads
Time=1:20.45 Mem=319940 PageFaults=0
Optimization Level: -O1 -findirect-inlining
Time=1:21.31 Mem=320020 PageFaults=0
Optimization Level: -O1 -finline-small-functions
Time=1:32.36 Mem=319992 PageFaults=0
Optimization Level: -O1 -fipa-bit-cp
Time=1:21.13 Mem=320008 PageFaults=0
Optimization Level: -O1 -fipa-cp
Time=1:19.94 Mem=322140 PageFaults=0
Optimization Level: -O1 -fipa-icf
Time=1:21.50 Mem=319940 PageFaults=0
Optimization Level: -O1 -fipa-icf-functions
Time=1:20.93 Mem=320060 PageFaults=0
Optimization Level: -O1 -fipa-icf-variables
Time=1:20.48 Mem=320044 PageFaults=0
Optimization Level: -O1 -fipa-ra
Time=1:20.58 Mem=320284 PageFaults=0
Optimization Level: -O1 -fipa-sra
Time=1:12.69 Mem=310648 PageFaults=0
Optimization Level: -O1 -fipa-vrp
Time=1:20.45 Mem=319836 PageFaults=0
Optimization Level: -O1 -fisolate-erroneous-paths-dereference
Time=1:20.61 Mem=320024 PageFaults=0
Optimization Level: -O1 -flra-remat
Time=1:20.56 Mem=319944 PageFaults=0
Optimization Level: -O1 -foptimize-sibling-calls
Time=1:20.69 Mem=320012 PageFaults=0
Optimization Level: -O1 -foptimize-strlen
Time=1:21.10 Mem=320024 PageFaults=0
Optimization Level: -O1 -fpartial-inlining
Time=1:21.19 Mem=319888 PageFaults=0
Optimization Level: -O1 -fpeephole2
Time=1:20.75 Mem=319888 PageFaults=0
Optimization Level: -O1 -freorder-functions
Time=1:20.63 Mem=319884 PageFaults=0
Optimization Level: -O1 -frerun-cse-after-loop
Time=1:21.96 Mem=320984 PageFaults=0
Optimization Level: -O1 -fschedule-insns2
Time=1:24.68 Mem=343916 PageFaults=0
Optimization Level: -O1 -fschedule-insns
Time=1:52.77 Mem=324696 PageFaults=0
Optimization Level: -O1 -fstore-merging
Time=1:20.47 Mem=320208 PageFaults=0
Optimization Level: -O1 -fstrict-aliasing
Time=1:20.86 Mem=319880 PageFaults=0
Optimization Level: -O1 -fthread-jumps
Time=1:20.31 Mem=319900 PageFaults=0
Optimization Level: -O1 -ftree-pre
Time=1:21.38 Mem=320696 PageFaults=0
Optimization Level: -O1 -ftree-switch-conversion
Time=1:20.51 Mem=320004 PageFaults=0
Optimization Level: -O1 -ftree-tail-merge
Time=1:21.13 Mem=320040 PageFaults=0
Optimization Level: -O1 -ftree-vrp
Time=1:21.01 Mem=323032 PageFaults=0

For AARCH64:

Optimization Level: -O1 -falign-functions
Time=2:22.49 Mem=393844 PageFaults=150
Optimization Level: -O1 -falign-jumps
Time=2:20.70 Mem=393952 PageFaults=0
Optimization Level: -O1 -falign-labels
Time=2:21.09 Mem=393880 PageFaults=0
Optimization Level: -O1 -falign-loops
Time=2:20.68 Mem=393956 PageFaults=0
Optimization Level: -O1 -fcaller-saves
Time=2:20.98 Mem=393968 PageFaults=0
Optimization Level: -O1 -fcode-hoisting
Time=2:22.60 Mem=395656 PageFaults=0
Optimization Level: -O1 -fcrossjumping
Time=2:21.69 Mem=393956 PageFaults=0
Optimization Level: -O1 -fcse-follow-jumps
Time=2:21.12 Mem=393968 PageFaults=0
Optimization Level: -O1 -fdevirtualize
Time=2:58.68 Mem=393412 PageFaults=0
Optimization Level: 

[Bug middle-end/81818] aarch64 uses 2-3x memory and 2x time of arm at -Os, -O2, -O3

2017-08-16 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81818

--- Comment #5 from Richard Biener  ---
Looking for memory leaks in the backend might be interesting.  Note you can
build GCC with --enable-gather-detailed-mem-stats and use -fmem-report to get
an idea where memory is allocated (and freed).

[Bug middle-end/81818] aarch64 uses 2-3x memory and 2x time of arm at -Os, -O2, -O3

2017-08-13 Thread andrewm.roberts at sky dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81818

--- Comment #5 from Andrew Roberts  ---
Ok, I've done some more digging. 

Looking at the optimization options enabled by -O2 vs -O1, I built the test
program at -O1 and enabled each optimization in turn, on both ARM and AARCH64.

It looks like -fgcse is using the most memory of all the optimizations.
On ARM "-O1 -fgcse" is using MORE memory than "-O2". 

This suggests to me that on ARM the gcse optimization is not being run for -O2
due to some cost benefit analysis or something. Where as it is on AARCH64. Is
there anyway to get some info out of gcc to prove this?

On AARCH64 -fgcse results in a huge compile time increase due to the additional
memory usage causing massive swapping. ARM compile time increased by 14%, but
AARCH compile time increased by 400%. When there is enough RAM to avoid
swapping  -fgcse looks ok (2Gb on odroid-c2).

Tested using: gcc version 8.0.0 20170806 (experimental) (GCC) on
Raspberry PI 3 1Gb RAM (both armv7l and aarch64).

For ARM:

Optimization Level: -O1 -falign-functions
Time=1:20.76 Mem=320040 PageFaults=0
Optimization Level: -O1 -falign-jumps
Time=1:21.10 Mem=319940 PageFaults=0
Optimization Level: -O1 -falign-labels
Time=1:21.00 Mem=320028 PageFaults=0
Optimization Level: -O1 -falign-loops
Time=1:20.62 Mem=320028 PageFaults=0
Optimization Level: -O1 -fcaller-saves
Time=1:20.45 Mem=319884 PageFaults=0
Optimization Level: -O1 -fcode-hoisting
Time=1:22.01 Mem=320832 PageFaults=0
Optimization Level: -O1 -fcrossjumping
Time=1:21.28 Mem=320164 PageFaults=0
Optimization Level: -O1 -fcse-follow-jumps
Time=1:20.47 Mem=32 PageFaults=0
Optimization Level: -O1 -fdevirtualize
Time=1:42.07 Mem=320032 PageFaults=0
Optimization Level: -O1 -fdevirtualize-speculatively
Time=1:20.44 Mem=320008 PageFaults=0
Optimization Level: -O1 -fexpensive-optimizations
Time=1:22.92 Mem=321752 PageFaults=0
Optimization Level: -O1 -fgcse
Time=1:34.12 Mem=556640 PageFaults=0 <
Optimization Level: -O1 -fhoist-adjacent-loads
Time=1:20.45 Mem=319940 PageFaults=0
Optimization Level: -O1 -findirect-inlining
Time=1:21.31 Mem=320020 PageFaults=0
Optimization Level: -O1 -finline-small-functions
Time=1:32.36 Mem=319992 PageFaults=0
Optimization Level: -O1 -fipa-bit-cp
Time=1:21.13 Mem=320008 PageFaults=0
Optimization Level: -O1 -fipa-cp
Time=1:19.94 Mem=322140 PageFaults=0
Optimization Level: -O1 -fipa-icf
Time=1:21.50 Mem=319940 PageFaults=0
Optimization Level: -O1 -fipa-icf-functions
Time=1:20.93 Mem=320060 PageFaults=0
Optimization Level: -O1 -fipa-icf-variables
Time=1:20.48 Mem=320044 PageFaults=0
Optimization Level: -O1 -fipa-ra
Time=1:20.58 Mem=320284 PageFaults=0
Optimization Level: -O1 -fipa-sra
Time=1:12.69 Mem=310648 PageFaults=0
Optimization Level: -O1 -fipa-vrp
Time=1:20.45 Mem=319836 PageFaults=0
Optimization Level: -O1 -fisolate-erroneous-paths-dereference
Time=1:20.61 Mem=320024 PageFaults=0
Optimization Level: -O1 -flra-remat
Time=1:20.56 Mem=319944 PageFaults=0
Optimization Level: -O1 -foptimize-sibling-calls
Time=1:20.69 Mem=320012 PageFaults=0
Optimization Level: -O1 -foptimize-strlen
Time=1:21.10 Mem=320024 PageFaults=0
Optimization Level: -O1 -fpartial-inlining
Time=1:21.19 Mem=319888 PageFaults=0
Optimization Level: -O1 -fpeephole2
Time=1:20.75 Mem=319888 PageFaults=0
Optimization Level: -O1 -freorder-functions
Time=1:20.63 Mem=319884 PageFaults=0
Optimization Level: -O1 -frerun-cse-after-loop
Time=1:21.96 Mem=320984 PageFaults=0
Optimization Level: -O1 -fschedule-insns2
Time=1:24.68 Mem=343916 PageFaults=0
Optimization Level: -O1 -fschedule-insns
Time=1:52.77 Mem=324696 PageFaults=0
Optimization Level: -O1 -fstore-merging
Time=1:20.47 Mem=320208 PageFaults=0
Optimization Level: -O1 -fstrict-aliasing
Time=1:20.86 Mem=319880 PageFaults=0
Optimization Level: -O1 -fthread-jumps
Time=1:20.31 Mem=319900 PageFaults=0
Optimization Level: -O1 -ftree-pre
Time=1:21.38 Mem=320696 PageFaults=0
Optimization Level: -O1 -ftree-switch-conversion
Time=1:20.51 Mem=320004 PageFaults=0
Optimization Level: -O1 -ftree-tail-merge
Time=1:21.13 Mem=320040 PageFaults=0
Optimization Level: -O1 -ftree-vrp
Time=1:21.01 Mem=323032 PageFaults=0

For AARCH64:

Optimization Level: -O1 -falign-functions
Time=2:22.49 Mem=393844 PageFaults=150
Optimization Level: -O1 -falign-jumps
Time=2:20.70 Mem=393952 PageFaults=0
Optimization Level: -O1 -falign-labels
Time=2:21.09 Mem=393880 PageFaults=0
Optimization Level: -O1 -falign-loops
Time=2:20.68 Mem=393956 PageFaults=0
Optimization Level: -O1 -fcaller-saves
Time=2:20.98 Mem=393968 PageFaults=0
Optimization Level: -O1 -fcode-hoisting
Time=2:22.60 Mem=395656 PageFaults=0
Optimization Level: -O1 -fcrossjumping
Time=2:21.69 Mem=393956 PageFaults=0
Optimization Level: -O1 -fcse-follow-jumps
Time=2:21.12 Mem=393968 PageFaults=0
Optimization Level: -O1 -fdevirtualize
Time=2:58.68 Mem=393412 PageFaults=0
Optimization Level: -O1 -fdevirtualize-speculatively
Time=2:20.83 Mem=393968 PageFaults=0

[Bug middle-end/81818] aarch64 uses 2-3x memory and 2x time of arm at -Os, -O2, -O3

2017-08-11 Thread andrewm.roberts at sky dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81818

--- Comment #4 from Andrew Roberts  ---
Looking at --param ggc-min-expand and --param ggc-min-heapsize

For gcc 8.0.0:
on arm with 1Gb RAM:
GGC heuristics: --param ggc-min-expand=93 --param ggc-min-heapsize=119808
on aarch64 with 1Gb RAM:
GGC heuristics: --param ggc-min-expand=88 --param ggc-min-heapsize=109859

So these are already slightly lower on aarch64, than on arm (presumably due to
less RAM being free after kernel usage, 789M vs 889M on arm).

Looking at individual optimizations:

as -O2 uses much more memory than -O1, I figured out the optimizations that
differed, and tried building at -O2 with each of these optimizations disabled
one by one.
The most any one optimization reduced the memory footprint by was 4%. So no
smoking gun there. The optimizations for -O2 are the same for arm and aarch64.