http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46763

           Summary: gcc 4.5: missed optimization: copy global to local,
                    prefetch
           Product: gcc
           Version: 4.5.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassig...@gcc.gnu.org
        ReportedBy: edwinto...@gmail.com


Created attachment 22601
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=22601
gy.i.bz2

I made a simple change to OCaml's GC: copy a global to a local var (and restore
before calling external function), and add a prefetchnta.
The global optimization is worth ~4% speedup, the prefetchnta alone is ~8%
speedup, and both ~10% speedup.
I would expect GCC to do this optimization by itself (at least the global to
register one).

Attached is a testcase to show the missed optimization, the relevant function
is sweep_slice (and its manually optimized variants sweep_slice2, ...):
$ gcc-4.5 gy.i -O2 -lm
$ ./a.out
             default: 1.325195s ( 100.0%)
            glob2loc: 1.268875s ( 95.8% +- 1.024%)
         prefetchnta: 1.207342s ( 91.1% +- 0.4986%)
            prefetch: 1.277638s ( 96.4% +- 0.1179%)
glob2loc+prefetchnta: 1.199906s ( 90.5% +- 0.3629%)


default is the original function (sweep_slice), glob2loc is my manual
optimization of caml_gc_sweep_hp, prefetchnta and prefetch are
__builtin_prefetch added by me (non-temporal prefetch is very good here), the
last one is both manual optimizations at once, resulting in a 9.5% speedup.

The attached testcase is quite large, because I dumped the sizes of all objects
from the GC to have a realistic run of the GC, I also included all functions
needed for the GC to run.

gcc-4.5 and gcc-4.4 both have this missed optimization, didn't try older ones.
BTW OCaml uses just -O -fno-defer-pop to compile, instead of -O2, but using -O
or -O2 doesn't make much difference on this testcase, so I used -O2.

$ gcc-4.5 -v
Using built-in specs.
COLLECT_GCC=gcc-4.5
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.5.1/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 4.5.1-11'
--with-bugurl=file:///usr/share/doc/gcc-4.5/README.Bugs
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
--program-suffix=-4.5 --enable-shared --enable-multiarch
--enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix
--with-gxx-include-dir=/usr/include/c++/4.5 --libdir=/usr/lib --enable-nls
--enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes
--enable-plugin --enable-gold --enable-ld=default --with-plugin-ld=ld.gold
--enable-objc-gc --with-arch-32=i586 --with-tune=generic
--enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
gcc version 4.5.1 (Debian 4.5.1-11)

CPU: AMD Phenom(tm) II X6 1090T Processor
uname -a: Linux debian 2.6.36-phenom #107 SMP PREEMPT Sat Oct 23 10:30:01 EEST
2010 x86_64 GNU/Linux

Reply via email to