doing some benchmarks on attached file (preprocessed source), i found, that gcc-4.4 is somehow (in my case about 4%) slower than gcc-4.3 on x86_64, tuned for core2:
code compiled with -O3 -march=core2 versions: g++-4.3 -v Using built-in specs. Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.3.2-0ubuntu3' --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3 --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --enable-mpfr --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.3.2 (Ubuntu 4.3.2-0ubuntu3) g++-4.4 -v Using built-in specs. Target: x86_64-linux-gnu Configured with: ../gcc-4.4-20080815/configure --enable-languages=c,c++,fortran,objc,obj-c++ --enable-shared --with-system-zlib --enable-mpfr --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/local/include/c++/4.4 --program-suffix=-4.4 --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc Thread model: posix gcc version 4.4.0 20080815 (experimental) (GCC) gcc-4.3 produces: 0000000000401040 <loop(nova::biquad<float, float, false, true>&, float*, float*, int)>: 401040: ff c9 dec %ecx 401042: f3 0f 10 6f 18 movss 0x18(%rdi),%xmm5 401047: f3 0f 10 67 14 movss 0x14(%rdi),%xmm4 40104c: 31 c0 xor %eax,%eax 40104e: f3 0f 10 35 6a 0d 00 movss 0xd6a(%rip),%xmm6 # 401dc0 <boost::array<float, 3ul>::operator[](unsigned long)::__PRETTY_FUNCTION__+0x60> 401055: 00 401056: 48 8d 0c 8d 04 00 00 lea 0x4(,%rcx,4),%rcx 40105d: 00 40105e: 66 90 xchg %ax,%ax while gcc-4.4 produces: 0000000000400fe0 <loop(nova::biquad<float, float, false, true>&, float*, float*, int)>: 400fe0: 49 89 d0 mov %rdx,%r8 400fe3: f3 0f 10 67 14 movss 0x14(%rdi),%xmm4 400fe8: 8b 47 18 mov 0x18(%rdi),%eax 400feb: 66 0f 7e e2 movd %xmm4,%edx 400fef: 48 c1 e0 20 shl $0x20,%rax 400ff3: 89 d2 mov %edx,%edx 400ff5: ff c9 dec %ecx 400ff7: 48 09 d0 or %rdx,%rax 400ffa: f3 0f 10 35 3e 0d 00 movss 0xd3e(%rip),%xmm6 # 401d40 <boost::array<float, 3ul>::operator[](unsigned long)::__PRETTY_FUNCTION__+0x60> 401001: 00 401002: 48 c1 e8 20 shr $0x20,%rax 401006: 48 8d 14 8d 04 00 00 lea 0x4(,%rcx,4),%rdx 40100d: 00 40100e: 66 0f 6e e8 movd %eax,%xmm5 401012: 31 c0 xor %eax,%eax the rest of the code is equivalent ... i am not really familiar with x86_64 assembly, but mov 0x18(%rdi),%eax movd %eax,%xmm5 has been realizied by gcc-4.3 as movss 0x18(%rdi),%xmm5 and for the other code, registers seem to be allocated and reused in a more efficient way ... -- Summary: [4.4 regression] speed regression Product: gcc Version: 4.4.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: tim at klingt dot org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37437