https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65363
Bug ID: 65363 Summary: trivial redundant code reordering makes code less optimal Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: minor Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vries at gcc dot gnu.org Consider this test-case test.c (based on PR65270 comment 27/28): ... #define N 100000 struct a { int a[N]; }; typedef struct a misaligned_t __attribute__ ((aligned (8))); typedef struct a aligned_t __attribute__ ((aligned (32))); static void __attribute__ ((noinline)) __attribute__ ((noclone)) __attribute__ ((used)) t (void *a, aligned_t *d) { int v, v2; int i; for (i=0; i < N; i++) { #if REORDER v2 = ((misaligned_t *)a)->a[i]; v = ((aligned_t *)a)->a[i]; #else v = ((aligned_t *)a)->a[i]; v2 = ((misaligned_t *)a)->a[i]; #endif d->a[i] += v + v2; } } aligned_t aa; aligned_t d; int main (void) { t (&aa, &d); return 0; } ... Changing the order of loads in the loop body results in different instructions (and I assume the unaligned one (movdqu) is more expensive than the aligned one (movdqa)): ... $ n=0; gcc -O2 -ftree-vectorize test.c -DREORDER=$n -S -o $n $ n=1; gcc -O2 -ftree-vectorize test.c -DREORDER=$n -S -o $n $ diff -u 0 1 --- 0 2015-03-09 15:46:41.395919753 +0100 +++ 1 2015-03-09 15:46:43.747919840 +0100 @@ -19,7 +19,7 @@ .p2align 4,,10 .p2align 3 .L4: - movdqa (%rdi,%rax), %xmm0 + movdqu (%rdi,%rax), %xmm0 paddd %xmm0, %xmm0 paddd (%rsi,%rax), %xmm0 movaps %xmm0, (%rsi,%rax) ... The two loads are redundant, and fre is the pass that picks the first one and eliminates the second one. I'm not sure though whether you want to fix this particular example in fre. Perhaps you want to propagate alignment before doing fre. OTOH, fre does not take the cost of the value producing statements into account when determining which to choose as representative and which to eliminate.