gcc (GCC) 4.4.1 20090725 (Red Hat 4.4.1-2) The testcase (built with -Wall -O3):
#include <math.h> void MulPi(float * __attribute__((aligned(16))) i, float * __attribute__((aligned(16))) f, int n) { for (int j = 0; j < n; j++) f[j] = (float) M_PI * i[j]; } produces the following for the vectorized version of the loop: .L7: movaps %xmm1, %xmm0 # zero XMM0 incl %ecx movlps (%rdi,%rax), %xmm0 # load the low half into XMM0 movhps 8(%rdi,%rax), %xmm0 # load the high half into XMM0 mulps %xmm2, %xmm0 # multiply by pi movaps %xmm0, (%rsi,%rax) # store to memory addq $16, %rax cmpl %r8d, %ecx jb .L7 -- Summary: vector loads are unnecessarily split into high and low loads Product: gcc Version: 4.4.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: nmiell at comcast dot net GCC build triplet: x86_64-linux-gnu GCC host triplet: x86_64-linux-gnu GCC target triplet: x86_64-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41464