[Bug target/100077] New: x86: by-value floating point array in struct - xmm regs spilling to stack

2021-04-14 Thread michaeljclark at mac dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100077

Bug ID: 100077
   Summary: x86: by-value floating point array in struct - xmm
regs spilling to stack
   Product: gcc
   Version: 10.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: michaeljclark at mac dot com
  Target Milestone: ---

Hi,

compiling a vec3 cross product using struct by-value on msvc,
clang and gcc. gcc is going through memory on the stack.
operands are by-value so I can't use restrict. same with -O2 and -Os.
i vaguely remember seeing this a couple of times but i searched
to see if i had reported it and couldn't find a duplicate report.

link with the 3 compilers here: https://godbolt.org/z/YWWfYxbM3

MSVC:  /O2 /fp:fast /arch:AVX2
Clang: -Os -mavx -x c
GCC: -Os -mavx -x c

--- BEGIN EXAMPLE ---

struct vec3a { float v[3]; };
typedef struct vec3a vec3a;

vec3a vec3f_cross_0(vec3a v1, vec3a v2)
{
vec3a dest = {
v1.v[1]*v2.v[2]-v1.v[2]*v2.v[1],
v1.v[2]*v2.v[0]-v1.v[0]*v2.v[2],
v1.v[0]*v2.v[1]-v1.v[1]*v2.v[0]
};
return dest;
}

struct vec3f { float x, y, z; };
typedef struct vec3f vec3f;

vec3f vec3f_cross_1(vec3f v1, vec3f v2)
{
vec3f dest = {
v1.y*v2.z-v1.z*v2.y,
v1.z*v2.x-v1.x*v2.z,
v1.x*v2.y-v1.y*v2.x
};
return dest;
}

void vec3f_cross_2(float dest[3], float v1[3], float v2[3])
{
dest[0]=v1[1]*v2[2]-v1[2]*v2[1];
dest[1]=v1[2]*v2[0]-v1[0]*v2[2];
dest[2]=v1[0]*v2[1]-v1[1]*v2[0];
}

--- END EXAMPLE ---

[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads

2021-01-30 Thread michaeljclark at mac dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

Michael Clark  changed:

   What|Removed |Added

 CC||michaeljclark at mac dot com

--- Comment #10 from Michael Clark  ---
another data point. I am seeing something similar on x86-64.
SysV x86-64 ABI specifies that _Decimal128 is to be passed in
xmm regs so I believe the stack stores here are redundant.

; cat > dec1.c << EOF
_Decimal128 add_d(_Decimal128 a, _Decimal128 b) { return a + b; }
EOF
; gcc -O2 -S -masm=intel dec1.c 
; cat dec1.s
add_d:
.LFB0:
.cfi_startproc
endbr64
sub rsp, 40
.cfi_def_cfa_offset 48
movaps  XMMWORD PTR [rsp], xmm0
movaps  XMMWORD PTR 16[rsp], xmm1
call__bid_addtd3@PLT
movaps  XMMWORD PTR [rsp], xmm0
add rsp, 40
.cfi_def_cfa_offset 8
ret
.cfi_endproc

[Bug target/96201] New: x86 movsd/movsq string instructions and alignment inference

2020-07-14 Thread michaeljclark at mac dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96201

Bug ID: 96201
   Summary: x86 movsd/movsq string instructions and alignment
inference
   Product: gcc
   Version: 10.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: michaeljclark at mac dot com
  Target Milestone: ---

Taking the time to record some observations and extract minimal test code for
alignment (inference) and x86 string instruction selection.

GCC9 and GCC10 are not generating x86 string instructions in cases apparently
due to the compiler believing the addresses are not aligned.

GCC10 appears to have an additional issue whereby x86 string instructions are
not selected unless the address is aligned to twice the natural alignment.

Two observations:

* (GCC9/10) integer alignment is not inferred from expressions i.e. x & ~3
* (GCC10) __builtin_assume_aligned appears to require double the alignment

The double alignment issue was observed with both int/movsd and long/movsq
whereby GCC10 will only generate movsd or movsq if the alignment is double the
type's natural alignment. The test case here is for int.


--- BEGIN SAMPLE CODE ---

void f1(long d, long s, unsigned n)
{
int *sn = (int*)( (long)(s) & ~3l );
int *dn = (int*)( (long)(d) & ~3l );
int *de = (int*)( (long)(d + n) & ~3l );

while (dn < de) *dn++ = *sn++;
}

void f2(long d, long s, unsigned n)
{
int *sn = (int*)( (long)(s) & ~7l );
int *dn = (int*)( (long)(d) & ~7l );
int *de = (int*)( (long)(d + n) & ~7l );

while (dn < de) *dn++ = *sn++;
}

void f3(long d, long s, unsigned n)
{
int *sn = __builtin_assume_aligned( (int*)( (long)(s) & ~3l ), 4 );
int *dn = __builtin_assume_aligned( (int*)( (long)(d) & ~3l ), 4 );
int *de = __builtin_assume_aligned( (int*)( (long)(d + n) & ~3l ), 4 );

while (dn < de) *dn++ = *sn++;
}

void f4(long d, long s, unsigned n)
{
int *sn = __builtin_assume_aligned( (int*)((long)(s) & ~3l ), 8 );
int *dn = __builtin_assume_aligned( (int*)((long)(d) & ~3l ), 8 );
int *de = __builtin_assume_aligned( (int*)((long)(d + n) & ~3l ), 8 );

while (dn < de) *dn++ = *sn++;
}

--- END SAMPLE CODE ---


GCC9 generates this for f1, f2 and GCC10 generates this for f1, f2, f3

.Ln:
leaq(%rax,%rsi), %rcx
movq%rax, %rdx
addq$4, %rax
movl(%rcx), %ecx
movl%ecx, (%rdx)
cmpq%rax, %rdi
ja  .Ln

GCC9 generates this for f3, f4 and GCC10 generates this only for f4

.Ln:
movsl
cmpq%rdi, %rdx
ja  .Ln

[Bug target/95251] New: x86 code size expansion inserting field into a union

2020-05-20 Thread michaeljclark at mac dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95251

Bug ID: 95251
   Summary: x86 code size expansion inserting field into a union
   Product: gcc
   Version: 10.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: michaeljclark at mac dot com
  Target Milestone: ---

Testing code on Godbolt and I came across some pathological code amplification
when SSE is enabled for field insertion into a structure containing a union. 

Here is the Godbolt link: https://godbolt.org/z/z_RpFt

Compiler flags: gcc -Os --save-temps -march=ivybridge -c x7b00.c

The function `x7b00`, inserts into the structure via char fields and it has a
voluminous translation (30 instructions).  The functionally equivalent `xyb87`
inserts into the structure via an 64-bit integer and it translates simply (5
instructions). `x`, `a7x` and `x7bcd` are for comparison.

Not adding  -march=ivybridge improves the code size but it is still nowhere
near optimal. `xyb87` serves as a reference for near optimal translation. It
seemed worthy of filing a bug due to the observed code amplification factor
(6X).

Can the backend choose the non-SSE code generation if it is more efficient?


--- CODE SNIPPET BEGINS ---

typedef unsigned long long u64;
typedef char u8;

typedef struct mr
{
union {
u64 y;
struct {
u8 a,b,c,d;
} i;
} u;
u64 x;
} mr;

u64 x(mr mr) { return mr.x; }
mr a7x(u64 x) { return (mr) { .u = { .i = { 7,0,0,0 } }, .x = x }; }
mr x7bcd(u64 x,u8 b,u8 c,u8 d) { return (mr) {.u={.i={7,b,c,d }}, .x=x }; }
mr xyb87(u64 x, u8 b) { return (mr) {.u={ .y =(u64)b << 8|7},.x=x }; }
mr x7b00(u64 x, u8 b) { return (mr) {.u={ .i ={7,b,0,0}}, .x=x }; }


--- EXPECTED OUTPUT ---

.cfi_startproc
endbr64
movsbq  %sil, %rax
movq%rdi, %rdx
salq$8, %rax
orq $7, %rax
ret
.cfi_endproc


--- OBSERVED OUTPUT ---

.cfi_startproc
endbr64
pushq   %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq%rdi, %r8
xorl%eax, %eax
movl$6, %ecx
movq%rsp, %rbp
.cfi_def_cfa_register 6
andq$-32, %rsp
leaq-32(%rsp), %rdi
rep stosb
movq$0, -48(%rsp)
movabsq $281474976710655, %rax
movq$0, -40(%rsp)
movq-48(%rsp), %rdx
andq-32(%rsp), %rax
movzwl  %dx, %edx
salq$16, %rax
orq %rax, %rdx
movq%rdx, -48(%rsp)
movb$7, -48(%rsp)
vmovdqa -48(%rsp), %xmm1
vpinsrb $1, %esi, %xmm1, %xmm0
vmovaps %xmm0, -48(%rsp)
movq-48(%rsp), %rax
movq%r8, -40(%rsp)
movq-40(%rsp), %rdx
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc

[Bug target/82261] x86: missing peephole for SHLD / SHRD

2020-05-18 Thread michaeljclark at mac dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82261

Michael Clark  changed:

   What|Removed |Added

 CC||michaeljclark at mac dot com

--- Comment #2 from Michael Clark  ---
Just refreshing this issue. I found it while testing some code-gen on Godbolt:

- https://godbolt.org/z/uXGxZ9

I noticed that Haswell code-gen uses SHRX/SHLX, but I think -Os and pre-Haswell
would benefit from this peephole if it is not complex to add. Noting that Clang
prefers SHLD / SHRD over the SHRX+SHLX pair no matter the -march flavor.