[Bug rtl-optimization/74585] powerpc64: Very poor code generation for homogeneous vector aggregates passed in registers

2016-08-16 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74585 --- Comment #14 from Bill Schmidt --- (In reply to Richard Biener from comment #13) > > You mean stores like the following? > > (insn 13 12 14 2 (set (mem/c:V4SI (plus:DI (reg/f:DI 150 virtual-stack-vars) > (const_int 112

[Bug rtl-optimization/74585] powerpc64: Very poor code generation for homogeneous vector aggregates passed in registers

2016-08-16 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74585 --- Comment #13 from Richard Biener --- (In reply to Bill Schmidt from comment #11) > With the original test case, -mcpu=power8 is problematic because of the use > of the "swapping stores," whose RHS is a vec_select rather than a register > or

[Bug rtl-optimization/74585] powerpc64: Very poor code generation for homogeneous vector aggregates passed in registers

2016-08-15 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74585 --- Comment #12 from Bill Schmidt --- The rest of the ugly code (once you ignore the loads/stores) is horrible choices of register allocation. Need to understand why we're not making use of the high floating-point registers; too much copying

[Bug rtl-optimization/74585] powerpc64: Very poor code generation for homogeneous vector aggregates passed in registers

2016-08-15 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74585 --- Comment #11 from Bill Schmidt --- With the original test case, -mcpu=power8 is problematic because of the use of the "swapping stores," whose RHS is a vec_select rather than a register or subreg. This prevents us from saving the RHS of the

[Bug rtl-optimization/74585] powerpc64: Very poor code generation for homogeneous vector aggregates passed in registers

2016-08-12 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74585 --- Comment #10 from Bill Schmidt --- The dse pass is responsible for removing all the unnecessary stack activity. I think that we are probably confusing it because the stores are full vector stores, but the loads are vector element loads of

[Bug rtl-optimization/74585] powerpc64: Very poor code generation for homogeneous vector aggregates passed in registers

2016-08-12 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74585 --- Comment #9 from Bill Schmidt --- We do optimize things well for the following: typedef struct { __vector double vx0; __vector double vx1; __vector double vx2; __vector double vx3; } vdoublex8_t; vdoublex8_t test_vecd8_rotate_left

[Bug rtl-optimization/74585] powerpc64: Very poor code generation for homogeneous vector aggregates passed in registers

2016-08-12 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74585 --- Comment #8 from Bill Schmidt --- FYI, adding -mcpu=power9 to the options makes it much easier to read the RTL, as it gets rid of the extra vector swaps needed for POWER8.

[Bug rtl-optimization/74585] powerpc64: Very poor code generation for homogeneous vector aggregates passed in registers

2016-08-12 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74585 Bill Schmidt changed: What|Removed |Added Component|tree-optimization |rtl-optimization Summary|SRA