[Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector

2024-09-20 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583

--- Comment #3 from Tamar Christina  ---
(In reply to Richard Biener from comment #2)
> Another example this shows is for gcc.dg/vect/slp-42.c - we definitely can
> do the interleaving scheme as non-SLP vectorization shows.
> 
> gcc.dg/vect/slp-42.c also shows we're not yet "lowering" all SLP load
> permutes.
> The original SLP attempt still has
> 
>node 0x45d5050 (max_nunits=4, refcnt=2) vector([4,4]) int
>op template: _2 = q[_1];
> stmt 0 _2 = q[_1];
> stmt 1 _8 = q[_7];
> stmt 2 _14 = q[_13];
> stmt 3 _20 = q[_19];
> load permutation { 0 1 2 3 }
>node 0x45d50e8 (max_nunits=4, refcnt=2) vector([4,4]) int
>op template: _4 = q[_3];
> stmt 0 _4 = q[_3];
> stmt 1 _10 = q[_9];
> stmt 2 _16 = q[_15];
> stmt 3 _22 = q[_21];
> load permutation { 4 5 6 7 }
> 
> instead of a single contiguous load and two VEC_PERM_EXPR nodes to extract
> the lo/hi parts (which is also extract even/odd, but with a larger mode
> encompassing 4 elements).
> 
> I'd say for VLA operation this is one of the major blockers for all-SLP.

I'll take a look if Richard hasn't yet once I finish early break transition :)
.

[Bug tree-optimization/116684] [vectorization][x86-64] dot_16x1x16_uint8_int8_int32 could be better optimized

2024-09-11 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116684

Tamar Christina  changed:

   What|Removed |Added

 CC||victorldn at gcc dot gnu.org

--- Comment #2 from Tamar Christina  ---
(In reply to ktkachov from comment #1)
> Indeed. Curiously, for aarch64 at -O2 GCC is smart enough to recognise a
> USDOT instruction but at -O3 (-mcpu=neoverse-v2) it all gets synthesised

Looks like SLP discovery fails to notice it's a reduction, we do have code to
find   + reduction with SLP but it seems that the issue is here that the store
is used to start the discovery.

The same happens with a normal dotprod

#include 

void
dot_16x1x16_uint8_int8_int32(
   uint8_t data[restrict 4],
   uint8_t kernel[restrict 16][4],
   int32_t output[restrict 16])
{
  for (int i = 0; i < 16; i++)
for (int k = 0; k < 4; k++)
  output[i] += data[k] * kernel[i][k];
}

> The O3 version does fully unroll the loop so it's probably better but maybe
> it could do a better job of using USDOT?

Yeah, we could get the same effect by implementing the
vect_recog_widen_sum_pattern using dotprod accumulating into a zero register,
and then combine should be able to do the right things.

Victor had a patch at some point I think...

But the real fix is teaching SLP discovery that there's a reduction here.

[Bug target/116667] New: missing superfluous zero-extends of SVE values

2024-09-10 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116667

Bug ID: 116667
   Summary: missing superfluous zero-extends of SVE values
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*

We've recently started vectorizing functions such as:

void
decode (unsigned char * restrict h, unsigned char * restrict p4,
unsigned char * restrict p6, int f, int b, char * restrict e,
char * restrict a, char * restrict i)
{
int j = b % 8;
for (int k = 0; k < 2; ++k)
{
p4[k] = i[a[k]] | e[k] << j;
h[k] = p6[k] = a[k];
}
}

due to the vectorizer now correctly eliding one of the loads making it
profitable. Using -O3 -march=armv9-a now vectorizes and generates:

decode:
ptrue   p7.s, vl2
ptrue   p6.b, all
ld1bz31.s, p7/z, [x6]
ld1bz28.s, p7/z, [x5]
and w4, w4, 7
movprfx z0, z31
uxtbz0.s, p6/m, z31.s
mov z30.s, w4
ld1bz29.s, p7/z, [x7, z0.s, uxtw]
lslrz30.s, p6/m, z30.s, z28.s
orr z30.d, z30.d, z29.d
st1bz30.s, p7, [x1]
st1bz31.s, p7, [x2]
st1bz31.s, p7, [x0]
ret

where as we used to generate:

decode:
ptrue   p7.s, vl2
and w4, w4, 7
ld1bz0.s, p7/z, [x6]
ld1bz28.s, p7/z, [x5]
ld1bz29.s, p7/z, [x7, z0.s, uxtw]
ld1bz31.s, p7/z, [x6]
mov z30.s, w4
ptrue   p6.b, all
lslrz30.s, p6/m, z30.s, z28.s
orr z30.d, z30.d, z29.d
st1bz30.s, p7, [x1]
st1bz31.s, p7, [x2]
st1bz31.s, p7, [x0]
ret

This is great, however we're let down by RTL opt.

There's a couple of weird things here,
Cleaning up the sequence a bit the problematic parts are:

ptrue   p7.s, vl2
ptrue   p6.b, all
ld1bz31.s, p7/z, [x6]
movprfx z0, z31
uxtbz0.s, p6/m, z31.s
ld1bz29.s, p7/z, [x7, z0.s, uxtw]

It zero extends the same vaue in z31 three times.  In the old code we actually
loaded the same value twice, both zero extended and not zero extended.

The RTL for the z31 + extend is

(insn 15 13 16 2 (set (reg:VNx4QI 110 [ vect__3.6 ])
(unspec:VNx4QI [
(subreg:VNx4BI (reg:VNx16BI 120) 0)
(mem:VNx4QI (reg/v/f:DI 117 [ a ]) [0  S[4, 4] A8])
] UNSPEC_LD1_SVE)) "/app/example.c":9:24 5683
{maskloadvnx4qivnx4bi}
 (expr_list:REG_DEAD (reg/v/f:DI 117 [ a ])
(expr_list:REG_EQUAL (unspec:VNx4QI [
(const_vector:VNx4BI [
(const_int 1 [0x1]) repeated x2
repeat [
(const_int 0 [0])
(const_int 0 [0])
]
])
(mem:VNx4QI (reg/v/f:DI 117 [ a ]) [0  S[4, 4] A8])
] UNSPEC_LD1_SVE)
(nil
(insn 16 15 17 2 (set (reg:VNx16BI 122)
(const_vector:VNx16BI repeat [
(const_int 1 [0x1])
])) 5658 {*aarch64_sve_movvnx16bi}
 (nil))
(insn 17 16 20 2 (set (reg:VNx4SI 121 [ vect_patt_59.7_52 ])
(unspec:VNx4SI [
(subreg:VNx4BI (reg:VNx16BI 122) 0)
(zero_extend:VNx4SI (reg:VNx4QI 110 [ vect__3.6 ]))
] UNSPEC_PRED_X)) 6943 {*zero_extendvnx4qivnx4si2}
 (expr_list:REG_EQUAL (zero_extend:VNx4SI (reg:VNx4QI 110 [ vect__3.6 ]))
(nil)))

But combine refuses the merge of the zero extend into the load,

deferring rescan insn with uid = 15.
allowing combination of insns 15 and 17
original costs 4 + 4 = 8
replacement costs 4 + 4 = 8
i2 didn't change, not doing this

and instead copies it into the gather load, but leaving the insn 17 alone
presumably because of the predicate.  So it looks like a bug in our backend
costing.  The widening load is definitely cheaper than load + extend.

However I'm not sure as the line "i2 didn't change, not doing this" seems to
indicate that it wasn't rejected because of cost?

In the codegen there's a peculiarity in that while the two loads

ld1bz31.s, p7/z, [x6]
ld1bz28.s, p7/z, [x5]

are both widening loads, but they aren't modelled the same:

ld1bz31.s, p7/z, [x6]   // 15 [c=4 l=4]  maskloadvnx4qivnx4bi
ld1bz28.s, p7/z, [x5]   // 50 [c=4 l=4] 
aarch64_load_zero_extendvnx4sivnx4qi

This is because the RTL pattern seems to want to keep the same number of
elements as the input vector size. So it ends up with a gather an

[Bug tree-optimization/115130] [meta-bug] early break vectorization

2024-09-10 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115130
Bug 115130 depends on bug 115866, which changed state.

Bug 115866 Summary: missed optimization vectorizing switch statements.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115866

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|FIXED   |---

[Bug tree-optimization/115866] missed optimization vectorizing switch statements.

2024-09-10 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115866

Tamar Christina  changed:

   What|Removed |Added

 Resolution|FIXED   |---
 Status|RESOLVED|REOPENED

--- Comment #7 from Tamar Christina  ---
The testcase I posted above is still not lowered by ifcvt

https://godbolt.org/z/44vd76eKx

short a[100];

int foo(int n, int counter)
{
   for (int i = 0; i < n; i++)
 {
if (a[i] == 1 || a[i] == 2 || a[i] == 7 || a[i] == 4)
  return 1;
 }
return 0;
}

still produces:

   [local count: 114863530]:
  if (n_8(D) > 0)
goto ; [94.50%]
  else
goto ; [5.50%]

   [local count: 108546036]:

   [local count: 1014686025]:
  # i_2 = PHI 
  _10 = a[i_2];
  switch (_10)  [94.50%], case 1 ... 2:  [5.50%], case 4:
 [5.50%], case 7:  [5.50%]>

   [local count: 55807731]:
:
  goto ; [100.00%]

with last night's compiler.

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2024-09-10 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 115866, which changed state.

Bug 115866 Summary: missed optimization vectorizing switch statements.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115866

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|FIXED   |---

[Bug tree-optimization/116628] [15 Regression] ICE in vect_analyze_loop_1 on aarch64 with -Ofast in TSVC

2024-09-06 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116628

Tamar Christina  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #8 from Tamar Christina  ---
Fixed, thanks for the report!

[Bug tree-optimization/116628] [15 Regression] ICE in vect_analyze_loop_1 on aarch64 with -Ofast in TSVC

2024-09-06 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116628

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #6 from Tamar Christina  ---
I'll take this one then.

[Bug tree-optimization/116628] [15 Regression] ICE in vect_analyze_loop_1 on aarch64 with -Ofast in TSVC

2024-09-06 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116628

--- Comment #5 from Tamar Christina  ---
(In reply to Richard Biener from comment #4)
> Confirmed.  The ICE means we've "fatally" failed to analyze an epilogue
> which we do not expect.
> 
> t.c:4:21: note:   worklist: examine stmt: .MASK_STORE (&a[e_10], 8B, _9 !=
> 0, _1);
> t.c:4:21: note:   vect_is_simple_use: operand _9 != 0, type of def: unknown
> t.c:4:21: missed:   Unsupported pattern.
> 
> possibly the embedded _9 != 0 is the problem?
> 
> t.c:4:21: note:   vect_recog_bool_pattern: detected: _ifc__24 = _9 ? _1 :
> _ifc__22;
> t.c:4:21: note:   bool pattern recognized: patt_8 = _9 != 0 ? _1 : _ifc__22;
> t.c:4:21: note:   vect_recog_cond_store_pattern: detected: a[e_10] =
> _ifc__24;
> t.c:4:21: note:   cond_store pattern recognized: .MASK_STORE (&a[e_10], 8B,
> _9 != 0, _1);

Hmmm if so https://gcc.gnu.org/pipermail/gcc-patches/2024-September/662146.html
should fix it?

[Bug tree-optimization/116628] [15 Regression] ICE in vect_analyze_loop_1 on aarch64 with -Ofast in TSVC

2024-09-06 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116628

--- Comment #3 from Tamar Christina  ---
Still seems to ICE after that commit on last night's trunk

https://godbolt.org/z/GnYT7Kx46

[Bug tree-optimization/116577] [15 Regression] tonto in SPECCPU 2006 ICEs in vect_lower_load_permutations

2024-09-03 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116577

--- Comment #3 from Tamar Christina  ---
reproducer should be saved with extension .f90

[Bug tree-optimization/116577] [15 Regression] tonto in SPECCPU 2006 ICEs in vect_lower_load_permutations

2024-09-03 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116577

--- Comment #2 from Tamar Christina  ---
---
module type
   type a
   complex(kind(1.0d0)) j
   real(kind(1.0d0)) k
   real(kind(1.0d0)) l
   end type
   contains
   subroutine b(c,g)
type(a), dimension(:) :: c
 target c
 type(a), dimension(:), target :: g
 type(a), pointer :: d,h
 do i=1,e
   h => c(i)
   d  => g(i)
   h%j  = d%j
   h%l  = d%l
   h%k = f
 end do
end
   end

compiled with -mcpu=neoverse-v1 -Ofast reproduces the ICE

[Bug tree-optimization/116575] [15 Regression] blender in SPEC2017 ICE in vect_analyze_slp

2024-09-02 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116575

--- Comment #1 from Tamar Christina  ---
---
int a;
float *b, *c;
void d() {
  char *e;
  for (; a; a++, b += 4, c += 4)
if (*e++) {
  float *f = c;
  f[0] = b[0];
  f[1] = b[1];
  f[2] = b[2];
  f[3] = b[3];
}
}

compiled with -mcpu=neoverse-v1 -Ofast reproduces the ICE

[Bug tree-optimization/116577] New: [15 Regression] tonto in SPECCPU 2006 ICEs in vect_lower_load_permutations

2024-09-02 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116577

Bug ID: 116577
   Summary: [15 Regression] tonto in SPECCPU 2006 ICEs in
vect_lower_load_permutations
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*

with -mcpu=neoverse-v1 -Ofast -flto tonto ICEs with

crystal.fppized.f90:1795:3: internal compiler error: Segmentation fault
 1795 |function d_chi2(p) result(res)
  |   ^
0x1c0a0f7 internal_error(char const*, ...)
   
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/diagnostic-global-context.cc:492
0xcff233 crash_signal
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/toplev.cc:321
0xfa9db8 vect_lower_load_permutations
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:4354
0xfae8c3 vect_lower_load_permutations
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:4509
0xfae8c3 vect_analyze_slp(vec_info*, unsigned int)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:4777
0xf83a6b vect_analyze_loop_2
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:2862
0xf85123 vect_analyze_loop_1
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:3409
0xf85857 vect_analyze_loop(loop*, vec_info_shared*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:3567
0xfc3cef try_vectorize_loop_1
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1068
0xfc3cef try_vectorize_loop
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1184
0xfc4223 execute
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1300

Running reducer and bisect

[Bug middle-end/116575] New: [15 Regression] blender in SPEC2017 ICE in vect_analyze_slp

2024-09-02 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116575

Bug ID: 116575
   Summary: [15 Regression] blender in SPEC2017 ICE in
vect_analyze_slp
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*

Blender from spec2017 ICEs when compiled with -Ofast -flto -mcpu=neoverse-v1
with

during GIMPLE pass: vect
blender/source/blender/editors/object/object_bake_api.c: In function
'write_internal_bake_pixels':
blender/source/blender/editors/object/object_bake_api.c:173:13: internal
compiler error: in vect_analyze_slp, at tree-vect-slp.cc:4765
  173 | static bool write_internal_bake_pixels(
  | ^
0x1c0a0f7 internal_error(char const*, ...)
   
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/diagnostic-global-context.cc:492
0x7bb0c7 fancy_abort(char const*, int, char const*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/diagnostic.cc:1658
0xfaf1bb vect_analyze_slp(vec_info*, unsigned int)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:4765
0xf83a6b vect_analyze_loop_2
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:2862
0xf85123 vect_analyze_loop_1
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:3409
0xf85857 vect_analyze_loop(loop*, vec_info_shared*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:3567
0xfc3cef try_vectorize_loop_1
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1068
0xfc3cef try_vectorize_loop
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1184
0xfc4223 execute
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1300

Creating a reducer and bisecting

[Bug tree-optimization/36010] Loop interchange not performed

2024-09-02 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36010

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org

--- Comment #7 from Tamar Christina  ---
It looks like today at -Ofast this is due to the full unrolling

https://godbolt.org/z/3K4hPbWfG

i.e. at -Ofast we fail due to the inner loop being fully unrolled.

Would it make sense to perform loop distribution before cunrolli?

in principle it should make any potential vect and SLP simpler no?

[Bug rtl-optimization/116541] [14/15 Regression] Inefficient missing use of reg+reg addressing modes

2024-09-02 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116541

Tamar Christina  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org
 Ever confirmed|0   |1
   Last reconfirmed||2024-09-02
 Status|UNCONFIRMED |NEW

--- Comment #1 from Tamar Christina  ---
(In reply to ktkachov from comment #0)
> I don't know if Tamar's pending IVOPTs fix this but filing it here just in
> case

No, this isn't an IVopts problem. The loop has only one IV expression, however
the way we expand the address breaks the addressing mode.

in GCC 13 we had:

(insn 19 18 20 4 (set (reg:DI 114)
(high:DI (const:DI (plus:DI (symbol_ref:DI ("c") [flags 0x82] 
)
(const_int 4 [0x4]) "/app/example.c":15:49 -1
 (nil))
(insn 20 19 21 4 (set (reg:DI 113)
(lo_sum:DI (reg:DI 114)
(const:DI (plus:DI (symbol_ref:DI ("c") [flags 0x82]  )
(const_int 4 [0x4]) "/app/example.c":15:49 -1
 (expr_list:REG_EQUAL (const:DI (plus:DI (symbol_ref:DI ("c") [flags 0x82] 
)
(const_int 4 [0x4])))
(nil)))
(insn 21 20 22 4 (set (reg/f:DI 112)
(plus:DI (reg:DI 100 [ ivtmp.21 ])
(reg:DI 113))) "/app/example.c":15:49 -1
 (nil))
(insn 22 21 23 4 (set (reg:SF 116)
(mem:SF (reg/f:DI 112) [1 MEM[(float *)&c + 4B + ivtmp.21_32 * 1]+0 S4
A32])) "/app/example.c":15:46 -1
 (nil))

and in GCC 14:

(insn 19 18 20 4 (set (reg:DI 123)
(high:DI (symbol_ref:DI ("c") [flags 0x82]  ))) "/app/example.c":15:49 -1
 (nil))
(insn 20 19 21 4 (set (reg:DI 122)
(lo_sum:DI (reg:DI 123)
(symbol_ref:DI ("c") [flags 0x82]  )))
"/app/example.c":15:49 -1
 (expr_list:REG_EQUAL (symbol_ref:DI ("c") [flags 0x82]  )
(nil)))
(insn 21 20 22 4 (set (reg/f:DI 121)
(plus:DI (reg:DI 101 [ ivtmp.22 ])
(reg:DI 122))) "/app/example.c":15:49 -1
 (nil))
(insn 22 21 23 4 (set (reg:SF 125)
(mem:SF (plus:DI (reg/f:DI 121)
(const_int 4 [0x4])) [1 MEM[(float *)&c + 4B + ivtmp.22_1 *
1]+0 S4 A32])) "/app/example.c":15:46 -1
 (nil))

i.e. the offset is now in the memory access rather than in the address
calculation.  This means that the base is updated rather than the offset, as
the offset is now a constant.

This is due to:

commit db4e496aadf1d7ab1c5af24410394d1551ddd3f0
Author: Wilco Dijkstra 
Date:   Tue Jan 16 16:27:02 2024 +

AArch64: Reassociate CONST in address expressions

GCC tends to optimistically create CONST of globals with an immediate
offset.
However it is almost always better to CSE addresses of globals and add
immediate
offsets separately (the offset could be merged later in single-use cases).
Splitting CONST expressions with an index in aarch64_legitimize_address
fixes
part of PR112573.

gcc/ChangeLog:
PR target/112573
* config/aarch64/aarch64.cc (aarch64_legitimize_address):
Reassociate
badly formed CONST expressions.

gcc/testsuite/ChangeLog:
PR target/112573
* gcc.target/aarch64/pr112573.c: Add new test.

Confirmed.

[Bug tree-optimization/116520] Multiple condition lead to missing vectorization due to missing early break

2024-08-29 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116520

--- Comment #4 from Tamar Christina  ---
(In reply to Tamar Christina from comment #3)
> (In reply to Richard Biener from comment #2)
> > The issue seems to be that if-conversion isn't done:
>
> I wonder if this transformation is really beneficial on modern cpus though.
> Seems like the compares are independent so the entire thing executes quite
> parallel?

and with this I mean, the vector result from vectoring the unreassociated code,
the scalar is obviously still a long dependency chain.

[Bug tree-optimization/116520] Multiple condition lead to missing vectorization due to missing early break

2024-08-29 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116520

--- Comment #3 from Tamar Christina  ---
(In reply to Richard Biener from comment #2)
> The issue seems to be that if-conversion isn't done:
> 
> Can not ifcvt due to multiple exits
> 
> maybe my patched dev tree arrives with a different CFG here (no switches
> into ifcvt).  I don't think if-conversion was adjusted when the vectorizer
> gained early exit vectorization 

it was adjusted only to deal with bitfield lowering which was one of the
scenarios we had on the roadmap.

> - if-conversion shouldn't for example
> predicate with the exit condition and it should leave those conditions
> and exits around.

How would if-cvt recover what reassoc did here though?

   [local count: 1044213920]:
  # s_15 = PHI 
  _1 = *s_15;
  if (_1 > 63)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 522106960]:
  goto ; [100.00%]

   [local count: 522106960]:
  _14 = (int) _1;
  _17 = 9223372036854785024 >> _14;
  _18 = _17 & 1;
  _19 = _18 == 0;
  _12 = ~_19;

   [local count: 1044213920]:
  # prephitmp_4 = PHI <_12(4), 0(11)>
  _10 = _1 == 92;
  _13 = prephitmp_4 | _10;
  if (_13 != 0)
goto ; [8.03%]
  else
goto ; [91.97%]

is the relevant block, wouldn't BB4 need to be fully predicated to be able to
vectorize this?  That also pushes this loop to only be vectorizable when fully
masked where the original input doesn't require masking.

As a side note, it looks like it's reassoc that's transforming and merging the
conditions, i.e. https://godbolt.org/z/9Kfafvava

I wonder if this transformation is really beneficial on modern cpus though.
Seems like the compares are independent so the entire thing executes quite
parallel?

[Bug tree-optimization/116463] [15 Regression] fast-math-complex-mls-{double,float}.c fail after r15-3087-gb07f8a301158e5

2024-08-28 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463

--- Comment #11 from Tamar Christina  ---
(In reply to Richard Biener from comment #6)
> I think
> 
>   a - ((b * -c) + (d * -e))  ->  a + (b * c) + (d * e)
> 
> is a good simplification to be made, but it's difficult to do this with
> canonicalization only.  Like a * -b -> -(a * b) as the negate might
> combine with both other negates down and upstream.  But for
> a*-b + c * -d it might be more obvious to turn that into
> -a*b - c*d.

Yeah, my expectation was that this would be an easier transform to avoid
the sharing problem we discussed before and that indeed the transform
looks at the entire chain not just transforming a * -b.

a*-b + c * -d -> -a*b - c*d

has the property of still maintaining the FMS and FMNS chains and can
get further simplified in the above case.

> 
> Maybe reassoc can be of help here - IIRC it turns b * -c into
> b * c * -1, undistribute_ops_list might get that.

hmm I see, but don't we have a higher chance that folding will just
fold it back into the multiply?

For this to work we'd have to do

  (b * -c) + (d * -e) -> -(b * c + d * e)

in one transformation no? since I'd imagine

  (b * c * -1) + (d * e * -1)

would just be undone by match.pd?

> 
> Note one issue is that complex lowering leaves around dead stmts,
> confusing reassoc and forwprop, in particular
> 
> -  _10 = COMPLEX_EXPR <_18, _6>;
> 
> stay around until reassoc.  scheduling dce for testing shows reassoc
> does something.
> 
> It's update_complex_assignment who replaces existing complex
> stmts with COMPLEX_EXPRs, we should possibly resort do
> simple_dce_from_worklist
> to clean those.  Let me try to do that.

Thanks!

[Bug tree-optimization/116463] [15 Regression] fast-math-complex-mls-{double,float}.c fail after r15-3087-gb07f8a301158e5

2024-08-22 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463

--- Comment #5 from Tamar Christina  ---
Yeah, This is because they generate different gimple sequences and thus
different SLP trees.
The core of the problem is there's no canonical form here, and a missing gimple
simplification rule:

  _33 = IMAGPART_EXPR <*_3> + ((REALPART_EXPR <*_5> * IMAGPART_EXPR <*_7>) +
(IMAGPART_EXPR <*_5> * REALPART_EXPR <*_7>));
vs
  _37 = IMAGPART_EXPR <*_3> - ((REALPART_EXPR <*_5> * -IMAGPART_EXPR <*_7>) +
(IMAGPART_EXPR <*_5> * -REALPART_EXPR <*_7>));

i.e. a - ((b * -c) + (d * -e)) == a + (b * c) + (d * e)

So probably in match.pd we should fold _37 into _33 which is a simpler form of
the same thing and it's better on scalar as well.

It would be better to finally introduce a vectorizer canonical form, for
instance the real part generates:

  _36 = (_31 - _30) + REALPART_EXPR <*_3>;
vs
  _32 = REALPART_EXPR <*_3> + (_26 - _27);

and this already is an additional thing to check, so it would be better if slp
build always puts complex parts consistently on one side of commutative
operations so we don't have to swap operands to check.

In any case, I have some patches in this area and can take a look when I'm
back, but think the new expression should be simplified back into the old one.

[Bug tree-optimization/116409] [15 Regression] Recent phiopt change causing ICE with sqrt and -frounding-math

2024-08-19 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116409

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org
 Target|riscv64-linux-gnu   |riscv64-linux-gnu,
   ||aarch64-linux-gnu

--- Comment #7 from Tamar Christina  ---
I see the same error in povray on SPECCPU 2006 and 2017.

[Bug target/116229] [15 Regression] wrong code at -Ofast aarch64 due to missing fneg to generate 0x8000000000000000

2024-08-08 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116229

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #5 from Tamar Christina  ---
Fixed, thanks for the report!

[Bug target/116229] [15 Regression] wrong code at -Ofast aarch64 due to missing fneg to generate 0x8000000000000000

2024-08-05 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116229

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #3 from Tamar Christina  ---
(In reply to Andrew Pinski from comment #2)
>   /* For Advanced SIMD we can create an integer with only the top bit set
>  using fneg (0.0f).  */
> 
> is wrong in aarch64_maybe_generate_simd_constant.
> 
> it should use either an unspec here or an XOR instead of fneg here I think
> especially for -ffast-math reasons.

XOR would defeat the point of the optimization. The original expression is fine
but relied on nothing in the late pipeline being able to fold the zero constant
back in.

It was for this reason that we explicitly forced it to a separate register.
Late combine is just doing something not possible before. I'll fix it.

[Bug libstdc++/116140] [15 Regression] 5-35% slowdown of 483.xalancbmk and 523.xalancbmk_r since r15-2356-ge69456ff9a54ba

2024-08-01 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116140

--- Comment #4 from Tamar Christina  ---
It looks like it's because the old unrolled code for the pointer version did a
subtract and used the difference to optimize the IV check away to every 4
elements.  This explains the increase in instruction count.

I hadn't noticed it during benchmarking because on aarch64 the non-pointer
version got recovered with cbz.

This should be fixable while still being vectorizable with

#pragma GCC unroll 4

on the loop.  The generated code looks good, but it looks like the pragma is
being
dropped when used in the template.

I'm away for a few days so Alex is looking into it.

[Bug libstdc++/116140] [15 Regression] 5-35% slowdown of 483.xalancbmk and 523.xalancbmk_r since r15-2356-ge69456ff9a54ba

2024-08-01 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116140

--- Comment #3 from Tamar Christina  ---
(In reply to Jan Hubicka from comment #2)
> Looking at the change, I do not see how that could disable inlining. It
> should only reduce size of the function size estimates in the heuristics.
> 
> I think it is more likely loop optimization doing something crazy.  But we
> need to figure out what really changed in the codegen.

Yes, looking at the change since the loop is now smaller it gets unlined into
the callers. So I guess something goes wrong after that. Trying something now..

[Bug fortran/90608] Inline non-scalar minloc/maxloc calls

2024-08-01 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90608

--- Comment #24 from Tamar Christina  ---
(In reply to Mikael Morin from comment #23)
> (In reply to Mikael Morin from comment #21)
> > 
> > (...) and should be able to submit the first
> > series (inline minloc without dim argument) this week.
> > 
> I missed the "this week" mark (again), but I've finally submitted the
> patches:
> https://gcc.gnu.org/pipermail/gcc-patches/2024-July/658909.html

Thank you! and thanks for the clear patch! it gives a starting point if
I have to inline simpler intrinsics in the future :)

much appreciated!

[Bug target/115974] sat_add, etc. vector patterns not done for aarch64 (non-sve)

2024-07-31 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115974

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org
 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org

--- Comment #5 from Tamar Christina  ---
I'll assign to myself for now.

[Bug target/116145] Suboptimal SVE immediate synthesis

2024-07-31 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116145

--- Comment #5 from Tamar Christina  ---
(In reply to ktkachov from comment #4)
> Intersting, thanks for the background. The bigger issue I was seeing was
> with a string-matching loop like https://godbolt.org/z/E7b13915E where the
> constant pool load is a reasonable codegen decision, but unfortunately every
> iteration of the loop reloads the constant which would hurt in a tight inner
> loop.
> So perhaps my problem is that the constant-pool loads are not being
> considered loop invariant, or something is sinking them erroneously

Ah, Yeah, that's definitely a bug. It looks like fwprop is pushing the constant
vector initialization into an UNSPEC, which LIM doesn't know is invariant so
can't pull it out.

We also don't do so in GIMPLE because the operation isn't lowered.

  pc_15 = svmatch_u8 (pg_12, arr1_14, { 63, 92, 13, 10, ... });

so the constant is never removed from the instruction in GIMPLE.

Should probably look at whether we really need an UNSPEC there.

[Bug target/116145] Suboptimal SVE immediate synthesis

2024-07-31 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116145

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org

--- Comment #3 from Tamar Christina  ---
We've looked at this type of constant initialization in the past and even
though the LLVM version looks faster it's not in practice.

If you look at the Software Optimization Guides SVE cores don't handle MOV/MOVK
pairs special anymore.
So here the sequence is a 3 instruction dependency chain with a very low
throughput:

mov w8, #23615
movkw8, #2573, lsl #16
mov z0.s, w8
ret

vs

ptrue   p3.b, all
adrpx0, .LC0
add x0, x0, :lo12:.LC0
ld1rw   z0.s, p3/z, [x0]
ret

which is also a 3 instruction dependency chain but loads have a higher
throughputs than register transfers and the latency difference is hidden.
In most real code you'd also have shared the anchor and ptrue, or if in a loop,
the ptrue and the adrp would have been floated out.

Benchmarking has shown that there's no real performance difference between
these two when it's 1 constant.  When there are more than one constant the load
variant wins by a large margin as the SVE mov serializes the construction of
all constants.

The concern here is that because of this serialization that constant
rematerialization inside loops would become slower.
So I don't believe the LLVM sequence is beneficial to implement.

That said, when we looked at this we did come to the conclusion that we can use
SVE's ORR and other immediate instructions to construct more immediate
sequences on the SIMD side itself.  That way we avoid the transfer.

[Bug libstdc++/116140] [15 Regression] 5-35% slowdown of 483.xalancbmk and 523.xalancbmk_r since r15-2356-ge69456ff9a54ba

2024-07-30 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116140

--- Comment #1 from Tamar Christina  ---
Yeah, we've noticed it as well.

The weird thing is that the dynamic instruction count went up by a lot.

So it looks like some inlining or something did not happen.

[Bug target/116074] [15 regression] ICE when building harfbuzz-9.0.0 on arm64 (related_int_vector_mode, at stor-layout.cc:581)

2024-07-26 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116074

Tamar Christina  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #11 from Tamar Christina  ---
Fixed, thanks for report.

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2024-07-26 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 116074, which changed state.

Bug 116074 Summary: [15 regression] ICE when building harfbuzz-9.0.0 on arm64 
(related_int_vector_mode, at stor-layout.cc:581)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116074

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug tree-optimization/116074] [15 regression] ICE when building harfbuzz-9.0.0 on arm64 (related_int_vector_mode, at stor-layout.cc:581)

2024-07-25 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116074

--- Comment #8 from Tamar Christina  ---
Going with a backend fix instead.

[Bug tree-optimization/116074] [15 regression] ICE when building harfbuzz-9.0.0 on arm64 (related_int_vector_mode, at stor-layout.cc:581)

2024-07-25 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116074

--- Comment #7 from Tamar Christina  ---
The backend is returning TImode for get_vectype_for_scalar_type for historical
reasons where large integer modes were considered struct types and this vector
modes.

However they're not modes the vectorizer can use but the backend hook
!targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode) ICEs because
it's
not a valid vector mode.

I don't think the target hook should ICE, and I don't see how other usages of
the hook do any additional checking.

diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 53af5e38b53..b68aea925a4 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -6638,6 +6638,7 @@ vect_recog_cond_store_pattern (vec_info *vinfo,
   machine_mode mask_mode;
   machine_mode vecmode = TYPE_MODE (vectype);
   if (targetm.vectorize.conditional_operation_is_expensive (IFN_MASK_STORE)
+  || !VECTOR_MODE_P (vecmode)
   || !targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
   || !can_vec_mask_load_store_p (vecmode, mask_mode, false))
 return NULL;

This fixes the issue, but I would have expected get_mask_mode to just return
False here.

[Bug tree-optimization/116074] [15 regression] ICE when building harfbuzz-9.0.0 on arm64 (related_int_vector_mode, at stor-layout.cc:581)

2024-07-24 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116074

Tamar Christina  changed:

   What|Removed |Added

   Last reconfirmed||2024-07-25
   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED

--- Comment #6 from Tamar Christina  ---
mine.

[Bug fortran/90608] Inline non-scalar minloc/maxloc calls

2024-07-24 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90608

--- Comment #22 from Tamar Christina  ---
(In reply to Mikael Morin from comment #21)
> (In reply to Tamar Christina from comment #20)
> > Hi Mikael,
> > 
> > I did regression testing on x86_64 and AArch64 and only found one test-ism.
> > 
> > I think I understand most of the patch to be able to deal with any fallout,
> > would it be ok if I fix the test-ism and submit the patch on your behalf?
> > 
> > It would be a shame to let it bitrot.
> > 
> 
> Sorry. In the last days, I submitted a few small minloc-related patches
> found while working on this PR, and should be able to submit the first
> series (inline minloc without dim argument) this week.

Ah ok, I'll wait for you then, thanks!

> 
> You can submit on my behalf if you prefer; it would definitely accelerate
> progress on this topic.
> 
> What do you mean by test-ism?


I think this was just me, I had tested the minloc patch on top of some
additional changes to IVopts that mostly help fortran code.

At the time gfortran.dg/maxloc_bounds_[4-7].f90 started failing and I had
assumed that it had to do with the inlining.

But it looks like they were a bug in my IVopts patch as they're no longer
failing with the new patches.

[Bug ipa/106783] [12/13/14/15 Regression] ICE in ipa-modref.cc:analyze_function since r12-5247-ga34edf9a3e907de2

2024-07-23 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106783

--- Comment #8 from Tamar Christina  ---
(In reply to Jan Hubicka from comment #6)
> The problem is that n/=0 is undefined behavior (so we can optimize out call
> to function doing divide by zero), while __builtin_trap is observable and we
> do not optimize out code paths that may trip to it.
> 

H I hit this today with:

void foo1 (char *restrict a, int *restrict c, int n, int stride)
{
  if (stride <= 1)
return;
  for (int i = 0; i < 9; i++)
{
  int res = c[i];
  c[i] = a[i] ? res : 9;
}
}

compiled with -Ofast -march=armv9-a -fdump-tree-modref.

At least this variant has no builtin traps (nor UB).

See https://godbolt.org/z/1h83rasns

[Bug fortran/90608] Inline non-scalar minloc/maxloc calls

2024-07-22 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90608

--- Comment #20 from Tamar Christina  ---
Hi Mikael,

I did regression testing on x86_64 and AArch64 and only found one test-ism.

I think I understand most of the patch to be able to deal with any fallout,
would it be ok if I fix the test-ism and submit the patch on your behalf?

It would be a shame to let it bitrot.

Thanks,
Tamar

[Bug tree-optimization/115531] vectorizer generates inefficient code for masked conditional update loops

2024-07-22 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #7 from Tamar Christina  ---
Fixed on trunk

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2024-07-22 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 115531, which changed state.

Bug 115531 Summary: vectorizer generates inefficient code for masked 
conditional update loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug tree-optimization/115936] [15 Regression] GCN vs. ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932]

2024-07-17 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115936

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #8 from Tamar Christina  ---
Fixed, thanks for the report.  Bug is latent on branches so won't backport for
now.

[Bug target/115936] [15 Regression] GCN vs. ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932]

2024-07-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115936

--- Comment #6 from Tamar Christina  ---
(In reply to Richard Biener from comment #3)
> iv->step should never be a pointer type

This is created by SCEV.

simple_iv_with_niters in the case where no CHREC is found creates an IV with
base == ev, offset == 0;

however in this case EV is a POINTER_PLUS_EXPR and so the type is a pointer.
it ends up creating an unusable expression.

following the remaining code, it looks like this should be

diff --git a/gcc/tree-scalar-evolution.cc b/gcc/tree-scalar-evolution.cc
index 5aa95a2497a..abb2bad7773 100644
--- a/gcc/tree-scalar-evolution.cc
+++ b/gcc/tree-scalar-evolution.cc
@@ -3243,7 +3243,11 @@ simple_iv_with_niters (class loop *wrto_loop, class loop
*use_loop,
   if (tree_does_not_contain_chrecs (ev))
 {
   iv->base = ev;
-  iv->step = build_int_cst (TREE_TYPE (ev), 0);
+  tree ev_type = TREE_TYPE (ev);
+  if (POINTER_TYPE_P (ev_type))
+   ev_type = sizetype;
+
+  iv->step = build_int_cst (ev_type, 0);
   iv->no_overflow = true;
   return true;
 }

So I think there are two bugs here.  I also think the IVopts one is a bug, as
it's clearly changing the type and introducing a mismatch there too.

I'm regression testing both changes.

[Bug target/115936] [15 Regression] GCN vs. ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932]

2024-07-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115936

--- Comment #5 from Tamar Christina  ---
(In reply to Richard Biener from comment #3)
> iv->step should never be a pointer type

This is created by SCEV.

simple_iv_with_niters in the case where no CHREC is found creates an IV with
base == ev, offset == 0;

however in this case EV is a POINTER_PLUS_EXPR and so the type is a pointer.
it ends up creating an unusable expression.

following the remaining code, it looks like this should be

diff --git a/gcc/tree-scalar-evolution.cc b/gcc/tree-scalar-evolution.cc
index 5aa95a2497a..abb2bad7773 100644
--- a/gcc/tree-scalar-evolution.cc
+++ b/gcc/tree-scalar-evolution.cc
@@ -3243,7 +3243,11 @@ simple_iv_with_niters (class loop *wrto_loop, class loop
*use_loop,
   if (tree_does_not_contain_chrecs (ev))
 {
   iv->base = ev;
-  iv->step = build_int_cst (TREE_TYPE (ev), 0);
+  tree ev_type = TREE_TYPE (ev);
+  if (POINTER_TYPE_P (ev_type))
+   ev_type = sizetype;
+
+  iv->step = build_int_cst (ev_type, 0);
   iv->no_overflow = true;
   return true;
 }

So I think there are two bugs here.  I also think the IVopts one is a bug, as
it's clearly changing the type and introducing a mismatch there too.

I'm regression testing both changes.

[Bug target/115934] [15 Regression] nvptx vs. ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932]

2024-07-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115934

--- Comment #7 from Tamar Christina  ---
(In reply to Thomas Schwinge from comment #6)
> Tamar, Richard, thanks for having a look.
> 
> (In reply to Tamar Christina from comment #4)
> > This one looks a bit like costing, [...]
> 
> I see.  So we (I) shall later re-visit this PR in context of
>  001101dad3b2$ef215730$cd640590$@nextmovesoftware.com> "[nvptx PATCH]
> Implement rtx_costs target hook for nvptx backend", and, if necessary,
> follow-up work:
> 

I believe so, it looks like try_improve_iv_set does nothing for nvptx because
it tries to look for TARGET_ADDRESS_COST and in it's absence tries to use
TARGET_RTX_COSTS both of which are missing.

Because of this it can't compare the different IVs and the costs all end up
being the same.

So basically it just ends up picking the first one from the list, which in this
case just happens to be worse off. 

> > I don't however see an implementation of TARGET_ADDRESS_COST for the target.

[Bug target/115936] [15 Regression] GCN vs. ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932]

2024-07-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115936

--- Comment #4 from Tamar Christina  ---
(In reply to Richard Biener from comment #3)
> iv->step should never be a pointer type

That's what I initially thought too.  My suspicion is that there is some code
that tries to create the 0 offset.

I'll try to track down where the IV is created.

0 + 0B is a weird candidate either way.

[Bug target/115934] [15 Regression] nvptx vs. ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932]

2024-07-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115934

--- Comment #4 from Tamar Christina  ---
This one looks a bit like costing,

before the patch IVopts had:

:
inv_expr 1: -element_7(D)
inv_expr 2: (signed int) rite_5(D) - (signed int) element_7(D)

and after the patch it generates a few more alternatives:


:
inv_expr 1: -element_7(D)
inv_expr 2: ((signed int) left_4(D) + (signed int) rite_5(D)) - (signed
int) element_7(D)
inv_expr 3: (signed int) left_4(D) + (signed int) rite_5(D)
inv_expr 4: (signed int) rite_5(D) - (signed int) element_7(D)
inv_expr 5: ((signed int) rite_5(D) - (signed int) element_7(D)) + (signed
int) left_4(D)
inv_expr 6: ((signed int) rite_5(D) + (signed int) element_7(D)) + (signed
int) left_4(D)
inv_expr 7: ((signed int) left_4(D) - (signed int) element_7(D)) + (signed
int) rite_5(D)

Before it decided it needed two separate IVs to satisfy these invariants:

Initial set of candidates:
  cost: 122 (complexity 0)
  reg_cost: 114
  cand_cost: 8
  cand_group_cost: 0 (complexity 0)
  candidates: 7, 9
   group:0 --> iv_cand:7, cost=(0,0)
   group:1 --> iv_cand:9, cost=(0,0)
   group:2 --> iv_cand:9, cost=(0,0)
   group:3 --> iv_cand:7, cost=(0,0)
  invariant variables: 1
  invariant expressions: 1

Original cost 122 (complexity 0)

Final cost 122 (complexity 0)

Selected IV set for loop 1 at
../gcc-dsg/gcc/testsuite/gcc.dg/tree-ssa/pr43378.c:7, 10 avg niters, 2 IVs:
Candidate 7:
  Var befor: left_14
  Var after: left_10
  Incr POS: orig biv
  IV struct:
Type:   int
Base:   left_4(D)
Step:   element_7(D)
Biv:N
Overflowness wrto loop niter:   Overflow
Candidate 9:
  Depend on inv.exprs: 1
  Var befor: rite_15
  Var after: rite_8
  Incr POS: orig biv
  IV struct:
Type:   int
Base:   rite_5(D)
Step:   -element_7(D)
Biv:N
Overflowness wrto loop niter:   Overflow


with the patch it decided it only needed the one IV:

Initial set of candidates:
  cost: 109 (complexity 0)
  reg_cost: 97
  cand_cost: 4
  cand_group_cost: 8 (complexity 0)
  candidates: 9
   group:0 --> iv_cand:9, cost=(4,0)
   group:1 --> iv_cand:9, cost=(0,0)
   group:2 --> iv_cand:9, cost=(0,0)
   group:3 --> iv_cand:9, cost=(4,0)
  invariant variables: 1
  invariant expressions: 1, 3

Initial set of candidates:
  cost: 109 (complexity 0)
  reg_cost: 97
  cand_cost: 4
  cand_group_cost: 8 (complexity 0)
  candidates: 9
   group:0 --> iv_cand:9, cost=(4,0)
   group:1 --> iv_cand:9, cost=(0,0)
   group:2 --> iv_cand:9, cost=(0,0)
   group:3 --> iv_cand:9, cost=(4,0)
  invariant variables: 1
  invariant expressions: 1, 3

Original cost 109 (complexity 0)

Final cost 109 (complexity 0)

Selected IV set for loop 1 at
../gcc-dsg/gcc/testsuite/gcc.dg/tree-ssa/pr43378.c:7, 10 avg niters, 1 IVs:
Candidate 9:
  Depend on inv.exprs: 1
  Var befor: rite_15
  Var after: rite_8
  Incr POS: orig biv
  IV struct:
Type:   int
Base:   rite_5(D)
Step:   -element_7(D)
Biv:N
Overflowness wrto loop niter:   Overflow

It realizes it can satisfy both IVs using 1 candidate and picks it because it
thinks the costs are much lower.
I don't however see an implementation of TARGET_ADDRESS_COST for the target.

On AArch64 for instance this is rejected by costing because the combined IV
requires more registers:

Initial set of candidates:
  cost: 17 (complexity 0)
  reg_cost: 5
  cand_cost: 4
  cand_group_cost: 8 (complexity 0)
  candidates: 9
   group:0 --> iv_cand:9, cost=(4,0)
   group:1 --> iv_cand:9, cost=(0,0)
   group:2 --> iv_cand:9, cost=(0,0)
   group:3 --> iv_cand:9, cost=(4,0)
  invariant variables: 1
  invariant expressions: 1, 3

Improved to:
  cost: 14 (complexity 0)
  reg_cost: 6
  cand_cost: 8
  cand_group_cost: 0 (complexity 0)
  candidates: 7, 9
   group:0 --> iv_cand:7, cost=(0,0)
   group:1 --> iv_cand:9, cost=(0,0)
   group:2 --> iv_cand:9, cost=(0,0)
   group:3 --> iv_cand:7, cost=(0,0)
  invariant variables: 1
  invariant expressions: 1

Original cost 14 (complexity 0)

Final cost 14 (complexity 0)

Selected IV set for loop 1 at /app/example.c:4, 10 avg niters, 2 IVs:

I'll have to take a look at what happens when a target has no cost model for
IVopts.

[Bug target/115936] [15 Regression] GCN vs. ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932]

2024-07-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115936

Tamar Christina  changed:

   What|Removed |Added

   Target Milestone|--- |15.0

--- Comment #2 from Tamar Christina  ---
Looks like IVopts has generated an invalid gimple:

ivtmp.39_65 = ivtmp.39_59 + 0B;

where the IVs are DI mode and the offset is a pointer.
This comes from this weird candidate:

Candidate 8:
  Var befor: ivtmp.39_59
  Var after: ivtmp.39_65
  Incr POS: before exit test
  IV struct:
Type:   sizetype
Base:   0
Step:   0B
Biv:N
Overflowness wrto loop niter:   No-overflow

Looks like this invalid candidate was always generated, but was not selected
before as the old constant_multiple_of bailed out due to the operand_equal_p
constraining the type of the arguments.

Question is why this invalid candidate was generated at all, and that's due to:

  /* Record common candidate with initial value zero.  */
  basetype = TREE_TYPE (iv->base);
  if (POINTER_TYPE_P (basetype))
basetype = sizetype;
  record_common_cand (data, build_int_cst (basetype, 0), iv->step, use);

which for the case the type is a pointer changes the base but not the step.
this makes base + step no longer valid gimple.

So I believe fix is:

diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
index 5fc188ae3f8..d590d6a9b78 100644
--- a/gcc/tree-ssa-loop-ivopts.cc
+++ b/gcc/tree-ssa-loop-ivopts.cc
@@ -3547,7 +3547,8 @@ add_iv_candidate_for_use (struct ivopts_data *data,
struct iv_use *use)
   basetype = TREE_TYPE (iv->base);
   if (POINTER_TYPE_P (basetype))
 basetype = sizetype;
-  record_common_cand (data, build_int_cst (basetype, 0), iv->step, use);
+  record_common_cand (data, build_int_cst (basetype, 0),
+ fold_convert (basetype, iv->step), use);


which fixes the ICE. Will regtest and submit.

[Bug target/115936] [15 Regression] GCN vs. ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932]

2024-07-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115936

Tamar Christina  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 CC||tnfchris at gcc dot gnu.org
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2024-07-15
   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org

--- Comment #1 from Tamar Christina  ---
odd thing to iCE on, mine.

[Bug target/115934] [15 Regression] nvptx vs. ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932]

2024-07-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115934

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2024-07-15
 Ever confirmed|0   |1

--- Comment #3 from Tamar Christina  ---
mine.

[Bug target/115934] [15 Regression] nvptx vs. ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932]

2024-07-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115934

--- Comment #1 from Tamar Christina  ---
Hi, thanks for the report, could you tell me a target triple I can use for
nvptx?

[Bug tree-optimization/115866] missed optimization vectorizing switch statements.

2024-07-11 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115866

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org

--- Comment #2 from Tamar Christina  ---
mine.

[Bug tree-optimization/115866] New: missed optimization vectorizing switch statements.

2024-07-10 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115866

Bug ID: 115866
   Summary: missed optimization vectorizing switch statements.
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
Blocks: 53947, 115130
  Target Milestone: ---

The following example:

short a[100];

int foo(int n, int counter)
{
   for (int i = 0; i < n; i++)
 {
if (a[i] == 1 || a[i] == 2 || a[i] == 7 || a[i] == 4)
  return 1;
 }
return 0;
}

fails to vectorize at -O3 -march=armv9-a because in GIMPLE the if is rewritten
into a switch statement:

   [local count: 114863530]:
  if (n_6(D) > 0)
goto ; [94.50%]
  else
goto ; [5.50%]

   [local count: 108546036]:

   [local count: 1014686025]:
  # i_3 = PHI 
  _1 = a[i_3];
  switch (_1)  [94.50%], case 1 ... 2:  [5.50%], case 4:
 [5.50%], case 7:  [5.50%]>

   [local count: 55807731]:
:
  goto ; [100.00%]

   [local count: 958878295]:
:
  i_8 = i_3 + 1;
  if (n_6(D) > i_8)
goto ; [94.50%]
  else
goto ; [5.50%]

   [local count: 52738306]:
  goto ; [100.00%]

   [local count: 906139989]:
  goto ; [100.00%]

   [local count: 6317494]:

   [local count: 59055800]:

   [local count: 114863531]:
  # _5 = PHI <1(9), 0(8)>
:
  return _5;

However such switch statements, where all the entries lead to the same
destination are easy to vectorize.  In SVE we have the MATCH instruction that
can be used here, and for others we can duplicate the constants and lower the
switch to a series of compare and ORRs.

This is similar as what's done for when the values don't fit inside a switch:

short a[100];

int foo(int n, int counter, short x, short b, short c)
{
   for (int i = 0; i < n; i++)
 {
if (a[i] == x || a[i] == b || a[i] == 7 || a[i] == c)
  return 1;
 }
return 0;
}

is vectorized as:

.L4:
incbx5
incwz30.s, all, mul #2
cmp w6, w1
bcc .L15
.L6:
ld1hz31.h, p7/z, [x5]
cmpeq   p15.h, p7/z, z31.h, z27.h
cmpeq   p11.h, p7/z, z31.h, z28.h
cmpeq   p14.h, p7/z, z31.h, #7
cmpeq   p12.h, p7/z, z31.h, z29.h
orr p15.b, p7/z, p15.b, p11.b
orr p14.b, p7/z, p14.b, p12.b
inchx1
orr p15.b, p7/z, p15.b, p14.b
ptest   p13, p15.b
b.none  .L4
umovw1, v30.s[0]
.L3:
sxtwx1, w1
b   .L7

which is great! but should be an SVE MATCH instruction as well.

This kind of code shows up through the use of std::find_if as well:

#include 
using namespace std;

bool pred(int i) { return i == 1 || i == 2 || i == 7 || i == 4; }

int foo(vector vec)
{
vector::iterator it;

it = find_if(vec.begin(), vec.end(), pred);

return *it;
}

and once the unrolled loop is gone we should be able to vectorize it.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115130
[Bug 115130] [meta-bug] early break vectorization

[Bug libstdc++/115799] ranges::find's optimized branching for memchr is not quite right

2024-07-06 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115799

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org

--- Comment #2 from Tamar Christina  ---
I also get an ICE related:

/opt/buildAgent/temp/buildTmp/toolchain/include/c++/15.0.0/bits/stl_algo.h:3873:38:
error: no match for 'operator+' (operand types are
'std::_Rb_tree_const_iterator' and 'long int')
 3873 |   return __first + ((const char*)__p1 - (const
char*)__p0);
  | 
^

[Bug tree-optimization/104265] Missed vectorization in 526.blender_r

2024-07-05 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104265

--- Comment #5 from Tamar Christina  ---
Also for fully masked architectures we can instead of recreating the vectors
just mask out the irrelevant values.

But we should still order the exits based on complexity.

[Bug tree-optimization/104265] Missed vectorization in 526.blender_r

2024-07-05 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104265

--- Comment #4 from Tamar Christina  ---
(In reply to Richard Biener from comment #3)
> Note the SLP discovery opportunity is from the "reduction" PHI to the
> return which merges control flow to a zero/one flag.

Right, so I get what you mean here, so in

   [local count: 308696474]:
  _52 = t2x_61 < 0.0;
  _53 = t2y_63 < 0.0;
  _54 = _52 | _53;
  _66 = t2z_65 < 0.0;
  _67 = _54 | _66;
  if (_67 != 0)
goto ; [51.40%]
  else
goto ; [48.60%]

   [local count: 158662579]:
  goto ; [100.00%]

   [local count: 150033894]:
  _55 = isec_58(D)->dist;
  _68 = _55 < t1y_62;
  _69 = _55 < t1x_60;
  _70 = _68 | _69;
  _71 = _55 < t1z_64;
  _72 = _70 | _71;
  _73 = ~_72;
  _74 = (int) _73;

   [local count: 1073741824]:
  # _56 = PHI <0(8), _74(6)>
  return _56;

we start at _56 and follow the preds up.  The interesting bit here though is
that the values being compared aren't sequential in memory.

So:

  if (t1x > isec->dist || t1y > isec->dist || t1z > isec->dist) return 0;

  float t1x = (bb[isec->bv_index[0]] - isec->start[0]) * isec->idot_axis[0];
  float t1y = (bb[isec->bv_index[2]] - isec->start[1]) * isec->idot_axis[1];
  float t1z = (bb[isec->bv_index[4]] - isec->start[2]) * isec->idot_axis[2];

but then in:

  if (t1x > t2y  || t2x < t1y  || t1x > t2z || t2x < t1z || t1y > t2z || t2y <
t1z) return 0;

we need a replicated t1x and {t2x, t2x, t2y}.

It looks like the ICX code does indeed rebuild/shuffle the vector at every
exit.
ICX does a better job than OACC here, it does a nice trick, the key is that it
also re-ordered the exits based on the complexity of the shuffle.

movsxd  rax, dword ptr [rdi + 56]
vmovsd  xmm1, qword ptr [rdi]   # xmm1 = mem[0],zero
vmovsd  xmm2, qword ptr [rdi + 76]  # xmm2 = mem[0],zero
movsxd  rcx, dword ptr [rdi + 64]
vmovss  xmm0, dword ptr [rsi + 4*rax]   # xmm0 = mem[0],zero,zero,zero
vinsertps   xmm0, xmm0, dword ptr [rsi + 4*rcx], 16 # xmm0 =
xmm0[0],mem[0],xmm0[2,3]
vsubps  xmm0, xmm0, xmm1
vmulps  xmm0, xmm0, xmm2
vxorps  xmm3, xmm3, xmm3
vcmpltpsxmm3, xmm0, xmm3

i.e. the exit:

  if (t2x < 0.0f || t2y < 0.0f || t2z < 0.0f) return 0;

was made the first exit so it doesn't perform the complicated shuffles if it
doesn't need to.

So it looks like schedule SLP should take in complexity in mind?  This will
become interesting with costing as well.

[Bug fortran/90608] Inline non-scalar minloc/maxloc calls

2024-07-05 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90608

--- Comment #19 from Tamar Christina  ---
Hi Mikael,

It looks like the last version of your patch already gets inlined in the call
sites we cared about.

Would it be possible for you to upstream it?

[Bug c++/115623] ICE: Segmentation fault in finish_for_cond with novector and almost infinite loop

2024-07-04 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115623

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #7 from Tamar Christina  ---
Fixed in master and GCC-14, thanks for the report!

[Bug tree-optimization/115629] Inefficient if-convert of masked conditionals

2024-07-01 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115629

--- Comment #6 from Tamar Christina  ---
(In reply to rguent...@suse.de from comment #5)
> > In this case, the second load is conditional on the first load mask,  which
> > means it's already done an AND.
> > And crucially inverting it means you also inverted both conditions.
> > 
> > So there are some superflous masking operations happening.  But I guess 
> > that's
> > a separate bug.  Shall I just add some tests here and close it and open a 
> > new
> > PR?
> 
> Not sure if that helps - do we fully understand this is a separate issue and
> not related to how we if-convert?
> 

if-convert looks ok to me:

   [local count: 955630226]:
  # i_28 = PHI 
  _1 = (long unsigned int) i_28;
  _2 = _1 * 4;
  _3 = a_16(D) + _2;
  _4 = *_3;
  _31 = _4 != 0;
  _55 = _54 + _2;
  _6 = (int *) _55;
  _56 = ~_31;
  _7 = .MASK_LOAD (_6, 32B, _56);
  _22 = _7 == 0;
  _34 = _22 & _56;
  _58 = _57 + _2;
  _9 = (int *) _58;
  iftmp.0_19 = .MASK_LOAD (_9, 32B, _34);
  _61 = _4 | _7;
  _35 = _61 != 0;
  _60 = _59 + _2;
  _8 = (int *) _60;
  iftmp.0_21 = .MASK_LOAD (_8, 32B, _35);
  iftmp.0_12 = _34 ? iftmp.0_19 : iftmp.0_21;
  _10 = res_23(D) + _2;
  *_10 = iftmp.0_12;
  i_25 = i_28 + 1;
  if (n_15(D) > i_25)
goto ; [89.00%]
  else
goto ; [11.00%]

I think what's missing here is that

  _7 = .MASK_LOAD (_6, 32B, _56);
  _22 = _7 == 0;
  _34 = _22 & _56;
  iftmp.0_19 = .MASK_LOAD (_9, 32B, _34);

in itself produces an AND. namely (_7 && _34) && _56 where _56 is the loop
mask.

my (probably poor understanding) is that the mask tracking in the vectorizer
attempts to prevent to keep masks and their inverse live at the same time.

but that this code in this case doesn't track masks introduced by nested
MASK_LOADS.  at least, that's my naive interpretation.

[Bug libstdc++/88545] std::find compile to memchr in trivial random access cases (patch)

2024-07-01 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88545

--- Comment #12 from Tamar Christina  ---
I had a bug in the benchmark, I forgot to set taskset,

These are the correct ones:

++---+-+-+
| NEEDLE | scalar 1x | vect| memchr  |
++---+-+-+
| 1  | -0.14%| 174.95% | 373.69% |
| 0  | 0.00% | -90.60% | -95.21% |
| 100| 0.03% | -80.28% | -80.39% |
| 1000   | 0.00% | -89.46% | -94.06% |
| 1  | 0.00% | -90.33% | -95.19% |
| -1 | 0.00% | -90.60% | -95.21% |
++---+-+-+

So this shows that on modern cores the unrolled scalar has no influence, so we
should just remove it.

It also shows that memchr is universally faster and that for the rest the
vectorizer does a pretty good job.  We'll get some additional speedups there
soon as well but memchr should still win as it's hand tuned.

So I think for 1-byte we should use memchr and the rest remove the unrolled
code and let the vectorizer handle it.

[Bug tree-optimization/115629] Inefficient if-convert of masked conditionals

2024-07-01 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115629

--- Comment #4 from Tamar Christina  ---
(In reply to Richard Biener from comment #3)
> So we now tail-merge the two b[i] loading blocks.  Can you check SVE
> code-gen with this?  If that fixes the PR consider adding a SVE testcase.

Thanks, the codegen is much better now, but shows some other missing mask
tracking in the vectorizer.

Atm we generate:

.L3:
ld1wz31.s, p6/z, [x0, x6, lsl 2] <-- load a
cmpeq   p7.s, p6/z, z31.s, #0<-- a == 0, !a
ld1wz0.s, p7/z, [x2, x6, lsl 2]  <-- load c conditionally on !a
cmpeq   p7.s, p7/z, z0.s, #0 <-- !a && !c
orr z0.d, z31.d, z0.d<-- a || c
ld1wz29.s, p7/z, [x3, x6, lsl 2] <--- load d where !a && !c
cmpne   p5.s, p6/z, z0.s, #0 <--- (a || c) & loop_mask
and p7.b, p6/z, p7.b, p7.b   <--- ((!a && !c) && (!a && !c)) &
loop_mask 
ld1wz30.s, p5/z, [x1, x6, lsl 2] <-- load b conditionally on (a ||
c)
sel z30.s, p7, z29.s, z30.s  <-- select (!a && !c, d, b)
st1wz30.s, p6, [x4, x6, lsl 2]
add x6, x6, x7
whilelo p6.s, w6, w5
b.any   .L3

which corresponds to:

  # loop_mask_63 = PHI 
  vect__4.10_64 = .MASK_LOAD (vectp_a.8_53, 32B, loop_mask_63);
  mask__31.11_66 = vect__4.10_64 != { 0, ... };
  mask__56.12_67 = ~mask__31.11_66;
  vec_mask_and_70 = mask__56.12_67 & loop_mask_63;
  vect__7.15_71 = .MASK_LOAD (vectp_c.13_68, 32B, vec_mask_and_70);
  mask__22.16_73 = vect__7.15_71 == { 0, ... };
  mask__34.17_75 = vec_mask_and_70 & mask__22.16_73;
  vect_iftmp.20_78 = .MASK_LOAD (vectp_d.18_76, 32B, mask__34.17_75);
  vect__61.21_79 = vect__4.10_64 | vect__7.15_71;
  mask__35.22_81 = vect__61.21_79 != { 0, ... };
  vec_mask_and_84 = mask__35.22_81 & loop_mask_63;
  vect_iftmp.25_85 = .MASK_LOAD (vectp_b.23_82, 32B, vec_mask_and_84);
  _86 = mask__34.17_75 & loop_mask_63;
  vect_iftmp.26_87 = VEC_COND_EXPR <_86, vect_iftmp.20_78, vect_iftmp.25_85>;
  .MASK_STORE (vectp_res.27_88, 32B, loop_mask_63, vect_iftmp.26_87);

it looks like what's missing is that the mask tracking doesn't know that other
masked operations naturally perform an AND when combined.  We do some of this
in the backend but I feel like it may be better to do it in the vectorizer.

In this case, the second load is conditional on the first load mask,  which
means it's already done an AND.
And crucially inverting it means you also inverted both conditions.

So there are some superflous masking operations happening.  But I guess that's
a separate bug.  Shall I just add some tests here and close it and open a new
PR?

[Bug libstdc++/88545] std::find compile to memchr in trivial random access cases (patch)

2024-07-01 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88545

--- Comment #11 from Tamar Christina  ---
(In reply to Jonathan Wakely from comment #9)
> Patch posted: https://gcc.gnu.org/pipermail/gcc-patches/2024-June/653731.html
> 
> Rerunning benchmarks with this patch would be very welcome.

OK, I have tested the patches on AArch64 and also compared against our
vectorized cases.

I tested 5 needle positions (where the element it's searching for is found):

1
0
100
1000
1
-1

where 0 is never find, and -1 means fine last:

Results are:

++---+-+-+
| NEEDLE | scalar 1x | vect| memchr  |
++---+-+-+
| 1  | -11.67%   | -14.95% | -12.92% |
| 0  | 137.48%   | -82.31% | -83.36% |
| 100| 3.75% | 17.06%  | 8.02%   |
| 1000   | -10.34%   | -10.83% | 0.29%   |
| 1  | -3.25%| -4.97%  | -5.19%  |
| -1 | 10.28%| -31.17% | -33.93% |
++---+-+-+

So it looks like, staying with scalar the unrolling still has a positive
effect, but calling memchr has an effect on longer searches, shorter ones. 

the vector loop as currently vectorized has about a 10% unneeded overhead which
we will be working on this year.  But otherwise it's also a significant win for
longer searches.

So perhaps an idea is to use memchr for bytes, for everything else remove the
unrolled code and let the vectorizer take care of it, and if that fails let the
RTL or tree unroller do it?

for completeness, my benchmark was:

#include 
#include 
#include 
#include 
#include 

#ifndef NEEDLE
#define NEEDLE 50
#endif

#ifndef ITERS
#define ITERS 1000
#endif

__attribute__((noipa))
bool
doIt (const char* s, char v, size_t len)
{
  const char* l = s + len;
  const char* r = std::find (s, l, v);
  return (r != l);
}

int main ()
{
  std::ifstream t("find.data");
  std::stringstream buffer;
  buffer << t.rdbuf();
  std::string content = buffer.str();
  if (NEEDLE > 0)
content[NEEDLE-1] = '|';
  else if (NEEDLE < 0)
content[content.length()-1] = '|';

  bool found = false;
  for (int i = 0; i < ITERS; i++)
 found = found | doIt (content.c_str (), '|', content.length ());

  return found;
}

[Bug tree-optimization/115120] Bad interaction between ivcanon and early break vectorization

2024-06-25 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115120

--- Comment #5 from Tamar Christina  ---
considering ivopts bails out on doloop prediction for multiple exits anyway,
what do you think about:

diff --git a/gcc/tree-ssa-loop-ivcanon.cc b/gcc/tree-ssa-loop-ivcanon.cc
index 5ef24a91917..d1b25ad7dea 100644
--- a/gcc/tree-ssa-loop-ivcanon.cc
+++ b/gcc/tree-ssa-loop-ivcanon.cc
@@ -1319,7 +1319,8 @@ canonicalize_loop_induction_variables (class loop *loop,

   if (create_iv
   && niter && !chrec_contains_undetermined (niter)
-  && exit && just_once_each_iteration_p (loop, exit->src))
+  && exit && just_once_each_iteration_p (loop, exit->src)
+  && (single_dom_exit (loop) || targetm.predict_doloop_p (loop)))
 {
   tree iv_niter = niter;
   if (may_be_zero)

richi?

[Bug tree-optimization/115629] New: Inefficient if-convert of masked conditionals

2024-06-24 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115629

Bug ID: 115629
   Summary: Inefficient if-convert of masked conditionals
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

With cases such as:

void foo1 (int *restrict a, int *restrict b, int *restrict c,
  int *restrict d, int *restrict res, int n)
{
for (int i = 0; i < n; i++)
  res[i] = a[i] ? b[i] : (c[i] ? b[i] : d[i]);
}

we generate:

foo1:
cmp w5, 0
ble .L1
mov x6, 0
whilelo p7.s, wzr, w5
ptrue   p3.b, all
.L3:
ld1wz31.s, p7/z, [x0, x6, lsl 2]
cmpeq   p6.s, p7/z, z31.s, #0
cmpne   p5.s, p7/z, z31.s, #0
ld1wz0.s, p6/z, [x2, x6, lsl 2]
ld1wz30.s, p5/z, [x1, x6, lsl 2]
cmpne   p15.s, p3/z, z0.s, #0
orr z0.d, z31.d, z0.d
and p6.b, p15/z, p6.b, p6.b
cmpeq   p4.s, p7/z, z0.s, #0
ld1wz28.s, p6/z, [x1, x6, lsl 2]
ld1wz29.s, p4/z, [x3, x6, lsl 2]
sel z29.s, p15, z28.s, z29.s
sel z29.s, p5, z30.s, z29.s
st1wz29.s, p7, [x4, x6, lsl 2]
incwx6
whilelo p7.s, w6, w5
b.any   .L3

where b is loaded twice, once with the mask a[i] != 0, and once a[i] == 0 &&
c[i] != 0.
clearly we don't need the second compare nor the second load.  This loop can be
optimally handled as:

foo1:
cmp w5, 0
ble .L1
mov x6, 0
cntwx7
whilelo p7.s, wzr, w5
.p2align 5,,15
.L3:
ld1wz1.s, p7/z, [x0, x6, lsl 2]
ld1wz0.s, p7/z, [x2, x6, lsl 2]
orr z0.d, z1.d, z0.d
cmpne   p6.s, p7/z, z0.s, #0
cmpeq   p5.s, p7/z, z0.s, #0
ld1wz31.s, p6/z, [x1, x6, lsl 2]
ld1wz30.s, p5/z, [x3, x6, lsl 2]
sel z30.s, p6, z31.s, z30.s
st1wz30.s, p7, [x4, x6, lsl 2]
add x6, x6, x7
whilelo p7.s, w6, w5
b.any   .L3
.L1:
ret

i.e. transform a ? b : (c ? b : d) into (a || c) ? b : d.

This transform is actually also beneficial for scalar, but that's not the case
when one of the conditions have to be inverted. i.e. cases 2 to 4 are
beneficial for vector masked operations but not scalar:

/* Convert a ? b : (c ? b : d) into (a || c) ? b : d.  */
void foo1 (int *restrict a, int *restrict b, int *restrict c,
  int *restrict d, int *restrict res, int n)
{
for (int i = 0; i < n; i++)
  res[i] = a[i] ? b[i] : (c[i] ? b[i] : d[i]);
}

/* Convert a ? (c ? b : d) : b into (-a || c) ? b : d.  */
void foo2 (int *restrict a, int *restrict b, int *restrict c,
  int *restrict d, int *restrict res, int n)
{
for (int i = 0; i < n; i++)
  res[i] = a[i] ? (c[i] ? b[i] : d[i]) : b[i];
}

/* Convert a ? (c ? d :b) : b into (-a || -c) ? b : d.  */
void foo3 (int *restrict a, int *restrict b, int *restrict c,
  int *restrict d, int *restrict res, int n)
{
for (int i = 0; i < n; i++)
  res[i] = a[i] ? (c[i] ? d[i] : b[i]) : b[i];
}

/* Convert a ? b : (c ? d : b) into (a || -c) ? b : d.  */
void foo4 (int *restrict a, int *restrict b, int *restrict c,
  int *restrict d, int *restrict res, int n)
{
for (int i = 0; i < n; i++)
  res[i] = a[i] ? b[i] : (c[i] ? d[i] : b[i]);
}

for instance case 3 is currently vectorized as:

foo3:
cmp w5, 0
ble .L10
mov x6, 0
whilelo p7.s, wzr, w5
ptrue   p3.b, all
.L12:
ld1wz1.s, p7/z, [x0, x6, lsl 2]
cmpeq   p5.s, p7/z, z1.s, #0
cmpne   p6.s, p7/z, z1.s, #0
ld1wz29.s, p5/z, [x1, x6, lsl 2]
ld1wz0.s, p6/z, [x2, x6, lsl 2]
cmpne   p15.s, p3/z, z0.s, #0
cmpeq   p4.s, p6/z, z0.s, #0
and p5.b, p15/z, p6.b, p6.b
ld1wz30.s, p4/z, [x1, x6, lsl 2]
ld1wz31.s, p5/z, [x3, x6, lsl 2]
sel z30.s, p15, z31.s, z30.s
sel z30.s, p6, z30.s, z29.s
st1wz30.s, p7, [x4, x6, lsl 2]
incwx6
whilelo p7.s, w6, w5
b.any   .L12

but can be

foo3:
cmp w5, 0
ble .L10
mov x6, 0
cntwx7
whilelo p6.s, wzr, w5
ptrue   p5.b, all
.p2align 5,,15
.L12:
ld1wz29.s, p6/z, [x0, x6, lsl 2]
ld1wz28.s, p6/z, [x2, x6, lsl 2]
cmpeq   p15.s, p5/z, z29.s, #0
cmpeq   p14.s, p5/z, z28.s, #0
orr p15.b, p5/z, p15.b, p14.b
and p4.b, p6/z, p15.b, p15.b
bic p7.b, p5/z, p6.b, p15.b
ld1wz31.s, p4/z, [x1, x6, lsl 2]
ld1wz30.s, p7/z, [x3, x6, lsl 2]
  

[Bug tree-optimization/115531] vectorizer generates inefficient code for masked conditional update loops

2024-06-24 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

[Bug c++/115623] ICE: Segmentation fault in finish_for_cond with novector and almost infinite loop

2024-06-24 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115623

--- Comment #4 from Tamar Christina  ---
novect3.c: In function 'void f(char*, int)':
novect3.c:4:9: error: missing loop condition in loop with 'GCC novector' pragma
before ';' token
4 |   for (;;i++)
  | 

should do it, will send patch later today.

[Bug c++/115623] ICE: Segmentation fault in finish_for_cond with novector and almost infinite loop

2024-06-24 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115623

Tamar Christina  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org

--- Comment #3 from Tamar Christina  ---
It looks like cp_parser_c_for is missing the handling for novector.

Mine.

[Bug tree-optimization/115120] Bad interaction between ivcanon and early break vectorization

2024-06-24 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115120

--- Comment #4 from Tamar Christina  ---
You asked why this doesn't happen with a normal vector loop Richi.

For a normal loop when IVcannon adds the downward counting loop there are two
main differences.

1. for a single exit loop, the downward IV is the main IV. which we ignore as
the vectorizer replaces the loop exit condition with a bound iteration check.

2.  when we peel, the main loop has a known iteration count.  So the starting
downward IV for the scalar loop is a known constant.  That means we statically
compute the start of the IV.  As such there's no data-flow from for this
downwards counting IV from the main loop into the scalar loop.

i.e. in this loop:

   [local count: 1063004408]:
  # i_8 = PHI 
  # ivtmp_2 = PHI 
  res[i_8] = i_8;
  i_5 = i_8 + 1;
  ivtmp_1 = ivtmp_2 - 1;
  if (ivtmp_1 != 0)
goto ; [98.99%]
  else
goto ; [1.01%]

   [local count: 1052266995]:
  goto ; [100.00%]

when we vectorize the final loop looks like:

   [local count: 1063004408]:
  # i_8 = PHI 
  # ivtmp_2 = PHI 
  # vect_vec_iv_.6_19 = PHI <_20(5), { 0, 1, 2, 3 }(2)>
  # vectp_res.7_21 = PHI 
  # ivtmp_24 = PHI 
  _20 = vect_vec_iv_.6_19 + { 4, 4, 4, 4 };
  MEM  [(int *)vectp_res.7_21] = vect_vec_iv_.6_19;
  i_5 = i_8 + 1;
  ivtmp_1 = ivtmp_2 - 1;
  vectp_res.7_22 = vectp_res.7_21 + 16;
  ivtmp_25 = ivtmp_24 + 1;
  if (ivtmp_25 < 271)
goto ; [98.99%]
  else
goto ; [1.01%]

   [local count: 1052266995]:
  goto ; [100.00%]

   [local count: 10737416]:
  # i_16 = PHI 
  # ivtmp_17 = PHI 

   [local count: 32212248]:
  # i_7 = PHI 
  # ivtmp_11 = PHI 
  res[i_7] = i_7;
  i_13 = i_7 + 1;
  ivtmp_14 = ivtmp_11 - 1;
  if (ivtmp_14 != 0)
goto ; [66.67%]
  else
goto ; [33.33%]

   [local count: 21474835]:
  goto ; [100.00%]

for a vector code neither assumption are no longer true.

1.  The vectorizer may pick another exit other than the downwards counting IV.
In particular if the early exit has a known iteration count lower than the main
exit.

2.  Because we don't know which exit the loop takes, we can't tell how many
iteration you have to do at a minimum for the scalar loop.  We only know the
maximum.  As such the loop reduction into the second loop is:

   [local count: 58465242]:
  # vect_vec_iv_.6_30 = PHI 
  # vect_vec_iv_.7_35 = PHI 
  _36 = BIT_FIELD_REF ;
  ivtmp_26 = _36;
  _31 = BIT_FIELD_REF ;
  i_25 = _31;
  goto ; [100.00%]

   [local count: 214528238]:
  # i_3 = PHI 
  # ivtmp_17 = PHI 

Since we don't know the iteration count we require both IVs to be live.  the
downcounting IV is live because the scalar loop needs a starting point, and the
incrementing IV is live due to addressing mode usages.

This means neither can be removed.

In the single exit case, the downward IV is only used for loop control:

   [local count: 32212248]:
  # i_7 = PHI 
  # ivtmp_11 = PHI 
  res[i_7] = i_7;
  i_13 = i_7 + 1;
  ivtmp_14 = ivtmp_11 - 1;
  if (ivtmp_14 != 0)
goto ; [66.67%]
  else
goto ; [33.33%]

   [local count: 21474835]:
  goto ; [100.00%]

and so IVopts rewrites the addressing mode usages of `i` into

   [local count: 32212248]:
  # ivtmp.12_2 = PHI 
  _5 = (unsigned int) ivtmp.12_2;
  i_7 = (int) _5;
  MEM[(int *)&res + ivtmp.12_2 * 4] = i_7;
  ivtmp.12_8 = ivtmp.12_2 + 1;
  if (ivtmp.12_8 != 1087)
goto ; [66.67%]
  else
goto ; [33.33%]

   [local count: 21474835]:
  goto ; [100.00%]

and rewrites the loop back into an incrementing loop.  This also happens for
the early exit loop, that's why the scalar code doesn't have the double IVs.

But vector loop we have this issue due to needing the second IV live.

We might be able to rewrite the vector IVs as you say in IVopts,  however not
only does IVopts not rewrite vector IVs, it also doesn't rewrite multiple exit
loops in general. 

It has two checks:

  /* Make sure that the loop iterates till the loop bound is hit, as otherwise
 the calculation of the BOUND could overflow, making the comparison
 invalid.  */
  if (!data->loop_single_exit_p)
return false;

and seems to lose a lot of information when niter_for_single_dom_exit (..) is
null, it seems that in order for this to work correctly IVopts needs to know
which exit we've chosen in the vectorizer. i.e. I think it would have issued
with a PEELED loop.

We also have the problem where both IVs are required:

int arr[1024];
int f()
{
int i;
for (i = 0; i < 1024; i++)
  if (arr[i] == 42)
return i;
return *(arr + i);
}

but with the downward counting IV enabled, we get a much more complicated
latch.

> Note this isn't really because of IVCANON but because the IV is live.  
> IVCANON adds a downward counting IV historically to enable RTL doloop 
> transforms.

IVopts currently has:

  /* Similar to doloop_optimize, check iteration description to know it's
 suitable or not.  Keep it as simple as possible, feel free to extend it
 if you find any multiple exits cases matter.  */
  edge e

[Bug middle-end/115597] [15 Regression] vectorizer takes 20+ h compiling 510.parest in SPECCPU2017 since g:46bb4ce4d30ab749d40f6f4cef6f1fb7c7813452

2024-06-23 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115597

--- Comment #4 from Tamar Christina  ---
(In reply to Richard Biener from comment #2)
> Ah, I feared this would happen - this case seems to be because of a lot of
> VEC_PERM nodes(?) which are not handled by the CSE process as well as the
> two-operator nodes which lack SLP_TREE_SCALAR_STMTS (we'd need NULL elements
> there, something I need to add anyway).
> 
> The bst_map deals as "visited" map, but nodes not handled there would need
> a "visited" set (but as said above, the plan is to reduce that set to zero).
> 

Ah I see, that makes sense.

> I'll see to reproduce to confirm.  Usually a two-operator node shouldn't
> be too bad since the next non-two-operator one will serve as 'visited' point
> but in this graph we have several adjacent two-operator nodes without any
> intermediate node handled by the bst-map processing code.  I can't reproduce
> with -Ofast -march=znver2 though.
> 

Yeah I forgot to mention I could only reproduce it with LTO and a recent glibc.

Thanks for the fix!

[Bug middle-end/115597] [15 Regression] vectorizer takes 20+ h compiling 510.parest in SPECCPU2017 since g:46bb4ce4d30ab749d40f6f4cef6f1fb7c7813452

2024-06-23 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115597

--- Comment #3 from Tamar Christina  ---
> 
> Can you check whether that fixes the issue?
> 
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 9465d94de1a..212d5f97f7d 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -6085,7 +6085,6 @@ static void
>  vect_cse_slp_nodes (scalar_stmts_to_slp_tree_map_t *bst_map, slp_tree& node)
>  {
>if (SLP_TREE_DEF_TYPE (node) == vect_internal_def
> -  && SLP_TREE_CODE (node) != VEC_PERM_EXPR
>/* Besides some VEC_PERM_EXPR, two-operator nodes also
>  lack scalar stmts and thus CSE doesn't work via bst_map.  Ideally
>  we'd have sth that works for all internal and external nodes.  */

Yeah that seems to do it, can compile SPECFP again.

[Bug middle-end/115597] New: [15 Regression] vectorizer takes 20+ h compiling 510.parest in SPECCPU2017 since g:46bb4ce4d30ab749d40f6f4cef6f1fb7c7813452

2024-06-23 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115597

Bug ID: 115597
   Summary: [15 Regression] vectorizer takes 20+ h compiling
510.parest in SPECCPU2017 since
g:46bb4ce4d30ab749d40f6f4cef6f1fb7c7813452
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: compile-time-hog
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
CC: rguenth at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*

Created attachment 58496
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58496&action=edit
slp dump graph

Since:

commit 46bb4ce4d30ab749d40f6f4cef6f1fb7c7813452 (HEAD)
Author: Richard Biener 
Date:   Wed Jun 19 12:57:27 2024 +0200

tree-optimization/114413 - SLP CSE after permute optimization

We currently fail to re-CSE SLP nodes after optimizing permutes
which results in off cost estimates.  For gcc.dg/vect/bb-slp-32.c
this shows in not re-using the SLP node with the load and arithmetic
for both the store and the reduction.  The following implements
CSE by re-bst-mapping nodes as finalization part of vect_optimize_slp.

I've tried to make the CSE part of permute materialization but it
isn't a very good fit there.  I've not bothered to implement something
more complete, also handling external defs or defs without
SLP_TREE_SCALAR_STMTS.

I realize this might result in more BB SLP which in turn might slow
down code given costing for BB SLP is difficult (even that we now
vectorize gcc.dg/vect/bb-slp-32.c on x86_64 might be not a good idea).
This is nevertheless feeding more accurate info to costing which is
good.

PR tree-optimization/114413
* tree-vect-slp.cc (release_scalar_stmts_to_slp_tree_map):
New function, split out from ...
(vect_analyze_slp): ... here.  Call it.
(vect_cse_slp_nodes): New function.
(vect_optimize_slp): Call it.

* gcc.dg/vect/bb-slp-32.c: Expect CSE and vectorization on x86.

Compilation takes an extremely long time in 510.parest_r.

The problem seems to be that vect_cse_slp_nodes visits the same nodes twice.
It looks like the function has no visited set, and the hot loop in parest (when
vectorizable thanks to libmvec) has many TWO_OPERANDS nodes and one of them is
rooted at the top level.

vect_cse_slp_nodes seems to skip VEC_PERM_EXPR but not it's children, as such
it ends up visiting the same subgraphs multiple times. The graph in parest has
so many TWO_OPERAND nodes that essentially compilation never finishes.

I believe this function needs a visited node set.

example call graph:

#334 0x018a1e14 in vect_cse_slp_nodes (bst_map=0x41627a0,
node=@0x4132f40: 0x3df2ec0) at
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:6111
#335 0x018a1e14 in vect_cse_slp_nodes (bst_map=0x41627a0,
node=@0x41321a0: 0x3df2b90) at
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:6111
#336 0x018a1e14 in vect_cse_slp_nodes (bst_map=0x41627a0,
node=@0x4130b00: 0x3df2860) at
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:6111
#337 0x018a1e14 in vect_cse_slp_nodes (bst_map=0x41627a0,
node=@0x41348a0: 0x3df2530) at
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:6111
#338 0x018a1e14 in vect_cse_slp_nodes (bst_map=0x41627a0,
node=@0x3b8b0d0: 0x3df2310) at
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:6111
#339 0x018a1e14 in vect_cse_slp_nodes (bst_map=0x41627a0,
node=@0x41348f0: 0x3dee928) at
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:6111
#340 0x018a1e14 in vect_cse_slp_nodes (bst_map=0x41627a0,
node=@0x4134500: 0x3dee460) at
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:6111
#341 0x018a1e14 in vect_cse_slp_nodes (bst_map=0x41627a0,
node=@0x3c14600: 0x3ded690) at
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:6111
#342 0x018a1e14 in vect_cse_slp_nodes (bst_map=0x41627a0,
node=@0x3ca75f0: 0x3de7910) at
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:6111
#343 0x018a1e14 in vect_cse_slp_nodes (bst_map=0x41627a0,
node=@0x3e28590: 0x3de8768) at
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:6111
#344 0x018a1e14 in vect_cse_slp_nodes (bst_map=0x41627a0,
node=@0x3c2e4b8: 0x3de7778) at
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:6111
#345 0x018a1e14 in vect_cse_slp_nodes (bst_map=0x41627a0,
node=@0x3da5e58: 0x3de7dd8) at
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:6111
#346 0x018a1e14 in vect_cse_slp_nodes (bst_map=0x41627a0,
node=@0x41d0770: 0x3de7f70) at
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:6111
#347 0x01

[Bug middle-end/115534] intermediate stack use not eliminated

2024-06-18 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115534

--- Comment #5 from Tamar Christina  ---
(In reply to Andrew Pinski from comment #4)
> This might be improved by
> https://gcc.gnu.org/pipermail/gcc-patches/2024-June/654819.html . Or it
> might be the case the vectorizer case needs to be improved afterwards. But I
> think that is the infrastructure for fixing this issue.

Yeah Richard pointed me to this today as well. The vectorizer case is a bit
unique because the vectorizer has packed scalar values in two vector registers.

So yeah think it's likely some work will be needed afterwards but will see
after the fsra patch lands :)

[Bug tree-optimization/115537] [15 Regression] vectorizable_reduction ICEs after g:d66b820f392aa9a7c34d3cddaf3d7c73bf23f82d

2024-06-18 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115537

--- Comment #5 from Tamar Christina  ---
Thanks for the fix!

I think the testcase needs SVE enabled to ICE no?
shouldn't that be -mcpu=neoverse-v1 and not -mcpu=neoverse-n1?

[Bug middle-end/115534] intermediate stack use not eliminated

2024-06-18 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115534

--- Comment #2 from Tamar Christina  ---
(In reply to Andrew Pinski from comment #1)
> I suspect there is a dup of this already. See the bug which I made this one
> blocking for a list of related bugs.

Most of the other bugs relate to the argument expansions, however this one,
regardless of the expansion itself shouldn't need the intermediate stack.

I think there are various other ways the operation could have been kept in a
gimple register.

[Bug tree-optimization/115537] New: [15 Regression] vectorizable_reduction ICEs after g:d66b820f392aa9a7c34d3cddaf3d7c73bf23f82d

2024-06-18 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115537

Bug ID: 115537
   Summary: [15 Regression] vectorizable_reduction ICEs after
g:d66b820f392aa9a7c34d3cddaf3d7c73bf23f82d
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
CC: rguenth at gcc dot gnu.org
  Target Milestone: ---

testcase:

---
char *a;
int b;
void c() {
  int d = 0, e = 0, f;
  for (; f; ++f)
if (a[f] == 5)
  ;
else if (a[f])
  e = 1;
else
  d = 1;
  if (d)
if (e)
  b = 0;
}
---

compiled with -mcpu=neoverse-v1 -O3 produces the following ICE:

during GIMPLE pass: vect
pngrtran.i: In function 'c':
pngrtran.i:3:6: internal compiler error: in vectorizable_reduction, at
tree-vect-loop.cc:8335
3 | void c() {
  |  ^
0xff74ff vectorizable_reduction(_loop_vec_info*, _stmt_vec_info*, _slp_tree*,
_slp_instance*, vec*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:8335
0x1b598f7 vect_analyze_stmt(vec_info*, _stmt_vec_info*, bool*, _slp_tree*,
_slp_instance*, vec*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-stmts.cc:13353
0x10225df vect_slp_analyze_node_operations_1
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:6457
0x10225df vect_slp_analyze_node_operations
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:6656
0x102253f vect_slp_analyze_node_operations
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:6635
0x1023ec3 vect_slp_analyze_operations(vec_info*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:7053
0xff816f vect_analyze_loop_2
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:2953
0xff9fb7 vect_analyze_loop_1
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:3484
0xffa6f7 vect_analyze_loop(loop*, vec_info_shared*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:3642
0x1035547 try_vectorize_loop_1
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1067
0x1035547 try_vectorize_loop
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1183
0x1035a5b execute
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1299
Please submit a full bug report, with preprocessed source (by using
-freport-bug).
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

after:

commit d66b820f392aa9a7c34d3cddaf3d7c73bf23f82d
Author: Richard Biener 
Date:   Thu Jun 13 14:42:25 2024 +0200

Support single def-use cycle optimization for SLP reduction vectorization

We can at least mimic single def-use cycle optimization when doing
single-lane SLP reductions and that's required to avoid regressing
compared to non-SLP.

* tree-vect-loop.cc (vectorizable_reduction): Allow
single-def-use cycles with SLP.
(vect_transform_reduction): Handle SLP single def-use cycles.
(vect_transform_cycle_phi): Likewise.

* gcc.dg/vect/slp-reduc-12.c: New testcase.

 gcc/testsuite/gcc.dg/vect/slp-reduc-12.c | 18 +
 gcc/tree-vect-loop.cc| 45 +++-
 2 files changed, 45 insertions(+), 18 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/slp-reduc-12.c
bisect run success

looks like it's hitting the assert:

  gcc_assert (op.code != COND_EXPR);

[Bug tree-optimization/115534] New: intermediate stack use not eliminated

2024-06-18 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115534

Bug ID: 115534
   Summary: intermediate stack use not eliminated
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

Consider the following example:

#include 

typedef struct _pixel_t
{
  double red, green, blue, opacity;
} pixel_t;

typedef struct _PixelPacket
{
  unsigned short blue, green, red, opacity;
} PixelPacket;

pixel_t f (unsigned height, unsigned width, unsigned virt_width,
   uint8_t *restrict k, const PixelPacket *restrict k_pixels)
{
pixel_t result = {};
for (unsigned u=0; u < (width & -4); u++, k--) {
result.red += (*k)*k_pixels[u].red;
result.green   += (*k)*k_pixels[u].green;
result.blue+= (*k)*k_pixels[u].blue;
result.opacity += (*k)*k_pixels[u].opacity;
k_pixels += virt_width;
}
return result;
}

---

compiled with -O3 vectorizes as good, but the epilogue code is very
inefficient:

faddv29.2d, v29.2d, v30.2d
faddv28.2d, v28.2d, v31.2d
cmp w5, w1
bhi .L3
mov v31.16b, v28.16b
ins v31.d[1], v29.d[1]
ins v29.d[1], v28.d[1]
stp q31, q29, [sp, 32]
ldp d0, d1, [sp, 32]
ldp d2, d3, [sp, 48]
add sp, sp, 64
ret
.L4:
moviv29.2d, 0
mov v31.16b, v29.16b
stp q31, q29, [sp, 32]
ldp d0, d1, [sp, 32]
ldp d2, d3, [sp, 48]
add sp, sp, 64
ret

as in it goes through the stack to create the return registers.  This looks
like  at gimple we still have the store:

   [local count: 105119324]:
  _33 = VEC_PERM_EXPR ;
  _31 = VEC_PERM_EXPR ;

   [local count: 118111600]:
  # vect_result_red_64.18_28 = PHI <_33(5), { 0.0, 0.0 }(2)>
  # vect_result_red_64.18_105 = PHI <_31(5), { 0.0, 0.0 }(2)>
  MEM  [(double *)&D.4535] = vect_result_red_64.18_28;
  MEM  [(double *)&D.4535 + 16B] = vect_result_red_64.18_105;
  return D.4535;

clang is able to generate much better code here:

faddv0.2d, v0.2d, v1.2d
faddv2.2d, v2.2d, v3.2d
b.ne.LBB0_2
.LBB0_3:
mov d1, v2.d[1]
mov d3, v0.d[1]
ret

The vectorized code gets reg-alloc'ed so that d0 an d2 are already in the right
registers at the end of the vector loop, and the epilogue only has to split the
registers up to get d1 and d3.

I think we would generate the same if we were to elide the intermediate stack
store.

See https://godbolt.org/z/ocqchWWs5

[Bug tree-optimization/115531] vectorizer generates inefficient code for masked conditional update loops

2024-06-17 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531

--- Comment #3 from Tamar Christina  ---
(In reply to Andrew Pinski from comment #1)
> I suspect PR 20999 would fix this ...
> but we have to be careful since without masked stores, you could still
> vectorize this unlike the transformed version.
> 
> Maybe ifcvt can produce a masked store version if this pattern ...

doing so during ifcvt forces you to commit to a masked operation. So you loose
the ability to not vectorize for non-fully masked architectures.

So it's too early.  A vector pattern doesn't have this problem. This question
was mostly to what degree the vectorizer has support for MASK_STORE as an
input. vect_get_vector_types_for_stmt seems to have support for it so it looks
like it may work.

[Bug tree-optimization/115531] New: vectorizer generates inefficient code for masked conditional update loops

2024-06-17 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531

Bug ID: 115531
   Summary: vectorizer generates inefficient code for masked
conditional update loops
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

The following code:

void __attribute__((noipa))
foo (char *restrict a, int *restrict b, int *restrict c, int n, int stride)
{
  if (stride <= 1)
return;

  for (int i = 0; i < n; i++)
{
  int res = c[i];
  int t = b[i+stride];
  if (a[i] != 0)
res = t;
  c[i] = res;
}
}

generates at -O3 -g0 -mcpu=generic+sve:

.L3:
ld1bz29.s, p7/z, [x0, x5]
ld1wz31.s, p7/z, [x2, x5, lsl 2]
ld1wz30.s, p7/z, [x1, x5, lsl 2]
cmpne   p15.b, p6/z, z29.b, #0
sel z30.s, p15, z30.s, z31.s
st1wz30.s, p7, [x2, x5, lsl 2]
add x5, x5, x4
whilelo p7.s, w5, w3
b.any   .L3
.L1:

and makes vectorization unprofitable until very high iterations of n.
This is because the vector code has more instructions than needed.

Since it's a masked store, whenever a value is being conditionally set we don't
need the intermediate VEC_COND_EXPR.  This loop can be vectorized as:

.L3:
ld1bz29.s, p7/z, [x0, x5]
ld1wz31.s, p7/z, [x2, x5, lsl 2]
cmpne   p4.b, p6/z, z29.b, #0
st1wz31.s, p4, [x2, x5, lsl 2]
add x5, x5, x4
whilelo p7.s, w5, w3
b.any   .L3
.L1:

I currently prototyped a load-to-store forward optimization in forwprop but
looking to move it into the vectorizer to cost it properly, however I'm not
entirely sure what the best way to do so is.

I can certainly fix it up during codegen but to cost it I need to do so during
analysis. I could detect it during vectorizable_condition but then the dead
load is still costed. Or I could maybe use a pattern, but unsure how to
represent the mask into the load.

Is it valid to produce a pattern with .IFN_MASK_STORE?

[Bug target/115464] [14 Backport] ICE when building libaom on arm64 (neon sve bridge usage with tbl/perm)

2024-06-13 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115464

--- Comment #10 from Tamar Christina  ---
Thanks for the fix, but I don't think it's sufficient.

what I meant with the earlier comment was that the subregs are broken in
general, so not just the one generated by the undef fast path.

i.e.

#include 
#include 
#include 

svuint16_t
convolve4_4_x (uint16x8x2_t permute_tbl, svuint16_t a)
{
return svset_neonq_u16 (a, permute_tbl.val[1]);
}

seems to still ICE for me because it goes into the general expander which
produces the same subreg.

[Bug target/115464] ICE when building libaom on arm64 (neon sve bridge usage with tbl/perm)

2024-06-12 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115464

--- Comment #7 from Tamar Christina  ---
(In reply to Tamar Christina from comment #6)
> (In reply to Richard Sandiford from comment #5)
> > In this kind of situation, we should go through a fresh pseudo rather than
> > try to take the subreg directly.
> 
> I did try that but fwprop pushed it back in.

Ahh no, I used force_reg, doh.. Fair.

[Bug target/115464] ICE when building libaom on arm64 (neon sve bridge usage with tbl/perm)

2024-06-12 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115464

--- Comment #6 from Tamar Christina  ---
(In reply to Richard Sandiford from comment #5)
> In this kind of situation, we should go through a fresh pseudo rather than
> try to take the subreg directly.

I did try that but fwprop pushed it back in.

[Bug target/115464] ICE when building libaom on arm64 (neon sve bridge usage with tbl/perm)

2024-06-12 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115464

Tamar Christina  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #4 from Tamar Christina  ---
Looks like the Neon lowpart optimization doesn't take into account aggregate
vectors.

That means simplest reproduction is:

#include 
#include 
#include 

svuint16_t
convolve4_4_x (uint16x8x2_t permute_tbl)
{
return svset_neonq_u16 (svundef_u16 (), permute_tbl.val[1]);
}

This generates a subreg between from E_V2x8HImode to E_VNx8HImode.

This subreg is an invalid paradoxical subreg as there's no strict ordered
relationship between the modes.

This fails because ordered_p does not have a relationship defined between a
poly vector
and an aggregate non-poly vector.

I think one should probably be provided, essentially the NEON<->SVE bridge is
broken for aggregate types in general at the moment.  I don't know the exact
semantics for poly-ints.

I tried patching ordered_p but the ICE just moves. Any thoughts Richard?

[Bug target/115464] ICE when building libaom on arm64 (neon sve bridge usage with tbl/perm)

2024-06-12 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115464

Tamar Christina  changed:

   What|Removed |Added

   Last reconfirmed||2024-06-12
 CC||tnfchris at gcc dot gnu.org
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1

--- Comment #3 from Tamar Christina  ---
Confirmed, slightly more cleaned up example:

#include 
#include 
#include 

int16x8_t
convolve4_4_x (uint16x8x2_t permute_tbl, svint16_t res)
{
return svget_neonq_s16 (
svtbl_s16 (res,
   svset_neonq_u16 (svundef_u16 (), permute_tbl.val[0])));
}

[Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.

2024-06-06 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #15 from Tamar Christina  ---
(In reply to rguent...@suse.de from comment #14)
> On Thu, 6 Jun 2024, tnfchris at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932
> > 
> > --- Comment #13 from Tamar Christina  ---
> > (In reply to rguent...@suse.de from comment #12) 
> > > > since we don't care about overflow here, it looks like the stripping 
> > > > should
> > > > be recursive as long as it's a NOP expression between two integral 
> > > > types.
> > > > 
> > > > That would get them to hash to the same IV expression.  Trying now..
> > > 
> > > Note tree-affine is a tool that's used for this kind of "weak" equalities.
> > > Convert both to affine, subtract them and if that's zero they are equal.
> > 
> > Hmm that's useful, though in this case this becomes the actual expression 
> > that
> > IVOpts uses.
> > 
> > For instance this is done in alloc_iv and add_iv_candidate_for_use when
> > determining the uses for the IV.
> > 
> > It looks like it's trying to force a canonical representation with as 
> > minimum
> > casting as possible.
> > 
> > would the "affine"'ed tree be safe to use for this context?
> 
> No, I'd use that just for the comparison.
> 
> > What I've done currently is make a STRIP_ALL_NOPS that recursively strips 
> > NOPs
> > for PLUS/MULT/MINUS.
> 
> But the stripped expression isn't necessarily valid to use either because
> of possible undefined overflow.  It's probably safe to pick any of the
> existing expressions (all are evaluated at each iteration), but if you
> strip all NOPs from all of them you might end up with new undefined
> behavior.
> 
> At least if that stripped expression is inserted somewhere or other
> new expressions are built upon it.

does overflow matter for addressing mode though? if you have undefined behavior
in your address space then your program would have crashed anyway no?

In this case IVOpts would have already stripped away the outer NOPs so building
upon this one could also cause undefined overflow can it not? i.e. if the IV
was
((signed) unsigned_calculation.)

[Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.

2024-06-05 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #13 from Tamar Christina  ---
(In reply to rguent...@suse.de from comment #12) 
> > since we don't care about overflow here, it looks like the stripping should
> > be recursive as long as it's a NOP expression between two integral types.
> > 
> > That would get them to hash to the same IV expression.  Trying now..
> 
> Note tree-affine is a tool that's used for this kind of "weak" equalities.
> Convert both to affine, subtract them and if that's zero they are equal.

Hmm that's useful, though in this case this becomes the actual expression that
IVOpts uses.

For instance this is done in alloc_iv and add_iv_candidate_for_use when
determining the uses for the IV.

It looks like it's trying to force a canonical representation with as minimum
casting as possible.

would the "affine"'ed tree be safe to use for this context?

What I've done currently is make a STRIP_ALL_NOPS that recursively strips NOPs
for PLUS/MULT/MINUS.

[Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.

2024-06-05 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #11 from Tamar Christina  ---
(In reply to Richard Biener from comment #10)
> I think the question is why IVOPTs ends up using both the signed and
> unsigned variant of the same IV instead of expressing all uses of both with
> one IV?
> 
> That's where I'd look into.

It looks like this is because of a subtle difference in the expressions.

In get_loop_invariant_expr IVOPTs first tries to strip away all casts with
STRIP_NOPS.

The first expression is (unsigned long) (stride.3_27 * 4) and the second
expression is ((unsigned long) stride.3_27) * 4 (The pretty printing here is
pretty bad...)

So the first one becomes:
  (unsigned long) (stride.3_27 * 4) -> stride.3_27 * 4

and second one:
  ((unsigned long) stride.3_27) * 4 -> ((unsigned long) stride.3_27) * 4

since we don't care about overflow here, it looks like the stripping should
be recursive as long as it's a NOP expression between two integral types.

That would get them to hash to the same IV expression.  Trying now..

[Bug tree-optimization/54013] Loop with control flow not vectorized

2024-06-05 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54013

Tamar Christina  changed:

   What|Removed |Added

 Blocks||115130

--- Comment #4 from Tamar Christina  ---
Since there's only one source here, alignment peeling should be enough to
vectorize it.

our pending patches should support it.  Will add it to verify list.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115130
[Bug 115130] [meta-bug] early break vectorization

[Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.

2024-06-05 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #9 from Tamar Christina  ---
It's taken me a bit of time to track down all the reasons for the speedup with
the earlier patch.

This comes from two parts:

1. Signed IVs don't get simplified.  Due to possible UB with signed overflows
gimple expressions don't get simplified when the type is signed.

However for addressing modes it doesn't matter as simplifying the constants any
potential overflow can still happen.  Secondly most architectures say you can
never reach the full address space range anyway.  Those that due (like those
that offer baremetal variants like Arm and AArch64) explicitly specify that
overflow is defined as wrapping around.  That means that IVs for their use in
IV opts should be save to simplify as if they were unsigned.

I have a patch that during the creation of IV candidates folds them to unsigned
and then folds them back to their original signed types.  This maintains all
the original overflow analysis and the correct typing in gimple.

2. The second problem is that due to Fortran not having unsigned types, the
front-end generates a signed IV.  Some optimizations as they work can convert
these to unsigned due to folding, e.g. extract_muldiv is one place where this
is done.

This can make us end up having the same IV as both signed and unsigned, as is
the case here:

:   
   
   
 inv_expr 1: stride.3_27 * 4   
   
   
  inv_expr 2:
(unsigned long) stride.3_27 * 4   

These end up being used in the same group:

Group 1:   
   
   
   cand  costcompl.  inv.expr.   inv.vars  
   
   
1 0   0
  NIL;6
   
   
 2 0   0   NIL;6   
   
   
  3 0   0   NIL;6  
   
   
   4 0 
 0   NIL;6 

which ends up with IV opts picking the signed and unsigned IVs:

Improved to:
  cost: 24 (complexity 3)
  reg_cost: 9
  cand_cost: 15
  cand_group_cost: 0 (complexity 3)
  candidates: 1, 6, 8
   group:0 --> iv_cand:6, cost=(0,1)
   group:1 --> iv_cand:1, cost=(0,0)
   group:2 --> iv_cand:8, cost=(0,1)
   group:3 --> iv_cand:8, cost=(0,1)
  invariant variables: 6
  invariant expressions: 1, 2

and so generates the same IV as both signed and unsigned:

;;   basic block 21, loop depth 3, count 214748368 (estimated locally, freq
58.2545), maybe hot
   
 ;;prev block 28, next block 31, flags:
(NEW, REACHABLE, VISITED)
;;pred:   28 [always]  count:23622320 (estimated locally, freq 6.4080)
(FALLTHRU,EXECUTABLE)  
   
  ;;25 [always]  count:191126046
(estimated locally, freq 51.8465) (FALLTHRU,DFS_BACK,EXECUTABLE)
  # .MEM_66 = PHI <.MEM_34(28), .MEM_22(25)>
  # ivtmp.22_41 = PHI <0(28), ivtmp.22_82(25)>
  # ivtmp.26_51 = PHI 
  # ivtmp.28_90 = PHI 

...

;;   basic block 24, loop depth 3, count 214748366 (estimated locally, freq
58.2545), maybe hot
   
 ;;prev block 22, 

[Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa

2024-05-22 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860

--- Comment #9 from Tamar Christina  ---
(In reply to prathamesh3492 from comment #8)
> Hi Tamar,
> Using -falign-loops=5 indeed brings back the performance.
> The adrp instruction has same address (0x4ae784) by setting -falign-loops=5
> (which reduces misalignment to 4) with/without a2f4be3dae0. So I guess this
> is really code-alignment issue ?
> 

Indeed, we don't aggressively align loops if they require too much padding to
not bloat the binaries too much.  That's why sometimes you just get unlucky and
the hot loop gets misaligned.

[Bug tree-optimization/115130] (early-break) [meta-bug] early break vectorization

2024-05-17 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115130

Tamar Christina  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2024-05-17
 Status|UNCONFIRMED |NEW

[Bug tree-optimization/115130] New: (early-break) [meta-bug] early break vectorization

2024-05-17 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115130

Bug ID: 115130
   Summary: (early-break) [meta-bug] early break vectorization
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: meta-bug, missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
Blocks: 53947
  Target Milestone: ---

Meta tickets about early break vectorization to better keep track of early
break specific issues


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug tree-optimization/115120] Bad interaction between ivcanon and early break vectorization

2024-05-17 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115120

--- Comment #3 from Tamar Christina  ---
That makes sense, though I also wonder how it works for scalar multi exit
loops, IVops has various checks on single exits.

I guess one problem is that the code in IVops that does this uses the exit to
determine niters.

But in the case of the multiple exits vector code the vectorizer could have
picked a different exit.

So I guess the question is how do we even tell which one is used or could the
transformation be driven from the PHI nodes themselves instead of an exit.

[Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa

2024-05-16 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860

--- Comment #7 from Tamar Christina  ---
Yeah, it's most likely an alignment issue, especially as there's no code
changes.

We run our benchmarking with different flags so it may be why we don't see it.
the loop seems misaligned, you can try increasing the alignment guarantee to
check. e.f. -falign-loops=5.

But ultimately, I think it's just bad luck. We don't align loops and labels if
they require too much padding instructions.

[Bug target/114412] [14/15 Regression] 7% slowdown of 436.cactusADM on aarch64

2024-05-16 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114412

--- Comment #5 from Tamar Christina  ---
(In reply to Filip Kastl from comment #4)
> (In reply to Tamar Christina from comment #3)
> > Hi Filip,
> > 
> > Do you generate these runs with counters based PGO or compiler
> > instrumentation?
> > 
> > Just so I know before I start trying to reproduce them.
> 
> Hi Tamar,
> 
> By counters you mean some sort of hardware counters? I didn't know there
> were multiple ways to do PGO with GCC.
> 
> I think that the answer to your question is "compiler instrumentation". I
> just do -fprofile-generate, run the instrumented binary and then
> -fprofile-use.

Yeah, with some elbow grease the perf record method works too, but it's not
very accurate on Armv8.

I'll try to reproduce and bisect these over the weekend!

[Bug target/115087] New: dead block not eliminated in SVE intrinsics code

2024-05-14 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115087

Bug ID: 115087
   Summary: dead block not eliminated in SVE intrinsics code
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

The testcase in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114151 had another
"regression" in that the same loop seems to have been peeled but only for 1
iteration.

However the loops are identical so the peeling is weird.

This was caused by

f5fb9ff2396fd41fdd2e6d35a412e936d2d42f75 is the first bad commit
commit f5fb9ff2396fd41fdd2e6d35a412e936d2d42f75
Author: Jan Hubicka 
Date:   Fri Jul 28 16:18:32 2023 +0200

loop-split improvements, part 3

extend tree-ssa-loop-split to understand test of the form
 if (i==0)
and
 if (i!=0)
which triggers only during the first iteration.  Naturally we should
also be able to trigger last iteration or split into 3 cases if
the test indeed can fire in the middle of the loop.

however the commit is innocent, it looks like we're not below a magic threshold
that causes the issue.

However a simpler testcase:

#include 

void test(int size, char uplo, float16_t *p_mat)
{
  int col_stride = uplo == 'u' ? 1 : size;
  auto *a = &p_mat[0];
  auto pg = svptrue_b16();
  for (int j = 0; j < size; ++j) {
auto *a_j = &a[j];
if (j > 0) {
  int col_i = j + 1;
  auto v_a_ji_0 = svld1_vnum_f16(pg, (const float16_t *)&a_j[col_i], 0);
  v_a_ji_0 = svcmla_f16_x(pg, v_a_ji_0, v_a_ji_0, v_a_ji_0, 180);
}

int col_i = j * col_stride;
auto v_a_ji_0 = svld1_vnum_f16(pg, (const float16_t *)&a_j[col_i], 0);
auto v_old_a_jj_0 = svld1_vnum_f16(pg, (const float16_t *)&a_j[j], 0);
v_a_ji_0 = svmul_f16_x(pg, v_old_a_jj_0, v_a_ji_0);

svst1_vnum_f16(pg, (float16_t *)&a_j[col_i], 0, v_a_ji_0);
  }
}

shows that the change in the patch is a positive one.

The issue seems to be that GCC does not see the if block as dead code:

if (j > 0) {
  int col_i = j + 1;
  auto v_a_ji_0 = svld1_vnum_f16(pg, (const float16_t *)&a_j[col_i], 0);
  v_a_ji_0 = svcmla_f16_x(pg, v_a_ji_0, v_a_ji_0, v_a_ji_0, 180);
}

is dead because v_a_ji_0 is overwritten before use.

  _29 = MEM <__SVFloat16_t> [(__fp16 *)_88 + ivtmp.10_52 * 2];
  svcmla_f16_x ({ -1, 0, ... }, _29, _29, _29, 180);

_29 is dead, but I guess it's not eliminated because it doesn't know what
svcmla_f16_x does. But are these intrinsics not marked as CONST|PURE ?

We finally eliminate it at RTL level but I think we should mark these
intrinsics   as ECF_CONST

[Bug target/114412] [14/15 Regression] 7% slowdown of 436.cactusADM on aarch64

2024-05-13 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114412

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org

--- Comment #3 from Tamar Christina  ---
Hi Filip,

Do you generate these runs with counters based PGO or compiler instrumentation?

Just so I know before I start trying to reproduce them.

[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains

2024-05-13 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

Tamar Christina  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2024-05-13
   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org

--- Comment #8 from Tamar Christina  ---
(In reply to Richard Biener from comment #7)
> Likely
> 
>   Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) *
> 324) + 36)
> 
> vs.
> 
>   Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
> * 81) + 9) * 4
> 
> where we fail to optimize the outer multiply.  It's
> 
>  ((unsigned)((signed)x * 81) + 9) * 4
> 
> and likely done by extract_muldiv for the case of (unsigned)x.  The trick
> would be to promote the inner multiply to unsigned to make the otherwise
> profitable transform valid.  But best not by enhancing extract_muldiv ...

Ah, merci!

Mine then.

[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains

2024-05-03 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #6 from Tamar Christina  ---
Created attachment 58096
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58096&action=edit
exchange2.fppized-bad.f90.187t.ivopts

[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains

2024-05-03 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #5 from Tamar Christina  ---
Created attachment 58095
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58095&action=edit
exchange2.fppized-good.f90.187t.ivopts

[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains

2024-05-03 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #4 from Tamar Christina  ---
reduced more:

---
  module brute_force
integer, parameter :: r=9
 integer  block(r, r, 0)
contains
  subroutine brute
 do
  do
  do
   do
do
 do
 do i7 = l0, 1
   select case(1 )
   case(1)
   block(:2, 7:, 1) = block(:2, 7:, i7) - 1
   end select
do i8 = 1, 1
   do i9 = 1, 1
if(1 == 1) then
call digits_20
end if
end do
  end do
end do
end do
  end do
  end do
   end do
 end do
  end do
 end
  end
---

I'll have to stop now till I'm back, but the main difference seems to be in:

good:

:
IV struct:
  SSA_NAME: _1
  Type: integer(kind=8)
  Base: (integer(kind=8)) ((unsigned long) l0_19(D) * 81)
  Step: 81
  Biv:  N
  Overflowness wrto loop niter: Overflow
IV struct:
  SSA_NAME: _20
  Type: integer(kind=8)
  Base: (integer(kind=8)) l0_19(D)
  Step: 1
  Biv:  N
  Overflowness wrto loop niter: No-overflow
IV struct:
  SSA_NAME: i7_28
  Type: integer(kind=4)
  Base: l0_19(D) + 1
  Step: 1
  Biv:  Y
  Overflowness wrto loop niter: No-overflow
IV struct:
  SSA_NAME: vectp.22_46
  Type: integer(kind=4) *
  Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) *
324) + 36)
  Step: 324
  Object:   (void *) &block
  Biv:  N
  Overflowness wrto loop niter: No-overflow

bad:

:
IV struct:
  SSA_NAME: _1
  Type: integer(kind=8)
  Base: (integer(kind=8)) l0_19(D) * 81
  Step: 81
  Biv:  N
  Overflowness wrto loop niter: No-overflow
IV struct:
  SSA_NAME: _20
  Type: integer(kind=8)
  Base: (integer(kind=8)) l0_19(D)
  Step: 1
  Biv:  N
  Overflowness wrto loop niter: No-overflow
IV struct:
  SSA_NAME: i7_28
  Type: integer(kind=4)
  Base: l0_19(D) + 1
  Step: 1
  Biv:  Y
  Overflowness wrto loop niter: No-overflow
IV struct:
  SSA_NAME: vectp.22_46
  Type: integer(kind=4) *
  Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D) *
81) + 9) * 4
  Step: 324
  Object:   (void *) &block
  Biv:  N
  Overflowness wrto loop niter: No-overflow

  1   2   3   4   5   6   7   8   9   10   >