[Bug target/99195] Optimise away vec_concat of 64-bit AdvancedSIMD operations with zeroes in aarch64

2024-04-04 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99195

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

  Known to work||14.0
 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #21 from ktkachov at gcc dot gnu.org ---
I think all the straightforward cases are handled and the infrastructure for
doing this is added. Any future improvements in the area should be tracked
separately. Marking as fixed for GCC 14.1

[Bug rtl-optimization/113019] [NOT A BUG] Multi-architecture binaries for Linux

2023-12-14 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113019

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ktkachov at gcc dot gnu.org

--- Comment #1 from ktkachov at gcc dot gnu.org ---
GCC provides the Function Multiversioning feature that's supported on some
architectures:
https://gcc.gnu.org/onlinedocs/gcc/Function-Multiversioning.html

That seems to do what you want?

[Bug middle-end/111782] New: [11/12/13/14 Regression] Extra move in complex double multiplication

2023-10-12 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111782

Bug ID: 111782
   Summary: [11/12/13/14 Regression] Extra move in complex double
multiplication
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64

The testcase:
__complex double
foo (__complex double a, __complex double b)
{
  return a * b;
}

With GCC trunk at -Ofast I see on aarch64:
foo(double _Complex, double _Complex):
fmovd31, d1
fmuld1, d1, d2
fmadd   d1, d0, d3, d1
fmuld31, d31, d3
fnmsub  d0, d0, d2, d31
ret

with GCC 10 the codegen used to be tighter:
foo(double _Complex, double _Complex):
fmuld4, d1, d3
fmuld5, d1, d2
fmadd   d1, d0, d3, d5
fnmsub  d0, d0, d2, d4
ret

There's an extra fmov emitted on trunk.
I noticed this regressed with the GCC 11 series

[Bug target/111733] New: Emit inline SVE FSCALE instruction for ldexp

2023-10-09 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111733

Bug ID: 111733
   Summary: Emit inline SVE FSCALE instruction for ldexp
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64

Having noticed https://github.com/llvm/llvm-project/pull/67552 in LLVM GCC
should be able to emit the SVE fscale instruction [1] to implement the ldexp
standard function.

There is already an ldexpm3 optab defined so it should be a relatively simple
matter of wiring up the expander for TARGET_SVE

[1]
https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/FSCALE--Floating-point-adjust-exponent-by-vector--predicated--?lang=en

[Bug tree-optimization/111478] [12/13/14 regression] aarch64 SVE ICE: in compute_live_loop_exits, at tree-ssa-loop-manip.cc:250

2023-09-27 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111478

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ktkachov at gcc dot gnu.org
   Target Milestone|14.0|12.4
   Priority|P3  |P1

--- Comment #3 from ktkachov at gcc dot gnu.org ---
Marking as P1. We hit this with a Fortran reproducer:
  SUBROUTINE REPRODUCER( M, A, LDA )
  IMPLICIT NONE
  INTEGERLDA, M, I
  COMPLEXA( LDA, * )
  DO I = 2, M
A( I, 1 ) = A( I, 1 ) / A( 1, 1 )
  END DO
  RETURN
  END

on aarch64 with -march=armv8-a+sve -O3
The ICE triggeres on 12.3 but compiles fine wiht 12.2

[Bug tree-optimization/111476] [14 regression] ICE when building Ruby 3.1.4

2023-09-19 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111476

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2023-09-19
 CC||ktkachov at gcc dot gnu.org

--- Comment #2 from ktkachov at gcc dot gnu.org ---
Confirmed. Reduced testcase.

int a, b, c, d;
void
e() {
  int f, g, h;
  for (;;)
switch (c) {
case '-':
  if (!b) {
if (a) {
  g = 0;
  goto i;
}
goto j;
  }
  for (; a;)
  i:
g++;
  if (b)
continue;
  f = 1;
  for (; f < g; f++) {
b++;
if (b)
  h *= 10;
  }
}
j:
  d = h;
}

[Bug middle-end/111378] Missed optimization for comparing with exact_log2 constants

2023-09-12 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111378

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2023-09-12
 CC||ktkachov at gcc dot gnu.org

--- Comment #1 from ktkachov at gcc dot gnu.org ---
Confirmed. On aarch64 GCC generates:
test:
mov w2, 65535
cmp w1, w2
bhi .L2
b   do_something
.L2:
b   do_something_other

but LLVM generates the shorter:
test:   // @test
lsr w8, w1, #16
cbnzw8, .LBB0_2
b   do_something
.LBB0_2:
b   do_something_other

[Bug web/111120] Rrrrr

2023-08-23 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=20

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Resolution|--- |INVALID
 Status|UNCONFIRMED |RESOLVED

--- Comment #1 from ktkachov at gcc dot gnu.org ---
.

[Bug target/110280] internal compiler error: in const_unop, at fold-const.cc:1884

2023-06-16 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110280

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Ever confirmed|0   |1
 CC||ktkachov at gcc dot gnu.org
   Last reconfirmed||2023-06-16
 Status|UNCONFIRMED |NEW
 Target|arm64   |aarch64

--- Comment #2 from ktkachov at gcc dot gnu.org ---
Confirmed, reducing.

[Bug target/110235] [14 Regression] Wrong use of us_truncate in SSE and AVX RTL representation

2023-06-15 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110235

--- Comment #4 from ktkachov at gcc dot gnu.org ---
(In reply to Hongtao.liu from comment #3)
> (In reply to Hongtao.liu from comment #2)
> > FAIL: gcc.target/i386/avx2-vpackssdw-2.c execution test
> > 
> > This one is about sign saturation which should match rtl SS_TRUNCATE.
> 
> I realize for 256-bit/512-bit vpackssdw, it's an 128-bit iterleave of src1
> and src2, and then ss_truncate to the dest, not just vec_concat src1 and
> src2. So the simplification exposed the bug.

Thanks for looking at it. I think it'd make sense for someone with x86/sse/avx
experience to rewrite the RTL representation of the patterns involved to match
the correct semantics for saturation and lane behaviour.
Alternatively, a quick solution would be to convert uses of
us_truncate/ss_truncate in the problematic patterns to an x86-specific UNSPEC,
which would make things work like they did before the simplification was added.
That would be just a stop-gap solution as it's better to use standard RTL
operations where possible.

[Bug target/110235] New: Wrong use of us_truncate in SSE and AVX RTL representation

2023-06-13 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110235

Bug ID: 110235
   Summary: Wrong use of us_truncate in SSE and AVX RTL
representation
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ktkachov at gcc dot gnu.org
CC: uros at gcc dot gnu.org
  Target Milestone: ---
Target: x86

After g:921b841350c4fc298d09f6c5674663e0f4208610 added constant-folding for
SS_TRUNCATE and US_TRUNCATE some tests in i386.exp started failing:
FAIL: gcc.target/i386/avx-vpackuswb-1.c execution test
FAIL: gcc.target/i386/avx2-vpackssdw-2.c execution test
FAIL: gcc.target/i386/avx2-vpackusdw-2.c execution test
FAIL: gcc.target/i386/avx2-vpackuswb-2.c execution test
FAIL: gcc.target/i386/sse2-packuswb-1.c execution test

>From what I can gather from the documentation for intrinsics like
_mm_packus_epi16 the operation they perform is not what we model as us_truncate
in RTL. That is, they don't perform a truncation while treating their input as
an unsigned value. Rather, they treat the input as a signed value and saturate
it to the unsigned min and max of the narrow mode before truncation. In that
regard they seem similar to the SQMOVUN instructions in aarch64.

I think it'd be best to change the representation of those instructions to a
truncating clamp operation, similar to
g:b747f54a2a930da55330c2861cd1e344f67a88d9 in aarch64.

[Bug target/110059] When SPEC is used to test the GCC (10.3.1), the test result of subitem 548 fluctuates abnormally.

2023-05-31 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110059

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ktkachov at gcc dot gnu.org

--- Comment #3 from ktkachov at gcc dot gnu.org ---
548.exchange2_r was improved in GCC 12 after PR98782 was fixed. I'd suggest you
try out a later version of GCC

[Bug target/110039] [14 Regression] FAIL: gcc.target/aarch64/rev16_2.c scan-assembler-times rev16\\tw[0-9]+ 2

2023-05-30 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110039

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

   Target Milestone|--- |14.0

[Bug target/110039] New: FAIL: gcc.target/aarch64/rev16_2.c scan-assembler-times rev16\\tw[0-9]+ 2

2023-05-30 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110039

Bug ID: 110039
   Summary: FAIL: gcc.target/aarch64/rev16_2.c
scan-assembler-times rev16\\tw[0-9]+ 2
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64

I think after g:d8545fb2c71683f407bfd96706103297d4d6e27b the test regresses on
aarch64.
We now generate:
__rev16_32_alt:
rev w0, w0
ror w0, w0, 16
ret

__rev16_32:
rev w0, w0
ror w0, w0, 16
ret

whereas before it was:
__rev16_32_alt:
rev16   w0, w0
ret

__rev16_32:
rev16   w0, w0
ret

I think the GIMPLE at expand time is better and the RTL that it tries to match
is simpler:
Failed to match this instruction:
(set (reg:SI 95)
(rotate:SI (bswap:SI (reg:SI 96))
(const_int 16 [0x10])))

So maybe it's simply a matter of adding that pattern to aarch64.md.

Anyway, filing this here to track the regression

[Bug target/109939] Invalid return type for __builtin_arm_ssat: Unsigned instead of signed

2023-05-24 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109939

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |ktkachov at gcc dot 
gnu.org

--- Comment #5 from ktkachov at gcc dot gnu.org ---
Fixed for GCC 14. It should be a very low risk patch to backport to the
branches as it fixes an inconsistency with the spec. Will do so after some time
for testing on trunk.

[Bug target/109855] [14 Regression] ICE: in curr_insn_transform, at lra-constraints.cc:4231 unable to generate reloads for {aarch64_mlav4hi_vec_concatz_le} at -O1

2023-05-23 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109855

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #9 from ktkachov at gcc dot gnu.org ---
Fixed, thanks for the report.

[Bug target/109939] Invalid return type for __builtin_arm_ssat: Unsigned instead of signed

2023-05-23 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109939

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Status|WAITING |NEW
 CC||ktkachov at gcc dot gnu.org

--- Comment #3 from ktkachov at gcc dot gnu.org ---
I think you're right, the qualifier for the return value of
SAT_BINOP_UNSIGNED_IMM should be qualifier_none

[Bug c/109940] [14 Regression] ICE in decide_candidate_validity, bisected

2023-05-23 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109940

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
   Target Milestone|--- |14.0
  Known to fail||14.0
  Known to work||13.1.0
   Last reconfirmed||2023-05-23
Summary|ICE in  |[14 Regression] ICE in
   |decide_candidate_validity,  |decide_candidate_validity,
   |bisected|bisected
 CC||ktkachov at gcc dot gnu.org

--- Comment #1 from ktkachov at gcc dot gnu.org ---
Confirmed. A more cleaned up testcase:
int a;
int *b;
void
c (int *d) { *d = a; }

int
e(int d, int f) {
  if (d <= 1)
return 1;
  int g = d / 2;
  for (int h = 0; h < g; h++)
if (f == (long int)b > b[h])
  c([h]);
  e(g, f);
  e(g, f);
}

[Bug target/109855] [14 Regression] ICE: in curr_insn_transform, at lra-constraints.cc:4231 unable to generate reloads for {aarch64_mlav4hi_vec_concatz_le} at -O1

2023-05-22 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109855

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |ktkachov at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #7 from ktkachov at gcc dot gnu.org ---
I'll take it.

[Bug target/109855] [14 Regression] ICE: in curr_insn_transform, at lra-constraints.cc:4231 unable to generate reloads for {aarch64_mlav4hi_vec_concatz_le} at -O1

2023-05-22 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109855

--- Comment #6 from ktkachov at gcc dot gnu.org ---
(In reply to ktkachov from comment #5)
> (In reply to rsand...@gcc.gnu.org from comment #4)
> > I guess the problem is that the define_subst output template has:
> > 
> >   (match_operand: 0)
> > 
> > which creates a new operand 0 with an empty predicate and constraint,
> > as opposed to a (match_dup 0), which would be substituted with the
> > original operand 0.  Unfortunately
> > 
> >   (match_dup: 0)
> > 
> > doesn't work as a way of inserting the original destination with
> > a different mode, since the : is ignored.  Perhaps we should
> > “fix” that.  Alternatively:
> > 
> >   (match_operand: 0 "register_operand" "=w")
> > 
> > should work, but probably locks us into using patterns that have one
> > alternative only.
> 
> I think this approach is the most promising and probably okay for the vast
> majority of cases we want to handle with these substs.

Interestingly, it does seem to do the right thing for multi-alternative
patterns too. For example:
(define_insn ("aarch64_cmltv4hf_vec_concatz_le")
 [
(set (match_operand:V8HI 0 ("register_operand") ("=w,w"))
(vec_concat:V8HI (neg:V4HI (lt:V4HI (match_operand:V4HF 1
("register_operand") ("w,w"))
(match_operand:V4HF 2 ("aarch64_simd_reg_or_zero")
("w,YDz"
(match_operand:V4HI 3 ("aarch64_simd_or_scalar_imm_zero")
(""
] ("(!BYTES_BIG_ENDIAN) && ((TARGET_SIMD) && (TARGET_SIMD_F16INST))") ("@
  fcmgt\t%0.4h, %2.4h, %1.4h
  fcmlt\t%0.4h, %1.4h, 0")
 [
(set_attr ("type") ("neon_fp_compare_s"))
(set_attr ("add_vec_concat_subst_le") ("no"))
])

[Bug target/109855] [14 Regression] ICE: in curr_insn_transform, at lra-constraints.cc:4231 unable to generate reloads for {aarch64_mlav4hi_vec_concatz_le} at -O1

2023-05-22 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109855

--- Comment #5 from ktkachov at gcc dot gnu.org ---
(In reply to rsand...@gcc.gnu.org from comment #4)
> I guess the problem is that the define_subst output template has:
> 
>   (match_operand: 0)
> 
> which creates a new operand 0 with an empty predicate and constraint,
> as opposed to a (match_dup 0), which would be substituted with the
> original operand 0.  Unfortunately
> 
>   (match_dup: 0)
> 
> doesn't work as a way of inserting the original destination with
> a different mode, since the : is ignored.  Perhaps we should
> “fix” that.  Alternatively:
> 
>   (match_operand: 0 "register_operand" "=w")
> 
> should work, but probably locks us into using patterns that have one
> alternative only.

I think this approach is the most promising and probably okay for the vast
majority of cases we want to handle with these substs.

[Bug target/109855] [14 Regression] ICE: in curr_insn_transform, at lra-constraints.cc:4231 unable to generate reloads for {aarch64_mlav4hi_vec_concatz_le} at -O1

2023-05-22 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109855

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

   Last reconfirmed||2023-05-22
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW

--- Comment #2 from ktkachov at gcc dot gnu.org ---
Confirmed.
The ICE in LRA happens very early on:
** Local #1: **

   Spilling non-eliminable hard regs: 31
alt=0: Bad operand -- refuse


The pattern matches:
 [(set (match_operand:VDQ_BHSI 0 "register_operand" "=w")
   (plus:VDQ_BHSI (mult:VDQ_BHSI
(match_operand:VDQ_BHSI 2 "register_operand" "w")
(match_operand:VDQ_BHSI 3 "register_operand" "w"))
  (match_operand:VDQ_BHSI 1 "register_operand" "0")))]

I wonder whether the substitution breaks something on the constraint in operand
1, which is tied to 0. The define_subst rule adds another operand to the
pattern to match the zero vector, but I would have expected the substitution
machinery to handle it all transparently...

[Bug target/108140] ICE expanding __rbit

2023-05-09 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108140

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #9 from ktkachov at gcc dot gnu.org ---
This should have been fixed for 12.3.

[Bug target/109636] [14 Regression] ICE: in paradoxical_subreg_p, at rtl.h:3205 with -O -mcpu=a64fx

2023-04-28 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109636

--- Comment #7 from ktkachov at gcc dot gnu.org ---
(In reply to rsand...@gcc.gnu.org from comment #6)
> Ugh.  I guess we've got no option but to force the original
> subreg into a fresh register, but that's going to pessimise
> cases where arithmetic is done on tuple types.
> 
> Perhaps we should just expose the SVE operation as a native
> V2DI one.  Handling predicated ops would be a bit more challenging
> though.

I did try a copy_to_mode_reg to a fresh V2DI register for non-REG_P arguments
and that did progress, but (surprisingly?) still ICEd during fwprop:
during RTL pass: fwprop1
mulice.c: In function 'foom':
mulice.c:17:1: internal compiler error: in paradoxical_subreg_p, at rtl.h:3205
   17 | }
  | ^
0xe903b9 paradoxical_subreg_p(machine_mode, machine_mode)
$SRC/gcc/rtl.h:3205
0xe903b9 simplify_context::simplify_subreg(machine_mode, rtx_def*,
machine_mode, poly_int<2u, unsigned long>)
$SRC/gcc/simplify-rtx.cc:7533
0xe1b5f7 insn_propagation::apply_to_rvalue_1(rtx_def**)
$SRC/gcc/recog.cc:1176
0xe1b3d8 insn_propagation::apply_to_rvalue_1(rtx_def**)
$SRC/gcc/recog.cc:1118
0xe1b7b7 insn_propagation::apply_to_rvalue_1(rtx_def**)
$SRC/gcc/recog.cc:1254
0xe1babf insn_propagation::apply_to_pattern_1(rtx_def**)
$SRC/gcc/recog.cc:1361
0xe1bae4 insn_propagation::apply_to_pattern(rtx_def**)
$SRC/gcc/recog.cc:1383
0x1c22e5b try_fwprop_subst_pattern
$SRC/gcc/fwprop.cc:454
0x1c22e5b try_fwprop_subst
$SRC/gcc/fwprop.cc:627
0x1c239a9 forward_propagate_and_simplify
$SRC/gcc/fwprop.cc:823
0x1c239a9 forward_propagate_into
$SRC/gcc/fwprop.cc:886
0x1c23bc1 fwprop_insn
$SRC/gcc/fwprop.cc:943
0x1c23d98 fwprop
$SRC/gcc/fwprop.cc:995
0x1c240e1 execute
$SRC/gcc/fwprop.cc:1033
Please submit a full bug report, with preprocessed source (by using
-freport-bug).
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

fwprop ended up creating:
(mult:VNx2DI (subreg:VNx2DI (reg/v:V2DI 95 [ v ]) 0)
(subreg:VNx2DI (subreg:V2DI (reg/v:OI 97 [ w ]) 16) 0))

and something blew up anyway, so it seems the RTL passes *really* don't like
these kind of subregs ;)
I'll look into expressing these ops as native V2DI patterns. I guess for the
unpredicated SVE2 mul that's easy, but for the predicated forms perhaps we can
have them consume a predicate register, generated at expand time, similar to
the  aarch64-sve.md expanders. Not super-pretty but maybe it'll be enough

[Bug target/109636] [14 Regression] ICE: in paradoxical_subreg_p, at rtl.h:3205 with -O -mcpu=a64fx

2023-04-28 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109636

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

   Priority|P3  |P1
   Assignee|unassigned at gcc dot gnu.org  |ktkachov at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #5 from ktkachov at gcc dot gnu.org ---
The multiplication case also ICEs
void foom (V v, W w)
{
  bar (__builtin_shuffle (v, __builtin_shufflevector ((V){}, w, 4, 5) * v));
}

as mulv2di3 was implemented with a similar trick for TARGET_SVE.
I'll take this, once I figure out how to wire up the Neon modes through SVE...

[Bug target/109636] [14 Regression] ICE: in paradoxical_subreg_p, at rtl.h:3205 with -O -mcpu=a64fx

2023-04-27 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109636

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

   Last reconfirmed||2023-04-27
 Status|UNCONFIRMED |NEW
 CC||rsandifo at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #4 from ktkachov at gcc dot gnu.org ---
Confirmed. The operand that's blowing it up is:
(subreg:V2DI (reg/v:OI 97 [ w ]) 16)
at
rtx sve_op1 = simplify_gen_subreg (sve_mode, operands[1], mode, 0);

simplify_gen_subreg, lowpart_subreg, copy_to_mode_reg and force_reg all ICE :(

[Bug target/109636] [14 Regression] ICE: in paradoxical_subreg_p, at rtl.h:3205 with -O -mcpu=a64fx

2023-04-26 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109636

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ktkachov at gcc dot gnu.org

--- Comment #2 from ktkachov at gcc dot gnu.org ---
(In reply to Andrew Pinski from comment #1)
> Are you sure this is not a regression also in GCC 13.1.0.
> The most obvious revision which caused this is r13-6620-gf23dc726875c26f2c3 .

I'd expect it's g:c69db3ef7f7d82a50f46038aa5457b7c8cc2d643 but haven't looked
deeper yet

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2023-04-24 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 109406, which changed state.

Bug 109406 Summary: Missing use of aarch64 SVE2 unpredicated integer multiply
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109406

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug target/109406] Missing use of aarch64 SVE2 unpredicated integer multiply

2023-04-24 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109406

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
   Target Milestone|--- |14.0
 Resolution|--- |FIXED

--- Comment #3 from ktkachov at gcc dot gnu.org ---
Fixed for GCC 14

[Bug target/108779] AARCH64 should add an option to change TLS register location to support EL1/EL2/EL3 system registers

2023-04-21 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108779

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
   Target Milestone|--- |14.0
 Resolution|--- |FIXED

--- Comment #10 from ktkachov at gcc dot gnu.org ---
Implemented for GCC 14.

[Bug c/109553] New: Atomic operations vs const locations

2023-04-19 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109553

Bug ID: 109553
   Summary: Atomic operations vs const locations
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: diagnostic
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---

When reasoning about optimal sequences for atomic operations for various
targets the issue of read-only memory locations keeps coming up, particularly
when talking about doing non-native larger-sized accesses locklessly

I wonder if the frontends in GCC should be more assertive with warnings on such
constructs. Consider, for example:
#include 

uint32_t
load_uint32_t (const uint32_t *a)
{
  return __atomic_load_n (a, __ATOMIC_ACQUIRE);
}

void
casa_uint32_t (const uint32_t *a, uint32_t *b, uint32_t *c)
{
  __atomic_compare_exchange_n (a, b, 3, 0, __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE);
}

Both of these functions compile fine with GCC.
With Clang casa_uint32_t  gives a hard error:
error: address argument to atomic operation must be a pointer to non-const type
('const uint32_t *' (aka 'const unsigned int *') invalid)
  __atomic_compare_exchange_n (a, b, 3, 0, __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE);

I would argue that for both cases the compiler should emit something. I think
an error is a appropriate for the __atomic_compare_exchange_n case, but even
for atomic load we may want to hint to the user to avoid doing an atomic load
from const types.

[Bug target/108840] Aarch64 doesn't optimize away shift counter masking

2023-04-19 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108840

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED
   Target Milestone|--- |14.0

--- Comment #5 from ktkachov at gcc dot gnu.org ---
Fixed for GCC 14.

[Bug tree-optimization/109154] [13 regression] jump threading de-optimizes nested floating point comparisons

2023-04-05 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109154

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

   Priority|P3  |P1

--- Comment #43 from ktkachov at gcc dot gnu.org ---
Indeed, thank you for the high quality analysis and improvements!
Marking this as P1 as it's a regression on aarch64-linux in GCC 13 so we'd want
to track this for the release, but of course it's up to RMs for the final say.

[Bug target/109406] Missing use of aarch64 SVE2 unpredicated integer multiply

2023-04-04 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109406

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

   Severity|normal  |enhancement

[Bug target/109406] New: Missing use of aarch64 SVE2 unpredicated integer multiply

2023-04-04 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109406

Bug ID: 109406
   Summary: Missing use of aarch64 SVE2 unpredicated integer
multiply
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64

For the testcase
#define N 1024

long long res[N];
long long in1[N];
long long in2[N];

void
mult (void)
{
  for (int i = 0; i < N; i++)
res[i] = in1[i] * in2[i];
}

With -O3 -march=armv8.5-a+sve2 we generate the loop:
ptrue   p1.b, all
whilelo p0.d, wzr, w2
.L2:
ld1dz0.d, p0/z, [x4, x0, lsl 3]
ld1dz1.d, p0/z, [x3, x0, lsl 3]
mul z0.d, p1/m, z0.d, z1.d
st1dz0.d, p0, [x1, x0, lsl 3]
incdx0
whilelo p0.d, w0, w2
b.any   .L2
ret

SVE2 supports the MUL (vectors, unpredicated) instruction that would allow us
to  eliminate the use of p1. Clang manages to do this (though it has other
inefficiencies) in https://godbolt.org/z/7xj6xEchx

[Bug tree-optimization/109401] New: Optimise max (a, b) + min (a, b) into a + b

2023-04-04 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109401

Bug ID: 109401
   Summary: Optimise max (a, b) + min (a, b) into a + b
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---

The testcase

#include 
#include 

uint32_t
foo (uint32_t a, uint32_t b)
{
  return std::max (a, b) + std::min (a, b);
}

uint32_t
foom (uint32_t a, uint32_t b)
{
  return std::max (a, b) * std::min (a, b);
}

could optimise foo into a + b and foom into a * b.
Should be a matter of some match.pd patterns?

[Bug target/109332] Bug in gcc (13.0.1) support for ARM SVE, which randomly ignore the predict register

2023-03-29 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109332

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ktkachov at gcc dot gnu.org
 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |INVALID

--- Comment #1 from ktkachov at gcc dot gnu.org ---
That's expected. Please see
https://github.com/ARM-software/acle/blob/main/main/acle.md#sve-naming-convention
Since the input uses the _x form of the intrinsic svsub_n_s64_x the predication
behaviour is left to the compiler and the ACLE specifies:
"This form of predication removes the need to choose between zeroing and
merging in cases where the inactive elements are unimportant. The code
generator can then pick whichever form of instruction seems to give the best
code. This includes using unpredicated instructions, where available and
suitable."

So using an unpredicated sub instruction is appropriate here and not a bug.

[Bug tree-optimization/109176] [13 Regression] internal compiler error: in to_constant, at poly-int.h:504

2023-03-21 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109176

--- Comment #10 from ktkachov at gcc dot gnu.org ---
For the testcase, having it in gcc.target/aarch64/sve as
/* { dg-options "-O2" } */

#include 

svbool_t
foo (svint8_t a, svint8_t b, svbool_t c)
{
  svbool_t d = svcmplt_s8 (svptrue_pat_b8 (SV_ALL), a, b);
  return svsel_b (d, c, d);
}

would be fine.

[Bug tree-optimization/109176] [13 Regression] internal compiler error: in to_constant, at poly-int.h:504

2023-03-20 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109176

--- Comment #3 from ktkachov at gcc dot gnu.org ---
Created attachment 54708
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54708=edit
Reduced testcase

Reduced testcase ICEs at -O2

[Bug tree-optimization/109176] internal compiler error: in to_constant, at poly-int.h:504

2023-03-17 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109176

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 CC||ktkachov at gcc dot gnu.org
   Target Milestone|--- |13.0
 Ever confirmed|0   |1
   Last reconfirmed||2023-03-17

--- Comment #2 from ktkachov at gcc dot gnu.org ---
Confirmed. Running reduction

[Bug middle-end/109153] missed vector constructor optimizations

2023-03-16 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109153

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ktkachov at gcc dot gnu.org
 Ever confirmed|0   |1
   Last reconfirmed||2023-03-16
 Status|UNCONFIRMED |NEW

--- Comment #1 from ktkachov at gcc dot gnu.org ---
Confirmed. Does the midend have a way of judging whether a constructor is
cheaper?

[Bug c++/108967] internal compiler error: in expand_debug_expr, at cfgexpand.cc:5450

2023-02-28 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108967

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Target||aarch64
   Last reconfirmed||2023-02-28
 Status|UNCONFIRMED |NEW
 CC||ktkachov at gcc dot gnu.org
 Ever confirmed|0   |1
   Target Milestone|--- |13.0
   Keywords||ice-on-valid-code
  Known to fail||13.0

--- Comment #2 from ktkachov at gcc dot gnu.org ---
Confirmed

[Bug rtl-optimization/106594] [13 Regression] sign-extensions no longer merged into addressing mode

2023-02-27 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106594

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ktkachov at gcc dot gnu.org

--- Comment #12 from ktkachov at gcc dot gnu.org ---
(In reply to Tamar Christina from comment #11)
> This patch seems to have stalled. CC'ing the maintainers as this is still a
> large regression for us.

Roger's latest updated patch was posted recently at
https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612840.html

[Bug target/108840] Aarch64 doesn't optimize away shift counter masking

2023-02-24 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108840

--- Comment #3 from ktkachov at gcc dot gnu.org ---
Created attachment 54531
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54531=edit
Candidate patch

Candidate patch attached.

[Bug tree-optimization/108901] [13 Regression] Testsuite failures in gcc.target/aarch64/sve/cond_*

2023-02-23 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108901

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #2 from ktkachov at gcc dot gnu.org ---
Yes, they are fixed now. Thank you!

[Bug tree-optimization/108901] [13 Regression] Testsuite failures in gcc.target/aarch64/sve/cond_*

2023-02-23 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108901

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

   Target Milestone|--- |13.0
   Priority|P3  |P1

[Bug tree-optimization/108901] New: [13 Regression] Testsuite failures in gcc.target/aarch64/sve/cond_*

2023-02-23 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108901

Bug ID: 108901
   Summary: [13 Regression] Testsuite failures in
gcc.target/aarch64/sve/cond_*
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: testsuite-fail
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64

After g:3da77f217c8b2089ecba3eb201e727c3fcdcd19d we're seeing testsuite
failures like:
gcc.target/aarch64/sve/cond_fmaxnm_7.c
gcc.target/aarch64/sve/cond_fminnm_7.c
gcc.target/aarch64/sve/cond_fmaxnm_8.c
gcc.target/aarch64/sve/cond_fminnm_8.c
gcc.target/aarch64/sve/cond_fminnm_6.c
gcc.target/aarch64/sve/fmla_2.c
gcc.target/aarch64/sve/cond_xorsign_2.c
gcc.target/aarch64/sve/cond_xorsign_1.c
gcc.target/aarch64/sve/cond_fmaxnm_6.c

on aarch64. I haven't looked into the cause, just reporting here for tracking

[Bug target/108874] [10/11/12/13 Regression] Missing bswap detection

2023-02-22 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108874

--- Comment #3 from ktkachov at gcc dot gnu.org ---
(In reply to Richard Biener from comment #2)
> The regression is probably rtl-optimization/target specific since we never
> had this kind of pattern detected on the tree/GIMPLE level and there's no
> builtin or IFN for this shuffling on u32.

FWIW a colleague reported that he bisected the failure to
g:98e30e515f184bd63196d4d500a682fbfeb9635e though I haven't tried it myself.
We do have patterns for these in aarch64 and arm, but combine would need to
match about 5 insns to get there and that's beyond its current limit of 4

[Bug tree-optimization/108874] [10/11/12/13 Regression] Missing bswap detection

2023-02-21 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108874

--- Comment #1 from ktkachov at gcc dot gnu.org ---
(In reply to ktkachov from comment #0)
> If we look at the arm testcases in gcc.target/arm/rev16.c
> typedef unsigned int __u32;
> 
> __u32
> __rev16_32_alt (__u32 x)
> {
>   return (((__u32)(x) & (__u32)0xff00ff00UL) >> 8)
>  | (((__u32)(x) & (__u32)0x00ff00ffUL) << 8);
> }
> 
> __u32
> __rev16_32 (__u32 x)
> {
>   return (((__u32)(x) & (__u32)0x00ff00ffUL) << 8)
>  | (((__u32)(x) & (__u32)0xff00ff00UL) >> 8);
> }
> 

this isn't a simple __builtin_bswap16 as that returns a uint16_t, this is sort
of a __builtin_swap16 in each of the half-words of the u32

[Bug tree-optimization/108874] New: [10/11/12/13 Regression] Missing bswap detection

2023-02-21 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108874

Bug ID: 108874
   Summary: [10/11/12/13 Regression] Missing bswap detection
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---

If we look at the arm testcases in gcc.target/arm/rev16.c
typedef unsigned int __u32;

__u32
__rev16_32_alt (__u32 x)
{
  return (((__u32)(x) & (__u32)0xff00ff00UL) >> 8)
 | (((__u32)(x) & (__u32)0x00ff00ffUL) << 8);
}

__u32
__rev16_32 (__u32 x)
{
  return (((__u32)(x) & (__u32)0x00ff00ffUL) << 8)
 | (((__u32)(x) & (__u32)0xff00ff00UL) >> 8);
}

we should be able to generate rev16 instructions for aarch64 (and arm) i.e.
recognise a __builtin_bswap16 essentially.
GCC fails to do so and generates:
__rev16_32_alt:
lsr w1, w0, 8
lsl w0, w0, 8
and w1, w1, 16711935
and w0, w0, -16711936
orr w0, w1, w0
ret
__rev16_32:
lsl w1, w0, 8
lsr w0, w0, 8
and w1, w1, -16711936
and w0, w0, 16711935
orr w0, w1, w0
ret

whereas clang manages to recognise it all into:
__rev16_32_alt: // @__rev16_32_alt
rev16   w0, w0
ret
__rev16_32: // @__rev16_32
rev16   w0, w0
ret

does the bswap pass need some tweaking perhaps?

Looks like this worked fine with GCC 5 but broke in the GCC 6 timeframe so
marking as a regression

[Bug target/108840] Aarch64 doesn't optimize away shift counter masking

2023-02-21 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108840

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |ktkachov at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #2 from ktkachov at gcc dot gnu.org ---
I have a patch to simplify and fix the aarch64 rtx costs for this case. I'll
aim it for GCC 14 as it's not a regression.

[Bug target/108779] AARCH64 should add an option to change TLS register location to support EL1/EL2/EL3 system registers

2023-02-14 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108779

--- Comment #3 from ktkachov at gcc dot gnu.org ---
Created attachment 54459
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54459=edit
Candidate patch

Patch that implements -mtp= similar to clang if you have the capability to try
it out

[Bug target/108779] AARCH64 should add an option to change TLS register location to support EL1/EL2/EL3 system registers

2023-02-14 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108779

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Ever confirmed|0   |1
 CC||ktkachov at gcc dot gnu.org
 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |ktkachov at gcc dot 
gnu.org
   Last reconfirmed||2023-02-14

--- Comment #2 from ktkachov at gcc dot gnu.org ---
Confirmed. I have a patch I'm testing for it.
Since GCC 13 is in stage4 (regression and wrong-code fixes only) this would be
GCC 14 material. Would that timeline be okay with you?

[Bug target/108659] Suboptimal 128 bit atomics codegen on AArch64 and x64

2023-02-03 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108659

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ktkachov at gcc dot gnu.org

--- Comment #2 from ktkachov at gcc dot gnu.org ---
(In reply to Niall Douglas from comment #0)
> Related:
> - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878
> - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94649
> - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688
> 
> I got bitten by this again, latest GCC still does not emit single
> instruction 128 bit atomics, even when the -march is easily new enough. Here
> is a godbolt comparing latest MSVC, latest GCC and latest clang for the
> skylake-avx512 architecture, which unquestionably supports cmpxchg16b. Only
> clang emits the single instruction atomic:
> 
> https://godbolt.org/z/EnbeeW4az
> 
> I'm gathering from the issue comments and from the comments at
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688 that you're going to
> wait for AMD to guarantee atomicity of SSE instructions before changing the
> codegen here, which makes sense. However I also wanted to raise potentially
> suboptimal 128 bit atomic codegen by GCC for AArch64 as compared to clang:
> 
> https://godbolt.org/z/oKv4o81nv
> 
> GCC emits `dmb` to force a global memory fence, whereas clang does not.
> 
> I think clang is in the right here, the seq_cst atomic semantics are not
> supposed to globally memory fence.

FWIW, the GCC codegen for aarch64 is at https://godbolt.org/z/qvx9484nY (arm
and aarch64 are different targets). It emits a call to libatomic, which for GCC
13 will use a lockless implementation when possible at runtime, see
g:d1288d850944f69a795e4ff444a427eba3fec11b

[Bug target/108495] [10/11/12/13 Regression] aarch64 ICE with __builtin_aarch64_rndr

2023-01-25 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108495

--- Comment #7 from ktkachov at gcc dot gnu.org ---
Yes, GCC could be more helpful here. The intrinsics and their use is documented
in the ACLE document:
https://github.com/ARM-software/acle/blob/main/main/acle.md#random-number-generation-intrinsics
There is work ongoing to augument it with more user-friendly information about
compiler flags, but GCC could keep track of the options used to gate these
builtins/intrinsics and report a hint

[Bug target/108495] [10/11/12/13 Regression] aarch64 ICE with __builtin_aarch64_rndr

2023-01-23 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108495

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Keywords||ice-on-invalid-code
   Last reconfirmed||2023-01-23
 Status|UNCONFIRMED |NEW

--- Comment #1 from ktkachov at gcc dot gnu.org ---
Confirmed. That said, __builtin_aarch64_rndr is not supposed to be used
directly by the user. They should include  and use the __rndr
intrinsic instead.
That will give the appropriate error:
inlining failed in call to 'always_inline' '__rndr': target specific option
mismatch

Still, I suppose the compiler shouldn't ICE

[Bug tree-optimization/108446] New: GCC fails to elide udiv/msub when doing modulus by select of constants

2023-01-18 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108446

Bug ID: 108446
   Summary: GCC fails to elide udiv/msub when doing modulus by
select of constants
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---

unsigned foo(int vl, unsigned len) {
  unsigned pad = vl <= 256 ? 128 : 256;
  return len % pad;
}

At -O2 aarch64 gcc generates:
foo:
cmp w0, 256
mov w2, 256
mov w0, 128
cselw2, w2, w0, gt
udivw0, w1, w2
msubw0, w0, w2, w1
ret

clang, for example can generate the cheaper:
foo:// @foo
cmp w0, #256
mov w8, #127
mov w9, #255
cselw8, w9, w8, gt
and w0, w8, w1
ret

Similar situation on x86.
I suppose this could be a match.pd fix or otherwise something during
expand-time?

[Bug middle-end/88345] -Os overrides -falign-functions=N on the command line

2023-01-17 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
 CC||ktkachov at gcc dot gnu.org
   Last reconfirmed||2023-01-17

--- Comment #12 from ktkachov at gcc dot gnu.org ---
(In reply to Kito Cheng from comment #7)
> We are hitting this issue on RISC-V, and got some complain from linux kernel
> developers, but in different form as the original report, we found cold
> function or any function is marked as cold by `-fguess-branch-probability`
> are all not honor to the -falign-functions=N setting, that become problem on
> some linux kernel feature since they want to control the minimal alignment
> to make sure they can atomically update the instruction which require align
> to 4 byte.
> 
> However current GCC behavior can't guarantee that even -falign-functions=4
> is given, there is 3 option in my mind:
> 
> 1. Fix -falign-functions=N, let it work as expect on -Os and all cold
> functions
> 2. Force align to 4 byte if -fpatchable-function-entry is given, that's
> should be doable by adjust RISC-V's FUNCTION_BOUNDARY
> 3. Adjust RISC-V's FUNCTION_BOUNDARY to let it honor to -falign-functions=N
> 4. Adding a -malign-functions=N...Okay, I know that suck idea, x86 already
> deprecated that.
> 
> But I think ideally this should fixed by 1 option if possible.
> 
> Testcase from RISC-V kernel guy:
> ```
> /* { dg-do compile } */
> /* { dg-options "-march=rv64gc -mabi=lp64d -O1 -falign-functions=128" } */
> /* { dg-final { scan-assembler-times ".align 7" 2 } } */
> 
> // Using 128 byte align rather than 4 byte align since it easier to observe.
> 
> __attribute__((__cold__)) void a() {} // This function isn't align to 128
> byte
> void b() {} // This function align to 128 byte.
> ```
> 
> Proposed fix:
> ```
> diff --git a/gcc/varasm.c b/gcc/varasm.c
> index 49d5cda122f..6f8ed85fea9 100644
> --- a/gcc/varasm.c
> +++ b/gcc/varasm.c
> @@ -1907,8 +1907,7 @@ assemble_start_function (tree decl, const char *fnname)
>   Note that we still need to align to DECL_ALIGN, as above,
>   because ASM_OUTPUT_MAX_SKIP_ALIGN might not do any alignment at all. 
> */
>if (! DECL_USER_ALIGN (decl)
> -  && align_functions.levels[0].log > align
> -  && optimize_function_for_speed_p (cfun))
> +  && align_functions.levels[0].log > align)
>  {
>  #ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
>int align_log = align_functions.levels[0].log;
> 
> ```

I think this patch makes sense given the extra information you and Mark have
provided. Would you mind testing it and posting it to gcc-patches for review
please?

[Bug rust/106072] [13 Regression] -Wnonnull warning breaks rust bootstrap

2022-12-20 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106072

--- Comment #18 from ktkachov at gcc dot gnu.org ---
(In reply to Richard Biener from comment #17)
> Fixed(?)

Yes on aarch64, thanks!

[Bug target/102218] 128-bit atomic compare and exchange does not honor memory model on AArch64 and Arm

2022-12-20 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102218

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ktkachov at gcc dot gnu.org

--- Comment #3 from ktkachov at gcc dot gnu.org ---
Does this need to be backported to other release versions as it's a wrong-code
bug?

[Bug target/95751] [aarch64] Consider using ldapr for __atomic_load_n(acquire) on ARMv8.3-RCPC

2022-12-20 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95751

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ktkachov at gcc dot gnu.org
 Ever confirmed|0   |1
   Last reconfirmed||2022-12-20
 Status|UNCONFIRMED |NEW

--- Comment #1 from ktkachov at gcc dot gnu.org ---
I had not seen this report at the time, but LDAPR generation has now been
implemented in GCC 13.1 for acquire loads with
https://gcc.gnu.org/g:0431e8ae5bdb854bda5f9005e41c8c4d03f6d74e and follow-ups.
Any testing/evaluation/feedback would be welcome

[Bug target/107209] [13 Regression] ICE: verify_gimple failed (error: statement marked for throw, but doesn't)

2022-12-20 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107209

--- Comment #5 from ktkachov at gcc dot gnu.org ---
(In reply to Jakub Jelinek from comment #4)
> Looking at other backends, rs6000 uses in *gimple_fold_builtin gsi_replace
> (..., true);
> all the time, ix86 gsi_replace (..., false); all the time, alpha with true,
> aarch64 with true.  But perhaps what is more important if the builtins
> folded are declared nothrow or not, if they are nothrow, then they shouldn't
> have any EH edges at the start already and so it shouldn't matter what is
> used.

The vmulx_f64 intrinsic is not marked "nothrow" by the logic:
1284 static tree
1285 aarch64_get_attributes (unsigned int f, machine_mode mode)
1286 {
1287   tree attrs = NULL_TREE;
1288
1289   if (!aarch64_modifies_global_state_p (f, mode))
1290 {
1291   if (aarch64_reads_global_state_p (f, mode))
1292 attrs = aarch64_add_attribute ("pure", attrs);
1293   else
1294 attrs = aarch64_add_attribute ("const", attrs);
1295 }
1296
1297   if (!flag_non_call_exceptions || !aarch64_could_trap_p (f, mode))
1298 attrs = aarch64_add_attribute ("nothrow", attrs);
1299
1300   return aarch64_add_attribute ("leaf", attrs);
1301 }

aarch64_could_trap_p returns true for it as it can raise an FP exception.
Should that affect the nothrow attribute though? Shouldn't that be for C++
exceptions only?

[Bug middle-end/108140] ICE expanding __rbit

2022-12-16 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108140

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

   Target Milestone|--- |12.3
 CC||ktkachov at gcc dot gnu.org
 Ever confirmed|0   |1
   Last reconfirmed||2022-12-16
 Status|UNCONFIRMED |ASSIGNED

--- Comment #5 from ktkachov at gcc dot gnu.org ---
Confirmed the ICE and I'm testing a patch to fix that, thanks for the report

[Bug rust/108084] New: AArch64 Linux bootstrap failure in rust

2022-12-13 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108084

Bug ID: 108084
   Summary: AArch64 Linux bootstrap failure in rust
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Keywords: build
  Severity: normal
  Priority: P3
 Component: rust
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ktkachov at gcc dot gnu.org
CC: dkm at gcc dot gnu.org
  Target Milestone: ---
  Host: aarch64-none-linux-gnu
Target: aarch64-none-linux-gnu

Congratulations on getting the rust frontend committed!
When trying a bootstrap on aarch64-none-linux with
--enable-languages=c,c++,fortran,rust I get a -Werror=nonnull failure

In file included from $SRC/gcc/rust/parse/rust-parse.h:730,
 from $SRC/gcc/rust/expand/rust-macro-builtins.cc:25:
$SRC/gcc/rust/parse/rust-parse-impl.h: In member function
'Rust::AST::ClosureParam
Rust::Parser::parse_closure_param() [with
ManagedTokenSource = Rust::Lexer]':
$SRC/gcc/rust/parse/rust-parse-impl.h:8916:70: error: 'this' pointer is null
[-Werror=nonnull]
 8916 | std::move (type), std::move (outer_attrs));
  |  ^
In file included from $SRC/gcc/rust/parse/rust-parse.h:730,
 from $SRC/gcc/rust/expand/rust-macro-expand.h:23,
 from $SRC/gcc/rust/expand/rust-macro-expand.cc:19:
$SRC/gcc/rust/parse/rust-parse-impl.h: In member function
'Rust::AST::ClosureParam
Rust::Parser::parse_closure_param() [with
ManagedTokenSource = Rust::MacroInvocLexer]':
$SRC/gcc/rust/parse/rust-parse-impl.h:8916:70: error: 'this' pointer is null
[-Werror=nonnull]
 8916 | std::move (type), std::move (outer_attrs));
  |  ^
In file included from $SRC/gcc/rust/parse/rust-parse.h:730,
 from $SRC/gcc/rust/rust-session-manager.cc:23:
$SRC/gcc/rust/parse/rust-parse-impl.h: In member function
'Rust::AST::ClosureParam
Rust::Parser::parse_closure_param() [with
ManagedTokenSource = Rust::Lexer]':
$SRC/gcc/rust/parse/rust-parse-impl.h:8916:70: error: 'this' pointer is null
[-Werror=nonnull]
 8916 | std::move (type), std::move (outer_attrs));

[Bug target/108006] [13 Regression] ICE in aarch64_move_imm building 502.gcc_r

2022-12-07 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108006

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #1 from ktkachov at gcc dot gnu.org ---
Wilco, is this something you've touched recently?

[Bug target/108006] New: [13 Regression] ICE in aarch64_move_imm building 502.gcc_r

2022-12-07 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108006

Bug ID: 108006
   Summary: [13 Regression] ICE in aarch64_move_imm building
502.gcc_r
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---

Building 502.gcc_r from SPEC2017 with -O2 -mcpu=neoverse-v1 ICEs with trunk.
Reduced testcase:

void c();

short *foo;
short *bar;
void
a() {
  for (bar; bar < foo; bar++)
*bar = 999;
  c();
}

backtrace is:
during RTL pass: expand
ice.c: In function a:
ice.c:8:10: internal compiler error: in aarch64_move_imm, at
config/aarch64/aarch64.cc:5692
8 | *bar = 999;
  | ~^
0x129db4c aarch64_move_imm(unsigned long, machine_mode)
$SRC/gcc/config/aarch64/aarch64.cc:5692
0x12c01cd aarch64_expand_sve_const_vector
$SRC/gcc/config/aarch64/aarch64.cc:6516
0x12c63cb aarch64_expand_mov_immediate(rtx_def*, rtx_def*)
$SRC/gcc/config/aarch64/aarch64.cc:6996
0x18c3248 gen_movvnx8hi(rtx_def*, rtx_def*)
$SRC/gcc/config/aarch64/aarch64-sve.md:662
0xa09062 rtx_insn* insn_gen_fn::operator()(rtx_def*,
rtx_def*) const
$SRC/gcc/recog.h:407
0xa09062 emit_move_insn_1(rtx_def*, rtx_def*)
$SRC/gcc/expr.cc:4172
0xa095bb emit_move_insn(rtx_def*, rtx_def*)
$SRC/gcc/expr.cc:4342
0x9db8aa copy_to_mode_reg(machine_mode, rtx_def*)
$SRC/gcc/explow.cc:654
0xd0607d maybe_legitimize_operand
$SRC/gcc/optabs.cc:7809
0xd0607d maybe_legitimize_operands(insn_code, unsigned int, unsigned int,
expand_operand*)
$SRC/gcc/optabs.cc:7941
0xd06366 maybe_gen_insn(insn_code, unsigned int, expand_operand*)
$SRC/gcc/optabs.cc:7960
0xd06592 maybe_expand_insn(insn_code, unsigned int, expand_operand*)
$SRC/gcc/optabs.cc:8005
0xd05b17 expand_insn(insn_code, unsigned int, expand_operand*)
$SRC/gcc/optabs.cc:8036
0xb53fb7 expand_partial_store_optab_fn
$SRC/gcc/internal-fn.cc:2878
0xb54307 expand_MASK_STORE
$SRC/gcc/internal-fn.def:141
0xb59960 expand_internal_call(internal_fn, gcall*)
$SRC/gcc/internal-fn.cc:4436
0xb5997a expand_internal_call(gcall*)
$SRC/gcc/internal-fn.cc:
0x8b6161 expand_call_stmt
$SRC/gcc/cfgexpand.cc:2737
0x8b6161 expand_gimple_stmt_1

[Bug target/107988] [13 Regression] ICE: in extract_insn, at recog.cc:2791 (unrecognizable insn) on aarch64-unknown-linux-gnu

2022-12-06 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107988

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

   Last reconfirmed||2022-12-06
 Ever confirmed|0   |1
 CC||ktkachov at gcc dot gnu.org,
   ||tnfchris at gcc dot gnu.org
 Status|UNCONFIRMED |NEW

--- Comment #1 from ktkachov at gcc dot gnu.org ---
Confirmed. Looks related to the recent div-by-special-constant changes but ICEs
only at -O0

[Bug target/107830] [13 Regression] ICE in gen_aarch64_bitmask_udiv3, at ./insn-opinit.h:813

2022-11-23 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107830

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org

--- Comment #2 from ktkachov at gcc dot gnu.org ---
I think it's more likely Tamar's recent patches for that optab

[Bug target/107102] SVE function fails to realize it doesn't need the frame-pointer in the tail call.

2022-10-04 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107102

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ktkachov at gcc dot gnu.org
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
   Last reconfirmed||2022-10-04

--- Comment #1 from ktkachov at gcc dot gnu.org ---
Confirmed, clang tail-calls this:
bar:// @bar
ptrue   p1.b
ptrue   p0.s
and p0.b, p1/z, p1.b, p0.b
b   foo

[Bug target/107025] gas doesn't accept code produced by -mcpu=thunderx3t110

2022-09-26 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107025

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

   Last reconfirmed||2022-09-26
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW

--- Comment #2 from ktkachov at gcc dot gnu.org ---
In the Arm architecture this is FEAT_LRCPC2. LLVM does have an MC (essentially
assembler-level) feature string for it called "rcpc-immo", so if we wanted to
support this I guess we'd want to be compatible.
That said, it may be cleaner to just remove support for thunderx3t110 if we
think it's the right time.

Unfortunately we do still have some cases where our features aren't
fine-grained enough and are tied to architecture levels that some CPUs don't
claim to support:
https://godbolt.org/z/axbnd4c5o

[Bug target/106583] New: Suboptimal immediate generation on aarch64

2022-08-11 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106583

Bug ID: 106583
   Summary: Suboptimal immediate generation on aarch64
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64

A simple codegen issue:
unsigned long long
foo (void)
{
  return 0x7efefefefefefeff;
}

generates at -O2
foo:
mov x0, 65279
movkx0, 0xfefe, lsl 16
movkx0, 0xfefe, lsl 32
movkx0, 0x7efe, lsl 48
ret

whereas LLVM can do:
foo:// @foo
mov x0, #-72340172838076674
movkx0, #65279
movkx0, #32510, lsl #48
ret

Should be a matter of just making aarch64_internal_mov_immediate in aarch64.cc
a bit smarter

[Bug middle-end/106568] -freorder-blocks-algorithm appears to causes a crash in stable code, no way to disable it

2022-08-08 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106568

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ktkachov at gcc dot gnu.org

--- Comment #1 from ktkachov at gcc dot gnu.org ---

> We are fairly certain the problem is with the -freorder-blocks-algorithm
> optimization. The problem we are now having is, we don't know how to disable
> it. The following fails to compile:
> 
> -fno-reorder-blocks-algorithm
> -freorder-blocks-algorithm=none
> -freorder-blocks-algorithm=
> 

You should be able to use -fno-reorder-blocks to disable it.
Alternatively, if you use -freorder-blocks-algorithm= you can only pass it the
"simple" or "stc" options as per the documentation. This will pick one of the
two available algorithms.

That said, one major change that happened in GCC 12.1 was enabling
auto-vectorisation by default at -O2. See
https://gcc.gnu.org/gcc-12/changes.html
The vectorisation at -O2 uses less aggressive heuristics than at -O3 so could
trigger different behaviour than -O3 or lower options (where it doesn't
vectorise at all). May be worth investigating.

[Bug tree-optimization/106343] Addition with constants is not vectorized by SLP when it includes zero

2022-07-18 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106343

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
   Last reconfirmed||2022-07-18
 CC||ktkachov at gcc dot gnu.org,
   ||rguenth at gcc dot gnu.org
 Target|aarch64 |aarch64, x86_64

--- Comment #1 from ktkachov at gcc dot gnu.org ---
Confirmed, it's quite odd. x86_64 is also affected:
https://godbolt.org/z/q46z3hh9Y

[Bug target/106324] ptrue not reused between vector instructions and predicate instructions

2022-07-18 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106324

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

   Last reconfirmed||2022-07-18
 Ever confirmed|0   |1
 CC||ktkachov at gcc dot gnu.org
   Keywords||missed-optimization
 Status|UNCONFIRMED |NEW

--- Comment #2 from ktkachov at gcc dot gnu.org ---
Confirmed.

[Bug tree-optimization/98138] BB vect fail to SLP one case

2022-07-06 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

   Last reconfirmed||2022-07-06
 CC||ktkachov at gcc dot gnu.org
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW

--- Comment #10 from ktkachov at gcc dot gnu.org ---
Note that current clang does a pretty decent job on this now on aarch64 (in
case it gives some inspiration on the approach)
https://godbolt.org/z/EPvqMhh7v

[Bug tree-optimization/106064] Wrong code comparing two global zero-sized arrays

2022-06-23 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106064

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

   Keywords||wrong-code
 CC||ktkachov at gcc dot gnu.org

--- Comment #1 from ktkachov at gcc dot gnu.org ---
This seems to have changed in the GCC 9 series. GCC 8.5 generates:
f():
mov w0, 1
ret
g():
mov w0, 1
ret
b:
a:

Tagging as a claimed wrong-code bug.

[Bug tree-optimization/105793] New: Missed vectorisation with conditional-select inside loop

2022-05-31 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105793

Bug ID: 105793
   Summary: Missed vectorisation with conditional-select inside
loop
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---

The code:
#define N 1024

float f(const float in[N], unsigned int n) {
float a = 0.0f;

for (unsigned  i = 0; i < N; ++i) {
float b = in[i];
if (b < 10.f)
a += b;
else
a -= b;
}

return a;
}

with -Ofast does not vectorise (on aarch64, for example):
f:
moviv0.2s, #0
add x1, x0, 4096
fmovs3, 1.0e+1
.L5:
ldr s1, [x0], 4
fsubs2, s0, s1
fcmpe   s1, s3
fadds0, s0, s1
fcsel   s0, s0, s2, mi
cmp x1, x0
bne .L5
ret

whereas clang can and does. Commenting out the "else a -=b;" line allows GCC to
vectorise it:
f:
moviv0.4s, 0
add x1, x0, 4096
fmovv3.4s, 1.0e+1
.L2:
ldr q2, [x0], 16
fcmgt   v1.4s, v3.4s, v2.4s
and v1.16b, v1.16b, v2.16b
faddv0.4s, v0.4s, v1.4s
cmp x1, x0
bne .L2
faddp   v0.4s, v0.4s, v0.4s
faddp   v0.4s, v0.4s, v0.4s
ret

Examples at https://gcc.godbolt.org/z/qbn6T73qE

[Bug target/99037] Invalid representation of vector zero in aarch64-simd.md

2022-05-06 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99037

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #10 from ktkachov at gcc dot gnu.org ---
This has been fixed in all active branches.

[Bug target/105219] [12 Regression] SVE: Wrong code with -O3 -msve-vector-bits=128 -mtune=thunderx

2022-04-11 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105219

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
   Target Milestone|--- |12.0
 CC||ktkachov at gcc dot gnu.org
   Priority|P3  |P1

--- Comment #2 from ktkachov at gcc dot gnu.org ---
Confirmed then.

[Bug target/105162] [AArch64] outline-atomics drops dmb ish barrier on __sync builtins

2022-04-07 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ktkachov at gcc dot gnu.org

--- Comment #6 from ktkachov at gcc dot gnu.org ---
Can you please send the patch to gcc-patches for review. It'll get more eyes
there

[Bug middle-end/104026] [12 Regression] ICE in wide_int_to_tree_1, at tree.c:1755 via tree-vect-loop-manip.c:673

2022-01-14 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104026

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2022-01-14
 Ever confirmed|0   |1
 CC||ktkachov at gcc dot gnu.org
 Target|amdgcn-amdhsa   |amdgcn-amdhsa, aarch64

--- Comment #9 from ktkachov at gcc dot gnu.org ---
We're also seeing this on aarch64-none-elf with:
#include 

void execute(int *y);

void foo (int n) {
  int *b = (int *)malloc((n - 1) * sizeof(int));
  execute(b);

  int n1 = 1.0 / (n - 1);
  for (int i = 0; i < n - 1; i++) {
b[i] *= n1;
  }
}

compiled with -O2 -march=armv8-a+sve

[Bug other/79469] Feature request: provide `__builtin_assume` builtin function to allow more aggressive optimizations and to match clang

2021-11-24 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79469

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2021-11-24
 CC||aldyh at gcc dot gnu.org,
   ||ktkachov at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #3 from ktkachov at gcc dot gnu.org ---
We've received requests from some users for this builtin as well. Given the new
ranger infrastructure, would it be able to make use of the semantics of such a
builtin in a useful way? (It'd be good to see GCC eliminate some redundant
extensions, maybe threading opportunities could be improved etc)

[Bug tree-optimization/102652] Unnecessary zeroing out of local ARM NEON arrays

2021-10-08 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102652

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ktkachov at gcc dot gnu.org
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
   Last reconfirmed||2021-10-08

--- Comment #1 from ktkachov at gcc dot gnu.org ---
Confirmed on the GCC 11 release. There is an active effort to improve the code
generation for these intrinsics and current trunk produces:
bug:
ldr q5, [x1]
sshrv4.16b, v5.16b, 7
mov v0.16b, v5.16b
mov v1.16b, v4.16b
mov v2.16b, v4.16b
mov v3.16b, v4.16b
st4 {v0.16b - v3.16b}, [x0], 64
ldr q4, [x1, 16]
mov v0.16b, v4.16b
sshrv4.16b, v4.16b, 7
mov v1.16b, v4.16b
mov v2.16b, v4.16b
mov v3.16b, v4.16b
st4 {v0.16b - v3.16b}, [x0]
ret

Not optimal yet, but moving in the right direction

[Bug tree-optimization/102324] ICE in initialize_matrix_A, at tree-data-ref.c:3959

2021-09-14 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102324

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

   Target Milestone|--- |10.4
 Target||aarch64
  Known to fail||10.3.1, 11.1.1, 12.0

[Bug tree-optimization/102324] New: ICE in initialize_matrix_A, at tree-data-ref.c:3959

2021-09-14 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102324

Bug ID: 102324
   Summary: ICE in initialize_matrix_A, at tree-data-ref.c:3959
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---

The AArch64 SVE ACLE testcase below

#include 
svint8_t doit(svbool_t ptrue, svint8_t m0) {
auto combine_low =
[]( svint8_t in) -> svint8_t {
int8_t data [2000];
svst1(ptrue, (int8_t *)data, in);
for (int _i = (int)svcntb()/2; _i < (int)svcntb(); ++_i)
data[_i] = data[_i-(int)svcntb()];
in = svld1(ptrue, data);
return in;
};
return combine_low(m0);

}

ICEs with -march=armv8-a+sve -O2 
ice.cc: In lambda function:
ice.cc:4:5: internal compiler error: in initialize_matrix_A, at
tree-data-ref.c:3959
4 | []( svint8_t in) -> svint8_t {
  | ^
0x1ed2988 initialize_matrix_A
$SRC/gcc/tree-data-ref.c:3959
0x1ed2965 initialize_matrix_A
$SRC/gcc/tree-data-ref.c:3929
0x1ed8454 analyze_subscript_affine_affine
$SRC/gcc/tree-data-ref.c:4361
0x1edb8fd analyze_siv_subscript
$SRC/gcc/tree-data-ref.c:4703
0x1edb8fd analyze_overlapping_iterations
$SRC/gcc/tree-data-ref.c:4933
0x1edb8fd subscript_dependence_tester_1
$SRC/gcc/tree-data-ref.c:5487
0x1edc10c subscript_dependence_tester
$SRC/gcc/tree-data-ref.c:5537
0x1edc10c compute_affine_dependence(data_dependence_relation*, loop*)
$SRC/gcc/tree-data-ref.c:5597
0x118ea4d loop_distribution::get_data_dependence(graph*, data_reference*,
data_reference*)
$SRC/gcc/tree-loop-distribution.c:1379
0x118eaba loop_distribution::data_dep_in_cycle_p(graph*, data_reference*,
data_reference*)
$SRC/gcc/tree-loop-distribution.c:1398
0x118ed49 loop_distribution::update_type_for_merge(graph*, partition*,
partition*)
$SRC/gcc/tree-loop-distribution.c:1441
0x118f927 loop_distribution::build_rdg_partition_for_vertex(graph*, int)
$SRC/gcc/tree-loop-distribution.c:1485
0x118fb51 loop_distribution::rdg_build_partitions(graph*, vec, vec*)
$SRC/gcc/tree-loop-distribution.c:1938
0x1191c19 loop_distribution::distribute_loop(loop*, vec const&, control_dependences*, int*, bool*, bool)
$SRC/gcc/tree-loop-distribution.c:2984
0x11940f8 loop_distribution::execute(function*)
$SRC/gcc/tree-loop-distribution.c:3353
0x119508d execute
$SRC/gcc/tree-loop-distribution.c:3441
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

[Bug target/102252] svbool_t with SVE can generate invalid assembly

2021-09-09 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102252

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |ktkachov at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #4 from ktkachov at gcc dot gnu.org ---
Testing a patch

[Bug target/102252] svbool_t with SVE can generate invalid assembly

2021-09-09 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102252

--- Comment #3 from ktkachov at gcc dot gnu.org ---
The RTL for the offending insn:

(insn 9 8 10 (set (reg:VNx16BI 68 p0)
(mem:VNx16BI (plus:DI (mult:DI (reg:DI 1 x1 [93])
(const_int 8 [0x8]))
(reg/f:DI 0 x0 [92])) [2 work_3(D)->array[offset_4(D)]+0 S8
A16])) "asm.c":29:29 4465 {*aarch64_sve_movvnx16bi}
 (nil))

That addressing mode isn't valid for predicate loads.
In aarch64.c:aarch64_classify_address if we set allow_reg_index_p to false when
vec_flags & VEC_SVE_PRED that fixes it, but will need more testing

[Bug target/102252] svbool_t with SVE can generate invalid assembly

2021-09-09 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102252

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

   Last reconfirmed||2021-09-09
 Status|UNCONFIRMED |NEW
 CC||ktkachov at gcc dot gnu.org
 Target||aarch64
 Ever confirmed|0   |1

--- Comment #2 from ktkachov at gcc dot gnu.org ---
Confirmed.

[Bug target/102226] [12 Regression] ICE with -O3 -msve-vector-bits=128

2021-09-07 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102226

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

   Priority|P3  |P1
Summary|ICE with -O3|[12 Regression] ICE with
   |-msve-vector-bits=128   |-O3 -msve-vector-bits=128
   Target Milestone|--- |12.0
  Known to fail||12.0
  Known to work||11.0

--- Comment #4 from ktkachov at gcc dot gnu.org ---
Works in GCC 11

[Bug target/102226] ICE with -O3 -msve-vector-bits=128

2021-09-07 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102226

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ktkachov at gcc dot gnu.org
   Last reconfirmed||2021-09-07
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1

--- Comment #3 from ktkachov at gcc dot gnu.org ---
Reduced testcase
template  struct b { using c = a; };
template  class> using f = b;
template  class g>
using h = typename f::c;
struct i {
  template  using k = typename j::l;
};
struct m : i {
  using l = h;
};
class n {
public:
  char operator[](long o) {
m::l s;
return s[o];
  }
} p;
n r;
int q() {
  long d;
  for (long e; e; e++)
if (p[e] == r[e])
  d++;
  return d;
}

[Bug target/95969] Use of __builtin_aarch64_im_lane_boundsi in AArch64 arm_neon.h interferes with gimple optimisation

2021-09-02 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95969

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ktkachov at gcc dot gnu.org

--- Comment #4 from ktkachov at gcc dot gnu.org ---
(In reply to Andrew Pinski from comment #3)
> Created attachment 51396 [details]
> Patch
> 
> Simple patch which adds both generic and gimple level folding for
> __builtin_aarch64_im_lane_boundsi.
> In this case (and most likely others), __builtin_aarch64_im_lane_boundsi is
> removed during early inlining so it will fix the majority of the issue.

looks like the wrong patch was attached?

[Bug target/102066] aarch64: Suboptimal addressing modes for SVE LD1W, ST1W

2021-08-25 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102066

--- Comment #2 from ktkachov at gcc dot gnu.org ---
(In reply to rsand...@gcc.gnu.org from comment #1)
> > I guess the predicates and constraints in @aarch64_pred_mov in 
> > aarch64-sve.md should allow for the scaled address modes
> They already allow them.  I'm guessing this is an ivopts problem,
> in that it doesn't realise it can promote the unsigned iterator
> to uint64_t for a svcntw() step.

ah indeed
#include 

void foo(int n, float *x, float *y) {
for (uint64_t i=0; i

[Bug target/102066] New: aarch64: Suboptimal addressing modes for SVE LD1W, ST1W

2021-08-25 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102066

Bug ID: 102066
   Summary: aarch64: Suboptimal addressing modes for SVE LD1W,
ST1W
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ktkachov at gcc dot gnu.org
CC: rsandifo at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64

For the code:
#include 

void foo(int n, float *x, float *y) {
for (unsigned i=0; i in
aarch64-sve.md should allow for the scaled address modes

[Bug tree-optimization/101637] #pragma omp for simd defeats VECT_COMPARE_COSTS optimisations

2021-07-27 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101637

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Ever confirmed|0   |1
 CC||ktkachov at gcc dot gnu.org
   Last reconfirmed||2021-07-27
 Status|UNCONFIRMED |NEW

--- Comment #2 from ktkachov at gcc dot gnu.org ---
Confirmed, though it also needs -fopenmp to trigger for me

[Bug tree-optimization/101390] Expand vector mod as vector div + multiply-subtract

2021-07-13 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101390

--- Comment #3 from ktkachov at gcc dot gnu.org ---
(In reply to Richard Biener from comment #2)
> scalar patterns are the appropriate way to do this

There may be parts of the compiler I'm not familiar here, so apologies...
By scalar patterns do you mean something in match.pd?

[Bug tree-optimization/101390] New: Expand vector mod as vector div + multiply-subtract

2021-07-09 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101390

Bug ID: 101390
   Summary: Expand vector mod as vector div + multiply-subtract
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---

When the target supports an sdiv/udiv pattern for vector modes we could
synthesise a vector modulus operation using the division and a
multiply-subtract operation.
#define N 128

extern signed int si_a[N], si_b[N], si_c[N];

void
test_si ()
{
  for (int i = 0; i < N; i++)
si_c[i] = si_a[i] % si_b[i];
}

On AArch64 SVE (but not Neon) has vector SDIV/UDIV instructions and so could
generate:
.L2:
ld1wz2.s, p0/z, [x4, x0, lsl 2]
ld1wz1.s, p0/z, [x3, x0, lsl 2]
movprfx z0, z2
sdivz0.s, p1/m, z0.s, z1.s
msb z0.s, p1/m, z1.s, z2.s
st1wz0.s, p0, [x1, x0, lsl 2]
incwx0
whilelo p0.s, w0, w2
b.any   .L2

This can be achieved by implementing the smod and mod optabs in the aarch64
backend for SVE, but this is a generic transformation, so could be handled more
generally in vect_recog_divmod_pattern and/or the vector lowering code so that
more targets can benefit.

[Bug target/100441] [8/9 Regression] ICE in output_constant_pool_2, at varasm.c:3955

2021-05-07 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100441

--- Comment #12 from ktkachov at gcc dot gnu.org ---
Should be fixed on GCC 8 and 9 branches now?

[Bug tree-optimization/96974] [10/11 Regression] ICE in vect_get_vector_types_for_stmt compiling for SVE

2021-03-31 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96974

--- Comment #13 from ktkachov at gcc dot gnu.org ---
Fixed now?

[Bug target/99820] aarch64: ICE (segfault) in aarch64_analyze_loop_vinfo with -moverride=tune=use_new_vector_costs

2021-03-30 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99820

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ktkachov at gcc dot gnu.org
 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #2 from ktkachov at gcc dot gnu.org ---
Fixed.

[Bug target/99822] [11 Regression] Assembler messages: Error: integer register expected in the extended/shifted operand register at operand 3 -- `adds x1,xzr,#2'

2021-03-30 Thread ktkachov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99822

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from ktkachov at gcc dot gnu.org ---
Fixed.

  1   2   3   4   5   6   7   8   9   10   >