[Bug libgomp/114765] linking to libgomp and setting CPU_PROC_BIND causes affinity reset

2024-04-18 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114765

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #5 from Alexander Monakov  ---
Can libgomp defer changing affinity of the initial thread to the launch of the
first parallel region (i.e. change it only at thread pool initialization,
together with new threads)?

[Bug c++/114480] g++: internal compiler error: Segmentation fault signal terminated program cc1plus

2024-04-05 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114480

--- Comment #21 from Alexander Monakov  ---
It is possible to reduce gcc_qsort workload by improving the presorted-ness of
the array, but of course avoiding quadratic behavior would be much better.

With the following change, we go from

   261,250,628,954  cycles:u
   533,040,964,437  instructions:u#2.04  insn per cycle
   114,415,857,214  branches:u
   395,327,966  branch-misses:u   #0.35% of all branches

to

   256,620,517,403  cycles:u
   526,337,243,809  instructions:u#2.05  insn per cycle
   113,447,583,099  branches:u
   383,121,251  branch-misses:u   #0.34% of all branches

diff --git a/gcc/tree-into-ssa.cc b/gcc/tree-into-ssa.cc
index d12a4a97f6..621793f7f4 100644
--- a/gcc/tree-into-ssa.cc
+++ b/gcc/tree-into-ssa.cc
@@ -805,21 +805,22 @@ prune_unused_phi_nodes (bitmap phis, bitmap kills, bitmap
uses)
  locate the nearest dominating def in logarithmic time by binary search.*/
   bitmap_ior (to_remove, kills, phis);
   n_defs = bitmap_count_bits (to_remove);
+  adef = 2 * n_defs + 1;
   defs = XNEWVEC (struct dom_dfsnum, 2 * n_defs + 1);
   defs[0].bb_index = 1;
   defs[0].dfs_num = 0;
-  adef = 1;
+  struct dom_dfsnum *head = defs + 1, *tail = defs + adef;
   EXECUTE_IF_SET_IN_BITMAP (to_remove, 0, i, bi)
 {
   def_bb = BASIC_BLOCK_FOR_FN (cfun, i);
-  defs[adef].bb_index = i;
-  defs[adef].dfs_num = bb_dom_dfs_in (CDI_DOMINATORS, def_bb);
-  defs[adef + 1].bb_index = i;
-  defs[adef + 1].dfs_num = bb_dom_dfs_out (CDI_DOMINATORS, def_bb);
-  adef += 2;
+  head->bb_index = i;
+  head->dfs_num = bb_dom_dfs_in (CDI_DOMINATORS, def_bb);
+  head++, tail--;
+  tail->bb_index = i;
+  tail->dfs_num = bb_dom_dfs_out (CDI_DOMINATORS, def_bb);
 }
+  gcc_assert (head == tail);
   BITMAP_FREE (to_remove);
-  gcc_assert (adef == 2 * n_defs + 1);
   qsort (defs, adef, sizeof (struct dom_dfsnum), cmp_dfsnum);
   gcc_assert (defs[0].bb_index == 1);

[Bug c++/114480] g++: internal compiler error: Segmentation fault signal terminated program cc1plus

2024-04-04 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114480

--- Comment #20 from Alexander Monakov  ---
(note that if you uninclude the testcase and compile with -fno-exceptions it's
much faster)

On the smaller testcase from comment 14, prune_unused_phi_nodes invokes
gcc_qsort 53386 times. There are two distinct phases.

In the first phase, the count of struct dom_dfsnum to sort grows in a roughly
linear fashion up to 23437 on the 12294'th invocation. Hence this first phase
is quadratic in the overall number of processed dom_dfsnum-s.

In the second phase, it almost always sorts exactly 7 elements for the
remaining ~41000 invocations.

The number of pairwise comparisons performed by gcc_qsort is approximately
(n+1)*(log_2(n)-1), which results in 1.8e9 comparisons overall for the 53386
invocations. perf shows 10.9e9 cycles spent under gcc_qsort, i.e. 6 cycles per
comparison, which looks about right. It's possible to reduce that further by
switching from classic to bidirectional merge, and using cmovs instead of
bitwise arithmetic for branchless selects.

> I'll note the swapping of 8 bytes is a bit odd and it seems to be
> if-converted, thus always doing a write.

That is not a swap. That's the merge step of a mergesort, we are taking the
smaller element from the heads of two arrays and moving it to the tail of a
third array.

Basically there's quadratic behavior in tree-into-ssa, gcc_qsort shows
relatively higher on the profile because the log_2(n) factor becomes
noticeable.

Hope that helps!

[Bug lto/114337] LTO symbol table doesn't include builtin functions

2024-03-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114337

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #5 from Alexander Monakov  ---
Gold handles such rescanning correctly. BFD ld regressed in 2.27, this
bugreport contains references to previous discussions about rescanning:
https://sourceware.org/bugzilla/show_bug.cgi?id=23935

(in the above bug there's a patch for ld.bfd that seemingly went nowhere)

[Bug target/108866] Allow to pass Windows resource file (.rc) as input to gcc

2024-03-14 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108866

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov  ---
It's possible to tinker with spec strings without rebuilding the compiler, see
this blog post by Geoff Wozniak (which supplies links to formal docs):
https://wozniak.ca/blog/2024/01/02/1/index.html

[Bug rtl-optimization/114261] [13/14 Regression] Scheduling takes excessive time (97%) since r13-5154-g733a1b777f1

2024-03-13 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114261

--- Comment #10 from Alexander Monakov  ---
Indeed, but OTOH according to bug 84402 comment 58 it caused a noticeable hit
on gimple-match.cc compilation:

733a1b777f16cd397b43a242d9c31761f66d3da8 13th January 2023
sched-deps: do not schedule pseudos across calls [PR108117] (Alexander Monakov)
Stage 2: +14%
Stage 3: +9%


In any case, if the proposed band-aid is unnecessary, that's fine with me.

[Bug rtl-optimization/114261] [13/14 Regression] Scheduling takes excessive time (97%) since r13-5154-g733a1b777f1

2024-03-13 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114261

--- Comment #8 from Alexander Monakov  ---
If we want to get rid of the compilation time regression sooner rather than
later, I can suggest limiting my change only to functions that call setjmp:

diff --git a/gcc/sched-deps.cc b/gcc/sched-deps.cc
index c23218890f..ae23f55274 100644
--- a/gcc/sched-deps.cc
+++ b/gcc/sched-deps.cc
@@ -3695,7 +3695,7 @@ deps_analyze_insn (class deps_desc *deps, rtx_insn *insn)

   CANT_MOVE (insn) = 1;

-  if (!reload_completed)
+  if (!reload_completed && cfun->calls_setjmp)
{
  /* Scheduling across calls may increase register pressure by
extending
 live ranges of pseudos over the call.  Worse, in presence of
setjmp


That way we retain the "correctness fix" part of r13-5154-g733a1b777f1 and keep
the previous status quo on normal functions (quadraticness on asms like
demonstrated in comment #5 would also remain).

[Bug rtl-optimization/114261] [13/14 Regression] Scheduling takes excessive time (97%)

2024-03-11 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114261

Alexander Monakov  changed:

   What|Removed |Added

 CC||mkuvyrkov at gcc dot gnu.org

--- Comment #5 from Alexander Monakov  ---
It appears sched-deps is O(N*M) given N reg_pending_barriers and M distinct
pseudos in a region (or even a basic block). For instance, on the following
testcase

#define x10(x) x x x x x x x x x x
#define x100(x) x10(x10(x))
#define x1(x) x100(x100(x))

void f(int);

void g(int *p)
{
#if 1
x1(f(*p++);)
#else
x1(asm("" :: "r"(*p++));)
#endif
}

gcc -O -fschedule-insns invokes add_dependence 2 times for each asm/call
after the first. There is a loop

  for (i = 0; i < (unsigned)deps->max_reg; i++)
{
  struct deps_reg *reg_last = >reg_last[i];
  reg_last->sets = alloc_INSN_LIST (insn, reg_last->sets);
  SET_REGNO_REG_SET (>reg_last_in_use, i);
}

that registers the insn with reg_pending_barrier != 0 in reg_last->sets of each
pseudo, and then all those reg_last->sets will be inspected on the next
reg_pending_barrier insn.

[Bug rtl-optimization/114261] [13/14 Regression] Scheduling takes excessive time (97%)

2024-03-06 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114261

--- Comment #3 from Alexander Monakov  ---
The first attachment is empty (perhaps you made a non-recursive archive when
you meant to recursively zip a directory).

[Bug c++/66487] sanitizer/warnings for lifetime DSE

2024-02-28 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66487

--- Comment #28 from Alexander Monakov  ---
The bug is about the issue of lacking diagnostics, it should be fine to make
note of various approaches to remedy the problem in one bug report.

(in any case, all discussion of the Valgrind-based approach happened on the
gcc-patches mailing list, not here)

[Bug rtl-optimization/113903] sched1 should schedule across EBBS

2024-02-13 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113903

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
Lifting those insns from the L8 BB to the L10 BB requires duplicating them on
all incoming edges targeting L8, doesn't it?

Why is decreasing live ranges important here?

[Bug ipa/113890] -fdump-tree-modref ICE with _BitInt

2024-02-12 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113890

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov  ---
Earlier reported (with a less straightforward testcase) in PR 106783, which is
tracked as a regression since gcc-12.

[Bug target/113560] Strange code generated when optimizing a multiplication on x86_64

2024-01-24 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113560

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov  ---
(In reply to Roger Sayle from comment #2)
> The costs look sane, and I'd expect the synth_mult generated sequence to be
> faster, though it would be good to get some microbenchmarking.
> A reduced test case is:
> __int128 foo(__int128 x) { return x*100; }

This is not an equivalent testcase, mulx is a widening multiply from 64-bit
source operands. It has latency 3 or 4 on most implementations. Costing it as a
synthesized general 128-bit multiplication is wrong.

[Bug ipa/113293] Incorrect code after inlining function containing extended asm

2024-01-09 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113293

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov  ---
(In reply to KBDeveloper from comment #2)
> 
> Ah, that makes sense. I had assumed that taking the address of arg would
> force gcc to store it in memory somewhere. 
> Is there a reason why gcc then allocates 8 bytes on the stack and fills r1
> with sp - #7? Or is what I had just UB and gcc can do whatever?

The compiler allocates stack memory for 'arg' and passes the address of 'arg'
to the asm; it is necessary in case the asm does something with it without
reading 'arg' itself. One example would be installing a hardware watchpoint on
that memory location.

[Bug rtl-optimization/113280] Strange error for empty inline assembly with +X constraint

2024-01-09 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113280

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #6 from Alexander Monakov  ---
>From the context given in the gcc-help thread, the goal is to place an
optimization barrier in a sequence of floating-point calculation. "+r" is
inappropriate for floats, as it usually incurs a reload from a floating-point
register to a GPR and back, and there's no universal constraint for FP regs
(e.g. on amd64 it is "+x" for SSE registers, but "+t" for long double (or x87
FPU on 32-bit x86)).

[Bug libstdc++/113159] More robust std::sort for silly comparator functions

2023-12-28 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113159

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov  ---
Can you outline what would merge criteria for such an enhancement look like?

If it comes at a cost of increased code complexity and runtime overhead, what
sacrifice is acceptable in the name of increased robustness?

[Bug middle-end/113082] builtin transforms do not honor errno

2023-12-19 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113082

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov  ---
re. comment #3, you'd need to be careful to avoid miscompiling

#include 

int f(size_t sz, void **out, int *eptr)
{
int e = *eptr;
*out = malloc(sz);
return *eptr - e;
}

to asm that unconditionally returns 0, because that changes the outcome for

  errno = 0;
  f(SIZE_MAX, , );

IOW, I'm not sure how you can go beyond TBAA since user code can pass around
the address of errno in a plain 'int *' anyway.


re. comment #2, Glibc has

* lazy PLT resolver calling back into the dynamic linker
* LD_AUDIT callbacks
* LD_PROFILE hooks
* IFUNC resolvers

and you'd have to guarantee they won't clobber errno either. For lazy PLT and
LD_PROFILE it is necessary anyway (otherwise it's a Glibc bug), but audit and
ifunc callbacks are provided by the user, not Glibc, and might accidentally
clobber errno.

[Bug c/44179] warn about sizeof(char) and sizeof('x')

2023-12-16 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=44179

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #5 from Alexander Monakov  ---
Warning when sizeof 'x' appears as a term in an addition/subtraction would
catch the misuses while leaving instances like assert(sizeof 'x' == 4) be.

[Bug middle-end/112697] [14 Regression] 30-40% exec time regression of 433.milc on zen2 since r14-4972-g8aa47713701b1f

2023-12-01 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112697

--- Comment #9 from Alexander Monakov  ---
... as does inserting a nop before the compare ¯\_(ツ)_/¯


--- d.out.ltrans0.ltrans.slow.s 2023-12-01 18:32:54.255841611 +0300
+++ d.out.ltrans0.ltrans.s  2023-12-01 18:53:04.909438690 +0300
@@ -743,6 +743,7 @@ add_force_to_mom:
.p2align 4,,10
.p2align 3
 .L58:
+   nop
cmpb$1, -680(%r11,%r12)
movapd  %xmm5, %xmm7
jne .L54

[Bug middle-end/112697] [14 Regression] 30-40% exec time regression of 433.milc on zen2 since r14-4972-g8aa47713701b1f

2023-12-01 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112697

--- Comment #8 from Alexander Monakov  ---
Thanks, I can reproduce it. It is pretty tricky though. For instance, just
swapping the mov and the compare is enough to make it fast:

--- d.out.ltrans0.ltrans.slow.s 2023-12-01 18:32:54.255841611 +0300
+++ d.out.ltrans0.ltrans.fast.s 2023-12-01 18:32:20.318668991 +0300
@@ -743,8 +743,8 @@ add_force_to_mom:
.p2align 4,,10
.p2align 3
 .L58:
-   cmpb$1, -680(%r11,%r12)
movapd  %xmm5, %xmm7
+   cmpb$1, -680(%r11,%r12)
jne .L54
xorpd   %xmm6, %xmm7
 .L54:

[Bug middle-end/112697] [14 Regression] 30-40% exec time regression of 433.milc on zen2 since r14-4972-g8aa47713701b1f

2023-11-27 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112697

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #5 from Alexander Monakov  ---
Martin, if you still have the binaries, would you mind sharing perf profiles?
You can produce plain-text reports with 'perf report --stdio' and 'perf
annotate --stdio'.

[Bug target/111107] i686-w64-mingw32 does not realign stack when __attribute__((aligned)) or __attribute__((vector_size)) are used

2023-11-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=07

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #9 from Alexander Monakov  ---
-mpreferred-stack-boundary=n means that functions consume stack in increments
of 2**n. This is sufficient to ensure that once stack is aligned to that
boundary, it will keep it without the need for dynamic realignment.

-mincoming-stack-boundary specifies the guaranteed alignment on entry. If the
function needs to place local variables with greater alignment requirement on
the stack, it has perform dynamic realignment.

[Bug preprocessor/112701] New: wrong type inference for ternary operator in preprocessing context

2023-11-24 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112701

Bug ID: 112701
   Summary: wrong type inference for ternary operator in
preprocessing context
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: preprocessor
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
  Target Milestone: ---

In the following snippet, the result of the ternary operator is (-1, cast to an
unsigned type), so the comparison yields false, and both conditional inclusions
must come out empty:

#if (0 ? 0u : -1) < 0
int foo = (0 ? 0u : -1) < 0;
#endif

#if (0 ? 0/0u : -1) < 0
int bar = (0 ? 0/0u : -1) < 0;
#endif

However, GCC emits:

bar:
.zero   4

So clearly the evaluation of the second expression is inconsistent between
preprocessing context (where it incorrectly yields 1) vs. initializer context
(where it is zero as it should be, as seen from the resulting asm).

[Bug c/112699] Should limits.h in freestanding environment be self-contained?

2023-11-24 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112699

--- Comment #2 from Alexander Monakov  ---
Sorry, even though GCC's limits.h is installed under include-fixed, it is
generated separately, not by the generic fixincludes mechanism. I was confused.

[Bug c/112699] Should limits.h in freestanding environment be self-contained?

2023-11-24 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112699

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
Can you clarify which file you mean? gcc/ginclude does not have a limits.h.

I assume you are not talking about the fixinclude'd limits.h?

[Bug middle-end/111655] [11/12/13/14 Regression] wrong code generated for __builtin_signbit and 0./0. on x86-64 -O2

2023-11-24 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111655

--- Comment #13 from Alexander Monakov  ---
> Then there is the MULT_EXPR x * x case

This is PR 111701.

It would be nice to clarify what "nonnegative" means in the contracts of this
family of functions, because it's ambiguous for NaNs and negative zeros (x < 0
is false while signbit is set, and x >= 0 is also false for positive NaNs).

[Bug rtl-optimization/110307] ICE in move_insn, at haifa-sched.cc:5473 when building Ruby on alpha with -fPIC -O2 (or -fpeephole2 -fschedule-insns2)

2023-11-11 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110307

Alexander Monakov  changed:

   What|Removed |Added

 CC||uros at gcc dot gnu.org

--- Comment #15 from Alexander Monakov  ---
It did not bring enlightenment. It looks like INT_MIN REG_EH_REGION annotating
a call that *does not* perform a non-local goto was a late addition, breaking
the assumption "EH_REGION notes may appear only on insns that may throw
exceptions", and now a few places in the compiler look as if they may forget to
preserve the special INT_MIN REG_EH_REGION note.

Uros, would you mind reading the discussion in this bug? Do you have
suggestions how to proceed here?

[Bug target/82242] IRA spills allocno in loop body if it crosses throwing call outside the loop

2023-11-10 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82242

--- Comment #5 from Alexander Monakov  ---
The small testcase from comment 3 is now improved on trunk, possibly thanks to
work in PR 110215.

[Bug c/112367] wrong rounding of sum of floating-point constants

2023-11-03 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112367

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #5 from Alexander Monakov  ---
As far as I can tell this was broken for all targets before gcc-12, and fixed
for all targets starting from gcc-12.

Paul, can this bug be closed?

[Bug c++/66487] sanitizer/warnings for lifetime DSE

2023-10-30 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66487

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #26 from Alexander Monakov  ---
RFC patch for detecting lifetime-dse issues via Valgrind (rather than MSan):
https://inbox.sourceware.org/gcc-patches/20231024141124.210708-1-exactl...@ispras.ru/

[Bug c/111884] New: unsigned char no longer aliases anything under -std=c2x

2023-10-19 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111884

Bug ID: 111884
   Summary: unsigned char no longer aliases anything under
-std=c2x
   Product: gcc
   Version: 13.2.1
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
  Target Milestone: ---

int f(int i)
{
int f = 1;
return i[(unsigned char *)];
}
int g(int i)
{
int f = 1;
return i[(signed char *)];
}
int h(int i)
{
int f = 1;
return i[(char *)];
}


gcc -O2 -std=c2x compiles 'f' as though inspecting representation via an
'unsigned char *' is not valid (with a confusing warning under -Wall).

[Bug target/111768] X86: -march=native does not support alder lake big.little cache infor correctly

2023-10-12 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111768

--- Comment #11 from Alexander Monakov  ---
(In reply to Hongtao.liu from comment #10)
> > indeed (but I believe it did happen with Alder Lake already, by accident,
> > with AVX512 on P-cores but not on E-cores).
> 
> AVX512 is physically fused off for Alderlake P-core, P-core and E-core share
> the same ISA level(AVX2).

I think Arsen means initial Alder Lake batches, where AVX-512 wasn't yet fused
off (but BIOS support was unofficial/experimental anyway).

[Bug target/111768] X86: -march=native does not support alder lake big.little cache infor correctly

2023-10-12 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111768

--- Comment #9 from Alexander Monakov  ---
(In reply to Arsen Arsenović from comment #8)
> indeed (but I believe it did happen with Alder Lake already, by accident,
> with AVX512 on P-cores but not on E-cores).

AFAIK on those Alder Lake CPUs you could only get AVX-512 by disabling E-cores
in the BIOS, so you couldn't boot in a configuration when both E-cores are
available and AVX-512 on P-cores is available.

[Bug target/111768] X86: -march=native does not support alder lake big.little cache infor correctly

2023-10-12 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111768

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #7 from Alexander Monakov  ---
I'm afraid hybrid CPUs with varying ISA feature sets are not practical for the
current ecosystem: you wouldn't be able to reschedule from a higher- to
lower-capable core. Not to mention scenarios like Mesa on-disk llvmpipe shader
cache.

"Always" probing all cores is a not a good idea (the compiler would have to
manually reschedule itself to all cores, of which there could be hundreds).
Plus, portable API for such probing across available cores does not exist
afaik.

I think releasing an x86 hybrid CPU with varying capabilities across cores
would require substantial preparatory work in the kernel and likely in the
userland as well, so probably best to leave it until the time comes and
specifics of what can differ are known.

[Bug target/111768] X86: -march=native does not support alder lake big.little cache infor correctly

2023-10-11 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111768

--- Comment #5 from Alexander Monakov  ---
I think it's similar to attempting -march=native under distcc, which is already
warned about on Gentoo wiki: https://wiki.gentoo.org/wiki/Distcc

The difference here is that Intel so far decided to make ISA feature set the
same between 'performance' and 'power-efficient' cores, so the differences for
-march=native detection are minimal.

Intel also added a cpuid bit for hybrid CPUs, so in principle native arch
detection could inspect that bit and then override l1-cache-size to 32 KiB
(having the exact size in the param is not important, specifying a lower value
is ok), or just drop it and let cc1 fall back to the default value (64) from
params.opt.

Short term, I would advise users to add --param=l1-cache-size=32 after
-march=native in CFLAGS.

[Bug tree-optimization/111694] [13/14 Regression] Wrong behavior for signbit of negative zero when optimizing

2023-10-09 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111694

--- Comment #7 from Alexander Monakov  ---
No backport for gcc-13 planned?

[Bug sanitizer/111736] Address sanitizer is not compatible with named address spaces

2023-10-09 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

--- Comment #3 from Alexander Monakov  ---
Sorry, the second half of my comment is confusing. To clarify, ASan works fine
for TLS data (the compiler knows that TLS base is at fs:0; libsanitizer uses
some hacks to initialize shadow for TLS anyway, so it seems explicit
registration is not needed).

The difference is,  produces an address in the generic address space by using
the knowledge that fs:0 stores the segment base. For __seg_{fs,gs} that can't
be done, and  is the offset w.r.t segment base.

[Bug sanitizer/111736] Address sanitizer is not compatible with named address spaces

2023-10-09 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov  ---
Looks that way — even though __seg_gs AS is a subset of the generic AS, the
compiler has no way to find out the base of __seg_gs.

We already skip registering TLS data with ASan:

__thread int x;

int foo (void)
{
  return x;
}

[Bug ipa/111643] __attribute__((flatten)) with -O1 runs out of memory (killed cc1)

2023-10-06 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111643

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #10 from Alexander Monakov  ---
(In reply to Lukas Grätz from comment #9)
> I also wondered whether
> 
> int bar_alias (void) { return bar_original(); }
> 
> could be a portable alternative to attribute alias. Except that current GCC
> does not translate it that way.

That's because function addresses are significant and so

  _alias == _original

must evaluate to false, but would be true for aliases.

In theory compilers could do better by introducing fall-through aliases:
https://gcc.gnu.org/wiki/cauldron2019talks?action=AttachFile=view=fallthrough-aliases.pdf

[Bug middle-end/111701] New: [11/12/13/14 Regression] wrong code for __builtin_signbit(x*x)

2023-10-05 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111701

Bug ID: 111701
   Summary: [11/12/13/14 Regression] wrong code for
__builtin_signbit(x*x)
   Product: gcc
   Version: 13.2.1
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
CC: amonakov at gcc dot gnu.org, eggert at cs dot ucla.edu,
rguenth at gcc dot gnu.org, unassigned at gcc dot gnu.org
Depends on: 111655
  Target Milestone: ---
Target: x86_64-linux-gnu

+++ This bug was initially created as a clone of Bug #111655 +++

See bug 111655 comment 11: we incorrectly deduce nonnegative_p for
floating-point 'x * x', and the following aborts:

__attribute__((noipa))
static int f(float *x)
{
*x *= *x;
return __builtin_signbit(*x);
}

int main()
{
float x = -__builtin_nan("");
int s = f();
if (s != __builtin_signbit(x))
__builtin_abort();
}


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111655
[Bug 111655] [11/12/13/14 Regression] wrong code generated for
__builtin_signbit and 0./0. on x86-64 -O2

[Bug tree-optimization/111694] [13/14 Regression] Wrong behavior for signbit of negative zero when optimizing

2023-10-04 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111694

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org
  Component|web |tree-optimization
Summary|Wrong behavior for signbit  |[13/14 Regression] Wrong
   |of negative zero when   |behavior for signbit of
   |optimizing  |negative zero when
   ||optimizing
   Keywords||wrong-code

--- Comment #1 from Alexander Monakov  ---
Reduced:

#define signbit(x) __builtin_signbit(x)

static void test(double l, double r)
{
  if (l == r && (signbit(l) || signbit(r)))
;
  else
__builtin_abort();
}

int main()
{
  test(0.0, -0.0);
}

[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3

2023-10-04 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov  ---
This is predcom. It's easier to see what's going wrong with

#pragma GCC unroll 99

added to the innermost loop so it's unrolled at -O2 and comparing

g++ -O2 -fpredictive-commoning --param=max-unroll-times=1

vs. plain g++ -O2 (or diffing 'pcom' dump against the preceding pass).

[Bug middle-end/111655] [11/12/13/14 Regression] wrong code generated for __builtin_signbit and 0./0. on x86-64 -O2

2023-10-04 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111655

--- Comment #11 from Alexander Monakov  ---
(In reply to Richard Biener from comment #10)
> And this conservatively has to apply to all FP divisions where we might infer
> "nonnegative" unless we can also infer !zerop?

Yes, I think the logic in tree_binary_nonnegative_warnv_p is incorrect for
floating-point division. Likewise for multiplication: it returns true for 'x *
x', but when x is a NaN, 'x * x' is also a NaN (potentially with the same
sign).


> On the side of replacing all uses I'd error on simply not folding.

Yes, as preceding transforms might have duplicated the division already. We can
only do such folding very early, when we are sure no duplication might have
taken place.


> Note 6.5.5/6 says "In both operations, if the value of the second operand is
> zero, the behavior is undefined." only remotely implying this doesn't
> apply to non-integer types (remotely by including modulo behavior in this
> sentence).
> 
> Possibly in some other place the C standard makes FP division by zero subject
> to other rules.

I think the intention is that Annex F makes it follow IEEE rules (returns an
Inf/NaN and sets FE_DIVBYZERO/FE_INVALID), but it doesn't seem to be clearly
worded, afaict.

[Bug middle-end/51446] -fno-trapping-math generates NaN constant with different sign

2023-10-02 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51446

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #21 from Alexander Monakov  ---
Bug 111655 is not a dup, I left a comment and reopened.

[Bug target/111655] [11/12/13/14 Regression] wrong code generated for __builtin_signbit and 0./0. on x86-64 -O2

2023-10-02 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111655

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org
 Ever confirmed|0   |1
Summary|wrong code generated for|[11/12/13/14 Regression]
   |__builtin_signbit and 0./0. |wrong code generated for
   |on x86-64 -O2   |__builtin_signbit and 0./0.
   ||on x86-64 -O2
 Resolution|DUPLICATE   |---
 Status|RESOLVED|NEW
   Last reconfirmed||2023-10-02

--- Comment #9 from Alexander Monakov  ---
It's true that the sign of 0./0 is unpredictable, but we can fold it only when
the division is being eliminated by the folding. 

It's fine to fold

  t = 0./0;
  s = __builtin_signbit(t);

to

  s = 0

with t eliminated from IR, but it's not OK to fold

  t = 0./0
  s = __builtin_signbit(t);

to

  t = 0./0
  s = 0

because the resulting program runs as if 0./0 was evaluated twice, first to a
positive NaN (which was used for signbit), then to a negative NaN (which fed
the following computations). This is not allowed.

This bug was incorrectly classified as a dup. The fix is either to not fold
this, or fold only when we know that the division will be eliminated (e.g. the
only use was in the signbit). Reopening.

[Bug c/111210] Wrong code at -Os on x86_64-linux-gnu since r12-4849-gf19791565d7

2023-08-28 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111210

--- Comment #4 from Alexander Monakov  ---
The testcase is small enough to notice the issue by inspection.

Note that you get the "expected" answer with -fno-strict-aliasing, and as
explained in https://gcc.gnu.org/bugs/ it is one of the things you should check
when submitting a bugreport:

Before reporting that GCC compiles your code incorrectly, compile it with gcc
-Wall -Wextra and see whether this shows anything wrong with your code.
Similarly, if compiling with -fno-strict-aliasing -fwrapv
-fno-aggressive-loop-optimizations makes a difference, or if compiling with
-fsanitize=undefined produces any run-time errors, then your code is probably
not correct.

[Bug c/111210] Wrong code at -Os on x86_64-linux-gnu since r12-4849-gf19791565d7

2023-08-28 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111210

Alexander Monakov  changed:

   What|Removed |Added

 Resolution|--- |INVALID
 Status|UNCONFIRMED |RESOLVED
 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
'c' is called with 'd' pointing to 'long e[2]', so

  return *(int *)(d + 1);

is an aliasing violation (dereferencing a pointer to an incompatible type).

[Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing

2023-08-26 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43

--- Comment #6 from Alexander Monakov  ---
Thanks.

i5-1335U has two "performance cores" (with HT, four logical CPUs) and eight
"efficiency cores". They have different micro-architecture. Are you binding the
benchmark to some core in particular?

On the "performance cores", 'add rbx, 1' can be eliminated ("executed" with
zero latency), this optimization appeared in the Alder Lake generation with the
"Golden Cove" uarch and was found by Andreas Abel. There are limitations (e.g.
it works for 64-bit additions but not 32-bit, the addend must be an immediate
less than 1024).

Of course, it is better to have 'add rbx, 1' instead of 'add rbx, rax' in this
loop on any CPU ('mov eax, 1' competes for ALU ports with other instructions,
so when it's delayed due to contention the dependent 'add rbx, rax; movsx rax,
[rbx]' get delayed too), but ascribing the difference to compiler scheduling on
a CPU that does out-of-order dynamic scheduling is strange.

[Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing

2023-08-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov  ---
(In reply to Paul Eggert from comment #0)
> The "movl $1, %eax" immediately followed by "addq %rax, %rbx" is poorly
> scheduled; the resulting dependency makes the code run quite a bit slower
> than it should. Replacing it with "addq $1, %rbx" and readjusting the
> surrounding code accordingly, as is done in the attached file
> code-mcel-opt.s, causes the benchmark to run 38% faster on my laptop's Intel
> i5-1335U.

This is a mischaracterization. The modified loop has one uop less, because you
are replacing 'mov eax, 1; add rbx, rax' with 'add rbx, 1'.

To evaluate scheduling aspect, keep 'mov eax, 1' while changing 'add rbx, rax'
to 'add rbx, 1'.

There are two separate loop-carried data dependencies, both one cycle per
iteration (addition chains over r12 and rbx).

[Bug rtl-optimization/111101] -finline-small-functions may invert FP arguments breaking FP bit accuracy in case of NaNs

2023-08-22 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=01

Alexander Monakov  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |INVALID
 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
0x7fe5ed65 is a quiet NaN, not signaling (it differs from the input 0x7fa5ed65
sNaN by the leading mantissa bit 0x0040).

IEEE-754 does not pin down which of the two payloads should be propagated when
both operands are NaNs, and neither do language standards, so for GCC
floating-point addition and similar operations are commutative.

Observed NaN payloads are not predictable and may change depending on
optimization level, choice of x87 vs. SSE instructions, etc. This is not a bug.

[Bug middle-end/111009] [12/13/14 regression] -fno-strict-overflow erroneously elides null pointer checks and causes SIGSEGV on perf from linux-6.4.10

2023-08-14 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111009

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
Triggered by GIMPLE loop invariant motion lifting

  a_9 = _8(D)->maj;

across a (dso != NULL) test.

[Bug target/110979] Miss-optimization for O2 fully masked loop on floating point reduction.

2023-08-11 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110979

--- Comment #2 from Alexander Monakov  ---
Yes, it is wrong-code to full extent. To demonstrate, you can initialize 'sum'
and the array to negative zeroes:

#define FLT double
#define N 20

__attribute__((noipa))
FLT
foo3 (FLT *a)
{
FLT sum = -0.0;
for (int i = 0; i != N; i++)
  sum += a[i];
return sum;
}

int main()
{
FLT a[N];
for (int i = 0; i != N; i++)
a[i] = -0.0;
if (!__builtin_signbit(foo3(a)))
__builtin_abort();
}

[Bug ipa/110946] 3x perf regression with -Os on M1 Pro

2023-08-08 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

--- Comment #11 from Alexander Monakov  ---
(In reply to Alexander Monakov from comment #8)
> inline void mbedtls_put_unaligned_uint64(void *p, uint64_t x)
> {
> memcpy(p, , sizeof(x));
> }
> 
> 
> We deciding to not inline this, while inlining its get_unaligned
> counterpart? Seems bizarre.

I can reproduce this part, and on my side it's caused by _FORTIFY_SOURCE: with
fortification, put_unaligned indeed looks bigger during inlining:

mbedtls_put_unaligned_uint32 (void * p, uint32_t x)
{
  long unsigned int _3;

   [local count: 1073741824]:
  _3 = __builtin_object_size (p_2(D), 0);
  __builtin___memcpy_chk (p_2(D), , 4, _3);
  return;

}

mbedtls_get_unaligned_uint64 (const void * p)
{
  long unsigned int _3;

   [local count: 1073741824]:
  _3 = MEM  [(char * {ref-all})p_2(D)];
  return _3;

}

[Bug ipa/110946] 3x perf regression with -Os on M1 Pro

2023-08-08 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

--- Comment #10 from Alexander Monakov  ---
Ah, the non-static inlines are intentional, the corresponding extern
declarations appear in library/platform_util.c. Sorry, I missed that file the
first time around.

[Bug ipa/110946] 3x perf regression with -Os on M1 Pro

2023-08-08 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

--- Comment #9 from Alexander Monakov  ---
(In reply to Alexander Monakov from comment #2)
> Note that inline functions in mbedtls/library/alignment.h all miss the
> 'static' qualifier, which affects inlining decisions, and looks like a
> mistake anyway (if they are really meant to be non-static inlines, shouldn't
> there be a comment?)

Can you address this on the mbedtls side? Even if it doesn't help with the
observed slowdown, it will remain a problem for the future if left unfixed.

[Bug ipa/110946] 3x perf regression with -Os on M1 Pro

2023-08-08 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

--- Comment #8 from Alexander Monakov  ---
Why? There's no bswap here, in particular mbedtls_put_unaligned_uint64 is a
straightforward wrapper for memcpy:

inline void mbedtls_put_unaligned_uint64(void *p, uint64_t x)
{
memcpy(p, , sizeof(x));
}


We deciding to not inline this, while inlining its get_unaligned counterpart?
Seems bizarre.

[Bug other/110946] 3x perf regression with -Os on M1 Pro

2023-08-08 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov  ---
So basically missed inlining at -Os, even memcpy wrappers are not inlined.

Can you provide a reproducible testcase?

Note that inline functions in mbedtls/library/alignment.h all miss the 'static'
qualifier, which affects inlining decisions, and looks like a mistake anyway
(if they are really meant to be non-static inlines, shouldn't there be a
comment?)

Does making them 'static inline' rectify the problem?

[Bug target/110926] [14 regression] Bootstrap failure (matmul_i1.c:1781:1: internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have 'w' (rtx const_int) in vpternlog_redundant_operand_m

2023-08-06 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110926

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #7 from Alexander Monakov  ---
Thanks for identifying the problem. Please don't rename the argument to
'op_mask' though: the parameter itself is not a mask, it's an eight-bit control
word of the vpternlog instruction (holding the logic table of a three-operand
Boolean function). The function derives a three-bit mask from it.

[Bug target/110202] _mm512_ternarylogic_epi64 generates unnecessary operations

2023-08-05 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202

Alexander Monakov  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #12 from Alexander Monakov  ---
We now generate

negate1:
vmovdqa64   zmm0, ZMMWORD PTR [rdi]
vpternlogq  zmm0, zmm0, zmm0, 85
ret
negate2:
vmovdqa32   zmm0, ZMMWORD PTR [rdi]
vpternlogd  zmm0, zmm0, zmm0, 0x55
ret

Fixed for gcc-14.

[Bug sanitizer/110799] [tsan] False positive due to -fhoist-adjacent-loads

2023-07-31 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110799

--- Comment #16 from Alexander Monakov  ---
In C11 and C++11 the issue of compiler-introduced racing loads is discussed as
follows (5.1.2.4 Multi-threaded executions and data races in C11):

28 NOTE 14 Transformations that introduce a speculative read of a potentially
shared memory location may not preserve the semantics of the program as defined
in this standard, since they potentially introduce a data race. However, they
are typically valid in the context of an optimizing compiler that targets a
specific machine with well-defined semantics for data races. They would be
invalid for a hypothetical machine that is not tolerant of races or provides
hardware race detection.


So for TSan we'd allow hoisting only after TSan instrumentation, and for
Helgrind we'd ask them to handle load-load-cmov combo as only consuming one of
the loads?


I think the other optimizations from comment #8 introduce racing loads more
rarely in practice, because they are limited to non-trapping accesses.

[Bug rtl-optimization/110823] [missed optimization] >50% speedup for x86-64 ASCII processing a la GNU diffutils

2023-07-30 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110823

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov  ---
It's a weakness in the REE pass. AFAICT normally it would handle this, but here
there are two elimination candidates in 'main', the first is eliminated
successfully, and then REE punts on the second because one if its reaching
definitions is the first redundant extension:

  /* If def_insn is already scheduled to be deleted, don't attempt
 to modify it.  */
  if (state->modified[INSN_UID (def_insn)].deleted)
return false;

While looking into this I noticed that the fix for PR 61094 introduced a
write-only bitfield 'do_not_reextend' (the Changelog wrongly claimed it was
used).

[Bug sanitizer/110799] [tsan] False positive due to -fhoist-adjacent-loads

2023-07-26 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110799

--- Comment #9 from Alexander Monakov  ---
(In reply to Tom de Vries from comment #7)
> Can you elaborate on what you consider a correct approach?

I think this optimization is incorrect and should be active only under -Ofast.

I can offer two arguments. First, even without considering correctness,
breaking TSan and Helgrind is a substantial QoI issue and we should consider
shielding -O2 users from that (otherwise they'll discover it the hard way,
curse at us, stick -fno-hoist-adjacent-loads in their build system and consider
switching to another compiler).

Second, I can upgrade the initial example to an actual miscompilation. The
upgrade is based on two considerations: the optimization works on
possibly-trapping accesses, and relies on types of memory references to decide
if it's safe, but it runs late where the types are not what they were in the C
source. Hence, the following example:

struct S {
int a;
};
struct M {
int a, b;
};

int f(struct S *p, int c, int d)
{
int r;
if (c)
if (d)
r = p->a;
else
r = ((struct M*)p)->a;
else
r = ((struct M*)p)->b;
return r;
}

is miscompiled to

f:
mov eax, DWORD PTR [rdi+4]
testesi, esi
cmovne  eax, DWORD PTR [rdi]
ret

even though the original program never accesses beyond struct S if 'c && d'.
Phi-opt incorrectly performs hoisting after PRE collapses 'if (d) ... else
...'.

[Bug sanitizer/110799] [tsan] False positive due to -fhoist-adjacent-loads

2023-07-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110799

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #5 from Alexander Monakov  ---
(In reply to Richard Biener from comment #1)
> We consider introducing load data races OK, what's the difference here? 
> There are other passes that would do similar things but in practice the
> loads would be considered to possibly trap so the real-world impact might be
> limited?

What are the examples of other transforms that can introduce data races?

This trips Valgrind's data race detector (valgrind --tool=helgrind) too. So I
don't think checking SANITIZE_THREAD is the correct approach.

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-21 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

--- Comment #14 from Alexander Monakov  ---
That seems undesirable in light of comment #4, you'd risk creating a situation
when -fno-trapping-math is unpredictably slower when denormals appear in dirty
upper halves.

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-21 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov  ---
In addition to FPU exception issue, it's also a performance trap due to
handling of accidental denormals in upper halves.

[Bug target/110611] X86 is not honouring POINTERS_EXTEND_UNSIGNED in m32 code.

2023-07-10 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110611

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #5 from Alexander Monakov  ---
You cannot use this internal macro to deduce how your C testcase should behave.
The language standard says this conversion has implementation-defined behavior,
and GCC manual (the user manual, not the internals manual) has a chapter on
implementation-defined behavior, which explicitly says:

https://gcc.gnu.org/onlinedocs/gcc/Arrays-and-pointers-implementation.html

A cast from pointer to integer [...] sign-extends if the pointer representation
is smaller than the integer type [...].

So the behavior is the same for all targets.

[Bug target/110438] generating all-ones zmm needs dep-breaking pxor before ternlog

2023-07-04 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110438

--- Comment #3 from Alexander Monakov  ---
Patch available:
https://inbox.sourceware.org/gcc-patches/8f73371d732237ed54ede44b7bd88...@ispras.ru/T/#u

[Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations

2023-06-27 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202

--- Comment #9 from Alexander Monakov  ---
(In reply to Hongtao.liu from comment #8)
> 
> For this one, we can load *a into %zmm0 to avoid false_dependence.
> 
> vmovdqau ZMMWORD PTR [rdi], zmm0
> vpternlogq  zmm0, zmm0, zmm0, 85

Yes, since ternlog with memory operand needs two fused-domain uops on Intel
CPUs, breaking out the load would be more efficient for both negate1 and
negate2.

[Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations

2023-06-27 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202

--- Comment #7 from Alexander Monakov  ---
Note that vpxor serves as a dependency-breaking instruction (see PR 110438). So
in negate1 we do the right thing for the wrong reasons, and in negate2 we can
cause a substantial stall if the previous computation of xmm0 has a non-trivial
dependency chain.

[Bug target/110438] generating all-ones zmm needs dep-breaking pxor before ternlog

2023-06-27 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110438

--- Comment #1 from Alexander Monakov  ---
We might want to omit PXOR when optimizing for size.

[Bug target/110438] New: generating all-ones zmm needs dep-breaking pxor before ternlog

2023-06-27 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110438

Bug ID: 110438
   Summary: generating all-ones zmm needs dep-breaking pxor before
ternlog
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
  Target Milestone: ---
Target: x86_64-*-*

VPTERNLOG is never a dependency-breaking instruction on existing x86
implementations, so generating a vector of all-ones via bare ternlog can stall
waiting on destination register. GCC should emit a dependency-breaking PXOR,
otherwise it will be a false-dependency-on-popcnt-lzcnt debacle all over again.

#include 

__m512i g(void)
{
return (__m512i){ 0 } - 1;
}

g:
# waits until previous computation
# of zmm0 has completed
vpternlogd  zmm0, zmm0, zmm0, 0xFF
ret

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-27 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #21 from Alexander Monakov  ---
(In reply to rguent...@suse.de from comment #19)
> But the size argument doesn't have anything to do with TBAA (and
> may_alias is about TBAA).  I don't think we have any way to circumvent
> C object access rules.  That is, for example, with -fno-strict-aliasing
> the following isn't going to work.
> 
> int a;
> int b;
> 
> int main()
> {
>   a = 1;
>   b = 2;
>   if ( + 1 == ) // equality compare of unrelated pointers OK
> {
>   long x = *(long *) // access outside of 'a' not OK
>   if (x != 0x00010002)
> abort ();
> }
> }
> 
> there's no command-line flag or attribute to form a pointer
> to an object composing 'a' and 'b' besides changing how the
> storage is declared.

But store-merging and SLP can introduce a wide long-sized access where on
source level you had two adjacent loads or even memcpy's, so we really seem to
have a problem here and might need to be able to annotate types or individual
accesses as "may-alias-with-oob-ok" in the IR: PR 110431.

[Bug middle-end/110431] New: Incorrect disambiguation of wide accesess from store-merging or SLP

2023-06-27 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110431

Bug ID: 110431
   Summary: Incorrect disambiguation of wide accesess from
store-merging or SLP
   Product: gcc
   Version: 12.3.0
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
  Target Milestone: ---

Inspired by bug 110237 comment 19:

int b, a;

int main()
{
int *pa = , *pb = 
asm("" : "+r"(pa));
asm("" : "+r"(pb));
if (pa + 1 == pb) {
a = 1, b = 2;
long x;
__builtin_memcpy(, pa, 4);
__builtin_memcpy(4 + (char *), pa+1, 4);
return (x - 0x00020001) * 131 >> 32;
}
}

https://godbolt.org/z/b67zxMv54

On GIMPLE, both store-merging and SLP vectorization are capable of introducing
merged long-sized access in place of individual int-sized memcpy's, which is
then disambiguated against initial stores on the RTL level, leading to a
miscompilation.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-26 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #18 from Alexander Monakov  ---
(In reply to rguent...@suse.de from comment #17)
> Yes, we do the same to loads.  I hope that's not a common technique
> though but I have to admit the vectorizer itself assesses whether it's
> safe to access "gaps" by looking at alignment so its code generation
> is prone to this same "mistake".
> 
> Now, is "alignment to 16 is ensured externally" good enough here?
> If we consider
> 
> static int a[2];
> 
> and code doing
> 
>  if (is_aligned (a))
>{
>  __v4si v = (__attribute__((may_alias)) __v4si *) 
>}
> 
> then we cannot even use a DECL_ALIGN that's insufficient for decls
> that bind locally.

I agree. I went with the 'extern' example because there it should be more
obvious the construction ought to work.


> Note we have similar arguments with aggregate type sizes (and TBAA)
> where when we infer a dynamic type from one access we check if
> the other access would fit.  Wouldn't the above then extend to that
> as well given we could also do aggregate copies of "padding" and
> ignore the bits if we'd have ensured the larger access wouldn't trap?

I think a read via a may_alias type just tells you that N bytes are accessible
for reading, not necessarily for writing. So I don't see a problem, but maybe I
didn't quite catch what you are saying.


> So supporting the above might be a bit of a stretch (though I think
> we have to fix the vectorizer here).

What would the solution be? Using a may_alias type for such accesses?


> > > If the v4si store is masked we cannot do this anymore, but the IL
> > > we seed the alias oracle with doesn't know the store is partial.
> > > The only way to "fix" it is to take away all of the information from it.
> > 
> > But that won't fix the trapping issue? I think we need a distinct RTX for
> > memory accesses where hardware does fault suppression for masked-out 
> > elements.
> 
> Yes, it doesn't fix that part.  The idea of using BLKmode instead of
> a vector mode for the MEMs would, I guess, together with specifying
> MEM_SIZE as not known.

Unfortunate if that works for the trapping side, but not for the aliasing side.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-26 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #16 from Alexander Monakov  ---
(In reply to rguent...@suse.de from comment #14)
> vectors of T and scalar T interoperate TBAA wise.  What we disambiguate is
> 
> int a[2];
> 
> int foo(int *p)
> {
>   a[0] = 1;
>   *(v4si *)p = {0,0,0,0};
>   return a[0];
> }
> 
> because the V4SI vector store is too large for the a[] object.  That
> doesn't even use TBAA (it works with -fno-strict-aliasing just fine).

Thank you for the example. If we do the same for vector loads, that's a footgun
for users who use vector loads to access small objects:

// alignment to 16 is ensured externally
extern int a[2];

int foo()
{
  a[0] = 1;

  __v4si v = (__attribute__((may_alias)) __v4si *) 
  // mask out extra elements in v and continue
 ...
}

This is a benign data race on data that follows 'a' in the address space, but
otherwise should be a valid and useful technique.

> If the v4si store is masked we cannot do this anymore, but the IL
> we seed the alias oracle with doesn't know the store is partial.
> The only way to "fix" it is to take away all of the information from it.

But that won't fix the trapping issue? I think we need a distinct RTX for
memory accesses where hardware does fault suppression for masked-out elements.

[Bug target/110273] [12/13/14 Regression] i686-w64-mingw32 with -mavx512f generates AVX instructions without stack alignment

2023-06-26 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110273

--- Comment #8 from Alexander Monakov  ---
(In reply to Sam James from comment #7)
> We keep getting quite a few reports of this downstream.

Of this mingw32 stack realignment issue specifically, i.e. Wine breakage when
AVX512 is enabled via CFLAGS?

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-26 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #13 from Alexander Monakov  ---
(In reply to rguent...@suse.de from comment #12)
> As explained in comment#3 the issue is related to the tree alias oracle
> part that gets invoked on the MEM_EXPR for the load where there is
> no information that the load could be partial so it gets disambiguated
> against a decl that's off less size than the full vector.

With my example I'm trying to say that types in the IR are wrong if we
disambiguate like that. People writing C need to attach may_alias to vector
types for plain load/stores to validly overlap with scalar accesses, and when
vectorizer introduces vector accesses it needs to do something like that, or
else intermixed scalar accesses may be incorrectly disambiguated against new
vector ones.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-26 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #11 from Alexander Monakov  ---
The trapping angle seems valid, but I have a really hard time understanding the
DSE issue, and the preceding issue about disambiguation based on RTL aliasing.

How would DSE optimize out 'd[5] = 1' in your example when the mask_store reads
it? Isn't that a data dependency?

How is the initial issue different from

int f(__m128i *p, __m128i v, int *q)
{
  *q = 0;
  *p = v;
  return *q;
}

that we cannot optimize to 'return 0' due to __attribute__((may_alias))
attached to __m128i?

[Bug rtl-optimization/110307] ICE in move_insn, at haifa-sched.cc:5473 when building Ruby on alpha with -fPIC -O2 (or -fpeephole2 -fschedule-insns2)

2023-06-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110307

--- Comment #13 from Alexander Monakov  ---
Note to self: check how control_flow_insn_p relates.

[Bug tree-optimization/110369] wrong code on x86_64-linux-gnu

2023-06-22 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110369

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov  ---
The loop over 'e' is never entered, because 'a' is zero.

[Bug rtl-optimization/110307] ICE in move_insn, at haifa-sched.cc:5473 when building Ruby on alpha with -fPIC -O2 (or -fpeephole2 -fschedule-insns2)

2023-06-22 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110307

--- Comment #10 from Alexander Monakov  ---
I think the first patch may result in duplicated notes, so I wouldn't recommend
picking it.

[Bug rtl-optimization/110307] ICE in move_insn, at haifa-sched.cc:5473 when building Ruby on alpha with -fPIC -O2 (or -fpeephole2 -fschedule-insns2)

2023-06-21 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110307

--- Comment #8 from Alexander Monakov  ---
REG_EH_REGION is handled further down that function, but
copy_reg_eh_region_note_backward does not copy the note. Perhaps it needs

diff --git a/gcc/except.cc b/gcc/except.cc
index e728aa43b6..cfe140c4d0 100644
--- a/gcc/except.cc
+++ b/gcc/except.cc
@@ -1795,7 +1795,7 @@ copy_reg_eh_region_note_backward (rtx note_or_insn,
rtx_insn *last, rtx first)
   note = XEXP (note, 0);

   for (insn = last; insn != first; insn = PREV_INSN (insn))
-if (insn_could_throw_p (insn))
+if (insn_could_throw_p (insn) || can_nonlocal_goto (insn))
   add_reg_note (insn, REG_EH_REGION, note);
 }

?

[Bug rtl-optimization/110307] ICE in move_insn, at haifa-sched.cc:5473 when building Ruby on alpha with -fPIC -O2 (or -fpeephole2 -fschedule-insns2)

2023-06-20 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110307

--- Comment #6 from Alexander Monakov  ---
Cross-compiler needs HAVE_AS_EXPLICIT_RELOCS=1.

With checking enabled, we get:

t.c:8:1: error: flow control insn inside a basic block
(call_insn 97 96 98 4 (parallel [
(set (reg:DI 0 $0)
(call (mem:DI (reg:DI 27 $27) [0  S8 A64])
(const_int 0 [0])))
(set (reg:DI 29 $29)
(unspec:DI [
(reg:DI 29 $29)
(const_int 6 [0x6])
] UNSPEC_LDGP1))
(use (symbol_ref:DI ("__tls_get_addr") [flags 0x41]  ))
(use (unspec [
(const_int 1 [0x1])
] UNSPEC_TLSGD_CALL))
(clobber (reg:DI 26 $26))
]) "t.c":6:22 -1
 (nil)
(expr_list (use (reg:DI 16 $16))
(nil)))
during RTL pass: peephole2
dump file: t.c.313r.peephole2
t.c:8:1: internal compiler error: in rtl_verify_bb_insns, at cfgrtl.cc:2797


Insn 96 appears via:

Splitting with gen_peephole2_8 (alpha.md:5972)
scanning new insn with uid = 96.
scanning new insn with uid = 97.
scanning new insn with uid = 98.
deleting insn with uid = 25.

Insn 25 was:

(call_insn/u 25 39 26 4 (parallel [
(set (reg:DI 0 $0)
(call (mem:DI (symbol_ref:DI ("__tls_get_addr") [flags 0x41] 
) [0  S8 A64])
(const_int 0 [0])))
(unspec [
(const_int 1 [0x1])
] UNSPEC_TLSGD_CALL)
(use (reg:DI 29 $29))
(clobber (reg:DI 26 $26))
]) "t.c":6:22 346 {call_value_osf_tlsgd}
 (expr_list:REG_DEAD (reg:DI 16 $16)
(expr_list:REG_EH_REGION (const_int -2147483648 [0x8000])
(nil)))
(expr_list (use (reg:DI 16 $16))
(nil)))

Note the REG_EH_REGION. This is relevant because can_nonlocal_goto checks it,
so for insn 25 we knew it wouldn't return to the setjmp receiver. Applying the
peephole dropped the note.

[Bug rtl-optimization/110307] ICE in move_insn, at haifa-sched.cc:5473 when building Ruby on alpha with -fPIC -O2 (or -fpeephole2 -fschedule-insns2)

2023-06-20 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110307

--- Comment #5 from Alexander Monakov  ---
It's not necessary yet for this particular bug, but might be helpful for future
bugs (if disk space is not an issue).

[Bug rtl-optimization/110307] ICE in move_insn, at haifa-sched.cc:5473 when building Ruby on alpha with -fPIC -O2 (or -fpeephole2 -fschedule-insns2)

2023-06-19 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110307

--- Comment #3 from Alexander Monakov  ---
Do you have older versions of GCC to check on this testcase?

[Bug rtl-optimization/110307] ICE in move_insn, at haifa-sched.cc:5473 on alpha with -fPIC -fpeephole2 -fschedule-insns2

2023-06-19 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110307

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
I tried building a cross-compiler from trunk with
--target=alpha-unknown-linux-gnu --with-gnu-ld --with-gnu-as --enable-secureplt
--enable-languages=c --enable-tls and got

t.c:8:1: error: unrecognizable insn:
8 | }
  | ^
(insn 23 22 24 5 (set (reg/f:DI 74)
(symbol_ref:DI ("ruby_current_ec") [flags 0x10]  )) "t.c":6:22 -1
 (nil))
during RTL pass: vregs

Would you mind compiling the testcase with -fdump-tree-all -fdump-rtl-all and
attaching a tar.gz with the resulting dumps?

[Bug target/110273] [12/13/14 Regression] i686-w64-mingw32 with -mavx512f generates AVX instructions without stack alignment

2023-06-16 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110273

--- Comment #6 from Alexander Monakov  ---
Huh? Just compile the supplied testcases without avx512, you'll see proper
stack realignment.

[Bug target/110273] i686-w64-mingw32 with -march=znver4 generates AVX instructions without stack alignment

2023-06-16 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110273

--- Comment #4 from Alexander Monakov  ---
Further reduced:

void f()
{
int c[4] = { 0, 0, 0, 0 };
int cc[8] = { 0 };
asm("" :: "m"(c), "m"(cc));
}

Also reproducible with -march=skylake-avx512 or even plain -mavx512f,
retitling.

[Bug target/110273] i686-w64-mingw32 with -march=znver4 generates AVX instructions without stack alignment

2023-06-16 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110273

--- Comment #3 from Alexander Monakov  ---
Seems to work fine with explicit '-mincoming-stack-boundary=2' on the command
line, even though it should make no difference for the 32-bit MinGW target.

[Bug target/110260] Multiple applications misbehave at runtime when compiled with -march=znver4

2023-06-15 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110260

--- Comment #10 from Alexander Monakov  ---
Right, those are different issues. Any chance of a standalone testcase
extracted from Wine? If you already see a function where stack realignment is
missing, just give us preprocessed containing source, full gcc command line,
and output of 'gcc -v', as described on https://gcc.gnu.org/bugs/

(please open a new bug with that, and mention the new bug # here)

[Bug target/110260] Multiple applications misbehave when compiled with -march=znver4

2023-06-15 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110260

--- Comment #6 from Alexander Monakov  ---
(In reply to Jimi Huotari from comment #0)
> (By the by, is ADCX a typo of ADX?  I see -madx as an option but only one
> use of it otherwise, and no -adcx as an option and lots of mentions of it...
> but perhaps I'm not reading them correct-like.)

ADX is an x86 extension that adds two new instructions, ADCX and ADOX:
https://en.wikipedia.org/wiki/Intel_ADX

[Bug target/110260] Multiple applications misbehave when compiled with -march=znver4

2023-06-15 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110260

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov  ---
Um, sched1 is not enabled on x86 so -fno-schedule-insns does nothing — you
probably meant -fno-schedule-insns2?

Another thing to try is -fstack-reuse=none, as indicated by comment #1.

[Bug web/110250] Broken url to README in st/cli-be project

2023-06-14 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110250

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
It's under refs/vendors/st in the new git repo:

https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=README;h=1b3709666818ce43833c42bd858e14ff3d233ff6;hb=refs/vendors/st/heads/README

You can use 'git ls-remote git://gcc.gnu.org/git/gcc.git' to view the list of
all remote heads and tags (or just 'git ls-remote' from a cloned repo).

[Bug c/110249] __builtin_unreachable helps optimisation at -O1 but not at -O2

2023-06-14 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110249

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
Please show 'gcc -v' for the case when you get one aligned load at -O1.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-14 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
Sorry, was your recent patch g:1c3661e224e3ddfc6f773b095740c0f5a7ddf5fc
tackling a different issue?

[Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations

2023-06-12 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #6 from Alexander Monakov  ---
(In reply to Jakub Jelinek from comment #3)
> And I must say I don't immediately see easy rules how to find out from the
> immediate value which set is which, so unless we find some easy rule for
> that, we'd need to hardcode the mapping between the 256 values to a bitmask
> which inputs are actually used.

Well, that's really easy. The immediate is just a eight-entry look-up table
from any possible input bit triple to the output bit. The leftmost operand
corresponds to the most significant bit in the triple, so to check if the
operation vpternlog(A, B, C, I) is invariant w.r.t A you check if nibbles of I
are equal. Here we have 0x55, equal nibbles, and the operation is invariant
w.r.t A.

Similarly, to check if it's invariant w.r.t B we check if two-bit groups in I
come in pairs, or in code: (I & 0x33) == ((I >> 2) & 0x33). For 0x55 both sides
evaluate to 0x11, so again, invariant w.r.t B.

Finally, checking invariantness w.r.t C is (I & 0x55) == ((I >> 1) & 0x55).

[Bug c/110169] wrong code with '-Ofast'

2023-06-08 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110169

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov  ---
It seems csmith was run with the --float argument. Differences under -Ofast are
expected (but even without -Ofast, it seems csmith can emit bit manipulation of
arbitrary floats, which can lead to unpredictable results when NaNs are
involved).

[Bug tree-optimization/110035] Missed optimization for dependent assignment statements

2023-06-06 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110035

--- Comment #15 from Alexander Monakov  ---
malloc and friends modify 'errno' on failure, so in they would have to be
special-cased for alias analysis.

[Bug middle-end/109967] [10/11/12/13/14 Regression] Wrong code at -O2 on x86_64-linux-gnu

2023-06-05 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109967

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov  ---
See PR 90348 for the discussion of problematic lifetime semantics.

[Bug middle-end/110089] sub-optimal code for attempting to produce JNA (jump on CF or ZF)

2023-06-02 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110089

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov  ---
If 'min' is needed anyway you can use it in subtraction:

void bar ();
void foo (unsigned int n, unsigned s)
{
  do
{
  np = MIN (n, s);
  bar (np);
}
  while (n -= np);
}

but getting the sub-jcc trick to work should yield more efficient code.

  1   2   3   4   5   6   7   8   9   10   >