[Bug target/112980] 64-bit powerpc ELFv2 does not allow nops to be generated before function global entry point

2024-04-11 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112980

--- Comment #16 from Michael Matz  ---
(In reply to Kewen Lin from comment #15)
> I agree, thanks for the comments! btw, I'm not fighting for the current
> implementation, just want to know more details why users are unable to make
> use of the current implementation, is it just due to its inefficiency (like
> the above sequence) or un-usability (unused at all). As your comments, I
> think it's due to the former (inefficiency)?!

Okay.  So, yeah, I _think_ that other way (with NOPs between GEP and LEP,
plus a jump around them) could be made to work with userspace live patching.
It would just be inefficient.  But do note that that jump around was _not_
part of the original way of -fpatchable-function-entry, so a change to codegen
would have to have happened anyway to make that other way usable.  And it has
the
(perhaps theoretical, who knows :) ) problem of not using the normal 8-byte
difference between GEP and LEP.

I think your current proposal from comment #10 is the better from all
perspectives.

[Bug target/112980] 64-bit powerpc ELFv2 does not allow nops to be generated before function global entry point

2024-04-09 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112980

--- Comment #14 from Michael Matz  ---
(In reply to Kewen Lin from comment #13)
> (In reply to Giuliano Belinassi from comment #12)
> > With your patch we have:
> > 
> > > .LPFE0:
> > > ...
> > Which seems what is expected.
> 
> Hi Giuliano, thanks for your time on testing it!  Could you kindly help to
> explain a bit on why "In such way we can't use the this space to place a
> trampoline to the new function"? Is it due to inefficient code like needing
> more branches?
> 
> global entry:
>   [b localentry]
> L1:
>   [patched code]
> localentry:
>   [b L1]
> 
> Or some other reason which makes it unused at all?

Hmm?  But this is not how the global-to-local hand-off is implemented (and
expected by tooling): a fall-through.  The global entry sets up the GOT
register, there simply is no '[b localentry]'.

If you mean to imply that also the '[b localentry]' should be patched in at
live-patch application time (and hence the GOT setup would need to be moved
to still somewhere else), then you have the problem that (in the
not-yet-patched 
case) as long as the L1-nops sit between global and local entry they will
always 
be executed when the global entry is called.  That's wasteful.

Additionally tooling will be surprised if the address difference between
global and local entry isn't exactly 8 (i.e. two instructions).  The psABI
allows for different values, of course.  But I'm willing to bet that there are
bugs in the wild when different values would be actually used.

So, the nops-between-gep-and-lep could probably be somehow made to work with
userspace live patching, but your most recent patch here makes this all mood.
It generates exactly the sequence we want: a single nop at the LEP, and
a configurable patching area outside of, but near to, the function (here: in
front of the GEP).

[Bug target/113874] GNU2 TLS descriptor calls do not follow psABI on x86_64-linux-gnu

2024-02-12 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113874

--- Comment #31 from Michael Matz  ---
(In reply to H.J. Lu from comment #30)
> (In reply to Michael Matz from comment #29)
> > It not only can call malloc.  As the backtrace of H.J. shows, it quite
> > clearly _does_ so :-)
> 
> ld.so can only call the malloc implementation internal to ld.so.

(And string functions for initializing that memory)  If that's ensured already
everywhere: super.  Because I agree, that this is the best thing to do here.
>From my perspective this is pure internal implementation details and hence
setting up thread-local areas should not be expected to be interposable by
users.
(a custom allocator that isn't malloc or doesn't interact with it also would
work)

[Bug target/113874] GNU2 TLS descriptor calls do not follow psABI on x86_64-linux-gnu

2024-02-12 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113874

--- Comment #29 from Michael Matz  ---
It not only can call malloc.  As the backtrace of H.J. shows, it quite clearly
_does_ so :-)

That's why there is talk earlier in this report about potentially not using
malloc as one-time allocator for thread-local areas at all, or allocate the
memory at a different time that from __tls_get_addr.

[Bug target/112980] 64-bit powerpc ELFv2 does not allow nops to be generated before function global entry point

2024-01-18 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112980

--- Comment #3 from Michael Matz  ---
(In reply to Kewen Lin from comment #1)
> 
> As Segher's review comments in [2], to support "before NOPs" before global
> entry and "after NOPs" after global entry,

Just to be perfectly clear here: the "after NOPs" need to come after local
entry
(which strictly speaking is of course after the global one as well, but I'm
being anal :) ).

[Bug target/99888] Add powerpc ELFv2 support for -fpatchable-function-entry*

2024-01-17 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99888

Michael Matz  changed:

   What|Removed |Added

 CC||matz at gcc dot gnu.org

--- Comment #15 from Michael Matz  ---
Umm.  I just noticed this one as we now try to implement userspace live
patching
for ppc64le.  The point of the "before" NOPs is (and always was) that they are
completely out of the way of patchable but as-of-yet unpatched functions.

For ppc that means the "before" and "after" NOPs cannot be consecutive.  The
two
NOP sets being consecutive was never a design criteria or requirement.

So, while the original bug is fixed by what was committed (local entry was
skipping the patching-nops), the chosen solution is exactly the wrong one :-/

[Bug tree-optimization/113372] wrong code with _BitInt() arithmetics at -O1

2024-01-15 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113372

--- Comment #16 from Michael Matz  ---
A general remark: in principle I don't see problems with undoing a little CSE,
as proposed.  I.e. for each SSA name use, see if it (trivially, or
conservatively
or optimistically) refers to an address of a tracked DECL, and include those
in the dont-share set.

But I will remark that this then handles code that was invalid to start with
as if it were valid, and I'm not sure if this will change diagnostics outcome
in any way (it certainly will prevent the slot sharing and hence any
diagnostics
that are based on such sharing to happen, e.g. when dirtying memory at
lifetime-end).

What I mean is the classic anti-pattern in C:

   char * ptr;
   {
 char localarray[n];
 dosomething (localarray);
 ptr = localarray;
   }
   domore (ptr);

The proposed solution will make the reference to the SSA name of 'ptr',
even though it was invalid in the source, keep localarray[] to be conflicting
and hence unshared with any other surrounding storage.

Depending on how PHI nodes are handled this will also mean that an SSA name
can't ever be un-associated with a DECL anymore:

  char * ptr = someotherarray;
  if (cond1)
  {
char localarray[n];
ptr = localarray;
foo (ptr);
  }
  if (cond2) { char localarray[n]; ptr = localarray; foo(ptr); }
  if (!cond1 && !cond2)
foo (ptr);

Here we will have a PHI for ptr: 'ptr_3 = PHI'.  The question
is how the reference(s) to 'localarray' will be handled: does it lead to ptr_3
also referring to them or not.  I.e. usual question in PHI merging: optimistic
or pessimistic.

Note that the answer to the latter question might influence the whole
effectiveness of stack slot sharing.  IIRC the latter type of code was one
of the examples that explodes in stack usage without sharing (localarray[]
being instead local C++ objects; but I don't remember if references to them
could bleed out of scopes in the same way as above).

[Bug middle-end/88345] -Os overrides -falign-functions=N on the command line

2023-12-06 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345

--- Comment #21 from Michael Matz  ---
(In reply to Jan Hubicka from comment #20)
> > 
> > Live patching (user-space) doesn't depend on any particular alignment of
> > functions, on x86-64 at least.  (The plan for other architectures wouldn't 
> > need
> > any specific alignment either).  Note that the above complaints about 
> > missing
> > alignment is for kernel (!) ftrace/livepatching on arm64 (!), not on x86_64.
> 
> I had impression that x86-64 also needs forced alignment so the patching
> can be always done atomically.  But it is not a big practical difference
> if we go with a flag specifying minimal function alignment.

Not for userspace live patching (it's done while the process is stopped).
kernel live patching may or may not need it.  Point being that alignment
shouldn't be implied by the live patching options as its orthogonal.

[Bug middle-end/88345] -Os overrides -falign-functions=N on the command line

2023-12-06 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345

--- Comment #19 from Michael Matz  ---
(In reply to Jan Hubicka from comment #18)
> Reading all the discussion again, I am leaning towards -falign-all-functions
> + documentation update explaining that -falign-functions/-falign-loops are
> optimizations and ignored for -Os.
> 
> I do use -falign-functions/-falign-loops when tuning for new generations of
> CPUs and I definitely want to have way to specify alignment that is ignored
> for cold functions (as perforance optimization) and we have this behavior
> since profile code was introduced in 2002.
> 
> As an optimization, we also want to have hot functions aligned more than 8
> byte boundary needed for patching.
> 
> I will prepare patch for this and send it for disucssion.  Pehraps we want
> -flive-patching to also imply FUNCTION_BOUNDARY increase on x86-64? Or is
> live patching useful if function entries are not aligned?

Live patching (user-space) doesn't depend on any particular alignment of
functions, on x86-64 at least.  (The plan for other architectures wouldn't need
any specific alignment either).  Note that the above complaints about missing
alignment is for kernel (!) ftrace/livepatching on arm64 (!), not on x86_64.

[Bug target/111591] ppc64be: miscompilation with -mstrict-align / -O3

2023-10-23 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111591

Michael Matz  changed:

   What|Removed |Added

 CC||matz at gcc dot gnu.org

--- Comment #28 from Michael Matz  ---
(In reply to Richard Biener from comment #27)
> (In reply to Kewen Lin from comment #26)
> > Thanks for the clarification! Is it possible to update the alias set for the
> > indirect accesses as well? since we know the address is originally taken
> > from one coalesced decl (also update its propagated ones).

That's not generally possible, the address-taking and the actual access might
be
separated by arbitrary obfuscating code:

   char *p = 
   char *p2 = get_some_pointer(p);
   *p2 = ...

Here p2 may, or may not, point to x.  So we'd need to be fairly conservative
here ...

> I suppose we could record a bitmap of all decls participating in any
> coalescing, check whether a MEM could possibly refer to any of them
> via the points-to API

... which the points-to API of course will be.

> and then force alias-set zero for those.

So that will work.  But I wonder if the result then won't be that essentially
all of the mem accesses will get alias set zero, at least if there was any
coalescing.  At that point we may also bite the bullet and just do away
with any TBAA alias sets in RTL at all.

> We
> could also try to do sophisticated analysis to make assigning a new
> alias-set for each coalesce group work, merging groups when there's
> indirect accesses that could alias a member of more than a single
> group.

Question is if the sophistication is worth it.

[Bug middle-end/88345] -Os overrides -falign-functions=N on the command line

2023-09-12 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345

Michael Matz  changed:

   What|Removed |Added

 CC||matz at gcc dot gnu.org

--- Comment #16 from Michael Matz  ---
I think -falign-functions=N should simply override any consideration of -Os or
coldness or suchlike.  Clearly the explicit use of such option is by some
intention of the compiler user, and as the compiler can't know _what_ exactly
is intended it needs to assume that the alignment needs to hold for all
functions.  Richis proposal would of course work, but seems a rather awkward
workaround for behaviour that actually is surprising.

[Bug target/110899] RFE: Attributes preserve_most and preserve_all

2023-08-08 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110899

--- Comment #11 from Michael Matz  ---
(In reply to Florian Weimer from comment #10)
> (In reply to Michael Matz from comment #9)
> > > > I don't see how that helps.  Imagine a preserve_all function foo that 
> > > > calls
> > > > printf.  How do you propose that 'foo' saves all parts of the SSE 
> > > > registers,
> > > > even those that aren't invented yet, or those that can't be touched by 
> > > > the
> > > > current ISA?  (printf might clobber all of these)
> > > 
> > > Vector registers are out of scope for this.
> > 
> > Why do you say that?  From clang: "Furthermore it also preserves all
> > floating-point registers (XMMs/YMMs)."  (for preserve_all, but this
> > bugreport does include that variant of the attribute).
> 
> Ugh, I preferred not to look at it because it's likely that the Clang
> implementation is broken (not future-proof).

I see, then we need to make clear that we aren't going to do anything about
preserve_all with clangs wording, in context of this report.

FWIW, in my implementation referred to above I chose to also have two variants:
one saving/restoring only the SSE2 parts of *mm8-*mm15 (i.e. xmm8-xmm15),
and one guaranteering to not clobber anything of *mm8-*mm15.  (No guarantees
about the *mm16 upwards).  The first variant can call foreign functions,
the second variant simply is allowed to only call functions that also give
that guarantee.

(There is also the question of mask registers, the clang docu doesn't talk
about them.  And I still would like to know the reason for the seemingly
arbitrary choice to leave some regs call clobbered for aarch64).

> > > But lets look at APX. If printf is recompiled to use APX, then it will
> > > clobber the extended register file. If we define __preserve_most__ the way
> > > we do in my psABI proposal (i.e., *not* as everything but %r11), the
> > > extended APX registers are still caller-saved.
> > 
> > Right, for preserve_most _with your wording_ it works out.  preserve_all
> > or preserve_most with clang wording doesn't.
> 
> In glibc, we already use a full context switch with XSAVE for the dynamic
> loader trampoline. As far as I understand it, it's not future-proof. The
> kernel could provide an interface that is guaranteed to work because it only
> enables those parts of the register file that it can context-switch. I can
> probably get the userspace-only implementation into glibc, but the kernel
> interface seems unlikely. We'd also have to work out the interaction of
> preserve_all and unwinding, setjmp etc.; not sure if there is a proper
> solution for that.

There are a couple possibilities to implement a halfway solution for this,
via XSAVE and friends, or via runtime dispatching dependend on current CPU
(e.g. provide a generic save/restore-stuff function in libgcc).  The problem
will always be where the memory for this save/restore pattern should come from,
its size isn't constant at compile time.  That's also solvable, but it's 
becoming more and more hairy.

That's why I chose to simply disallow calling foreign functions from those
that want to give very strict guarantees.  We could do the same for
preserve_all, if we absolutely want to have it.

[Bug target/110899] RFE: Attributes preserve_most and preserve_all

2023-08-07 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110899

--- Comment #9 from Michael Matz  ---
(In reply to Florian Weimer from comment #8)
> (In reply to Michael Matz from comment #7)
> > > > Does the clang implementation take into account the various problematic
> > > > cases that arise when calling a normal function from a (say) 
> > > > preserve_all
> > > > function
> > > > (hint: such call can't usually be done)?
> > > 
> > > How so? We need to version the __preserve_most__ attribute with the ISA
> > > level, of course.
> > 
> > I don't see how that helps.  Imagine a preserve_all function foo that calls
> > printf.  How do you propose that 'foo' saves all parts of the SSE registers,
> > even those that aren't invented yet, or those that can't be touched by the
> > current ISA?  (printf might clobber all of these)
> 
> Vector registers are out of scope for this.

Why do you say that?  From clang: "Furthermore it also preserves all
floating-point registers (XMMs/YMMs)."  (for preserve_all, but this bugreport
does
include that variant of the attribute).

> But lets look at APX. If printf is recompiled to use APX, then it will
> clobber the extended register file. If we define __preserve_most__ the way
> we do in my psABI proposal (i.e., *not* as everything but %r11), the
> extended APX registers are still caller-saved.

Right, for preserve_most _with your wording_ it works out.  preserve_all
or preserve_most with clang wording doesn't.

> (These details are the main reason why I want this in the psABI
> documentation. This stuff is out there and will be used, so let's make sure
> that it's somewhat reasonable.)

I agree with that.  I just want a little hashing-out-the-details before
putting anything in the psABI.

[Bug target/110899] RFE: Attributes preserve_most and preserve_all

2023-08-07 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110899

--- Comment #7 from Michael Matz  ---
(In reply to Florian Weimer from comment #5)
> > It also makes argument registers be callee-saved, which is very
> > unconventional.
> 
> Isn't this done for the this pointer in some C++ ABIs?

There are some, yes.  But those are then incompatible with the normal
definition of C++ calling conventions in terms of the ones for C (where 'this'
is simply
the first pointer argument).

The sense of that deviation lies in observing that 'this' usually isn't changed
by the callee.  For arbitrary arguments this doesn't hold, they belong to the
called function and so using a call-clobbered reg makes most sense.  It's
possible to do different of course, but it's very unusual.

> > Does the clang implementation take into account the various problematic
> > cases that arise when calling a normal function from a (say) preserve_all
> > function
> > (hint: such call can't usually be done)?
> 
> How so? We need to version the __preserve_most__ attribute with the ISA
> level, of course.

I don't see how that helps.  Imagine a preserve_all function foo that calls
printf.  How do you propose that 'foo' saves all parts of the SSE registers,
even those that aren't invented yet, or those that can't be touched by the
current ISA?  (printf might clobber all of these)

[Bug target/110899] RFE: Attributes preserve_most and preserve_all

2023-08-07 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110899

--- Comment #4 from Michael Matz  ---
(In reply to Florian Weimer from comment #2)
> I tried to write up something for the x86-64 psABI:
> 
> Document the ABI for __preserve_most__ function calls
> 
> 
> This should match what Clang implements (but not what it documents).

Shouldn't we first discuss whether any of this, or particular aspects of it,
are
a good idea at all, before specifying behaviour that a random compiler happened
to implement?

[Bug target/110899] RFE: Attributes preserve_most and preserve_all

2023-08-07 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110899

--- Comment #3 from Michael Matz  ---
Huh, since when does clang implement this?  See also 
  https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624004.html
where I asked for comments about a similar, but not same, mechanism.  I came
from
the angle of preserving a couple of SSE registers on x86-64.

For ABIs you generally want a good mix between caller- and callee-saved
registers. The x86-64 psABI didn't do that on the SSE regs for conscious, but
meanwhile irrelevant, reasons, so my approach above tried to rectify this.

The clang attributes seem to go against that general idea, they move all regs
(or all general regs) into being callee-saved (except, strangely for aarch64?).
It also makes argument registers be callee-saved, which is very unconventional.

Does the clang implementation take into account the various problematic cases
that arise when calling a normal function from a (say) preserve_all function
(hint: such call can't usually be done)?  Does exception handling,
setjmp/longjmp
and make/swapcontext interact with the clang attributes?

That said: the implementation itself (after a more detailed spec) can be
implemented in the same framework like the above patch.  (It's possible that
something more needs doing for the unusual situation that argument regs are
then
callee-saved).

[Bug middle-end/110728] should __attribute__((cleanup())) callback get invoked for indirect edges of asm goto

2023-08-03 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110728

--- Comment #11 from Michael Matz  ---
(In reply to Michael Matz from comment #9)
> Just for completeness: I agree with Andrew that the initial code example in
> comment #0 doesn't show any problem.  The edge from asmgoto to l0 doesn't
> cross
> the scope of the variable, hence no cleanups should be run.  The cleanup
> call that is there is from the edge that leaves the function scope before
> return, and it's placed correctly.

I was reminded that this is incorrect.  Though it isn't documented that way
(AFAICS) the cleanup attribute itself create a scope, as we're using the
try/finally middle-end mechanisms to implement this attribute.  We can't change
that behaviour anymore of course, so that's how it has to be: jumping in front
of the decl of 'x' _is_ supposed to run the cleanup.

The initial example essentially boils down to:

void test4(void) {
l0:;
try { /* implicit scope per instantiation of __cleanup__ variable */
int x __attribute__((cleanup(test4cleanup)));
asm goto("# %l0"l0); /* <-- leaves scope created by x */
} finally {
test4cleanup();
}
}

[Bug middle-end/110728] should __attribute__((cleanup())) callback get invoked for indirect edges of asm goto

2023-08-03 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110728

Michael Matz  changed:

   What|Removed |Added

 CC||matz at gcc dot gnu.org

--- Comment #9 from Michael Matz  ---
Just for completeness: I agree with Andrew that the initial code example in
comment #0 doesn't show any problem.  The edge from asmgoto to l0 doesn't cross
the scope of the variable, hence no cleanups should be run.  The cleanup
call that is there is from the edge that leaves the function scope before
return, and it's placed correctly.

But the example from comment #8 indeed shows a problem with asm-goto.  But it
may be impossible to fix.  That has to do with how we need to (possibly) split
critical edges, which changes label identity, which in turn might actually
be the thing that's required by the asmgoto.  Like in such situation:

--
extern int printf (const char *, ...);
extern void doit(void *p);

void foo (int cond)
{
  {
int x __attribute__((cleanup(doit)));
if (cond)
  asm goto("# A %l0"out);
else
  asm goto("# B %l0"out);
  }
  printf("...");
  out:
}
--

The calls to doit need to be emitted on the two edges asm1->out and asm2->out,
but must _not_ be emitted on the fallthrough printf->out (of course, because
the cleanup was already done on the edge x-scope->printf).  The only
way to do that is by splitting those edges which introduces new labels:

{
  {
int x;
if (cond)
  asm goto("# A %l0"out1);
else
  asm goto("# B %l0"out2);
doit();
  }
  printf("...");
 out:
  return;
out1: doit(); goto out;
out2: doit(); goto out;
}

(In this easy example we could save ourself by realizing that the out1 and out2
code sequences are the same, and hence could get away with just emitting
one new label instead of two.  In general that's not possible.)

That's the CFG manipulation that needs happening.  The question is what
it does to the asmgoto: both now receive different labels, whereas before
they received the same.  If whatever the asm does relies on the identity of
these label then this will break things.

Now, asm gotos were specifically introduced to be able to jump to stuff,
not to capture the identity of labels/code-addresses (as is done for instance
in the linux kernel alternative mechanism), so it may be fine to do the edge
splitting and associated renaming of labels.  But that needs to be a conscious
decision.

[Bug target/108742] Incorrect constant folding with (or exposed by) -fexcess-precision=standard

2023-02-09 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108742

--- Comment #10 from Michael Matz  ---
(In reply to Jakub Jelinek from comment #9)
> (In reply to Michael Matz from comment #7)
> > (In reply to Jakub Jelinek from comment #5)
> > > https://eel.is/c++draft/cfloat.syn points to the C standard for
> > > FLT_EVAL_METHOD
> > > (plus https://eel.is/c++draft/expr#pre-6 talks about excess precision too)
> > > and e.g. C17
> > > 5.2.4.2.2/9):
> > > "2 evaluate all operations and constants to the range and precision of the
> > > long double type."
> > > 
> > > Note the " and constants" above.
> > 
> > Yes.  But that leaves unspecified exactly to what bit pattern the string
> > "4.2" should be converted to.
> 
> It should be converted to the closest long double, which is
> 0x8.8p-1,
> otherwise the constants wouldn't be evaluated to the range and precision of
> the long double type, only to double type then extended to long double.

Yes.

> In that case there
> would be no point to mention the " and constants" above, only operations
> would have excess precision, so double d; ... d = d + 4.2; would be d =
> (double) ((long double) d + (long double) (double) 4.2), while it actuall
> should be d = (double) ((long double) d + 4.2L);

As is clear by now, I disagree on that.  FLT_EVAL_METHOD is for dealing with
excess presicion hardware in a predictable way.  x87 loads double constants
into the 80bit regs trivially (with the value I want "(long double)4.2" to
have, not
4.2L), so that's what the frontend should do.  I think an argument that
involves that the standard otherwise "would have no point mentioning" something
is
slippery at best.

> Sure, see the typeof above, 4.2 with FLT_EVAL_METHOD == 2 has typeof double,
> but value of 4.2L.

I think we'll have to agree to disagree here and let Joseph have the say :-)

[Bug target/108742] Incorrect constant folding with (or exposed by) -fexcess-precision=standard

2023-02-09 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108742

--- Comment #8 from Michael Matz  ---
(In reply to Michael Matz from comment #7)
> So, my interpretation is that unsuffixed "4.2" has to be the double constant
> 4.2 (in IEEE double aka 0x1.0cccdp+2), which is then, because of
> FLT_EVAL_METHOD, evaluated to "range and precision of long double" leading
> to 0x8.8p-1, not to 0x8.666p-1 .

That is, I think that the real_from_string3 call in interpret_float should use
TYPE_MODE(type) plus conversion to const_type, not const_type plus conversion
to type (for over/underflow warnings).

[Bug target/108742] Incorrect constant folding with (or exposed by) -fexcess-precision=standard

2023-02-09 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108742

--- Comment #7 from Michael Matz  ---
(In reply to Jakub Jelinek from comment #5)
> https://eel.is/c++draft/cfloat.syn points to the C standard for
> FLT_EVAL_METHOD
> (plus https://eel.is/c++draft/expr#pre-6 talks about excess precision too)
> and e.g. C17
> 5.2.4.2.2/9):
> "2 evaluate all operations and constants to the range and precision of the
> long double type."
> 
> Note the " and constants" above.

Yes.  But that leaves unspecified exactly to what bit pattern the string "4.2"
should be converted to.  Both values used (0x8.666p-1 and
0x8.8p-1) are in "the range and precision of the long doubel type".

For that: 6.4.4.2 Floating constants (sorry, only c11 here, I hope wording to
same effect is still in c17):

4 An unsuffixed floating constant has type double. If suffixed by the letter f
  or F...

5  Floating constants are converted to internal format as if at
translation-time. ...

So, my interpretation is that unsuffixed "4.2" has to be the double constant
4.2 (in IEEE double aka 0x1.0cccdp+2), which is then, because of
FLT_EVAL_METHOD, evaluated to "range and precision of long double" leading
to 0x8.8p-1, not to 0x8.666p-1 .

[Bug target/108742] Incorrect constant folding with (or exposed by) -fexcess-precision=standard

2023-02-09 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108742

--- Comment #3 from Michael Matz  ---
(In reply to Jakub Jelinek from comment #2)
> Note, internally in standard excess precision, 4.2 seen by the lexer is
> actually
> EXCESS_PRECISION ,

Then _that_ is the problem.  The literal "4.2" simply is not a long double
literal "4.2L".

> when it is assigned to a double variable or
> cast
> to double (i.e. in places where C/C++ require the excess precision to be
> converted to the narrower one) it is rounded to double,
> but when used as (long double)4.2 it is the same as 4.2L

I disagree.  As "4.2" is "(double)4.2" then therefore "(long double)4.2" should
be the same as "(long double)(double)4.2".

> and even (long double)d == (long double)4.2 should behave
> the same as (long double)d == 4.2 and d == 4.2.

On this we agree, all these expressions should behave the same.  But I say they
should _not_ behave the same as "(long double)d == 4.2L".

[Bug middle-end/108742] New: Incorrect constant folding with (or exposed by) -fexcess-precision=standard

2023-02-09 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108742

Bug ID: 108742
   Summary: Incorrect constant folding with (or exposed by)
-fexcess-precision=standard
   Product: gcc
   Version: 12.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: matz at gcc dot gnu.org
  Target Milestone: ---

This came from a discussion about a user wondering why g++ behaved different
between GCC 12 and GCC 13, regarding equality comparisons of floating point
values on x87, i.e. -m32, which ultimately was because GCC 13 activated
-fexcess-precision=standard with -std=c++17.  Now this is not in itself
surprising, but I think something is still wrong.  From commit
98e341130f87984af07c884fea773c0bb3cc8821 (adding this to c++17):

...
The existing testcases tweaks were for cases on i686-linux where excess
precision breaks those tests, e.g. if we have
  double d = 4.2;
  if (d == 4.2)
then it does the expected thing only with -fexcess-precision=fast,
because with -fexcess-precision=standard it is actually
  double d = 4.2;
  if ((long double) d == 4.2L)
where 4.2L is different from 4.2.
...

And indeed a simple testcase about this (for C, but it demonstrates the same
thing): 
% cat float.c
#define bool _Bool
int main()
{
  double d = 4.2;

  bool eq1 = (long double)d == (long double)4.2;
  bool eq2 = (long double)d == 4.2L;
  bool eq3 = (long double)d == (long double)(double)4.2;

  __builtin_printf ("equal: %d/%d/%d\n", eq1, eq2, eq3);
  __builtin_printf ("1: %La\n", 4.2L);
  __builtin_printf ("2: %La\n", (long double)4.2);
  __builtin_printf ("3: %La\n", (long double)(double)4.2);

  return 0;
}
% gcc -m32 -O0 -fexcess-precision=fast -o fast float.c
% gcc -m32 -O0 -fexcess-precision=standard -o standard float.c
% ./fast
equal: 1/0/1
1: 0x8.666p-1
2: 0x8.8p-1
3: 0x8.8p-1
% ./standard
equal: 0/0/1
1: 0x8.666p-1
2: 0x8.666p-1
3: 0x8.8p-1


The bug I think exists is that also with =standard the first comparison
should yield 'true' and that irrespective of the excess-prec setting result (2)
should always be the same as (3).  As the comment for the commit correctly
says:
   (long double)4.2 != 4.2L
hence, even with -fexcess-precision=standard the former should _not_ be
const-folded into the latter, but rather into the exact long double value that
corresponds to (double)4.2.  Namely:

(gdb) p/x (long double)4.2
$1 = 0x40018800
(gdb) p/x (long double)4.2L
$2 = 0x40018666

The former is the correct long double value to use for the casted double
literal, not the latter.

IOW: it's fine that -fexcess-precision=standard tries to emulate the excess
precision on the hardware by doing all arithmetics in the larger type.  But
then
also the constants need to be folded exactly as the hardware would be doing.

In still other words: I think we have a bug in our constant folders and I don't
think it has anything directly to do with -fexcess-precision=standard, but is
merely exposed by it.  The wrong value is already present in 005t.original:

  __builtin_printf ((const char *) "1: %La\n",
4.1982652765240231929055880755186080932617e+0);
  __builtin_printf ((const char *) "2: %La\n",
4.1982652765240231929055880755186080932617e+0);
  __builtin_printf ((const char *) "3: %La\n",
4.20017763568394002504646778106689453125e+0);

[Bug c/106571] Implement -Wsection diag

2022-08-10 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106571

Michael Matz  changed:

   What|Removed |Added

 CC||matz at gcc dot gnu.org

--- Comment #4 from Michael Matz  ---
To elaborate: normally you don't need to know which section a variable will be
placed into when generating code to access it.  (In fact you don't even need to
know if it's defined in the same linking component at all)

So, for declarations the section is immaterial, and checking for a mismatch
between decl and def doesn't seem appropriate.  In the general case.  This is
not
the general case, the per-cpu sections of the kernel have special semantics,
and I agree with Andrew that checking would be better implemented with
something
orthogonal to the section attribute itself.

Boris: what does DECLARE_PER_CPU() expand into?  Are there other attributes
that could be usefully checked for mismatch between decl and def?

[Bug fortran/106353] [suboptimal] Why is a 3D array initialized, use case 2 two-layer loop?

2022-07-19 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106353

--- Comment #2 from Michael Matz  ---
True, but in this case it could also be emitted in a better way by the fortran
frontend.

[Bug tree-optimization/106192] [11/12/13 Regression] ICE in vect_loop_versioning, at tree-vect-loop-manip.cc:3522

2022-07-05 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106192

--- Comment #3 from Michael Matz  ---
(In reply to Michael Matz from comment #2)
> see why it should be.  If GIMP_SIMD_LANE has properties that make this
> transformation invalid I would argue that those properties are correctly

"are _not_" I meant to say, of course.

> represented.

[Bug tree-optimization/106192] [11/12/13 Regression] ICE in vect_loop_versioning, at tree-vect-loop-manip.cc:3522

2022-07-05 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106192

--- Comment #2 from Michael Matz  ---
Unroll-and-jam simply unrolls the outer loop and merged all resulting
inner-loop bodies.  In this situation we have (before unroll-and-jam):

outerloop(i_1) {
  _12 = i_1 <= 59
  innerloop(i_1, j by 1) {
.GOMP_SIMD_LANE (simduid.6_16(D), 0, _12);
... uninteresting things (j) ...
  }
}

Unroll-and-jam then simply does (that's the unrolling):

outerloop(i by 2) {
  _12 = i_1 <= 59
  innerloop(i_1, j by 1) {
.GOMP_SIMD_LANE (simduid.6_16(D), 0, _12);
... uninteresting things (i, j) ...
  }
  i_2 = i_1 + 1
  _15 = i_2 <= 59
  innerloop(i_2, j by 1) {
.GOMP_SIMD_LANE (simduid.6_16(D), 0, _15);
... uninteresting things (i + 1, j) ...
  }
}

and then fuses the two inner loops, which means that the instructions between
them (the original pre-header of the inner loop) become replicated inside
the new inner loop body (here, the loop-invariant condition):

outerloop(i by 2) {
  _12 = i_1 <= 59
  innerloop(i_1, j by 1) {
.GOMP_SIMD_LANE (simduid.6_16(D), 0, _12);
... uninteresting things (i, j) ...
i_2 = i_1 + 1
_15 = i_2 <= 59
.GOMP_SIMD_LANE (simduid.6_16(D), 0, _15);
... uninteresting things (i + 1, j) ...
  }
}

There is nothing which somehow would indicate that this is invalid, and I can't
see why it should be.  If GIMP_SIMD_LANE has properties that make this
transformation invalid I would argue that those properties are correctly
represented.  One could of course hack bb_prevents_fusion_p or
unroll_jam_possible_p to avoid this situation, but that would seem like a
bad hack, as random other CFG transformation might introduce similar things:
namely a GOMP_SIMD_LANE statement that's fed by an unhoisted loop-invariant
condition.

So, I'd argue the assert is too eager.

[Bug c++/105169] Compiling C++ code with -fpatchable-function-entry=16,14 results in references to discarded sections

2022-04-07 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105169

Michael Matz  changed:

   What|Removed |Added

 CC||matz at gcc dot gnu.org

--- Comment #8 from Michael Matz  ---
Something like this would be needed.  It's proof-of-concept.  It actually just
copy-pastes code from default_function_rodata_section without proper
integration
and caring for other cases handled there.  To be done properly it would need
abstracting what default_function_rodata_section does, just with a choosable
prefix.  Additionally the whole section dealing in GCC should be changed such
that internally not the section name is the only differentiator for the section
hash table: currently that's the reason the comdat name has to be appended
to the section name, although ELF doesn't need this for comdat sections.  So,
terrible hack, but ...

diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index c9b5208853d..1ca5a4c3592 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1963,14 +1963,36 @@ default_print_patchable_function_entry_1 (FILE *file,
   char buf[256];
   static int patch_area_number;
   section *previous_section = in_section;
+  section *sect;
   const char *asm_op = integer_asm_op (POINTER_SIZE_UNITS, false);

   gcc_assert (asm_op != NULL);
   patch_area_number++;
   ASM_GENERATE_INTERNAL_LABEL (buf, "LPFE", patch_area_number);

-  switch_to_section (get_section ("__patchable_function_entries",
- flags, current_function_decl));
+  if (DECL_COMDAT_GROUP (current_function_decl) && HAVE_COMDAT_GROUP)
+   {
+ const char *dot;
+ size_t len;
+ char* rname;
+ const char *sname = "__patchable_function_entries";
+ const char *name = DECL_SECTION_NAME (current_function_decl);
+
+ dot = strchr (name + 1, '.');
+ if (!dot)
+   dot = name;
+ len = strlen (dot) + strlen (sname) + 1;
+ rname = (char *) alloca (len);
+
+ strcpy (rname, sname);
+ strcat (rname, dot);
+ sect = get_section (rname, (SECTION_LINKONCE | flags),
+ current_function_decl);
+   }
+  else
+   sect = get_section ("__patchable_function_entries", flags,
+   current_function_decl);
+  switch_to_section (sect);
   assemble_align (POINTER_SIZE);
   fputs (asm_op, file);
   assemble_name_raw (file, buf);

[Bug middle-end/104721] currently_expanding_gimple_stmt isn't cleared properly

2022-03-01 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104721

--- Comment #2 from Michael Matz  ---
Is there a testcase where you noticed this, or was it just reading code?

[Bug tree-optimization/104543] [9/10/11 Regression] wrong code at -O3 on x86_64-linux-gnu

2022-02-15 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104543

--- Comment #11 from Michael Matz  ---
(In reply to Richard Biener from comment #5)
> in particular the comment in bb_prevents_fusion_p saying
> 
>   /* BB is duplicated by outer unrolling and then all N-1 first copies
>  move into the body of the fused inner loop.  If BB exits the outer loop
>  the last copy still does so, and the first N-1 copies are cancelled
>  by loop unrolling, so also after fusion it's the exit block.
> 
> looks wrong.  Yes, the first is cancelled but the remaining is not the
> exit block.

I think your patch is fine and fixes the bug.  We can possibly do better also
for head-controlled loops by just reversing the above: the _first_ of the N
exit-bb copies needs to remain (and stay in front of the inner loop), and
the last N-1 copies need to be cancelled.  Needs to be carefully thought
through,
but something like that should be possible.

[Bug demangler/99188] cxxfilt may exist a uaf

2021-12-06 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99188

Michael Matz  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED
 CC||matz at gcc dot gnu.org

--- Comment #7 from Michael Matz  ---
Actually, it _is_ fixed.  This problem report is about version 2.26, which is
many
years old.  Current versions don't have this problem, at the very least when
the problematic code was removed whole-sale in late 2018/early 2019.

[Bug c++/102067] SEGFAULT in varpool_node::get_constructor during lto1 when optimising or not using debug symbols

2021-08-25 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102067

Michael Matz  changed:

   What|Removed |Added

 CC||matz at gcc dot gnu.org

--- Comment #8 from Michael Matz  ---
(In reply to Martin Liška from comment #7)
> Hm, note the ccxjqZU1 file contains:
> 
> libwave_common.a@0x4307c
> libwave_common.a@0x180314c
> 
> What does the syntax mean?

That's the syntax for 'the .o file found within the .a file at offset 0x4307c'.
(.a files are very simple and contain .o files mostly verbatim with some
headers
and an index, so an offset within it is enough to describe a contained object
files (the sizes and such can be reconstructed from the object file header
itself)).

[Bug target/102024] [12 Regression] zero width bitfields and ABIs

2021-08-24 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102024

--- Comment #8 from Michael Matz  ---
The only thing the x86-64 psABI says about this case is "'Unnamed bit-fields'
types do not affect the alignment of a structure or union." .

(zero-width bit-fields are _always_ unnamed)

But the x86-64 psABI was written with the assumption in mind that zero-width
bitfields influence the alignment of the following field (i.e. that they break
the current bit bucket for two neighboring bitfield members).  I.e. exactly
what GCCs
stor-layout.c code always did for these when not in microsoft compat mode.

In the example in question this align-next-bitfield item doesn't come into
play.  But it does show that zero-width bitfields were designed and allowed as
extension for a purpose.  So it makes sense for instance on platforms that have
a concept of homogenous aggregates that zero-width bit-fields disable those,
without otherwise changing memory layout.

So, I think, not removing those members from the FE makes sense, they contain
crucial information.  Unfortunately that means that they need to be dealt with
in code dealing with layout (correct) or argument/return-value passing
(seemingly
fishy right now for some platforms?).

Also, all else being equal I think the C language defines the de-facto psABI,
so any difference between C and C++ should be resolved towards C.

[Bug middle-end/100394] wrong-code with EH and pure/const functions

2021-05-03 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100394

--- Comment #4 from Michael Matz  ---
That then still shows problems with the pure function and -O2, but with
standard
C++ this then works:

struct S {
int foo(int i) const { if (i) throw 42; return 0; }
};

int __attribute__((noinline)) bar2()
{
  S s;
  int a[2];
  try {
int res = s.foo (1);
a[0] = res;
  } catch (...) {
  return 0;
  }
  return 1;
}

int main()
{
  if (bar2 ())
__builtin_abort ();
  return 0;
}

[Bug middle-end/100394] wrong-code with EH and pure/const functions

2021-05-03 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100394

Michael Matz  changed:

   What|Removed |Added

  Known to fail|3.4.6, 4.3.5|
 CC||matz at gcc dot gnu.org

--- Comment #3 from Michael Matz  ---
That's simply invalid C++.  A rethrow (which is what a 'throw' without value
is)
can only be done from within a catch.  You need to throw a value:

...
int __attribute__((pure,noinline)) foo () { if (x) throw 42; return y; }
...

[Bug target/100077] x86: by-value floating point array in struct - xmm regs spilling to stack

2021-04-14 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100077

--- Comment #2 from Michael Matz  ---
Yeah, to solve this fully requires representing the parameter passing in a
better
way, one that can be (a) used on the gimple side (where the code is already
generated assuming the vec3a params go into memory) and (b) is surviving the
gimple to RTL switch (or at least is used during that switch to find a better
expansion of the parameter into register loads (using shuffles in this case)).

No easy fix :-/

(Note: in normal programs such kernels should be inlined into whatever uses
the basic operations in loops, at which point this particular problem of
parameter
passing artifacts simply goes away, so it's visible only in micro tests.  It's
still a problem of course)

[Bug driver/99896] g++ drops -lc

2021-04-06 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99896

Michael Matz  changed:

   What|Removed |Added

 CC||matz at gcc dot gnu.org

--- Comment #7 from Michael Matz  ---
(In reply to Jonathan Wakely from comment #1)
> (In reply to Tom de Vries from comment #0)
> > With g++, we have instead:
> > ...
> > collect2 ... main.o foo.o -lpcre2-posix ...
> > ...
> 
> It isn't dropped, it's moved to the end:
> 
> main.o foo.o -lpcre2-posix -lstdc++ -lm -lc -lgcc_s -lgcc -lc -lgcc_s -lgcc
> 
> If you need it before foo.o then -Wl,-lc seems like the right workaround for
> me.

Workaround is the correct term here.  The correct thing would be for g++ to not
reorder -l arguments.  The similarity to -I is superficial: duplicated -l
arguments have meaning (with static archives for instance) and their position
in relation to object and source files matters.  g++ can validly tack on
additional -l arguments to the end, and arguably also replace a lone -lc
argument that was originally at the end of the command line or implicit (e.g.
to inject its unwinder), but it shouldn't otherwise reorder such arguments.

I will of course agree that the issue that the added -lc "solves" is actually
a bug in the testcase (and gdb).  But that should be immaterial here.  At the
very least gcc and g++ should behave the same in this respect.

[Bug tree-optimization/99101] optimization bug with -ffinite-loops

2021-03-03 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99101

--- Comment #22 from Michael Matz  ---
Over the last days I came to a similar conclusion.  Control dependence as
defined
with postdoms doesn't capture the number of invocations of an instruction, just
wether it's invoked at all.  I.e. paths with and without cycles are regarded
the
same on that level.

Basically the path expression for the testcase is (just inner loop):

  (bb11 -> bb4 -> ((bb5->bb9) | (bb6 -> (bb7->leave-cycle | bb9*

It's clear that bb7 is reached exactly once (because as soon as it's reached
it's
also the end of this function, and it must be reached because it's the only end
of
the function).

The thing with bb6 is that it can be invoked many times, and doesn't always
need
to lead to bb7, but rather can fall into bb9 (and the cycle restarts).
But it's also the case that bb6 doesn't need to be invoked in _all_ invocations
of
the cycle.  bb4 can choose to skip it (via bb5->bb9).

So, while it's true that bb6 and bb7 are not control dependend on bb4 in the
post-domincance formulation it's pretty clear that bb4 isn't useless because it
exactly can cause bb6 to be skipped _in the current cycle_.

If control dependence via postdoms doesn't capture the number of invocations
of a node then it simply cannot be used in cyclic regions to determine
uselessness
of predicates.

[Bug tree-optimization/99101] optimization bug with -ffinite-loops

2021-02-25 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99101

--- Comment #19 from Michael Matz  ---
(In reply to Michael Matz from comment #18)
> But there are other blocks control dependend on bb4, namely bb5 and bb9.
> To see this observe that bb9 doesn't pdom bb4, but there's a path from bb4 to
> bb9 (e.g. bb4 -> bb5 -> bb9), where one node in that path that isn't bb4 is

... where _each_ node in that path that isn't bb4 ...  Sorry.

(Essentially this captures the idea that the path-start node that is the
control
node is the _latest_ node that determines (non)execution of the path-end, i.e.
all
further nodes after the control node unavoidably pass through path-end).

[Bug tree-optimization/99101] optimization bug with -ffinite-loops

2021-02-25 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99101

--- Comment #18 from Michael Matz  ---
(In reply to Richard Biener from comment #11)
> (In reply to Richard Biener from comment #10)
> > Created attachment 50248 [details]
> > dot of the CFG as CD-DCE sees it
> 
> (gdb) p debug_dominance_info (CDI_POST_DOMINATORS)
> 2 3
> 3 11
> 4 6
> 5 9
> 6 7
> 7 1
> 9 11
> 11 4
> 12 3

So, on IRC you said that the at_eof is completely eliminated.  I wonder why.
It's true that bb6 post-dominates bb4, hence bb6 is not control dependend on
bb4.

But there are other blocks control dependend on bb4, namely bb5 and bb9.
To see this observe that bb9 doesn't pdom bb4, but there's a path from bb4 to
bb9 (e.g. bb4 -> bb5 -> bb9), where one node in that path that isn't bb4 is
pdom by bb9 (here bb5 and bb9 are such path nodes).  With same reasoning also
bb5 is
control dependend on bb4 (the example path being bb4->bb5, and bb5 the only
test 
node to check for pdom by bb5).

So, we have that bb5 and bb9 are control dependend on bb4, so removing of
bb4 (or it's predicate) would be wrong.  If our control dependence machinery
doesn't figure out that cdep(bb4) = {bb5,bb9}, then that would be the bug.

(I haven't checked what our cdep calculation gives as result here, the above is
what should have happened).

[Bug tree-optimization/99101] optimization bug with -ffinite-loops

2021-02-25 Thread matz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99101

--- Comment #17 from Michael Matz  ---
(In reply to Richard Biener from comment #16)
> Of course since -ffinite-loops and the C++ standard require forward progress
> here and all testcases expect the loop to not terminate we're in the realm
> of undefined behavior.  But I'm not yet convinced the control-dependence /
> CD-DCE issue only shows up in such cases.  That said, it's fully expected
> that
> 
> int xx;
> int main()
> {
>   int jobs_ = 1;
>   int at_eof_ = 0;
>   while (1)
> {
>   for (int i = 0; i < jobs_; i++)
> {
>   if (at_eof_)
> continue;
>   at_eof_ = 1;
>   if (xx)
> return 1;
> }
>   jobs_ = 0;
> }
>   return 0;
> }
> 
> is eventually optimized to just return 1 with -ffinite-loops and we should
> try to preserve that behavior.

Just commenting on this last statement: I think that's wrong.  It's provable
that 'xx' doesn't change in the loop, and that it starts out as 0 (this is main
here).  So in fact we have produced an endless loop without a return, and hence
can do anything (when endless loops are undefined).

[Bug target/96895] ABI of returning V1DF differs between GCC and clang

2020-09-02 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96895

--- Comment #7 from Michael Matz  ---
(In reply to Richard Biener from comment #5)
> So vector types with element type T and N, a power-of-two, not otherwise
> specified are passes the same as
> 
> struct S { T a[N] };
> 
> ?

No.  structs, if they are larger than 16 bytes are of class memory.  vector
types
larger than 16 bytes (which came only after the original psABI was written) are
passed different than the corresponding struct would have been (namely in the
appropriate vector registers, when those types are native to the hardware!).

That also explains your other observation:

> Ugh.  We pass struct S { double a[4]; } in %rdi  _and_ on the stack?!  And
> return by invisible reference _and_ in %eax?

Yes, a double[4] is larger than 16 bytes, hence memory, hence passed on the
stack
and returned via implicit reference.  You're mis-reading the assembler code
when
you think it's passed in %rdi and stack.  %rdi contains the address of the
caller
allocated return slot, i.e. the implicit reference.  %rax contains that same
address on return.

[Bug target/96895] ABI of returning V1DF differs between GCC and clang

2020-09-02 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96895

--- Comment #2 from Michael Matz  ---
The psABI doesn't say anything about such types, no.  Maybe it could in some
additional info pages, but it's always a problem to codify behaviour
retroactively
in it, when conflicting implementations already exist.  It is about extension
types, though, so we might be fine.

FWIW, even ignoring the obvious relation of v1Xf to Xf, GCC behaviour for float
and clang behaviour for double is the most logical one: this extended type is
most
similar to a struct containing one float/double, and such are passed in XMM
registers per psABI.  As this is also consistent with how a single
top-level float is passed, this choice is the most consistent one.  This is
also
what the psABI _would_ say, if we had written it into it, so at least both
compilers would need a change to implement it.

[Bug bootstrap/96794] --with-build-config=bootstrap-lto-lean with --enable-link-mutex leads to poor LTRANS utilization

2020-08-26 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96794

--- Comment #11 from Michael Matz  ---
(In reply to Jan Hubicka from comment #10)
> > We could also punt: when enable-link-mutex we could artificially up the job
> > number for make to account for the waiting link steps.  I.e. when normally 
> > -jN
> > was given, the link step could be done under -j(N + nr-language - 1).  That
> > would lead to the
> > nr-of-languages-1 job server tokens taken up by the sleeping link steps to 
> > be
> > accounted for, and hence the (single running) link step still be able to 
> > use N
> > threads/processes in parallel.
> 
> I do not think it is easy to do with current make jobserver.  The
> toplevel jobserver opens the pipi with -jN many tokens and all other
> makes connect to it.  It does not provide interface to return a token
> and re-take it unless we implement it of course.

Sure, but that's not the only way the above punting could be done.  The
parallel
phase that currently communicates with the jobserver essentially tries to get N
tokens.  It could be changed to assume that there are implicitely
nr-of-languages-1 more tokens available.  I.e. the number of threads/processes
would always be enlarged by nr-of-languages-1, even without any tokens from the
jobserver.

[Bug bootstrap/96794] --with-build-config=bootstrap-lto-lean with --enable-link-mutex leads to poor LTRANS utilization

2020-08-26 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96794

--- Comment #9 from Michael Matz  ---
We could also punt: when enable-link-mutex we could artificially up the job
number for make to account for the waiting link steps.  I.e. when normally -jN
was given, the link step could be done under -j(N + nr-language - 1).  That
would lead to the
nr-of-languages-1 job server tokens taken up by the sleeping link steps to be
accounted for, and hence the (single running) link step still be able to use N
threads/processes in parallel.

[Bug target/96373] SVE miscompilation on vectorized division loop, leading to FP exception

2020-08-05 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96373

--- Comment #11 from Michael Matz  ---
(In reply to rguent...@suse.de from comment #9)
> How do we represent sNaNs with -fnon-call-exceptions?  That is,

I think we're currently simply buggy at various stages as soon as sNaNs are
involved _and_ STDC FENV_ACCESS is ON.

>  y_1 = x_2 + 1.;
> 
> may trap.  Does
> 
>  foo (x_2);
> 
> get transformed to
> 
>  tem_3 = x_2;
>  foo (tem_3);
> 
> and the SSA assignment now traps dependent on whether the call
> ABI requires pushing x_2 to a stack slot (which might trap)?

If copying a sNaN (to registers or memory) signals an invalid-op depends on the
CPU (and is implementation defined in ieee754).  And GCC doesn't necessarily
preserve such signals even on CPUs that do signal (if there are any), because
it could e.g. use integer stores to transfer the bit patterns.

> sNaNs are odd anyway I guess.

> But yes, a pure function can still trap (and also throw).
> 
> I think we don't have a good notion for trappingness of calls
> and I do expect inconsistencies here.

__builtin_sqrtf isn't an arbitrary call.  FP operations aren't arbitrary
expressions.

We don't really have the problem of generating calls to arbitrary
arguments out of the blue, not even with non-throwing calls.  I'm not sure if
we should mix this problem here with that more generic problem.

Btw, that we if-convert calls to builtinf_sqrt is indeed a bug without
special options giving a license for that.  But doing that for the original
testcase instead of the division would _not_ be a problem on SVE: the inactive
lanes are zeroed and that doesn't signal anything for square root.

(We could perhaps extend the meaning of -fno-math-errno to give this license,
i.e. guarantee that the user hasn't enabled stops for any FP exceptions; but
that might be too aggressive).

[Bug target/96373] SVE miscompilation on vectorized division loop, leading to FP exception

2020-08-05 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96373

--- Comment #10 from Michael Matz  ---
(In reply to Andreas Schwab from comment #5)
> > Just note that _all_ floating point operations, not just divisions, can trap
> > (without fast-math).  You never know if the user enabled stops for any of
> > the FP exceptions (overflow, underflow, inexact, invalid op, div-by-zero). 
> 
> You need #pragma STDC FENV_ACCESS ON for that, otherwise it's undefined
> behaviour.

Sure, that's for operations that occur due to normal abstract machine
evaluations.  The point here is that we introduce operations that aren't in the
original
program (for the testcase on "array elements" after [10]), and if we do so
those
must be unobservable.  These pragmas don't give license to introduce additional
faults that can't possibly have happened in the abstract machine.  (To see
this, replace all arrays elements by 1.0, then the FP operations are all exact,
we still get SIGFPE).

[Bug target/96373] SVE miscompilation on vectorized division loop, leading to FP exception

2020-08-04 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96373

--- Comment #3 from Michael Matz  ---
(In reply to Richard Biener from comment #2)
> which means for non-memory gimple_could_trap_p (stmt) - sth you can
> easily check I guess.

Just note that _all_ floating point operations, not just divisions, can trap
(without fast-math).  You never know if the user enabled stops for any of the
FP exceptions (overflow, underflow, inexact, invalid op, div-by-zero).  So, on
SVE that means all FP operations need to use the loop mask.

(Of course enabling stops for something like inexact/xflow is silly, so punting
on that might be fine)

[Bug target/96373] New: SVE miscompilation on vectorized division loop, leading to FP exception

2020-07-29 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96373

Bug ID: 96373
   Summary: SVE miscompilation on vectorized division loop,
leading to FP exception
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: matz at gcc dot gnu.org
  Target Milestone: ---

I believe gcc-10 miscompiles the following program when SVE and vectorization
are enabled.  You need glibc to show this, or a different way to enable traps
on floating point exceptions:

% cat x.c
#define _GNU_SOURCE
#include 
void __attribute__((noinline, noclone)) div (double *d, double *s, int n)
{
  for (;n; n--, d++, s++)
*d = *d / *s;
}

extern int printf(const char*, ...);

int main()
{
  int i;
  double d[] = {1,2,3,4,5,6,7,8,9,10,11};
  double s[] = {11,10,9,8,7,6,5,4,3,2,1};
  //fesetenv(FE_NOMASK_ENV);
  feenableexcept(FE_DIVBYZERO|FE_INVALID);
  div(d, s, 11);
  for (i = 0; i < 11; i++)
printf(" %f", d[i]);
  printf("\n");
  return 0;
}

% gcc-10 --version
gcc-10 (SUSE Linux) 10.2.1 20200723 [revision
677b80db41f5345b32ce18cd000e45ea39b80d8f]

% gcc-10 -g -march=armv8.2-a -O2 -ftree-vectorize x.c -lm && ./a.out
 0.090909 0.20 0.33 0.50 0.714286 1.00 1.40 2.00
3.00 5.00 11.00

% gcc-10 -g -march=armv8.2-a+sve -O2 -ftree-vectorize  x.c -lm && ./a.out 
Floating point exception (core dumped)

I think the code speaks for itself, excerpt from div():

whilelo p0.d, wzr, w2
ptrue   p1.b, all
.p2align 3,,7
.L4:
ld1dz0.d, p0/z, [x0, x3, lsl 3]
ld1dz1.d, p0/z, [x1, x3, lsl 3]
fdivz0.d, p1/m, z0.d, z1.d
st1dz0.d, p0, [x0, x3, lsl 3]
incdx3
whilelo p0.d, w3, w2
b.any   .L4

So, it enables all lanes in p1, while the active lanes in the loop are tracked
in p0.  In particular non-active lanes from the load are zeroed.  The
division uses p1 and hence divides all lanes, including those that were zeroed.

Indeed that's what happens when the exception is thrown:

% gdb ./a.out
...
Program received signal SIGFPE, Arithmetic exception.
(gdb) x/i $pc
=> 0x400848 :   fdivz0.d, p1/m, z0.d, z1.d
(gdb) p $p1
$1 = {255, 255, 255, 255, 255, 255, 255, 255}
(gdb) p $z1.d.f
$2 = {3, 2, 1, 0, 0, 0, 0, 0}

When traps aren't enabled (the default is disabled) then these zero divisions
simply lead to NaNs in the respective lanes, and as in further instructions
the p0 predicate is used that's of no issue as those are ignored then.

But if traps are enabled this leads to an incorrect FPE trap.

The same behaviour occurs already with gcc-9.  I haven't tested master.

We noticed this within OpenFOAM on SVE capable hardware, but divisions in
vectorizable contexts should occur reasonably often for this to be a serious
problem.  (traps on exceptions aren't enabled very often, though, so this
bug will be hidden often).

[Bug inline-asm/94522] Enhancement request: asm goto with outputs

2020-04-07 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94522

--- Comment #3 from Michael Matz  ---
See the llvm link of the respective patch.  They specify that the outputs are
reliable only on the fallthrough (i.e. no goto taken) path (in particular the
outputs might or might not have been changed on the jump paths).  That avoids
all
the problems reload and it's replacements traditionally have with outputs on
jump insns, so it'd be relatively easy for us to support that as well.

This restriction, if documented, isn't even unreasonable and still allows some
interesting uses of goto asms with outputs.

[Bug testsuite/91626] [9/10 Regression] FAIL: gcc.dg/lto/pr48622 c_lto_pr48622_0.o-c_lto_pr48622_0.o link, -O -flto -finline-small-functions -fno-early-inlining

2020-04-01 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91626

--- Comment #7 from Michael Matz  ---
I've added it verbatim from PR48622, which itself was an autoreduced testcase
for
an ICE, at the time preventing bootstrapping.  I haven't verified if making the
testcase conforming C (i.e. provide a definition of ashift_qi_1) would
reproduce
the problem, but the missing of a definition might be crucial.

Maybe we could simply disable this test on hpux?

[Bug c++/92662] change in gcc 8 vs 9: call of overloaded ‘basic_string()’ is ambiguous

2020-01-30 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92662

--- Comment #8 from Michael Matz  ---
>From the GCC perspective, yes.  From the standard-is-surprising perspective,
no, but that probably doesn't belong to the GCC bugzilla.  So, yeah, can be
closed
for gcc 9 (theoretically it's still a bug in gcc 8, but I don't think one worth
fixing).

[Bug middle-end/90348] [8/9/10 Regression] Partition of char arrays is incorrect in some cases

2020-01-23 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90348

--- Comment #18 from Michael Matz  ---
(In reply to Alexander Monakov from comment #17)
> I think part of the problem is trying to make "deaths" explicit via CLOBBERs
> without making "births" also explicit in the IR.

Yes, that's basically the crux (or rather, that our births (first mention of a
decl name) don't conflict with address-taking or access), and without changing
this we're bound to heap hacks onto hacks to make it all work.  So, ideally
we'd
represent the births of local decls as well (i.e. "allocate" them), represent
all accesses indirectly via pointers, and "free" them at deaths.  Optionally,
very late in the pipeline, we can rewrite alloc/free pairs whose pointers don't
escape into direct accesses to decls again (not sure if that's worthwhile at
all; the 'free/alloc' statements would be virtual anyway, and we would handle
only decls that must lie in memory this way).

With that we'd have to amend some optimizations nevertheless, as those
alloc/free pairs would hinder optimizations (no moving of instructions
accessing 
the locals over them), so they would have to be moved sometimes in order to
allow them (if deemed profitable).

[Bug target/93270] [8/9/10 Regression] DSE removes store incorrectly

2020-01-21 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93270

Michael Matz  changed:

   What|Removed |Added

 CC||matz at gcc dot gnu.org

--- Comment #5 from Michael Matz  ---
(In reply to Richard Biener from comment #2)
> And pasting from my ml analysis:
> 
> So clearly something is wrong:
> 
>   type  size 
> unit-size 
> align:128 warn_if_not_align:0 symtab:0 alias-set 2
> canonical-type 0x7682d3f0 precision:80
> pointer_to_this >
> 
> and thus GET_MODE_SIZE (XFmode) == 16.  The target cannot possibly
> just store 12 bytes here,
> it lies.  Why's XFmode not 12 bytes but with 8/16 byte alignment?
> Does the ABI say sizeof(long double) is 16?

Of course.  sizeof _must_ be a multiple of alignment, but it can contain
padding.  So the target is completely fine with stating that sizeof is 16,
but then only storing 12 bytes.  (That implies that there are 4 bytes of
padding).  The number of bits that are actually relevant for determining the
value of something is supposed to be in precision, though this use of precision
was never explicitly called out, so there might be cases where precision isn't
reliably usable for this purpose.

> That said, a mode-has-padding or whatever should be reflected in
> TYPE_SIZE & friends as well, inconsistencies
> there just make things worse.

No, I don't see any inconsistencies.

What of course needs to be said is that the original testcase simply is
invalid.  It stores .value and then reads from another union member, that's
not allowed.  (You can play these games if the other member would be a char
array, and GCC tries to allow you to do that also with other types, but as you
noticed this allowance has its limits).

So, I think, nothing needs fixing.  The quality of implementation could be
somewhat improved in one dimension by using precision instead of size, but
only at expense of lowering quality in another dimension (namely of not
removing
the memset).

[Bug target/92821] Miscompilation when passing 8-bit enum to extern function

2019-12-05 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92821

--- Comment #5 from Michael Matz  ---
Yes, we (intentionally) haven't required any extensions to happen for arguments
or return values smaller than 64bit (e.g. we haven't even specified that
arguments <= 32bit would be zero-extended in the high bits, as would have been
natural with the instruction set).  If LLVM relies on that it would be a bug.

[Bug c++/92662] change in gcc 8 vs 9: call of overloaded ‘basic_string()’ is ambiguous

2019-11-26 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92662

--- Comment #6 from Michael Matz  ---
(In reply to Jonathan Wakely from comment #5)
> 
> Before choosing which conversion operator to use, the compiler considers the
> constructors of S, finding S(const S&) and S(S&&) as candidates. There is a
> viable conversion from Test&& to const S& and also one from Test&& to S&&.
> Both of those conversion sequences have conversion rank "user-defined"
> (because they use a user-defined conversion operator) but they use different
> conversion operators. Two different user-defined conversions are ambiguous,

Ah, it makes sense now ...

> it doesn't matter that one is a better match for being called on an rvalue.

... except may be this should matter ;-)  But it is as it is.

> The moveme(t).str() case is different, because there's no user-defined
> conversion. Overload resolution chooses which str() to call, and that *does*
> consider the ref-qualifiers when deciding which str() is a better match.

Right overload resolution vs. UCS, okay got it, many thanks :)

[Bug c++/92662] change in gcc 8 vs 9: call of overloaded ‘basic_string()’ is ambiguous

2019-11-26 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92662

--- Comment #4 from Michael Matz  ---
Even though bugzilla isn't really for educating people, I'd still like to ask
why the two conversion sequences are deemed either incomparable or equal.  In

  S b { moveme(t) };

the return value of moveme() has type Test&&.  There exists a conversion
operator exactly matching that type, and giving S&&, which can be used for
initialization.  Why is the sequence going via 'const S &' as good as that one,
even though it requires a change in refness?

(In a way it's simply irritating that, despite the same signatures of all
involved functions, the compiler differs between both cases, in particular the
call to str() in 'S a {moveme(t).str()}' also gives type S&&).

[Bug c++/92662] change in gcc 8 vs 9: call of overloaded ‘basic_string()’ is ambiguous

2019-11-25 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92662

--- Comment #1 from Michael Matz  ---
I _think_ a reduced program would be this:

-
template struct remove_ref { typedef _Tp type; };
template struct remove_ref<_Tp&> { typedef _Tp type; };
template struct remove_ref<_Tp&&> { typedef _Tp type; };

template 
typename remove_ref<_Tp>::type&&
moveme(_Tp&& __t) noexcept;

struct S
{
  S();
};

struct Test
{
  S const& str() const&;
  S && str() &&;

  operator S const&() const&;
  operator S &&() &&;
};

int main()
{
  Test t;
  S a { moveme(t).str() };
  S b { moveme(t) };
  return 0;
}
-

at least it gives same behaviour and captures the structure of the involved
move() functions.

% gcc-8 -c xreduced.cc
% gcc-9 -c xreduced.cc
x.cc: In function ‘int main()’:
x.cc:27:19: error: call of overloaded ‘S()’ is
ambiguous
   27 |   S b { moveme(t) };
  |   ^
x.cc:9:8: note: candidate: ‘constexpr S::S(const S&)’
9 | struct S
  |^
x.cc:9:8: note: candidate: ‘constexpr S::S(S&&)’

[Bug c++/92662] New: change in gcc 8 vs 9: call of overloaded ‘basic_string()’ is ambiguous

2019-11-25 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92662

Bug ID: 92662
   Summary: change in gcc 8 vs 9: call of overloaded
‘basic_string()’ is
ambiguous
   Product: gcc
   Version: 9.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: matz at gcc dot gnu.org
  Target Milestone: ---

A user of ours noted a difference in behaviour between gcc8 and gcc9 regarding
braced initializers.  Take this example:

-
#include 

// allow to steal _text from rvalues
struct Test
{
  std::string const& str() const& { return _text; }
  std::string && str() && { return std::move(_text); }

  operator std::string const&() const& { return _text; }
  operator std::string &&() && { return std::move(_text); }

  std::string _text;
};

int main()
{
  Test t;
  std::string a { std::move(t).str() };   // 1
  std::string b { std::move(t) }; // 2
  return 0;
}
-

gcc 8 accepts the program (and str() && is chosen in line 1), whereas gcc 9
only accepts line 1 but not line 2 due to:

xorig.cc: In function ‘int main()’:
xorig.cc:19:32: error: call of overloaded ‘basic_string()’ is ambiguous
   19 |   std::string b { std::move(t) };
  |^
In file included from /usr/include/c++/9/string:55,
 from xorig.cc:1:
/usr/include/c++/9/bits/basic_string.h:3628:7: note: candidate:
‘std::basic_string<_CharT, _Traits,
_Alloc>::basic_string(std::basic_string<_CharT, _Traits, _Alloc>&&) [with
_CharT = char; _Traits = std::char_traits; _Alloc =
std::allocator]’
 3628 |   basic_string(basic_string&& __str)
  |   ^~~~
/usr/include/c++/9/bits/basic_string.h:3564:7: note: candidate:
‘std::basic_string<_CharT, _Traits, _Alloc>::basic_string(const
std::basic_string<_CharT, _Traits, _Alloc>&) [with _CharT = char; _Traits =
std::char_traits; _Alloc = std::allocator]’
 3564 |   basic_string(const basic_string& __str);
  |   ^~~~


One difference is the temporary object
used for calling .str() in line 1.  But still I have difficulties to see
why there's a difference in ambiguities.

So, who's right (8 or 9), and due to which reasons.  (Depending on that this
is either an error in gcc 8 or 9, so I'm not marking it yet).

[Bug middle-end/90796] [8/9 Regression] GCC: O2 vs O3 output differs on simple test

2019-11-20 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90796

--- Comment #17 from Michael Matz  ---
Author: matz
Date: Wed Nov 20 16:51:10 2019
New Revision: 278512

URL: https://gcc.gnu.org/viewcvs?rev=278512=gcc=rev
Log:
Fix PR90796

   PR middle-end/90796
   * gimple-loop-jam.c (any_access_function_variant_p): New function.
   (adjust_unroll_factor): Use it to constrain safety, new parameter.
   (tree_loop_unroll_and_jam): Adjust call and profitable unroll factor.

testsuite/
   Backport from mainline
   PR middle-end/90796
   * gcc.dg/unroll-and-jam.c: Disable loop-invariant motion and adjust.

   PR middle-end/90796
   * gcc.dg/unroll-and-jam.c: Add three invalid and one valid case.

Modified:
branches/gcc-9-branch/gcc/ChangeLog
branches/gcc-9-branch/gcc/gimple-loop-jam.c
branches/gcc-9-branch/gcc/testsuite/ChangeLog
branches/gcc-9-branch/gcc/testsuite/gcc.dg/unroll-and-jam.c

[Bug middle-end/90796] [8/9 Regression] GCC: O2 vs O3 output differs on simple test

2019-11-20 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90796

--- Comment #16 from Michael Matz  ---
(In reply to Jakub Jelinek from comment #14)
> Time to backport now?

Hmpf, I've actually done the regstrapping for gcc9 already but then forgot to
submit.  Thanks for the reminder.

[Bug middle-end/90796] [8/9 Regression] GCC: O2 vs O3 output differs on simple test

2019-10-22 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90796

Michael Matz  changed:

   What|Removed |Added

Summary|[8/9/10 Regression] GCC: O2 |[8/9 Regression] GCC: O2 vs
   |vs O3 output differs on |O3 output differs on simple
   |simple test |test
  Known to fail|10.0|

--- Comment #13 from Michael Matz  ---
Fixed in trunk so far.  Will be backporting in a few days.

[Bug middle-end/90796] [8/9/10 Regression] GCC: O2 vs O3 output differs on simple test

2019-10-22 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90796

--- Comment #12 from Michael Matz  ---
Author: matz
Date: Tue Oct 22 12:25:03 2019
New Revision: 277287

URL: https://gcc.gnu.org/viewcvs?rev=277287=gcc=rev
Log:
Fix PR middle-end/90796

PR middle-end/90796
* gimple-loop-jam.c (any_access_function_variant_p): New function.
(adjust_unroll_factor): Use it to constrain safety, new parameter.
(tree_loop_unroll_and_jam): Adjust call and profitable unroll factor.

testsuite/
* gcc.dg/unroll-and-jam.c: Add three invalid and one valid case.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/gimple-loop-jam.c
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.dg/unroll-and-jam.c

[Bug rtl-optimization/91898] [optimization] switch statements for state machines could be better

2019-09-25 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91898

--- Comment #3 from Michael Matz  ---
For purposes of discussion, let's make a full example:

% cat ex.c
int get(int);
int foo (int a, int *ary)
{
  //void *labelsy[] = {&, &, &, &_end};
  //int ary[] = {0, 1, 2, 3};
  int i = 0;
  int ret = 0;
  for (;;)
switch (ary[i++]) {
case 0:
ret += get (1);
break;
case 1:
ret += get (2);
break;
case 2:
ret += get (3);
break;
case 3:
goto after_loop;
case 4:
ret += get (4);
break;
case 34:
ret += get (5);
break;
case 44:
ret += get (6);
break;
case 46:
ret += get (7);
break;
case 444:
ret += get (8);
break;
case 124:
ret += get (9);
break;
case 4321:
ret += get (10);
break;
}
after_loop:
  return ret;
}

On at least x86-64 (with -O2) this is compiled to a table jump variant
for the switch.  What Nathans wants, I think (please correct), is that the
table jump sequence (in particular the indirect jump) be duplicated.

[Bug rtl-optimization/91898] [optimization] switch statements for state machines could be better

2019-09-25 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91898

Michael Matz  changed:

   What|Removed |Added

 CC||matz at gcc dot gnu.org

--- Comment #2 from Michael Matz  ---
Especially you don't want to do the duplication of the indirect jump before 
pass_duplicate_computed_gotos, as otherwise you'd get all the O(n^2) complexity
problems that the factorization of computed gotos is designed to prevent.

I.e. yes: RTL side.

[Bug middle-end/91358] Wrong code with dynamic allocation and optional like class

2019-08-08 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91358

--- Comment #7 from Michael Matz  ---
(In reply to Antony Polukhin from comment #6)
> (In reply to Michael Matz from comment #3)
> > I don't really see any, no good idea here :-/
> 
> How about moving all the optimizations based on reading uninitialized values
> under a flag like -funinitialized-logic, so that users could build with -O2
> -fno-uninitialized-logic ?

That can't work in general.  How would you propose that GCC automagically
detects that in:

struct S {int a, b;};
int foo (struct S *s) { return s->a ? s->b : 0; }

the read of ->b is uninitialized?  After all, it might have been initialized
by the caller or not.

[Bug middle-end/90796] [8/9/10 Regression] GCC: O2 vs O3 output differs on simple test

2019-08-07 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90796

--- Comment #11 from Michael Matz  ---
(In reply to rguent...@suse.de from comment #10)
> >It's the only affine functions that don't progress with each iteration.
> > I
> >think, at least :)
> 
> Hm. At least we analyze wrapping ones, but I guess 0, 1, 0, 1 would be
> caught in another way..

Yes, we analyze them, but for nothing.  They aren't affine either, and hence
result in unknown dependences.

[Bug middle-end/90796] [8/9/10 Regression] GCC: O2 vs O3 output differs on simple test

2019-08-06 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90796

--- Comment #9 from Michael Matz  ---
(In reply to rguent...@suse.de from comment #8)
> >The fun thing is, there's a difference between these two loop nests:
> >
> >   for (i) for (j) a[i][0] = f(a[i+1][0]);
> >   for (i) for (j) b[i][j] = f(a[i+1][j]);
> 
> What about
> 
>   B[i][j/2]...
> 
> ?

That would be a problem as well, but luckily that's not an affine function of
j,
and hence has no analyzable access function, and so isn't fused for different
reasons.

> It's really surprising that only invariants are special here.

It's the only affine functions that don't progress with each iteration.  I
think, at least :)

[Bug middle-end/91358] Wrong code with dynamic allocation and optional like class

2019-08-06 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91358

--- Comment #3 from Michael Matz  ---
(In reply to Antony Polukhin from comment #2)
> (In reply to Michael Matz from comment #1)
> Valgrind complains are distracting. GDB entering the destructor is
> missleading. Is there a simple way to change the GCC codegen to avoid the
> issue and not affect performance?

The only way I see would be to emit the jumpy sequence that RTL generates
(sometimes) from "if (x & y)" in the opposite order. There's no real reason
within the intermediate form to prefer one or the other, but it might help in
practice.  (The problem will be that there's also no reason that would prevent
GCC from transforming "x & y" into "y & x", e.g. for canonicalization, and then
the other order would create the problem). 

> Otherwise, is there some kind of a pattern that valgrind/gdb could detect to
> avoid false positives?

I don't really see any, no good idea here :-/

[Bug middle-end/91358] Wrong code with dynamic allocation and optional like class

2019-08-05 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91358

Michael Matz  changed:

   What|Removed |Added

 CC||matz at gcc dot gnu.org

--- Comment #1 from Michael Matz  ---
No wrong code involved here.  The uninitialized register that valgrind
complains
about is only uninitialized in exactly the case when %ebp (aka .m_initialized)
is zero.  So the insn sequence

=> 0x004007af <+31>:test   %rbx,%rbx
   0x004007b2 <+34>:je 0x4007b9 )+41>
   0x004007b4 <+36>:test   %bpl,%bpl
   0x004007b7 <+39>:jne0x4007c0 )+48>
   0x004007b9 <+41>:add$0x8,%rsp

goes to 0x4007b9 no matter what.  (if %rbx==uninit happens to be zero, then
directly, otherwise via the not-taken jump at 0x4007b7, because %ebp is zero).

The read from the uninitialized memory itself (from .ptr) is harmless as well,
because the memory backing that access must be there, as the structure is
large enough.

So, if you've seen a real problem somewhere (and not just valgrind complaining
about uninitialized registers in comparisons), then you've reduced the testcase
too much.  (The abort() in the testcase leads me to think that this was once
a larger testcase where the abort was triggered unexpectedly.  I'll note that
it isn't triggered with this testcase.)

[Bug tree-optimization/91240] [8/9/10 Regression] Wrong code with -O3 due to unroll and jam pass

2019-08-05 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91240

--- Comment #3 from Michael Matz  ---
Also fixed by the patch at PR90796.

[Bug middle-end/90796] [8/9/10 Regression] GCC: O2 vs O3 output differs on simple test

2019-08-05 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90796

--- Comment #7 from Michael Matz  ---
Created attachment 46675
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46675=edit
potential patch

Actually I was barking up the wrong tree.  It's not as easy as the CFG
manipulation for loop fusion going wrong (like missing some last iterations
or so).  It's really a problem in the dependence analysis.  See the extensive
comment in the patch.

The fun thing is, there's a difference between these two loop nests:

   for (i) for (j) a[i][0] = f(a[i+1][0]);
   for (i) for (j) b[i][j] = f(a[i+1][j]);

Even though the distance vector for the read/write in the single statement
is (-1,0) for both loops, unroll-and-jam is valid for the second but not
for the first.

[Bug tree-optimization/91240] [8/9/10 Regression] Wrong code with -O3 due to unroll and jam pass

2019-08-05 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91240

Michael Matz  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |matz at gcc dot gnu.org

--- Comment #2 from Michael Matz  ---
Mine.

[Bug middle-end/90796] [8/9/10 Regression] GCC: O2 vs O3 output differs on simple test

2019-06-12 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90796

Michael Matz  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |matz at gcc dot gnu.org

--- Comment #6 from Michael Matz  ---
I think I know what's going on.  Mine.

[Bug middle-end/90796] [8/9/10 Regression] GCC: O2 vs O3 output differs on simple test

2019-06-11 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90796

--- Comment #5 from Michael Matz  ---
FWIW, the reduced testcase from comment #3 is wrong.  Even with -O0 or with gcc
4.3 or 6 I get:

b:48
Aborted (core dumped)


But I can reproduce the problem with the original testcase.

[Bug middle-end/90348] [7/8/9/10 Regression] Partition of char arrays is incorrect in some cases

2019-05-07 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90348

--- Comment #12 from Michael Matz  ---
(In reply to Jakub Jelinek from comment #11)
> before that region.  If we can say for:
>   for (...)
> {
>   unsigned char v[10];
>   unsigned char *p = foo (v);
>   *p = 1;
>   unsigned char w[10];
>   bar (w);
> }
> hoist the p = foo (v); call before the loop, then indeed we are in big
> trouble.

This is effectively what the testcase is doing (just that 'foo' is no call, but
a normal address expression), so yes, we can do that, and yes we are in big
trouble :-/

> If I have to consider pt->anything and pt->escaped, then it will be as
> useless for the variable conflicts as is removing the important clearing of
> the bitmap bit on clobber stmt, we won't share stack slots pretty much at
> all.

Yeah; if we don't want to patch the specific situation for this testcase
(which might be okayish, we haven't seen this problem very often over the
last years), but want to really fix it we might have to take more involved
means like doing stack slot sharing before gimplification and rewriting
the IL to reflect this.  Or give up on sharing (which isn't a good idea).
Gnah.

[Bug middle-end/90348] [7/8/9/10 Regression] Partition of char arrays is incorrect in some cases

2019-05-06 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90348

--- Comment #7 from Michael Matz  ---
No, this is not a problem in the stack slot sharing algorithm, but rather in
the input.  As presented to expand, and only showing the important parts,
and reordering some BBs to make the flow more obvious:

;;   basic block 2, loop depth 0
;;pred:   ENTRY
  _30 = (unsigned long) 
  ivtmp.27_29 = _30 + 1;
  goto ; [100.00%]

So, 'in' becomes "live" here, it's address in _30 and _29.  Fallthrough to bb5,
which also uses in, but otherwise is uninteresting, except that it can reach
only BBs 6 and 7:

;;   basic block 5, loop depth 1
  ...
  _2 = check_zero (, _31);
  if (_2 != 0)
goto ; [99.96%]
  else
goto ; [0.04%]

bb6 is a no-return block, hence uninteresting.  bb7 _is_ interesting in that
it clobbers in:

;;   basic block 7, loop depth 1
;;pred:   5
  in ={v} {CLOBBER};
  if (i_11 != 5)
goto ; [83.33%]
  else
goto ; [16.67%]

Note that the semantics of the clobber is not only that the former contents
are destroyed, but rather that the very storage associated with the clobbered
entity is gone.  So, from now on, any pointers into 'in', and memory accesses
into 'in' are invalid.  Nevertheless the flow from bb7 goes to bb 8 and 9,
the latter being the return block, so:

;;   basic block 8, loop depth 1
;;pred:   7
  if (i_11 > 0)
goto ; [100.00%]
  else
goto ; [0.00%]

and here we finally have a path into bb3, which is the other interesting one:

;;   basic block 3, loop depth 2
  # ivtmp.20_6 = PHI <_30(8), ivtmp.20_18(3)>

 BOOM!  Here _30 is used, and ...

  _4 = (void *) ivtmp.20_6;
  MEM[base: _4, offset: 0B] = 0;

... even written into ...  That's invalid.  _30 is associated with an
object that is clobbered and gone ...

  set_one ();
  buf ={v} {CLOBBER};
  ivtmp.20_18 = ivtmp.20_6 + 1;

... and as the MEM[] write can't have possibly been into 'in' (as that is
invalid, as 'in' didn't exist at the MEM access), it's okay and sensible to
allocate 'in' and 'buf' into the same memory.

It seems to be a common misconception of what the clobber is really about.
I designed it to mean what I wrote above, the storage associated with the
clobbered entities is gone after the clobber (not merely it's former
contents!). 

But ivopts or dom extends the lifetime of 'in' (by moving an address-taken
earlier) and hence the lifetime of its storage, but without doing anything
about the clobber (and hence not _really_ extending the lifetime).  That
doesn't work.
It's basically a mirrored transformation of the usual take-address-of-local
and access it out of it's declared scope, just that here the _start_ of the
supposed lifetime is moved out of the declared scope, not the end.

[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

2019-03-12 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561

--- Comment #9 from Michael Matz  ---
(In reply to Richard Biener from comment #8)
> 
> I'm out of ideas suitable for GCC 9 (besides reverting the patch, reverting
> to bogus state).

Either that or some hack (e.g. artificially avoiding vectorization if runtime
checks are necessary and the loop-nest isn't a box but a pyramid).  Whatever
we do it's better to release GCC with internal bogus state than to release
GCC with a known 10% performance regression (you could revert only on the
release branch so that the regression stays in trunk).

[Bug target/89545] ABI clarification for over-aligned type stack passing

2019-03-01 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89545

--- Comment #12 from Michael Matz  ---
(In reply to H.J. Lu from comment #11)
> (In reply to Michael Matz from comment #10)
> > Ah, I missed that.  Yeah, I'd like to be co-owner.
> 
> Please send me your gitlab account name.

Err, right, that probably helps ;-)  It's 'susematz'

[Bug target/89545] ABI clarification for over-aligned type stack passing

2019-03-01 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89545

--- Comment #10 from Michael Matz  ---
Ah, I missed that.  Yeah, I'd like to be co-owner.

[Bug target/89545] ABI clarification for over-aligned type stack passing

2019-03-01 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89545

--- Comment #7 from Michael Matz  ---
What about this variant of the second part?

diff --git a/x86-64-ABI/low-level-sys-info.tex
b/x86-64-ABI/low-level-sys-info.tex
index 66270b9..93b5e95 100644
--- a/x86-64-ABI/low-level-sys-info.tex
+++ b/x86-64-ABI/low-level-sys-info.tex
@@ -517,7 +517,9 @@ Once arguments are classified, the registers get assigned
(in
 left-to-right order) for passing as follows:

 \begin{enumerate}
-\item If the class is MEMORY, pass the argument on the stack.
+\item If the class is MEMORY, pass the argument on the stack at an
+  address respecting the arguments alignment (which might be more
+  than its natural alignement).

 \item If the class is INTEGER, the next available register of the
   sequence \RDI, \RSI, \RDX, \RCX, \reg{r8} and \reg{r9} is

[Bug target/89545] ABI clarification for over-aligned type stack passing

2019-03-01 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89545

--- Comment #2 from Michael Matz  ---
I think we should say something about the addresses of stack slots individual
overaligned arguments as well (i.e. that the slot itself will also be aligned
accordingly), not just for the overall effect.

[Bug tree-optimization/88767] 'unroll and jam' not optimizing some loops

2019-01-09 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88767

--- Comment #3 from Michael Matz  ---
I don't see anything to improve either (as far as unroll-and-jam is concerned).
It's quite possible that cunrolli is harming more than helping in this case,
but with it disabled it seems the code is as it should be.

So, please state what you want to see changed: unroll-and-jam or cunrolli?

[Bug middle-end/86575] [7/8 Regression] -Wimplicit-fallthrough affects code generation

2018-11-20 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86575

--- Comment #7 from Michael Matz  ---
As I stated, it's only fixed in trunk, so it's still a regression in 7 and 8,
as marked in the summary.

[Bug tree-optimization/31130] [7/8 Regression] VRP no longer derives range for division after negation

2018-11-19 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31130

Michael Matz  changed:

   What|Removed |Added

 CC||matz at gcc dot gnu.org
Summary|[7/8/9 Regression] VRP no   |[7/8 Regression] VRP no
   |longer derives range for|longer derives range for
   |division after negation |division after negation

--- Comment #30 from Michael Matz  ---
As far as I can tell this is fixed in gcc-9 (and should be in gcc-8 as well).
Certainly all testcases herein are optimized as expected.

[Bug middle-end/86575] [7/8 Regression] -Wimplicit-fallthrough affects code generation

2018-11-14 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86575

Michael Matz  changed:

   What|Removed |Added

Summary|[7/8/9 Regression]  |[7/8 Regression]
   |-Wimplicit-fallthrough  |-Wimplicit-fallthrough
   |affects code generation |affects code generation

--- Comment #5 from Michael Matz  ---
Fixed in trunk.  Not planning backporting, it's not a very important problem.

[Bug middle-end/86575] [7/8/9 Regression] -Wimplicit-fallthrough affects code generation

2018-11-14 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86575

--- Comment #4 from Michael Matz  ---
Author: matz
Date: Wed Nov 14 15:43:54 2018
New Revision: 266148

URL: https://gcc.gnu.org/viewcvs?rev=266148=gcc=rev
Log:
Fix PR middle-end/86575

PR middle-end/86575
* gimplify.c (collect_fallthrough_labels): Add new argument,
return location via that, don't modify statements.
(warn_implicit_fallthrough_r): Adjust call, don't use
statement location directly.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/gimplify.c

[Bug demangler/87675] Stack Overflow in function next_is_type_qual() in cp-demangle.c, as demonstrated by "nm -C"

2018-10-29 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87675

Michael Matz  changed:

   What|Removed |Added

 CC||matz at gcc dot gnu.org

--- Comment #1 from Michael Matz  ---
One of usual fuzzer fake CVEs.

This is basically a similar "problem" like initially reported in
  https://sourceware.org/bugzilla/show_bug.cgi?id=23008
where I actually analyzed it.  The problem is that C++ mangled names
have a recursive structure.  For demonstration purposes let's assume that the
character 'F' in a mangled name means "here come nested template arguments,
described next", then you need to recurse down to decode those nested args,
and if the next character is 'F' as well, you just recurse down again.  So
a mangled "name" with a million 'F' characters in succession will need
a recursion depth of a million.

So, when you feed the demangler such a name a stack overflow is expected.
Exactly when the overflow occurs depends on how the demangler is compiled,
i.e. how much stack space is needed from one to the next recursion level
(sometimes the recursion is tail recursion, so in some compilation modes
can even be elided and so lead to non-exhaustion).

Many characters of the mangled names have this property, so there are multiple
variants of names that all lead to stack exhaustion, so the fuzzers were able
to create many different testcases:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85122 (aka bugzilla PR23008)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85452
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87335
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87636
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87675
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87681

Unfortunately they now also started to submit fake CVEs for these, like this
one (CVE-2018-18701) or CVE-2018-18700 (aka bug 87681).

If libiberty ever implements a check for this (which essentially can only be an
arbitrary limit, which is frowned upon, especially as it must be very small, as
people might have their stack limit set very low) fine, if not, also fine.
Until then feeding such names to any demangling tool leads to stack exhaustion
and hence segfault.  Like any other memory exhaustion not a security bug.

[Bug tree-optimization/87074] [8/9 Regression] Unroll and jam bug: O3 result differ from O2

2018-09-01 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87074

Michael Matz  changed:

   What|Removed |Added

   Keywords|needs-bisection |
 Status|NEW |RESOLVED
 Resolution|--- |FIXED
   Assignee|unassigned at gcc dot gnu.org  |matz at gcc dot gnu.org

--- Comment #7 from Michael Matz  ---
Fixed in trunk and gcc-8-branch

[Bug tree-optimization/87074] [8/9 Regression] Unroll and jam bug: O3 result differ from O2

2018-09-01 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87074

--- Comment #6 from Michael Matz  ---
Author: matz
Date: Sat Sep  1 17:33:45 2018
New Revision: 264030

URL: https://gcc.gnu.org/viewcvs?rev=264030=gcc=rev
Log:
Fix PR87074

Backport from mainline
PR tree-optimization/87074
* gimple-loop-jam.c (unroll_jam_possible_p): Check loop exit
PHIs for outer-loop uses.

testsuite/
* gcc.dg/pr87074.c: New test.

Added:
branches/gcc-8-branch/gcc/testsuite/gcc.dg/pr87074.c
Modified:
branches/gcc-8-branch/gcc/ChangeLog
branches/gcc-8-branch/gcc/gimple-loop-jam.c
branches/gcc-8-branch/gcc/testsuite/ChangeLog

[Bug tree-optimization/87074] [8/9 Regression] Unroll and jam bug: O3 result differ from O2

2018-09-01 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87074

--- Comment #5 from Michael Matz  ---
Author: matz
Date: Sat Sep  1 17:22:05 2018
New Revision: 264029

URL: https://gcc.gnu.org/viewcvs?rev=264029=gcc=rev
Log:
Fix PR87074

PR tree-optimization/87074
* gimple-loop-jam.c (unroll_jam_possible_p): Check loop exit
PHIs for outer-loop uses.

testsuite/
* gcc.dg/pr87074.c: New test.

Added:
trunk/gcc/testsuite/gcc.dg/pr87074.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/gimple-loop-jam.c
trunk/gcc/testsuite/ChangeLog

[Bug target/86973] [6/7/8/9 Regression] ICE in expand_call, at calls.c:4217

2018-08-24 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86973

--- Comment #4 from Michael Matz  ---
FWIW, the testcase is broken since it can be compiled, namely since
the two attributes ms_abi and sysv_abi are accepted, which is r137525 from
2008.  Only broken with -mno-accumulate-outgoing-args of course.

So, I guess we can safely say that this is no regression in all sensible
definitions of the word.

[Bug target/86973] [6/7/8/9 Regression] ICE in expand_call, at calls.c:4217

2018-08-22 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86973

--- Comment #3 from Michael Matz  ---
A testcase that doesn't need -mabi cmdline args:

extern __attribute__((ms_abi)) void foo(void);
__attribute__((sysv_abi))
void a(__attribute__((__vector_size__(8 * sizeof(double double b){
foo();
}

For very old GCCs (starting with r206947 from 2014) we actually get a sorry,
and
then the known ICE:

pr86973.i: In function ‘a’:
pr86973.i:4:8: sorry, unimplemented: ms_abi attribute requires
-maccumulate-outgoing-args or subtarget optimization implying it
 foo();
^
pr86973.i:4:8: internal compiler error: in expand_call, at calls.c:3471
0x6ed0b1 expand_call(tree_node*, rtx_def*, int)
../../gcc/gcc/calls.c:3469

Indeed adding -maccumulate-outgoing-args removes the error.  That revision
simply changed the default for that option.

We lost the sorry with r207677, but retained the ICE.  Since then we ICE.

[Bug target/86973] [6/7/8/9 Regression] ICE in expand_call, at calls.c:4217

2018-08-22 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86973

--- Comment #1 from Michael Matz  ---
I can reproduce the same error without any va-args:

% cat bug.i
extern void foo(void *);
__attribute__((sysv_abi))
void a(__attribute__((__vector_size__(8 * sizeof(double double b){
foo(0);
}
% ./cc1 -mabi=ms -mavx512f pr86973.i
pr86973.i:4:5: internal compiler error: in expand_call, at calls.c:4218
4 | foo(0);

So, are you absolutely sure that r222173 (which delayed expansion of va_arg
until pass_stdarg) is the one that introduced this error?  As far as I can
determine this is some interaction between backend and call setup going
wrong, which specifically is about a sysv_abi function (with large param)
calling a ms_abi function.

[Bug demangler/85304] Segmentation fault

2018-04-16 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85304

Michael Matz  changed:

   What|Removed |Added

 CC||matz at gcc dot gnu.org

--- Comment #1 from Michael Matz  ---
Similar to https://sourceware.org/bugzilla/show_bug.cgi?id=23008,
the testcase contains a mangled name with roughly 29000 successive 'E'
characters.  Processing one 'E' character involves calling these three
routines:

5  0x004e8901 in demangle_expression (work=0x7fffd810,
mangled=0x7fffd710, 
s=0x7fffd540, tk=tk_integral) at ../../libiberty/cplus-dem.c:1895
1895  success = demangle_template_value_parm (work, mangled, s, tk);
(gdb) 
#4  0x004e98cb in demangle_template_value_parm (work=0x7fffd810,
mangled=0x7fffd710, 
s=0x7fffd540, tk=tk_integral) at ../../libiberty/cplus-dem.c:2069
2069success = demangle_integral_value (work, mangled, s);
(gdb) 
#3  0x004e8b82 in demangle_integral_value (work=0x7fffd810,
mangled=0x7fffd710, 
s=0x7fffd540) at ../../libiberty/cplus-dem.c:1916
1916success = demangle_expression (work, mangled, s, tk_integral);

That advances *mangled by one character and uses 496 bytes of stack while
doing that (when compiled by gcc-6 with address sanitizer).  The linux default
stack of 8 MB is good for 16893 of the E characters until stack overflow
occurs.
Without sanitizer we need less stack per recursion level, so that the testcase
doesn't cause a proplem (but just increasing the number of 'E' will make
it segfault there as well).

It seems all is working as designed, you request it to demangle a recursive
structure of > 2 levels deep and get what can be expected from that, a
stack
overflow.

[Bug target/84829] -mieee-fp causes to link with -lieee but that is no longer available

2018-03-12 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84829

--- Comment #9 from Michael Matz  ---
(In reply to Carlos O'Donell from comment #8)
> > It matters from the users' point of view.  I think it's better to give them
> > a build error when they use unsupported functionality, rather than giving
> > the wrong results at run time.  The overloaded nature of -mieee-fp makes
> > this difficult to achieve, though.
> 
> I agree.

Except, of course, that it isn't unsupported functionality.  On the contrary,
it now is the default in glibc.

Even worse: the non-libm aspect of -mieee-fp is also the default on linux.  So
giving -mieee-fp is a no-op on new glibc, while it is not one on old glibc
(adding libieee.a).  I don't see how the current link error can be avoided
in a sensible way from within gcc (sensible includes: "works with old and new
glibc").

[Bug target/84829] -mieee-fp causes to link with -lieee but that is no longer available

2018-03-12 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84829

--- Comment #3 from Michael Matz  ---
(In reply to Jakub Jelinek from comment #2)
> I'd say the right thing is to keep libieee.a around, even if it will be an
> empty ar archive.

Agreed.

[Bug target/84829] -mieee-fp causes to link with -lieee but that is no longer available

2018-03-12 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84829

Michael Matz  changed:

   What|Removed |Added

 CC||matz at gcc dot gnu.org

--- Comment #1 from Michael Matz  ---
Note that the old libieee.a (which was only an object) set the global variable
_LIB_VERSION to _IEEE_.  That variable is (or was) checked by various libm
routines to conditionalize behaviour in error cases, which is the purpose of
-mieee-fp.  So that's what warranted the inclusion of -lieee in the past.

The change removing libieee from glibc also removed all handling of that global
variable, so the right thing to do for gcc is to not link against it if it
doesn't exist.  Which is of course a problem, as then you can't update glibc
without having to rebuild gcc :-/

[Bug tree-optimization/37239] peeling last iteration of a <= loop

2018-02-16 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37239

--- Comment #8 from Michael Matz  ---
(In reply to Richard Biener from comment #7)
> While hmmer is now split the testcase in this bug isn't.  We do find some
> split points but appearantly never split the loop.

This is not a case for normal loop splitting.  The loop in question has
multiple exits.  The potential split point that is found is in the loop in
question,
but relative to the outer loop:

for (  qty--; i >  0; i--) siftDown  (numbers, i, qty);

which inline-expanded looks somewhat like:

  for (  qty--; i >  0; i--)
top = i, maxIdx = i, last = qty;
while (last >= (maxIdx += maxIdx)) {

/* This is where the comparison occurrs and where a sufficiently
   good compiler can use a computed conditional result rather
   than using control logic. */
if (maxIdx != last && numbers[maxIdx] < numbers[maxIdx + 1]) maxIdx++;

if (tmp >= numbers[maxIdx]) break;
numbers[top] = numbers[maxIdx];
top = maxIdx;
}

The potential split point is the pre-header check if the inner loop is to
be entered at all (i.e. "last >= 2*maxIdx").  The outer loop isn't split
because i and 2*maxIdx (aka 2*i) don't have the same steps.

Of course even if we'd split the outer loop we wouldn't gain anything here.

Also note that the wanted transformation isn't exactly peeling the last
iteration.  The last iteration must only be entered if the non-normal exit
isn't taken.  Dealing with all that is currently not implemented.

[Bug tree-optimization/84111] [8 Regression] Compile time hog w/ -O2

2018-01-30 Thread matz at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84111

Michael Matz  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #9 from Michael Matz  ---
(In reply to Michael Matz from comment #7)
> (It might of course be the case that we can't easily avoid creating that
> situation, in that case we have to change follow_copies_to_constant()
> or friends.  But I don't think we have established that yet).

So, explanation for future reference:
  https://gcc.gnu.org/ml/gcc-patches/2018-01/msg02322.html

  1   2   3   4   >