This patch attempts to add __builtin_operator_new/delete. So far they
are not optimized, which will need to be done by extra flag of BUILT_IN_
code. also the decl.cc code can be refactored to be less of cut
and I guess has_builtin hack to return proper value needs to be moved
to C++ FE.
However
There is still problem with loop bounds. I am testing patch on that and
then we should be (finally) finally safe.
> Note GCC has not retuned its -Os heurstics for a long time because it has been
> decent enough for most folks and corner cases like this is almost never come
> up.
There were quite few changes to -Os heuristics :)
One of bigger challenges is that we do see more and more C++ code built
with -Os
Looking at the prototype patch, why need to change also the splitters?
My original goal was to use splitters to expand to faster code sequences
while having patterns necessary for both variants. This makes it
possible to use optimize_insn_for_size/speed and make decisions using BB
profile, since
> > I guess PTA gets around by tracking points-to set also for non-pointer
> > types and consequently it also gives up on any such addition.
>
> It does. But note it does _not_ for POINTER_PLUS where it treats
> the offset operand as non-pointer.
>
> > I think it is
> Confirm. But option save/restore has been always implemented:
>
> .section.gnu.lto_.opts,"",@progbits
> .ascii "'-fno-openmp' '-fno-openacc' '-fno-pie' '-fcf-protection"
> .ascii "=none' '-mabi=lp64d' '-march=loongarch64' '-mfpu=64' '-m"
> .ascii "simd=lasx'
> But adds a return with a value. And then the inliner inlines foo into foo2 but
> we still have the return with a value around ...
I guess ICF can special case unused return value, but why this is not
taken care of by ipa-sra?
> > If I comment it out as above patch, then O3/PGO can get 16% and 12%
> > performance
> > improvement compared to O2 on x86.
> >
> > O2 O3 PGO
> > cycles 2,497,674,824 2,104,993,224 2,199,753,593
> > instructions
> This heuristic wants to catch
>
>
> if (foo) abort ();
>
>
> and avoid sinking "too far" across a path with "similar enough"
> execution count (I think the original motivation was to fix some
> spilling / register pressure issue). The loop depth test
> should be !(bb_loop_depth
> I suspect this is most likely the profile updates changes ...
Quite possibly. The goal of this excercise is to figure out if there are
some bugs in profile estimate or whether passes somehow preffer broken
profile or if it is just back luck.
Looking at sphinx and fatigue it seems that LRA
>
> why disallow caller->indirect_calls?
See testcase in comment #9
>
> > + return false;
> > + for (cgraph_edge *e2 = callee->callees; e2; e2 = e2->next_callee)
>
> I don't think this flys - it looks quadratic. Can we compute this
> in the inline summary once instead?
I guess I
Just so it is somewhere, here is a testcase that we can't inline leaf
functions to always_inlines unless we do some tracking of what calls
were formerly indirect calls.
We really overloaded always_inline from the original semantics "drop
inlining heuristics" into "be sure that result is inlined"
>
> There is no guarantee that std::vector::max_size() is PTRDIFF_MAX. It
> depends on the Allocator type, A. A user-defined allocator could have
> max_size() == 100.
If inliner we see path to the throw functions, it will not determine
_M_check_len as early inlinable.
Perhaps we can
> > Indeed it is quite long time problem with clang not building with lifetime
> > DSE and strict aliasing. I wonder why this is not fixed on clang side?
>
> Because the problems were not communicated? I knew that Firefox needed
> -flifetime-dse=1, but it's the first time I hear that any such
>
> Do you mean we should fix modeling of divisions there as well? I don't have
> latency/throughput measurements for those CPUs, nor access so I can run
> experiments myself, unfortunately.
>
> I guess you mean just making a patch to model division units separately,
> leaving latency/throughput
> > For this one it's PRE hoisting *b across the endless loop (PRE handles
> > calls as possibly not returning but not loops as possibly not
> > terminating...)
> > So it's a different bug.
>
> Btw, C++ requiring forward progress makes the testcase undefined.
In my understanding access to
> > My guess is that the
> > BUILD_BUG();
> > line is the sole thing that is wrong, it should be just break;
> > as the memory_is_poisoned_n(addr, size); will handle all the sizes,
> > regardless if they are constants or not.
>
> Sure, I'm going to suggest such a change.
To me it looked like a
> To me, all of these do the same thing and should generate the same code.
> As nobody else can see removeme, and we aren't leaking its address, shouldn't
> the compiler be able to deduce that all accesses to removeme are
> inconsequential and can be removed?
>
> My gcc 11.3 generates a condidion
> I would say so. It saves code size and also uop space unless the two
> can magically fuse to a immediate to %xmm move (I doubt that).
I made simple benchmark
double a=10;
int
main()
{
long int i;
double sum,val1,val2,val3,val4;
for (i=0;i<10;i++)
{
#if
> > According to znver2_cost
> >
> > Cost of sse_to_integer is a little bit less than fp_store, maybe increase
> > sse_to_integer cost(more than fp_store) can helps RA to choose memory
> > instead of GPR.
>
> That sounds reasonable - GPR<->xmm is cheaper than GPR -> stack -> xmm
> but GPR<->xmm
> > bool
> Since the pass issues a bunch other warnings (e.g., -Wstringop-overflow,
> -Wuse-after-free, etc.) the gate doesn't seem right. But since #pragma GCC
> diagnostic can re-enable warnings disabled by -w (or turn them into errors)
> any
> gate that considers the global option setting
So I assume that this is due to new pass_waccess which was added into
early optimizations. I think this is not really ipa component but
tree-optimize.
> So nothing to see? I guess our unit growth limit doesn't trigger because it's
> a small (benchmark) unit?
Yep, unit growths do not apply for very small units. ipa-cp heuristics
still IMO needs work and be based on relative speedups rather then
absolute for the cutoffs.
>
> Sure - I just remember (falsely?) that we finally decided to do it :)
I do not recall this, but I may have forgotten :))
> If we don't run IPA inline we don't figure we failed to inline the
> always_inline either ;) And IPA inline can expose more indirect
> alywas-inlines we only discover
> You can not disable an IPA pass becasuse then we will mishandle
> optimize attributes. I think you simply want to set
>
> flag_inline_small_functions = 0
> flag_inline_functions_called_once = 0
Actually I forgot, we have flag_no_inline which makes
tree_inlinable_function_p to return false
> --- Comment #6 from Richard Biener ---
> Honza, -Og was supposed to not do so much work, I intended to disable IPA
> inlining but there's no knob for that. I wonder where to best put such
> guard? I set flag_inline_small_functions to zero for -Og but we still
> run inline_small_functions ().
on zen2 and 3 with -flto the speedup seems to be cca 12% for both -O2
and -Ofast -march=native which is both very nice!
Zen1 for some reason sees less improvement, about 6%.
With PGO it is 3.8%
Overall it seems a win, but there are few noteworthy issues.
I also see a 6.69% regression on x64 with
>
> Well, I'm specifically speaking about:
> error: the control flow of function ‘BZ2_compressBlock’ does not match its
> profile data (counter ‘arcs’)
>
> this type of errors should not happen even in a multi-threaded programs.
There are some cases where I see even those on clang build - I am
The patch passed testing on x86_64-linux.
This is bit modified patch I am testing. I added pre-computation of the
number of accesses, enabled the path for const functions (in case they
have memory operand), initialized alias sets and clarified the logic
around every_* and global_memory_accesses
PR tree-optimization/103168
> (The -fno-semantic-interposition thing is probably the biggest performance gap
> between gcc -fpic and clang -fpic.)
Yep, it is often confusing to users (who do not understand what ELF
interposition is) that clang and gcc disagree on default flags here.
Recently -Ofast was extended to imply
Needs -O2 -floop-unroll-and-jam --param early-inlining-insns=14
to fail, so I guess it may be issue with unrol-and-jam.
> @@ -1,4 +1,3 @@
> -static int
> __attribute__ ((noinline,const))
> infinite (int p)
> {
Just for a record, it crahes with or without static int here for me :)
I run across it because the code tracking must access in ipa-sra is IMO
conceptually wrong. I noticed that because ipa-modref solves
Aha, but here is better example (reproduces same way).
In the former one I forgot const attribute which makes it invalid.
The testcase tests that ipa-sra is missing ECF_LOOPING_CONST_OR_PURE
check
static int
__attribute__ ((noinline))
infinite (int p)
{
if (p)
while (1);
return p;
}
Works for me even with the 3 warnings.
hubicka@lomikamen:/aux/hubicka/trunk/build-lto2/gcc$ cat >tt.c
__attribute__ ((noinline,const))
infinite (int p)
{
if (p)
while (1);
return p;
}
__attribute__ ((noinline))
static void
test(int p, int *a)
{
int v = infinite (p);
if (*a && v)
> [659] %
> [659] % gcctk -O0 -w small.c
> [660] %
> [660] % gcctk -O1 -w small.c
> [661] % gcctk -O1 -w small.c
> [662] % gcctk -O1 -w small.c
> gcctk: internal compiler error: Segmentation fault signal terminated program
> cc1
> Please submit a full bug report,
> with preprocessed source if
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103230
>
> --- Comment #2 from Martin Liška ---
> > How do you build ubsan compiler?
>
> F="-O0 -g -fsanitize=undefined" ; make -j16 all-host -k CFLAGS="$F"
> CXXFLAGS="$F" LDFLAGS="$F"
>
> is the fastest approach.
Thanks, it is similar to what I
> Happens with UBSAN compiler for:
>
> $ gcc gcc/testsuite/gcc.c-torture/execute/pr71494.c -O1 -flto
> ...
> /home/marxin/Programming/gcc/gcc/ipa-modref-tree.h:550:33: runtime error: load
> of value 255, which is not a valid value for type 'bool'
> #0 0x18acc38 in
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103211
>
> --- Comment #2 from Martin Liška ---
> Optimized dump differs for couple of functions in the same way:
>
> diff -u good bad
> --- good2021-11-12 17:42:36.995947103 +0100
> +++ bad 2021-11-12 17:41:56.728194961 +0100
> @@ -38,7
The sanity check verifies that functions acessing parameter indirectly
also reads the parameter (otherwise the indirect reference can not
happen). This patch moves the check earlier and removes some overactive
flag cleaning on function call boundary which introduces the non-sential
situation. I
Note that it still seems to me that the crossed_loop_header handling is
overly conservative. We have:
@ -2771,6 +2771,7 @@ jt_path_registry::cancel_invalid_paths
(vec )
bool seen_latch = false;
int loops_crossed = 0;
bool crossed_latch = false;
+ bool crossed_loop_header = false;
>
> This PR is still open, at least for slowdown in the threader with LTO. The
> issue is ranger wide, so it may also cause slowdowns on non-LTO builds for
> WRF, though I haven't checked.
I just wanted to record the fact somewhere since I was looking up the
revision range mostly to figure out
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943
>
> Aldy Hernandez changed:
>
>What|Removed |Added
>
> Depends on||103058
>
> --- Comment
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103040
>
> --- Comment #15 from Iain Buclaw ---
> Got it. The difference between D and C++ is a matter of early inlining.
>
> The C++ example Jakub posted fails in the same way that D does if you compile
> with: -O1 -fno-inline
Great, I will take a
> See above comments from Iain, even if that pre-initialization is removed it is
> still miscompiled. And, the testcase fails not because of the padding bits
> not
> being zero, but because the address of self stored into one of the fields
> isn't
> there or modref thinks it can't be changed or
> Not seen on Haswell (but w/o PGO). Is this PGO specific? There's another
> large jump visible end of 2019.
It is between 2019-11-15 and 18 but the revisions does not exist at git
- perhaps they reffer to the old git mirror. Martin will know better.
In that range there are many of Richard's
>
> fixup_cfg already removes write-only stores so that seems fit for that
> purpose.
>
> Btw,
>
> static int x = 1;
>
> int main()
> {
> x = 1;
> }
>
> should ideally be handled as well as maybe the more common(?)
>
> static int x[128];
>
> int main()
> {
> memset (x, 0, 128*4);
> }
>
47 matches
Mail list logo