https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111317
--- Comment #1 from Robin Dapp ---
I think the default cost model is not too bad for these simple cases. Our
emitted instructions match gimple pretty well.
The thing we don't model is vsetvl. We could ignore it under the assumption
that it
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111337
--- Comment #12 from Robin Dapp ---
Yes, as far as I know. I would also go ahead and merge the test suite patch
now as there is already a v2 fix posted. Even if it's not the correct one it
will be done soon so we should not let that block
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111337
--- Comment #8 from Robin Dapp ---
Yes, I doubt we would get much below 4 instructions with riscv specifics.
A quick grep yesterday didn't reveal any aarch64 or gcn patterns for those (as
long as they are not hidden behind some pattern
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111337
--- Comment #10 from Robin Dapp ---
I would be OK with the riscv implementation, then we don't need to touch isel.
Maybe a future vector extension will also help us here so we could just switch
the implementation then.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53
--- Comment #2 from Robin Dapp ---
With the current trunk we don't spill anymore:
(VLS)
.L4:
vle32.v v2,0(a5)
vadd.vv v1,v1,v2
addia5,a5,16
bne a5,a4,.L4
Considering just that loop I'd say costing works
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53
--- Comment #4 from Robin Dapp ---
Yes, with VLS reduction this will improve.
On aarch64 + sve I see
loop inside costs: 2
This is similar to our VLS costs.
And their loop is indeed short:
ld1wz30.s, p7/z, [x0, x2, lsl 2]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111401
Robin Dapp changed:
What|Removed |Added
CC||rdapp at gcc dot gnu.org
--- Comment #2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111337
Robin Dapp changed:
What|Removed |Added
CC||rdapp at gcc dot gnu.org
--- Comment #1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111401
--- Comment #6 from Robin Dapp ---
Created attachment 55902
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55902=edit
Tentative
You're referring to the case where we have init = -0.0, the condition is false
and we end up wrongly doing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111401
--- Comment #3 from Robin Dapp ---
Several other things came up, so I'm just going to post the latest status here
without having revised or tested it. Going to try fixing it and testing
tomorrow.
--- a/gcc/tree-vect-loop.cc
+++
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111311
Bug ID: 111311
Summary: RISC-V regression testsuite errors with
--param=riscv-autovec-preference=scalable
Product: gcc
Version: 14.0
Status: UNCONFIRMED
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111794
--- Comment #10 from Robin Dapp ---
>From what I can tell with my barely working connection no regressions on x86,
aarch64 or power10 with the adjusted check.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112109
Bug ID: 112109
Summary: Missing riscv vectorized strcmp (and other) expanders
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111600
--- Comment #30 from Robin Dapp ---
On my machine it is not nearly as bad as insn-emit.cc. What dominates for me
with a GCC 13 host compiler is the already fixed insn-opinit problem.
How long does it take for you (maybe in % of the total
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111311
--- Comment #10 from Robin Dapp ---
As a general remark: Some of those are present on other backends as well, some
have been introduced by recent common-code changes and some are bogus test
prerequisites or checks. I'm not saying we are in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112363
--- Comment #1 from Robin Dapp ---
This test was introduced in order to check that we correctly "reduce" with -0.0
as neutral element, i.e. a reduction preserves an intial -0.0 and doesn't turn
it into 0.0 by adding 0.0. Kernel aborted means
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112361
--- Comment #2 from Robin Dapp ---
I can have a look. Of course I tested it but neither the compile farm machine
(gcc188) I used nor my local device have AVX512 run capability. Anywhere else
I can test it?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112361
--- Comment #6 from Robin Dapp ---
So "before" we created
vect__3.12_55 = MEM [(float *)vectp_a.10_53];
vect__ifc__43.13_57 = VEC_COND_EXPR ;
// _ifc__43 = _24 ? _3 : 0.0;
stmp__44.14_58 = BIT_FIELD_REF ;
stmp__44.14_59 = r3_29 +
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112359
--- Comment #2 from Robin Dapp ---
Would something like
+ bool allow_cond_op = flag_tree_loop_vectorize
+&& !gimple_bb (phi)->loop_father->dont_vectorize;
in convert_scalar_cond_reduction be sufficient or are the more conditions to
check
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111488
Bug ID: 111488
Summary: ICE ion riscv gcc.dg/vect/vect-126.c
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111488
Robin Dapp changed:
What|Removed |Added
CC||juzhe.zhong at rivai dot ai
--- Comment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111428
--- Comment #2 from Robin Dapp ---
Reproduced locally. The identical binary sometimes works and sometimes doesn't
so it must be a race...
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111506
--- Comment #5 from Robin Dapp ---
Ah, thanks Joseph, so this at least means that we do not need
!flag_trapping_math here.
However, the vectorizer emulates the 64-bit integer to _Float16 conversion via
an intermediate int32_t and now the riscv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111506
Robin Dapp changed:
What|Removed |Added
CC||joseph at codesourcery dot com
---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111600
--- Comment #16 from Robin Dapp ---
Confirming that it's the compilation of insn-emit.cc which takes > 10 minutes.
The rest (including auto generating of files) is reasonably fast. Going to do
some experiments with it and see which pass takes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111600
--- Comment #18 from Robin Dapp ---
Just finished an initial timing run, sorted, first 10:
Time variable usr sys wall
GGC
phase opt and generate : 567.60 ( 97%) 38.23
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111600
--- Comment #20 from Robin Dapp ---
Mhm, why is your profile so different from mine? I'm also on an x86_64 host
with a 13.2.1 host compiler (Fedora).
Is it because of the preprocessed source? Or am I just reading the timing
report wrong?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111600
--- Comment #22 from Robin Dapp ---
Ah, then it's not that different, your machine is just faster ;)
callgraph ipa passes : 69.77 ( 11%) 5.97 ( 13%) 76.05 ( 12%)
2409M ( 10%)
integration: 91.95 (
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111600
Robin Dapp changed:
What|Removed |Added
CC||law at gcc dot gnu.org
--- Comment #12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111600
--- Comment #23 from Robin Dapp ---
For the lack of a better idea (and time constraints as looking for compiler
bottlenecks is slow and tedious) I went with Kito's suggestion of splitting
insn-emit.cc
This reduces this part of the compilation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111600
--- Comment #25 from Robin Dapp ---
At least here locally the maximum I saw was 1.4 GB of RES for insn-emit-10.cc.
That's still not ideal (especially when 8 or 10 of those files compile in
parallel) but at least no 8 GB for a single file
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111791
--- Comment #4 from Robin Dapp ---
This is a scalar popcount and as Kito already noted we will just emit
cpop a0, a0
once the zbb extension is present.
As to the question what is actually being vectorized here, I'm not so sure :D
It looks
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111760
--- Comment #6 from Robin Dapp ---
Yes, thanks for filing this bug separately. The patch doesn't disable all of
those optimizations, of course I paid special attention not mess up with them.
The difference here is that we valueize, add
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111428
--- Comment #3 from Robin Dapp ---
Still difficult to track down. The following is a smaller reproducer:
program main
implicit none
integer, parameter :: n=5, m=3
integer, dimension(n,m) :: v
real, dimension(n,m) :: r
do
call
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111760
Robin Dapp changed:
What|Removed |Added
CC||rdapp at gcc dot gnu.org,
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111794
--- Comment #5 from Robin Dapp ---
Disregarding the reasons for the precision adjustment, for this case here, we
seem to fail at:
/* We do not handle bit-precision changes. */
if ((CONVERT_EXPR_CODE_P (code)
|| code ==
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111794
--- Comment #9 from Robin Dapp ---
Yes, that's from pattern recog:
slp.c:11:20: note: === vect_pattern_recog ===
slp.c:11:20: note: vect_recog_mask_conversion_pattern: detected: _5 = _2 &
_4;
slp.c:11:20: note: mask_conversion pattern
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111794
--- Comment #7 from Robin Dapp ---
vectp.4_188 = x_50(D);
vect__1.5_189 = MEM [(int *)vectp.4_188];
mask__2.6_190 = { 1, 1, 1, 1, 1, 1, 1, 1 } == vect__1.5_189;
mask_patt_156.7_191 = VIEW_CONVERT_EXPR>(mask__2.6_190);
_1 = *x_50(D);
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111600
--- Comment #26 from Robin Dapp ---
So insn-opinit.cc still takes 2-3 minutes to compile here, even though the file
is not gigantic.
With the same GCC 13.1 x86 host compiler I see:
phase opt and generate : 170.28 ( 99%) 0.75 (
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111794
--- Comment #4 from Robin Dapp ---
Just to mention here as well. As this seems ninstance++ where the
adjust_precision thing comes back to bite us, I'm going to go back and check if
the issue why it was introduced (DCE?) cannot be solved
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110559
--- Comment #3 from Robin Dapp ---
I got back to this again today, now that pressure-aware scheduling is the
default. As mentioned before, it helps but doesn't get rid of the spills.
Testing with the "generic ooo" scheduling model it looks
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36
Bug ID: 36
Summary: ICE in RISC-V test case since r14-3441-ga1558e9ad85693
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53
--- Comment #1 from Robin Dapp ---
We seem to decide that a slightly more expensive loop (one instruction more)
without an epilogue is better than a loop with an epilogue. This looks
intentional in the vectorizer cost estimation and is not
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36
--- Comment #4 from Robin Dapp ---
All gather-scatter tests pass for me again (the given example in particular)
after applying this.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108271
Robin Dapp changed:
What|Removed |Added
CC||rdapp at gcc dot gnu.org
--- Comment #3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108412
Robin Dapp changed:
What|Removed |Added
CC||rdapp at gcc dot gnu.org
--- Comment #3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112464
--- Comment #4 from Robin Dapp ---
Is there another way to make it more robust?
Or does the existing
void
vect_finish_replace_stmt (vec_info *vinfo,
stmt_vec_info stmt_info, gimple *vec_stmt)
{
gimple *scalar_stmt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112464
--- Comment #2 from Robin Dapp ---
I tested
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index a544bc9b059..257fd40793e 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7084,7 +7084,7 @@
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112406
--- Comment #11 from Robin Dapp ---
Thanks, this is helpful.
I have a patch that I just bootstrapped and ran the testsuite with on aarch64.
Going to post it soon, maybe Richi still has a better idea how to work around
this.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91213
rdapp at gcc dot gnu.org changed:
What|Removed |Added
CC||rdapp at gcc dot gnu.org
---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91213
--- Comment #8 from rdapp at gcc dot gnu.org ---
Hacked something together, inspired by the other cases that try two different
sequences. Does this go into the right direction? Works for me on s390. I see
some regressions related to predictive
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91213
--- Comment #9 from rdapp at gcc dot gnu.org ---
The regressions are unrelated and due to another patch that I still had on the
same branch.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106701
rdapp at gcc dot gnu.org changed:
What|Removed |Added
Target|s390|s390 x86_64-linux-gnu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106701
--- Comment #3 from rdapp at gcc dot gnu.org ---
I though expand (or combine) were independent of value range. What would be the
proper place for it then?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100756
rdapp at gcc dot gnu.org changed:
What|Removed |Added
CC||rdapp at gcc dot gnu.org
---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106919
--- Comment #8 from rdapp at gcc dot gnu.org ---
Yes, one of dst and dest is superflous. Looks good like that. I bootstrapped
the same patch locally already, no regressions.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105988
rdapp at gcc dot gnu.org changed:
What|Removed |Added
Target|x86_64-pc-linux-gnu |x86_64-pc-linux-gnu s390
---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106527
Bug ID: 106527
Summary: ICE with modulo scheduling dump (-fdump-rtl-sms)
Product: gcc
Version: 13.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107617
rdapp at gcc dot gnu.org changed:
What|Removed |Added
Priority|P3 |P4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107617
Bug ID: 107617
Summary: SCC-VN with len_store and big endian
Product: gcc
Version: 13.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: middle-end
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107617
--- Comment #1 from rdapp at gcc dot gnu.org ---
For completeness, the mailing list thread is here:
https://gcc.gnu.org/pipermail/gcc-patches/2022-September/602252.html
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100756
--- Comment #8 from rdapp at gcc dot gnu.org ---
For completeness: haven't observed any fallout on s390 since and the regression
is fixed.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110559
--- Comment #1 from Robin Dapp ---
This can be improved in parts by enabling register-pressure aware scheduling.
The rest is due to the default issue rate of 1. Setting proper instruction
latency will then obviously cause a bit more reordering
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
--- Comment #2 from Robin Dapp ---
> It's interesting, for Clang only RISC-V can vectorize it.
The full loop can be vectorized on clang x86 as well when I remove the first
conditional (which is not in the snippet I posted above). So that's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
Bug ID: 113583
Summary: Main loop in 519.lbm not vectorized.
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113575
Robin Dapp changed:
What|Removed |Added
CC||rdapp at gcc dot gnu.org
--- Comment #5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113575
--- Comment #7 from Robin Dapp ---
Ok, I'm going to check.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113570
--- Comment #2 from Robin Dapp ---
I'm pretty certain this is "works as intended" and -Ofast causes the precision
to be different than with -O3 (and dependant on the target). See also:
It has been reported that with gfortran -Ofast
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113575
--- Comment #12 from Robin Dapp ---
Created attachment 57209
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57209=edit
Tentative
I tested the attached "fix". On my machine with 13.2 host compiler it reduced
the build time for
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113827
--- Comment #1 from Robin Dapp ---
x86 (-march=native -O3 on an i7 12th gen) looks pretty similar:
.L3:
movq(%rdi), %rax
vmovups (%rax), %xmm1
vdivps %xmm0, %xmm1, %xmm1
vmovups %xmm1, (%rax)
addq
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113827
Bug ID: 113827
Summary: MrBayes benchmark redundant load
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: enhancement
Priority: P3
Component: target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548
--- Comment #4 from Robin Dapp ---
Judging by the graph it looks like it was slow before, then got faster and now
slower again. Is there some more info on why it got faster in the first place?
Did the patch reverse something or is it rather a
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114027
Robin Dapp changed:
What|Removed |Added
CC||rguenth at gcc dot gnu.org
Last
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114027
--- Comment #9 from Robin Dapp ---
Argh, I actually just did a gcc -O3 -march=native pr114027.c
-fno-vect-cost-model on cfarm188 with a recent-ish GCC but realized that I used
my slightly modified version and not the original test case.
long
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113014
--- Comment #4 from Robin Dapp ---
Richard has posted it and asked for reviews. I have tested it and we have
several testsuite regressions with it but no severe ones. Most or all of them
are dump fails because we combine into vx variants that
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113014
--- Comment #2 from Robin Dapp ---
Yes, that's right.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112773
--- Comment #16 from Robin Dapp ---
I'd hope it was not fixed by this but just latent because we chose a VLS-mode
vectorization instead. Hopefully we're better off with the fix than without :)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112971
--- Comment #2 from Robin Dapp ---
It doesn't look like the same issue to me. The other bug is related to TImode
handling in combination with mask registers. I will also have a look at this
one.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112971
--- Comment #8 from Robin Dapp ---
Yes, can confirm that this helps.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112971
--- Comment #5 from Robin Dapp ---
Yes that's what I just tried. No infinite loop anymore then. But that's not a
new simplification and looks reasonable so there must be something special for
our backend.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112971
--- Comment #3 from Robin Dapp ---
In match.pd we do something like this:
;; Function e (e, funcdef_no=0, decl_uid=2751, cgraph_uid=1, symbol_order=4)
Pass statistics of "forwprop":
Matching expression match.pd:2771,
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112999
--- Comment #1 from Robin Dapp ---
What actually gets in the way of vec_extract here is changing to a "better"
vector mode (which is RVVMF4QI here). If we tried to extract from the mask
directly everything would work directly.
I have a patch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112999
Bug ID: 112999
Summary: riscv: Infinite loop with mask extraction
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112929
--- Comment #13 from Robin Dapp ---
I just built from the most recent commit and it still fails for me.
Could there be a difference in qemu? I'm on qemu-riscv64 version 8.1.91 but
yours is even newer so that might not explain it.
You could
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112853
--- Comment #10 from Robin Dapp ---
I just realized that I forgot to post the comparison recently. With the patch
now upstream I don't see any differences for zvl128b and different vlens
anymore. What I haven't fully tested yet is zvl256b or
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112929
--- Comment #9 from Robin Dapp ---
In the good version the length is 32 here because directly before the vsetvl we
have:
li a4,32
That seems to get lost somehow.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112929
--- Comment #6 from Robin Dapp ---
This seems to be gone when simple vsetvl (instead of lazy) is used or with
-fno-schedule-insns which might indicate a vsetvl pass problem.
We might have a few more of those. Maybe it would make sense to run
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112929
--- Comment #7 from Robin Dapp ---
Here
0x105c6 vse8.v v8,(a5)
is where we overwrite m. The vl is 128 but the preceding vsetvl gets a4 =
46912504507016 as AVL which seems already borken.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112929
--- Comment #15 from Robin Dapp ---
I think we need to make sure that we're not writing out of bounds. In that
case anything might happen and if we just don't happen to overwrite this
variable we might hit another one but the test can still
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112999
Robin Dapp changed:
What|Removed |Added
Resolution|--- |FIXED
Status|UNCONFIRMED
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113249
--- Comment #1 from Robin Dapp ---
Yes, several (most?) of those are expected because the tests rely on the
default latency model. One option is to hard code the tune in those tests.
On the other hand the dump tests checking for a more or less
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281
--- Comment #2 from Robin Dapp ---
Confirmed. Funny, we shouldn't vectorize that but really optimize to "return
0". Costing might be questionable but we also haven't optimized away the loop
when comparing costs.
Disregarding that, of course
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247
--- Comment #9 from Robin Dapp ---
I also noticed this (likely unwanted) vector snippet and wondered where it is
being created. First I thought it's a vec_extract but doesn't look like it.
I'm going to check why we create this.
Pan, the test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112971
--- Comment #22 from Robin Dapp ---
Yes, going to the thread soon.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113474
--- Comment #1 from Robin Dapp ---
Good catch. Looks like the ifn expander always forces into a register. That's
probably necessary on all targets except riscv.
diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247
--- Comment #3 from Robin Dapp ---
Yes, sure and I gave a bit of detail why the values chosen there (same as
aarch64) make sense to me.
Using this generic vector cost model by default without adjusting the latencies
is possible. I would be OK
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247
--- Comment #4 from Robin Dapp ---
The other option is to assert that all tune models have at least a vector cost
model rather than NULL... But not falling back to the builtin costs still
makes sense.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113249
--- Comment #4 from Robin Dapp ---
> One of the reasons I've been testing things with generic-ooo is because
> generic-ooo had initial vector pipelines defined. For cleaning up the
> scheduler, I copied over the generic-ooo pipelines into
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247
--- Comment #1 from Robin Dapp ---
Hmm, so I tried reproducing this and without a vector cost model we indeed
vectorize. My qemu dynamic instruction count results are not as abysmal as
yours but still bad enough (20-30% increase in dynamic
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112853
--- Comment #5 from Robin Dapp ---
Can confirm. The scalable build works with qemu vlen=128 but fails with
vlen=256. That's a good data point as I'm not sure we're already covering this
with the current runs?
I'm going to start a testsuite
1 - 100 of 208 matches
Mail list logo