[Bug tree-optimization/114760] New: traling zero count detection failure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114760 Bug ID: 114760 Summary: traling zero count detection failure Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jiangning.liu at amperecomputing dot com Target Milestone: --- For this small case, gcc failed to detect trailing zero count calculation, so the x86 instruction tzcnt cannot be generated, but clang can generate it. unsigned ntz32_6a(unsigned x) { int n; n = 32; while (x != 0) { n = n - 1; x = x + x; } return n; } If we slightly change "x = x + x" to "x = x << 1", the optimization will just work. unsigned ntz32_6a(unsigned x) { int n; n = 32; while (x != 0) { n = n - 1; x = x << 1; } return n; } It seems number_of_iterations_cltz/number_of_iterations_cltz_complement in tree-ssa-loop-niter.cc or somewhere else need to be enhanced.
[Bug tree-optimization/98138] BB vect fail to SLP one case
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138 --- Comment #12 from Jiangning Liu --- Hi Richi, > That said, "failure" to identify the common (vector) load is known > and I do have experimental patches trying to address that but did > not yet arrive at a conclusive "best" approach. It was long time ago, so do you have the "best" approach now? Thanks, -Jiangning
[Bug target/106671] aarch64: BTI instruction are not inserted for cross-section direct calls
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106671 --- Comment #11 from Jiangning Liu --- Hi Wilco, > "it means we will need a linker optimization to remove those redundant BTIs > (eg. by changing them into NOPs)" It will be only for performance optimization, right? If we don't care about performance, the linker doesn't need to optimize it to be NOP, right? It could still be useful if we only do this operation for a specific module. Thanks, -Jiangning
[Bug tree-optimization/109603] New: Vectorization failure for a small loop containing a simple branch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109603 Bug ID: 109603 Summary: Vectorization failure for a small loop containing a simple branch Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jiangning.liu at amperecomputing dot com Target Milestone: --- For the following small case, #include #include #include #define NANOSECS10L int main(int argc, char * argv[]) { long long i, even, odd, c; char *eptr; struct timespec ts0, ts1; c = strtoll(argv[1], &eptr, 10); printf("c = %lld \n", c); even = odd = 0; clock_gettime(CLOCK_MONOTONIC, &ts0); for (i = 0; i < c; i++) { if (i % 2) even++; else odd++; } clock_gettime(CLOCK_MONOTONIC, &ts1); printf("even = %lld odd = %lld\n", even, odd); printf("elapsed %ld\n", (ts1.tv_sec - ts0.tv_sec) * NANOSECS + (ts1.tv_nsec - ts0.tv_nsec)); return 0; } Using "-mcpu=neoverse-n1" gcc fails to vectorize the loop, while using "-mcpu=neoverse-n1 -mtune=generic" or without -mcpu and -mtune, gcc can successfully vectorize it. The scalar version for the loop is like, 400660: 36000381tbz w1, #0, 4006d0 400664: 91000694add x20, x20, #0x1 400668: 91000421add x1, x1, #0x1 40066c: eb01027fcmp x19, x1 400670: 5481b.ne400660 // b.any ... 4006d0: 910006b5add x21, x21, #0x1 4006d4: 17e5b 400668 The vectorization version is like below (factor=2), and it is much faster on neoverse-n1. 400670: 91000421add x1, x1, #0x1 400674: 4e241c20and v0.16b, v1.16b, v4.16b 400678: 4ee48421add v1.2d, v1.2d, v4.2d 40067c: 4ee09800cmeqv0.2d, v0.2d, #0 400680: 6e631ca0bsl v0.16b, v5.16b, v3.16b 400684: 4ee08442add v2.2d, v2.2d, v0.2d 400688: eb13003fcmp x1, x19 40068c: 5421b.ne400670 // b.any It seems neoverse-n1 vector cost model is inaccurate and does work well for this small case. (1) For -mcpu=neoverse-n1 version, the vectorization cost model result is Vector inside of loop cost: 12 Scalar iteration cost: 5 12 > 5*2, so gcc doesn't think it's worth doing vectorization for factor=2. (2) For the version without -mcpu , the vectorization cost model result is Vector inside of loop cost: 4 Scalar iteration cost: 5 Actually, the loop body cost for vectorized version is 4, which is too small, and it looks incorrect as well, although in reality vectorized version is faster than scalar version. In contract, the 12 for -mcpu=neoverse-n1 looks more reasonable, although it blocked the vectorization.
[Bug rtl-optimization/109343] New: invalid if conversion optimization for aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109343 Bug ID: 109343 Summary: invalid if conversion optimization for aarch64 Product: gcc Version: rust/master Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jiangning.liu at amperecomputing dot com Target Milestone: --- For this small case, if-conversion optimization in back-end generated csel instruction for aarch64, which is unsafe. The address of variable sga_var could be invalid if sga_mapped is false. $ cat ttt2.c extern int sga_mapped, sga_var; int func(int j){ int i=0; if(sga_mapped) i=i+sga_var; return i; } $ gcc -O3 -S ttt2.c $ cat ttt2.s .arch armv8-a .file "ttt2.c" .text .align 2 .p2align 4,,11 .global func .type func, %function func: .LFB0: .cfi_startproc adrpx0, sga_mapped adrpx1, sga_var ldr w0, [x0, #:lo12:sga_mapped] ldr w1, [x1, #:lo12:sga_var] cmp w0, 0 cselw0, w1, w0, ne ret .cfi_endproc .LFE0: .size func, .-func .ident "GCC: (GNU) 12.2.1 20221121 (Red Hat 12.2.1-4)" .section.note.GNU-stack,"",@progbits For x86, the following code is generated. It is safe because the memory access to sga_var(%rip) won't be really triggered if %eax is not set. Here x86 and aarch64 are different. $ cat ttt2.s .file "ttt2.c" .text .p2align 4 .globl func .type func, @function func: .LFB0: .cfi_startproc endbr64 movlsga_mapped(%rip), %eax testl %eax, %eax cmovne sga_var(%rip), %eax ret
[Bug tree-optimization/89430] A missing ifcvt optimization to generate csel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89430 --- Comment #17 from Jiangning Liu --- Yes. > -Original Message- > From: tnfchris at gcc dot gnu.org > Sent: Friday, November 11, 2022 4:48 PM > To: JiangNing Liu > Subject: [Bug tree-optimization/89430] A missing ifcvt optimization to > generate csel > > [EXTERNAL EMAIL NOTICE: This email originated from an external sender. > Please be mindful of safe email handling and proprietary information > protection practices.] > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89430 > > --- Comment #16 from Tamar Christina --- I > think this can be closed now right? > > -- > You are receiving this mail because: > You reported the bug.
[Bug c/106823] New: #pragma GCC diagnostic ignored "-Wattribute-warning" doesn't work for -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106823 Bug ID: 106823 Summary: #pragma GCC diagnostic ignored "-Wattribute-warning" doesn't work for -flto Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: jiangning.liu at amperecomputing dot com Target Milestone: --- $ cat foo.cpp extern "C" __attribute__((__warning__(""))) void _foo(int) {}; void foo(int num) { #pragma GCC diagnostic ignored "-Wattribute-warning" ::_foo(num); } int main() { foo(1); } $ g++ foo.cpp $ g++ -flto foo.cpp foo.cpp: In function ‘foo’: foo.cpp:5:9: warning: call to ‘_foo’ declared with attribute warning: [-Wattribute-warning] 5 | ::_foo(num); | ^
[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #7 from Jiangning Liu --- Without reverting the commit g:1118a3ff9d3ad6a64bba25dc01e7703325e23d92, we still see exchange2 performance issue for aarch64. BTW, we have been using -fno-inline-functions-called-once to get the best performance number for exchange2.
[Bug tree-optimization/100511] Fail to remove dead code in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100511 --- Comment #5 from Jiangning Liu --- If we change "c3 = a" to "c3 = x->b", GCC can optimize it without IPA. It seems VRP is working for this case. $ cat tt7.c #include int a; typedef struct { int b; int count; } XX; int g; __attribute__((noinline)) void f(XX *x) { int c1 = 0; int c3 = x->b; if (x) c1 = x->count; for (int i=0; icount) { if (i > c3) { printf("Unreachable!"); break; } else g = 2; } else g = i; } } void main(void) { XX x; x.count = 100; a = 100; f(&x); }
[Bug tree-optimization/100511] Fail to remove dead code in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100511 --- Comment #2 from Jiangning Liu --- Then why gcc can't optimize this case either? sizeof (XX) <> sizeof(g) here. #include int a; typedef struct { int b; int count; } XX; int g; __attribute__((noinline)) void f(XX *x) { int c1 = 0; int c3 = a; if (x) c1 = x->count; for (int i=0; icount) { if (i > c3) { printf("Unreachable!"); break; } else g = 2; } else g = i; } } void main(void) { XX x; x.count = 100; a = 100; f(&x); }
[Bug tree-optimization/100511] New: Fail to remove dead code in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100511 Bug ID: 100511 Summary: Fail to remove dead code in loop Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jiangning.liu at amperecomputing dot com Target Milestone: --- For this simple case, gcc doesn't know the if condition (i > c2) is always false. #include typedef struct { int count; } XX; int g; __attribute__((noinline)) void f(XX *x) { int c1 = 0; if (x) c1 = x->count; for (int i=0; icount; if (i > c2) { printf("Unreachable!"); break; } else g = i; } } void main(void) { XX x; x.count = 100; f(&x); } If we change variable the type of variable g to float, gcc does optimize away this if condition inside the loop, so why alias analysis can't recognize g is different from x->count?
[Bug tree-optimization/99946] fail to exchange if conditions in terms of likely/unlikely probability
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99946 --- Comment #1 from Jiangning Liu --- Is there any gcc pass that can deal with this simple optimization?
[Bug tree-optimization/99946] New: fail to exchange if conditions in terms of likely/unlikely probability
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99946 Bug ID: 99946 Summary: fail to exchange if conditions in terms of likely/unlikely probability Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jiangning.liu at amperecomputing dot com Target Milestone: --- For this simple case, $ cat test_cond.c #define likely(x) __builtin_expect((x),1) #define unlikely(x) __builtin_expect((x),0) extern void g(void); int a, b; void f(void) { if (likely(a>0)) if (unlikely(b>0)) g(); } We expect gcc compiler can exchange the if conditions to be like below, if (unlikely(b>0)) if (likely(a>0)) g(); This way, performance can be improved due to saving the comparison for a>0. At the moment, gcc generate code as below, .LFB0: .cfi_startproc movla(%rip), %edx testl %edx, %edx jle .L1 movlb(%rip), %eax testl %eax, %eax jg .L4 .L1: ret
[Bug rtl-optimization/98782] [11 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 --- Comment #4 from Jiangning Liu --- Hi Honza, Do you see any other real case problems if the patch g:1118a3ff9d3ad6a64bba25dc01e7703325e23d92 is not applied? If exchange2 is the only one affected by this patch so far, and because we have observed big performance regression, it sounds we need to provide an IRA fix along with this patch to avoid unexpected performance degradation for gcc11 release vs. gcc10. Thanks, -Jiangning
[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 --- Comment #12 from Jiangning Liu --- MGO RFC is at https://gcc.gnu.org/pipermail/gcc/2021-January/234682.html
[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 --- Comment #11 from Jiangning Liu --- (In reply to rguent...@suse.de from comment #8) > On Sat, 9 Jan 2021, jiangning.liu at amperecomputing dot com wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 > > > > --- Comment #7 from Jiangning Liu > com> --- > > (In reply to rguent...@suse.de from comment #6) > > > On January 9, 2021 4:17:17 AM GMT+01:00, "jiangning.liu at amperecomputing > > > dot com" wrote: > > > >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 > > > > > > > >--- Comment #5 from Jiangning Liu > > >com> --- > > > >> It has to be done with care of course, cost modeling is difficult > > > >> (we need to have a good estimate of n and m or need to version > > > >> the whole nest). That said, usually we attempt the reverse > > > >transform. > > > > > > > >Before tuning the cost model good enough, we may implement this > > > >optimization by > > > >adding a new optimization command line option. This won't hurt gcc, > > > >right? > > > > > > New options not enabled by default tend to bitrot, be broken from the > > > start > > > and won't be used by the lazy user. So I see no point in doing that. > > > > > > > Understand. I mean we can enable it by default eventually, but we need to > > implement and tune it step by step. It is unrealistic to work out the best > > cost > > model at the very beginning. > > Sure. The "easiest" thing is to rely on a profile from PGO, we did > have some transforms only enabled by -fprofile-use by default. That is, > the cost model needs to be conservative, esp. if you introduce dynamic > allocation for this. In the end I guess only a variant that versions > the nest on the size of the temporary will be good enough to not trigger > OOM or excessive overhead for small sizes anyway. People usually don't use PGO unless they can't find any better static compiler switches. This optimization should always benefit performance if we can tune the cost model good enough. It is true that the temp memory size needs to be checked to avoid OOM, which is one of the runtime overheads.
[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 --- Comment #10 from Jiangning Liu --- (In reply to Hongtao.liu from comment #9) > It looks like a SOA/AOC opt opportunity which is discussed in > https://gcc.gnu.org/wiki/ > cauldron2015?action=AttachFile&do=view&target=Olga+Golovanevsky_+Memory+Layou > t+Optimizations+of+Structures+and+Objects.pdf > > And i remember there's someone working on enabling SOA/AOS opt in GCC. No. The key difference is the optimization opportunity here doesn't rely on LTO at all. It is purely a local optimization within a function instead.
[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 --- Comment #7 from Jiangning Liu --- (In reply to rguent...@suse.de from comment #6) > On January 9, 2021 4:17:17 AM GMT+01:00, "jiangning.liu at amperecomputing > dot com" wrote: > >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 > > > >--- Comment #5 from Jiangning Liu >com> --- > >> It has to be done with care of course, cost modeling is difficult > >> (we need to have a good estimate of n and m or need to version > >> the whole nest). That said, usually we attempt the reverse > >transform. > > > >Before tuning the cost model good enough, we may implement this > >optimization by > >adding a new optimization command line option. This won't hurt gcc, > >right? > > New options not enabled by default tend to bitrot, be broken from the start > and won't be used by the lazy user. So I see no point in doing that. > Understand. I mean we can enable it by default eventually, but we need to implement and tune it step by step. It is unrealistic to work out the best cost model at the very beginning.
[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 --- Comment #5 from Jiangning Liu --- > It has to be done with care of course, cost modeling is difficult > (we need to have a good estimate of n and m or need to version > the whole nest). That said, usually we attempt the reverse transform. Before tuning the cost model good enough, we may implement this optimization by adding a new optimization command line option. This won't hurt gcc, right? > > My personal opinion is that hinting the user to possibly refactor > his code (guided by profiling to be not too noisy) is much > prefered to the idea that the compiler can ever apply such transform > to the loops where it matters and not to the loops where it is > harmful. Sometimes, it is not always easy for the user to modify the code, and even the user may be lazy and reluctant to change the code. This kind of Memory Gathering Optimization can make end-user's life easier.
[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 --- Comment #2 from Jiangning Liu --- Loop distribution can only handle very simple case. If the inner loop has complicated control flow and other memory accesses with loop-carried dependence, it would be hard to handle it. For example, int foo (int n, int m, A *pa) { int sum; for (int i = 0; i < n; i++) { for (int j = 0; j < m; j++) { sum += pa[j].pb->pc->val; // each value is repeatedly loaded "n" times sum = sum % 7; } sum = sum % 13; } return sum; } Alternatively, we can detect "invariant" dependent memory loads for the nested loops with alias conflict checked. If the outer loop is hot enough, we could have a chance to "hoist" them to create cache. As for temp storage, is it a gcc's rule of thumb not to introduce temp storage on heap, or it is just gcc doesn't have it yet and we want to have it?
[Bug web/95380] New: ipcp-unit-growth was renamed to ipa-cp-unit-growth
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95380 Bug ID: 95380 Summary: ipcp-unit-growth was renamed to ipa-cp-unit-growth Product: gcc Version: 10.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: web Assignee: unassigned at gcc dot gnu.org Reporter: jiangning.liu at amperecomputing dot com Target Milestone: --- https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options Option ipcp-unit-growth (9.1.0) has been renamed to ipa-cp-unit-growth (10.1.0), but the document in the link above doesn't reflect the change. The 10.1.0 pdf document at https://gcc.gnu.org/onlinedocs/gcc-10.1.0/gcc.pdf also doesn't have correct info.
[Bug c++/93163] internal compiler error: verify_gimple failed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93163 Jiangning Liu changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #3 from Jiangning Liu --- Confirmed that the issue has been fixed on trunk.
[Bug c/93163] internal compiler error: verify_gimple failed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93163 --- Comment #1 from Jiangning Liu --- Created attachment 47591 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47591&action=edit bad case from llvm build
[Bug c/93163] New: internal compiler error: verify_gimple failed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93163 Bug ID: 93163 Summary: internal compiler error: verify_gimple failed Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: jiangning.liu at amperecomputing dot com Target Milestone: --- LLVM trunk build with gcc trunk exposed failure "internal compiler error: verify_gimple failed". $ g++ -O3 -c bad.cpp bad.cpp: In constructor ‘{anonymous}::AArch64SIMDInstrOpt::AArch64SIMDInstrOpt()’: bad.cpp:140602:3: error: incorrect sharing of tree nodes 140602 | AArch64SIMDInstrOpt() : MachineFunctionPass(ID) { | ^~~ *D.397057 D.397057->RC = FPR128RegClass; bad.cpp:140602:3: error: incorrect sharing of tree nodes *D.397057 D.397057->RC = FPR128RegClass; bad.cpp:140602:3: error: incorrect sharing of tree nodes *D.397057 D.397057->RC = FPR64RegClass; bad.cpp:140602:3: error: incorrect sharing of tree nodes *D.397057 D.397057->RC = FPR128RegClass; bad.cpp:140602:3: error: incorrect sharing of tree nodes *D.397057 D.397057->RC = FPR64RegClass; bad.cpp:140602:3: error: incorrect sharing of tree nodes *D.397057 D.397057->RC = FPR128RegClass; bad.cpp:140602:3: error: incorrect sharing of tree nodes *D.397057 D.397057->RC = FPR64RegClass; bad.cpp:140602:3: error: incorrect sharing of tree nodes *D.397057 D.397057->RC = FPR128RegClass; bad.cpp:140602:3: error: incorrect sharing of tree nodes *D.397057 D.397057->RC = FPR128RegClass; bad.cpp:140602:3: error: incorrect sharing of tree nodes *D.397057 D.397057->RC = FPR64RegClass; bad.cpp:140602:3: error: incorrect sharing of tree nodes *D.397057 D.397057->RC = FPR128RegClass; bad.cpp:140602:3: error: incorrect sharing of tree nodes *D.397057 D.397057->RC = FPR64RegClass; bad.cpp:140602:3: error: incorrect sharing of tree nodes *D.397057 D.397057->RC = FPR128RegClass; bad.cpp:140602:3: error: incorrect sharing of tree nodes *D.397057 D.397057->RC = FPR64RegClass; during GIMPLE pass: cfg bad.cpp:140602:3: internal compiler error: verify_gimple failed 0x100bbff verify_gimple_in_cfg(function*, bool) ../../gcc/gcc/tree-cfg.c:5445 0xebad33 execute_function_todo ../../gcc/gcc/passes.c:1983 0xebbc4b execute_todo ../../gcc/gcc/passes.c:2037 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. Bisect run shows the failure is related to commit https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=279576
[Bug tree-optimization/92649] dead store elimination
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92649 --- Comment #5 from Jiangning Liu --- Unrolling 1024 iterations would increase code size a lot, so usually we don't do that. 1024 is only an example. Without knowing we could eliminate most of them, we don't really want to do loop unrolling, I guess. Yes. Assigning 5 to all a's elements is only an example as well. It could be any random value or predefined number. Let me give a more complicated case, extern int rand(void); #define LIVE_SIZE 100 #define DATA_SIZE 256 int f(void) { int a[DATA_SIZE], b[DATA_SIZE][DATA_SIZE]; int i,j; long long s = 0; int next; for (i=0; i
[Bug tree-optimization/92649] dead store elimination
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92649 --- Comment #3 from Jiangning Liu --- It is a stupid test, but it is simplified from a real application. To solve even more complicated scenario, this simple case needs to be addressed first. If we change the case to be as below, int f(void) { int i, a[1024], s=0; for (i=0; i<1024; i++) a[i] = 5; for (i=0; i<37; i++) s += a[i]; return s; } the loop peeling will not work, but compiler should still know the store to elements with index >= 37 can all be eliminated. Can any framework in GCC solve this problem?
[Bug tree-optimization/92649] New: dead store elimination
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92649 Bug ID: 92649 Summary: dead store elimination Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jiangning.liu at amperecomputing dot com Target Milestone: --- For this small case, int f(void) { int i, a[1024]; for (i=0; i<1024; i++) a[i] = 5; return a[0]; } "gcc -O3" can't figure out the memory stores from a[1] to a[1023] all can be eliminated. The assembly code for aarch64 is as below. moviv0.4s, 0x5 sub sp, sp, #4096 mov x0, sp add x1, sp, 4096 .L2: str q0, [x0], 16 cmp x0, x1 bne .L2 ldr w0, [sp] add sp, sp, 4096 ret
[Bug tree-optimization/91246] vectorization failure for a small loop to search array element
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91246 --- Comment #3 from Jiangning Liu --- Expect to vectorize the inner loop by generating the code below for x86, vpbroadcastd [mem], ymm0 vpaddd [mem], ymm0, ymm1 vpbroadcastd reg, ymm2 vpcmpeqd ymm2, ymm1, k0 kortestw k0, k0 cmovne ... AArch64 should have vectorization instructions counterpart to implement the same functionality.
[Bug tree-optimization/91246] vectorization failure for a small loop to search array element
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91246 --- Comment #2 from Jiangning Liu --- Created attachment 46626 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46626&action=edit A new test Attached is a test case that is more closely matching the real-world code.
[Bug tree-optimization/91246] New: vectorization failure for a small loop to search array element
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91246 Bug ID: 91246 Summary: vectorization failure for a small loop to search array element Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jiangning.liu at amperecomputing dot com Target Milestone: --- For the following simple case, the inner loop can be completely removed by vectorization. GCC fails to do that. SIZE can be either 4 or 8. #define SIZE 4 int f(int *data, int x) { int i, j; int s = 0; for (i = 0; i < 1024; i++) { int found = 0; for (j = 0; j < SIZE; j++) { if (data[j] == x) { found = 1; break; } } s += found; } return s; }
[Bug middle-end/91195] [10 regression] incorrect may be used uninitialized smw (272711, 273474]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91195 Jiangning Liu changed: What|Removed |Added CC||msebor at gcc dot gnu.org --- Comment #8 from Jiangning Liu --- Martin is arguing setting the no-warning bit in middle-end for this scenario is not a robust solution at https://gcc.gnu.org/ml/gcc-patches/2019-07/msg01525.html. What about moving the case below to -O3? Could it be acceptable by -Wmaybe-uninitialized tests? tree base = get_base_address (lhs); if (!nontrap->contains (lhs) && auto_var_p (base) && TREE_ADDRESSABLE (base) && optimization_level > 2) { /* Do conditional store replacement by inserting a load. */ }
[Bug middle-end/91195] [10 regression] incorrect may be used uninitialized smw (272711, 273474]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91195 --- Comment #6 from Jiangning Liu --- It seems -Werror=maybe-uninitialized cannot always work, and it fails to report the error message for the case below. However, the option name is "maybe-xxx", so I can understand it is OK, but for the same reason, it should be also OK if we report error message for the original case. $ cat pr89430-1.c unsigned test(unsigned k, unsigned b) { unsigned a[2]; if (b < a[k]) { a[k] = b; } return a[0]+a[1]; } $ gcc -O2 -S pr89430-1.c -Werror=maybe-uninitialized $ cat pr89430-1.s .file "pr89430-1.c" .text .p2align 4 .globl test .type test, @function test: .LFB0: .cfi_startproc movl%edi, %edi cmpl%esi, -8(%rsp,%rdi,4) cmovbe -8(%rsp,%rdi,4), %esi movl%esi, -8(%rsp,%rdi,4) movl-4(%rsp), %eax addl-8(%rsp), %eax ret .cfi_endproc .LFE0: .size test, .-test .ident "GCC: (GNU) 10.0.0 20190722 (experimental)" .section.note.GNU-stack,"",@progbits
[Bug middle-end/91195] [10 regression] incorrect may be used uninitialized smw (272711, 273474]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91195 --- Comment #3 from Jiangning Liu --- The gcc compilation difference between FOR_UP_LIMIT is 3 and 4 is that, cunrolli can do loop unrolling when FOR_UP_LIMIT is 3, for which the control flow can be significantly simplified, so the conditional store optimization in phiopt will not be triggered. The following code is generated with conditional store optimization, and "cstore_8 = MEM [(void *)&Msg][0];" is inserted in the else branch "if (m1_9(D) != 0B)" statement. [local count: 214748364]: if (m1_9(D) != 0B) goto ; [70.00%] else goto ; [30.00%] [local count: 64424509]: cstore_8 = MEM [(void *)&Msg][0]; [local count: 214748364]: # num_2 = PHI <1(2), 0(3)> # cstore_4 = PHI MEM [(void *)&Msg][0] = cstore_4; The possible solution is to disable this optimization when "-Werror=maybe-uninitialized" is enabled.
[Bug tree-optimization/89134] A missing optimization opportunity for a simple branch in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89134 --- Comment #13 from Jiangning Liu --- Feng already sent out the 1st patch at https://gcc.gnu.org/ml/gcc-patches/2019-03/msg00541.html . But the 2nd one is related to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89713 .
[Bug rtl-optimization/89430] A missing ifcvt optimization to generate csel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89430 --- Comment #8 from Jiangning Liu --- It is related to https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02998.html Bernd's patch is an overkill.
[Bug rtl-optimization/89430] A missing ifcvt optimization to generate csel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89430 --- Comment #7 from Jiangning Liu --- To avoid "readonly" issue, try this case, unsigned test(unsigned k, unsigned b) { unsigned a[2]; if (b < a[k]) { a[k] = b; } return a[0]+a[2]; } Variable a is local, and it is NOT readonly, so now the following code is generated, sub sp, sp, #16 uxtwx0, w0 add x2, sp, 8 ldr w3, [x2, x0, lsl 2] cmp w3, w1 bls .L2 str w1, [x2, x0, lsl 2] .L2: ldr w1, [sp, 8] ldr w0, [sp, 16] add sp, sp, 16 add w0, w1, w0 ret But gcc should generate code below instead, uxtwx2, w0 add x3, sp, 8 ldr w5, [sp, 16] ldr w4, [x3, x2, lsl 2] cmp w4, w1 cselw1, w1, w4, hi str w1, [x3, x2, lsl 2] ldr w0, [sp, 8] add sp, sp, 16 add w0, w0, w5 ret Any other glass jaw?
[Bug rtl-optimization/89430] A missing ifcvt optimization to generate csel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89430 --- Comment #6 from Jiangning Liu --- (In reply to Richard Biener from comment #5) > (In reply to Jiangning Liu from comment #4) > > >We need to be careful with loads > > >or stores, for instance a load might not trap, while a store would, > > >so if we see a dominating read access this doesn't mean that a later > > >write access would not trap. > > > > Why? For this case, there is a dominating load for the same address. I don't > > see why it might trap. Any example? > > The memory might be mapped readonly. But in such a simple basic block, how can it be mapped readonly? We can easily know it is NOT to do readonly mapping.
[Bug rtl-optimization/89430] A missing ifcvt optimization to generate csel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89430 --- Comment #4 from Jiangning Liu --- >We need to be careful with loads >or stores, for instance a load might not trap, while a store would, >so if we see a dominating read access this doesn't mean that a later >write access would not trap. Why? For this case, there is a dominating load for the same address. I don't see why it might trap. Any example?
[Bug rtl-optimization/89430] New: A missing ifcvt optimization to generate csel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89430 Bug ID: 89430 Summary: A missing ifcvt optimization to generate csel Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jiangning.liu at amperecomputing dot com Target Milestone: --- For a small case, unsigned *a; void test(unsigned k, unsigned b) { if (b < a[k]) { a[k] = b; } } "gcc -O3 -S" generates, adrpx2, a uxtwx0, w0 ldr x2, [x2, #:lo12:a] ldr w3, [x2, x0, lsl 2] cmp w3, w1 bls .L1 str w1, [x2, x0, lsl 2] Actually we should use csel instruction instead of conditional branch, so expect to have the followings generated, adrpx2, a uxtwx0, w0 ldr x2, [x2, #:lo12:a] ldr w3, [x2, x0, lsl 2] cmp w3, w1 cselw1, w1, w3, hi str w1, [x2, x0, lsl 2] RTL optimization ifcvt misses this opportunity.
[Bug tree-optimization/89134] A missing optimization opportunity for a simple branch in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89134 --- Comment #10 from Jiangning Liu --- (In reply to Martin Sebor from comment #9) > But since GCC emits infinite loops regardless of whether or not > they have any side-effects, whether inc() is pure or not may not matter. I think "for (; it != m.end (); ++it); /* get an empty loop */" is a finite loop.
[Bug tree-optimization/89134] A missing optimization opportunity for a simple branch in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89134 --- Comment #5 from Jiangning Liu --- The loop below should be treated as a finite loop, for (iter = booktable.begin(); iter!=booktable.end(); ++iter) { ... } so there is a chance to optimize away the empty loop, in which do_something doesn't exist at all.
[Bug tree-optimization/89134] A missing optimization opportunity for a simple branch in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89134 Jiangning Liu changed: What|Removed |Added Status|RESOLVED|UNCONFIRMED Resolution|INVALID |--- --- Comment #2 from Jiangning Liu --- The original case is only a simple example, and what if GCC can figure out it is NOT an infinite loop? For example, std::map BookTable; BookTable::iterator iter; BookTable booktable; for (iter = booktable.begin(); iter!=booktable.end(); ++iter) { if (b) { b = do_something(); } } Then GCC should be able to figure out this loop is a finite loop due to using standard C++ STL std::map. The cost of iterating std::map might be high, so we'd better consider optimize away the empty loop.
[Bug tree-optimization/89134] New: A missing optimization opportunity for a simple branch in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89134 Bug ID: 89134 Summary: A missing optimization opportunity for a simple branch in loop Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jiangning.liu at amperecomputing dot com Target Milestone: --- For this simple case, __attribute__((pure)) __attribute__((noinline)) int inc(int i) { /* Do something else here */ return i+1; } extern int do_something(void); extern int b; void test(int n) { for (int i=0; i
[Bug tree-optimization/88492] New: SLP optimization generates ugly code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492 Bug ID: 88492 Summary: SLP optimization generates ugly code Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jiangning.liu at amperecomputing dot com Target Milestone: --- For aarch64, SLP optimization generates ugly code for the case below, int test_slp( unsigned char *b ) { unsigned int tmp[4][4]; int sum = 0; for( int i = 0; i < 4; i++, b += 4 ) { tmp[i][0] = b[0]; tmp[i][2] = b[1]; tmp[i][1] = b[2]; tmp[i][3] = b[3]; } for( int i = 0; i < 4; i++ ) { sum += tmp[0][i] + tmp[1][i] + tmp[2][i] + tmp[3][i]; } return sum; } With command line "gcc -O3", the following code is generated, : 0: 9001adrpx1, 0 4: d10103ffsub sp, sp, #0x40 8: 3dc1ldr q1, [x0] c: 3dc00020ldr q0, [x1] 10: 4e21tbl v1.16b, {v1.16b}, v0.16b 14: 2f08a422uxtlv2.8h, v1.8b 18: 6f08a421uxtl2 v1.8h, v1.16b 1c: 2f10a443uxtlv3.4s, v2.4h 20: 6f10a442uxtl2 v2.4s, v2.8h 24: 2f10a420uxtlv0.4s, v1.4h 28: 6f10a421uxtl2 v1.4s, v1.8h 2c: 9e660060fmovx0, d3 30: ad000be3stp q3, q2, [sp] 34: b9401be8ldr w8, [sp, #24] 38: ad0107e0stp q0, q1, [sp, #32] 3c: 9e660022fmovx2, d1 40: d360fc01lsr x1, x0, #32 44: 9e660040fmovx0, d2 48: 294117e6ldp w6, w5, [sp, #8] 4c: d360fc43lsr x3, x2, #32 50: b9402be2ldr w2, [sp, #40] 54: d360fc07lsr x7, x0, #32 58: 9e66fmovx0, d0 5c: 0ea18400add v0.2s, v0.2s, v1.2s 60: 0b0100e7add w7, w7, w1 64: 0b0800c6add w6, w6, w8 68: b9401fe8ldr w8, [sp, #28] 6c: d360fc00lsr x0, x0, #32 70: 1e260001fmovw1, s0 74: 0ea28460add v0.2s, v3.2s, v2.2s 78: 0b63add w3, w3, w0 7c: 0b070063add w3, w3, w7 80: 29471fe0ldp w0, w7, [sp, #56] 84: 1e260004fmovw4, s0 88: 0b42add w2, w2, w0 8c: b9402fe0ldr w0, [sp, #44] 90: 0b060042add w2, w2, w6 94: 0b040021add w1, w1, w4 98: 0b07add w0, w0, w7 9c: 0b030021add w1, w1, w3 a0: 0b0800a3add w3, w5, w8 a4: 0b020021add w1, w1, w2 a8: 0b03add w0, w0, w3 ac: 0b20add w0, w1, w0 b0: 910103ffadd sp, sp, #0x40 b4: d65f03c0ret In the code, vectorization code is generated, but there are ugly instructions generated as well, e.g. memory store and register copy from SIMD register to general purpose register. With command line "gcc -O3 -fno-tree-slp-vectorize", the following code can be generated, and it looks pretty clean. Usually, this code sequence is friendly to hardware prefetch. : 0: 39402004ldrbw4, [x0, #8] 4: 39401002ldrbw2, [x0, #4] 8: 39403001ldrbw1, [x0, #12] c: 3943ldrbw3, [x0] 10: 39402806ldrbw6, [x0, #10] 14: 0b040021add w1, w1, w4 18: 39401805ldrbw5, [x0, #6] 1c: 0b020063add w3, w3, w2 20: 39403804ldrbw4, [x0, #14] 24: 0b030021add w1, w1, w3 28: 39400802ldrbw2, [x0, #2] 2c: 39400403ldrbw3, [x0, #1] 30: 0b060084add w4, w4, w6 34: 39402407ldrbw7, [x0, #9] 38: 0b050042add w2, w2, w5 3c: 39401406ldrbw6, [x0, #5] 40: 0b020084add w4, w4, w2 44: 39403405ldrbw5, [x0, #13] 48: 0b040021add w1, w1, w4 4c: 0b060063add w3, w3, w6 50: 39400c02ldrbw2, [x0, #3] 54: 0b0700a5add w5, w5, w7 58: 39403c04ldrbw4, [x0, #15] 5c: 0b050063add w3, w3, w5 60: 39401c06ldrbw6, [x0, #7] 64: 39402c05ldrbw5, [x0, #11] 68: 0b030021add w1, w1, w3 6c: 0b060040add w0, w2, w6 70: 0b050082add w2, w4, w5 74: 0b02add w0, w0, w2 78: 0b20add w0, w1, w0 7c: d65f03c0ret Anyway, it looks the heuristic rule to enable SLP optimization needs to be improved.
[Bug tree-optimization/88459] New: vectorization failure for a simple sum reduction loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88459 Bug ID: 88459 Summary: vectorization failure for a simple sum reduction loop Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jiangning.liu at amperecomputing dot com Target Milestone: --- For the simple loop below, gcc -O3 fails to vectorize it. unsigned int tmp[1024]; unsigned int test_vec(int n) { int sum = 0; for(int i = 0; i < 1024; i++) { sum += tmp[i]; } return sum; } The kernel loop is, .L2: ldr w2, [x1], 4 add w0, w0, w2 cmp x3, x1 bne .L2 But if we change the data type of sum from "int" to "unsigned int" as below, unsigned int tmp[1024]; unsigned int test_vec(int n) { unsigned int sum = 0; for(int i = 0; i < 1024; i++) { sum += tmp[i]; } return sum; } gcc can vectorize it, and the kernel loop is like, .L2: ldr q1, [x0], 16 add v0.4s, v0.4s, v1.4s cmp x1, x0 bne .L2
[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398 --- Comment #4 from Jiangning Liu --- I expect "gcc -O3 -flto" could work.
[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398 --- Comment #2 from Jiangning Liu --- memcmp doesn't return the position where they differ.
[Bug tree-optimization/88398] New: vectorization failure for a small loop to do byte comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398 Bug ID: 88398 Summary: vectorization failure for a small loop to do byte comparison Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jiangning.liu at amperecomputing dot com Target Milestone: --- For the small case below, GCC -O3 can't vectorize the small loop to do byte comparison in func2. void *malloc(long unsigned int); typedef struct { unsigned char *buffer; } data; static unsigned char *func1(data *d) { return d->buffer; } static int func2(int max, int pos, unsigned char *cur) { unsigned char *p = cur + pos; int len = 0; while (++len != max) if (p[len] != cur[len]) break; return cur[len]; } int main (int argc) { data d; d.buffer = malloc(2*argc); return func2(argc, argc, func1(&d)); } At the moment, the following code is generated for this loop, 4004d4: 38616862ldrbw2, [x3,x1] 4004d8: 6b5fcmp w2, w0 4004dc: 54a1b.ne4004f0 4004e0: 38616880ldrbw0, [x4,x1] 4004e4: 6b01027fcmp w19, w1 4004e8: 91000421add x1, x1, #0x1 4004ec: 5441b.ne4004d4 In fact, this loop can be vectorized by checking if the comparison size is aligned to SIMD register length. It may introduce run time overhead, but cost model could make decision on doing it or not.
[Bug tree-optimization/88259] New: vectorization failure for a typical loop for getting max value and index
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88259 Bug ID: 88259 Summary: vectorization failure for a typical loop for getting max value and index Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jiangning.liu at amperecomputing dot com Target Milestone: --- GCC -O3 can't vectorize the following typical loop for getting max value and index from an array. void test_vec(int *data, int n) { int best_i, best = 0; for (int i = 0; i < n; i++) { if (data[i] > best) { best = data[i]; best_i = i; } } data[best_i] = data[0]; data[0] = best; } The code generated in the kernel loop is as below, .L4: ldr w4, [x0, x2, lsl 2] cmp w3, w4 cselw6, w4, w3, lt cselw5, w2, w5, lt add x2, x2, 1 mov w3, w6 cmp w1, w2 bgt .L4 If n is a constant like 1024, gcc -O3 still fails to vectorize it. If we only get the max value and keep only one statement in the if statement inside the loop, void test_vec(int *data, int n) { int best = 0; for (int i = 0; i < n; i++) { if (data[i] > best) { best = data[i]; } } data[0] = best; } "gcc -O3" can do vectorization and the kernel loop is like below, .L4: ldr q1, [x2], 16 smaxv0.4s, v0.4s, v1.4s cmp x2, x3 bne .L4
[Bug tree-optimization/86530] Vectorization failure for a simple loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86530 --- Comment #1 from Jiangning Liu --- Created attachment 44396 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44396&action=edit vectorization failure Attached is -O3 result for aarch64, in which no vectorization code generated at all.
[Bug tree-optimization/86530] New: Vectorization failure for a simple loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86530 Bug ID: 86530 Summary: Vectorization failure for a simple loop Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jiangning.liu at amperecomputing dot com Target Milestone: --- GCC -O3 can't vectorize the following simple case. $ cat test_loop_2.c int test_loop_2(char *p1, char *p2) { int s = 0; for(int i=0; i<4; i++, p1+=4, p2+=4) { s += (p1[0]-p2[0]) + (p1[1]-p2[1]) + (p1[2]-p2[2]) + (p1[3]-p2[3]); } return s; } The vector size is 4*1=4 bytes, and it doesn't directly fit into 8-byte or 16-byte vector, but we still can extend the element to be 32-bit, and use the vector operations on 4*4=16 bytes vector.
[Bug tree-optimization/86504] vectorization failure for a nest loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86504 --- Comment #1 from Jiangning Liu --- Created attachment 44387 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44387&action=edit bad vectorizatoin result for boundary size 8
[Bug tree-optimization/86504] New: vectorization failure for a nest loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86504 Bug ID: 86504 Summary: vectorization failure for a nest loop Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jiangning.liu at amperecomputing dot com Target Milestone: --- Created attachment 44386 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44386&action=edit bad vectorizatoin result for boundary size 16 For the case below, the code generated by “gcc -O3” is very ugly, and the inner loop can be correctly vectorized. Please refer to attached file test_loop_inner_16.s. char g_d[1024], g_s1[1024], g_s2[1024]; void test_loop(void) { char *d = g_d, *s1 = g_s1, *s2 = g_s2; for ( int y = 0; y < 128; y++ ) { for ( int x = 0; x < 16; x++ ) d[x] = s1[x] + s2[x]; d += 16; } } If we change inner loop “for ( int x = 0; x < 16; x++ )” to be like “for ( int x = 0; x < 32; x++ )”, i.e. the loop boundary size changes from 16 to 32, very beautiful vectorization code would be generated. For example, the code below is the aarch64 result for loop boundary size 32, and it the same case for x86. test_loop: .LFB0: .cfi_startproc adrpx2, g_s1 adrpx3, g_s2 add x2, x2, :lo12:g_s1 add x3, x3, :lo12:g_s2 adrpx0, g_d adrpx1, g_d+2048 add x0, x0, :lo12:g_d add x1, x1, :lo12:g_d+2048 ldp q1, q2, [x2] ldp q3, q0, [x3] add v1.16b, v1.16b, v3.16b add v0.16b, v0.16b, v2.16b .p2align 3,,7 .L2: str q1, [x0] str q0, [x0, 16]! cmp x0, x1 bne .L2 ret The code generated for loop boundary size 8 is also very bad. Any idea?