[Bug tree-optimization/114760] New: traling zero count detection failure

2024-04-17 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114760

Bug ID: 114760
   Summary: traling zero count detection failure
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jiangning.liu at amperecomputing dot com
  Target Milestone: ---

For this small case, gcc failed to detect trailing zero count calculation, so
the x86 instruction tzcnt cannot be generated, but clang can generate it.

unsigned  ntz32_6a(unsigned x) {
  int n;

  n = 32;
  while (x != 0) {
n = n - 1;
x = x + x;
  }
  return n;
}

If we slightly change "x = x + x" to "x = x << 1", the optimization will just
work.

unsigned  ntz32_6a(unsigned x) {
  int n;

  n = 32;
  while (x != 0) {
n = n - 1;
x = x << 1;
  }
  return n;
}

It seems number_of_iterations_cltz/number_of_iterations_cltz_complement in
tree-ssa-loop-niter.cc or somewhere else need to be enhanced.

[Bug tree-optimization/98138] BB vect fail to SLP one case

2023-10-04 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #12 from Jiangning Liu  
---
Hi Richi,

> That said, "failure" to identify the common (vector) load is known
> and I do have experimental patches trying to address that but did
> not yet arrive at a conclusive "best" approach.

It was long time ago, so do you have the "best" approach now?

Thanks,
-Jiangning

[Bug target/106671] aarch64: BTI instruction are not inserted for cross-section direct calls

2023-08-14 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106671

--- Comment #11 from Jiangning Liu  
---
Hi Wilco,

> "it means we will need a linker optimization to remove those redundant BTIs 
> (eg. by changing them into NOPs)"

It will be only for performance optimization, right? If we don't care about
performance, the linker doesn't need to optimize it to be NOP, right? It could
still be useful if we only do this operation for a specific module.

Thanks,
-Jiangning

[Bug tree-optimization/109603] New: Vectorization failure for a small loop containing a simple branch

2023-04-24 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109603

Bug ID: 109603
   Summary: Vectorization failure for a small loop containing a
simple branch
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jiangning.liu at amperecomputing dot com
  Target Milestone: ---

For the following small case,

#include 
#include 
#include 

#define NANOSECS10L

int main(int argc, char * argv[])
{
  long long i, even, odd, c;
  char *eptr;
  struct timespec ts0, ts1;

  c = strtoll(argv[1], &eptr, 10);

  printf("c = %lld \n", c);

  even = odd = 0;

  clock_gettime(CLOCK_MONOTONIC, &ts0);

  for (i = 0; i < c; i++)
  {
if (i % 2) 
  even++;
else
  odd++;
  }

  clock_gettime(CLOCK_MONOTONIC, &ts1);

  printf("even = %lld odd = %lld\n", even, odd);
  printf("elapsed %ld\n", (ts1.tv_sec - ts0.tv_sec) * NANOSECS + (ts1.tv_nsec -
ts0.tv_nsec));

  return 0;
}

Using "-mcpu=neoverse-n1" gcc fails to vectorize the loop, while using
"-mcpu=neoverse-n1 -mtune=generic" or without -mcpu and -mtune, gcc can
successfully vectorize it.



The scalar version for the loop is like,

  400660:   36000381tbz w1, #0, 4006d0 
  400664:   91000694add x20, x20, #0x1
  400668:   91000421add x1, x1, #0x1
  40066c:   eb01027fcmp x19, x1
  400670:   5481b.ne400660   // b.any
  ...
  4006d0:   910006b5add x21, x21, #0x1
  4006d4:   17e5b   400668 

The vectorization version is like below (factor=2), and it is much faster on
neoverse-n1.

  400670:   91000421add x1, x1, #0x1
  400674:   4e241c20and v0.16b, v1.16b, v4.16b
  400678:   4ee48421add v1.2d, v1.2d, v4.2d
  40067c:   4ee09800cmeqv0.2d, v0.2d, #0
  400680:   6e631ca0bsl v0.16b, v5.16b, v3.16b
  400684:   4ee08442add v2.2d, v2.2d, v0.2d
  400688:   eb13003fcmp x1, x19
  40068c:   5421b.ne400670   // b.any



It seems neoverse-n1 vector cost model is inaccurate and does work well for
this small case.

(1) For -mcpu=neoverse-n1 version, the vectorization cost model result is

Vector inside of loop cost: 12
Scalar iteration cost: 5

12 > 5*2, so gcc doesn't think it's worth doing vectorization for factor=2.

(2) For the version without -mcpu , the vectorization cost model result is

Vector inside of loop cost: 4
Scalar iteration cost: 5

Actually, the loop body cost for vectorized version is 4, which is too small,
and it looks incorrect as well, although in reality vectorized version is
faster than scalar version. In contract, the 12 for -mcpu=neoverse-n1 looks
more reasonable, although it blocked the vectorization.

[Bug rtl-optimization/109343] New: invalid if conversion optimization for aarch64

2023-03-30 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109343

Bug ID: 109343
   Summary: invalid if conversion optimization for aarch64
   Product: gcc
   Version: rust/master
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jiangning.liu at amperecomputing dot com
  Target Milestone: ---

For this small case, if-conversion optimization in back-end generated csel
instruction for aarch64, which is unsafe. The address of variable sga_var could
be invalid if sga_mapped is false.

$ cat ttt2.c
extern int sga_mapped, sga_var;
int func(int j){
int i=0;
if(sga_mapped)
i=i+sga_var;
return i;
}
$ gcc -O3 -S ttt2.c
$ cat ttt2.s
.arch armv8-a
.file   "ttt2.c"
.text
.align  2
.p2align 4,,11
.global func
.type   func, %function
func:
.LFB0:
.cfi_startproc
adrpx0, sga_mapped
adrpx1, sga_var
ldr w0, [x0, #:lo12:sga_mapped]
ldr w1, [x1, #:lo12:sga_var]
cmp w0, 0
cselw0, w1, w0, ne
ret
.cfi_endproc
.LFE0:
.size   func, .-func
.ident  "GCC: (GNU) 12.2.1 20221121 (Red Hat 12.2.1-4)"
.section.note.GNU-stack,"",@progbits

For x86, the following code is generated. It is safe because the memory access
to sga_var(%rip) won't be really triggered if %eax is not set. Here x86 and
aarch64 are different.

$ cat ttt2.s
.file   "ttt2.c"
.text
.p2align 4
.globl  func
.type   func, @function
func:
.LFB0:
.cfi_startproc
endbr64
movlsga_mapped(%rip), %eax
testl   %eax, %eax
cmovne  sga_var(%rip), %eax
ret

[Bug tree-optimization/89430] A missing ifcvt optimization to generate csel

2022-11-11 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89430

--- Comment #17 from Jiangning Liu  
---
Yes.

> -Original Message-
> From: tnfchris at gcc dot gnu.org 
> Sent: Friday, November 11, 2022 4:48 PM
> To: JiangNing Liu 
> Subject: [Bug tree-optimization/89430] A missing ifcvt optimization to
> generate csel
> 
> [EXTERNAL EMAIL NOTICE: This email originated from an external sender.
> Please be mindful of safe email handling and proprietary information
> protection practices.]
> 
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89430
> 
> --- Comment #16 from Tamar Christina  --- I
> think this can be closed now right?
> 
> --
> You are receiving this mail because:
> You reported the bug.

[Bug c/106823] New: #pragma GCC diagnostic ignored "-Wattribute-warning" doesn't work for -flto

2022-09-03 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106823

Bug ID: 106823
   Summary: #pragma GCC diagnostic ignored "-Wattribute-warning"
doesn't work for -flto
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jiangning.liu at amperecomputing dot com
  Target Milestone: ---

$ cat foo.cpp
extern "C" __attribute__((__warning__(""))) void _foo(int) {};

void foo(int num) {
#pragma GCC diagnostic ignored "-Wattribute-warning"
  ::_foo(num);
}

int main()
{
foo(1);
}
$ g++ foo.cpp
$ g++ -flto foo.cpp
foo.cpp: In function ‘foo’:
foo.cpp:5:9: warning: call to ‘_foo’ declared with attribute warning: 
[-Wattribute-warning]
5 |   ::_foo(num);
  | ^

[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-11-28 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #7 from Jiangning Liu  ---
Without reverting the commit g:1118a3ff9d3ad6a64bba25dc01e7703325e23d92, we
still see exchange2 performance issue for aarch64. BTW, we have been using
-fno-inline-functions-called-once to get the best performance number for
exchange2.

[Bug tree-optimization/100511] Fail to remove dead code in loop

2021-05-11 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100511

--- Comment #5 from Jiangning Liu  ---
If we change "c3 = a" to "c3 = x->b", GCC can optimize it without IPA. It seems
VRP is working for this case.

$ cat tt7.c
#include 

int a;
typedef struct {
int b;
int count;
} XX;

int g;

__attribute__((noinline)) void f(XX *x)
{
int c1 = 0;
int c3 = x->b;
if (x)
c1 = x->count;
for (int i=0; icount) {
if (i > c3) {
printf("Unreachable!");
break;
}
else
g = 2;
} else
g = i;
}
}

void main(void)
{
XX x;
x.count = 100;
a = 100;
f(&x);
}

[Bug tree-optimization/100511] Fail to remove dead code in loop

2021-05-10 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100511

--- Comment #2 from Jiangning Liu  ---
Then why gcc can't optimize this case either? sizeof (XX) <> sizeof(g) here.

#include 

int a;
typedef struct {
int b;
int count;
} XX;

int g;

__attribute__((noinline)) void f(XX *x)
{
int c1 = 0;
int c3 = a;
if (x)
c1 = x->count;
for (int i=0; icount) {
if (i > c3) {
printf("Unreachable!");
break;
}
else
g = 2;
} else
g = i;
}
}

void main(void)
{
XX x;
x.count = 100;
a = 100;
f(&x);
}

[Bug tree-optimization/100511] New: Fail to remove dead code in loop

2021-05-10 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100511

Bug ID: 100511
   Summary: Fail to remove dead code in loop
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jiangning.liu at amperecomputing dot com
  Target Milestone: ---

For this simple case, gcc doesn't know the if condition (i > c2) is always
false.

#include 

typedef struct {
int count;
} XX;

int g;

__attribute__((noinline)) void f(XX *x)
{
int c1 = 0;
if (x)
c1 = x->count;
for (int i=0; icount;
if (i > c2) {
printf("Unreachable!");
break;
}
else
g = i;
}
}

void main(void)
{
XX x;
x.count = 100;
f(&x);
}

If we change variable the type of variable g to float, gcc does optimize away
this if condition inside the loop, so why alias analysis can't recognize g is
different from x->count?

[Bug tree-optimization/99946] fail to exchange if conditions in terms of likely/unlikely probability

2021-04-06 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99946

--- Comment #1 from Jiangning Liu  ---
Is there any gcc pass that can deal with this simple optimization?

[Bug tree-optimization/99946] New: fail to exchange if conditions in terms of likely/unlikely probability

2021-04-06 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99946

Bug ID: 99946
   Summary: fail to exchange if conditions in terms of
likely/unlikely probability
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jiangning.liu at amperecomputing dot com
  Target Milestone: ---

For this simple case,

$ cat test_cond.c 
#define likely(x)   __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)

extern void g(void);

int a, b;
void f(void)
{
  if (likely(a>0))
if (unlikely(b>0))
  g();
}

We expect gcc compiler can exchange the if conditions to be like below,

  if (unlikely(b>0))
if (likely(a>0))
  g();

This way, performance can be improved due to saving the comparison for a>0.

At the moment, gcc generate code as below,

.LFB0:
.cfi_startproc
movla(%rip), %edx
testl   %edx, %edx
jle .L1
movlb(%rip), %eax
testl   %eax, %eax
jg  .L4
.L1:
ret

[Bug rtl-optimization/98782] [11 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

2021-02-22 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #4 from Jiangning Liu  ---
Hi Honza,

Do you see any other real case problems if the patch
g:1118a3ff9d3ad6a64bba25dc01e7703325e23d92 is not applied?

If exchange2 is the only one affected by this patch so far, and because we have
observed big performance regression, it sounds we need to provide an IRA fix
along with this patch to avoid unexpected performance degradation for gcc11
release vs. gcc10.

Thanks,
-Jiangning

[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops

2021-01-14 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598

--- Comment #12 from Jiangning Liu  
---
MGO RFC is at https://gcc.gnu.org/pipermail/gcc/2021-January/234682.html

[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops

2021-01-11 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598

--- Comment #11 from Jiangning Liu  
---
(In reply to rguent...@suse.de from comment #8)
> On Sat, 9 Jan 2021, jiangning.liu at amperecomputing dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598
> > 
> > --- Comment #7 from Jiangning Liu  > com> ---
> > (In reply to rguent...@suse.de from comment #6)
> > > On January 9, 2021 4:17:17 AM GMT+01:00, "jiangning.liu at amperecomputing
> > > dot com"  wrote:
> > > >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598
> > > >
> > > >--- Comment #5 from Jiangning Liu  > > >com> ---
> > > >> It has to be done with care of course, cost modeling is difficult
> > > >> (we need to have a good estimate of n and m or need to version
> > > >> the whole nest).  That said, usually we attempt the reverse
> > > >transform.
> > > >
> > > >Before tuning the cost model good enough, we may implement this
> > > >optimization by
> > > >adding a new optimization command line option. This won't hurt gcc,
> > > >right?
> > > 
> > > New options not enabled by default tend to bitrot, be broken from the 
> > > start
> > > and won't be used by the lazy user. So I see no point in doing that. 
> > > 
> > 
> > Understand. I mean we can enable it by default eventually, but we need to
> > implement and tune it step by step. It is unrealistic to work out the best 
> > cost
> > model at the very beginning.
> 
> Sure.  The "easiest" thing is to rely on a profile from PGO, we did
> have some transforms only enabled by -fprofile-use by default.  That is,
> the cost model needs to be conservative, esp. if you introduce dynamic
> allocation for this.  In the end I guess only a variant that versions
> the nest on the size of the temporary will be good enough to not trigger
> OOM or excessive overhead for small sizes anyway.

People usually don't use PGO unless they can't find any better static compiler
switches. This optimization should always benefit performance if we can tune
the cost model good enough. It is true that the temp memory size needs to be
checked to avoid OOM, which is one of the runtime overheads.

[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops

2021-01-11 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598

--- Comment #10 from Jiangning Liu  
---
(In reply to Hongtao.liu from comment #9)
> It looks like a SOA/AOC opt opportunity which is discussed in
> https://gcc.gnu.org/wiki/
> cauldron2015?action=AttachFile&do=view&target=Olga+Golovanevsky_+Memory+Layou
> t+Optimizations+of+Structures+and+Objects.pdf
> 
> And i remember there's someone working on enabling SOA/AOS opt in GCC.

No. The key difference is the optimization opportunity here doesn't rely on LTO
at all. It is purely a local optimization within a function instead.

[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops

2021-01-09 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598

--- Comment #7 from Jiangning Liu  ---
(In reply to rguent...@suse.de from comment #6)
> On January 9, 2021 4:17:17 AM GMT+01:00, "jiangning.liu at amperecomputing
> dot com"  wrote:
> >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598
> >
> >--- Comment #5 from Jiangning Liu  >com> ---
> >> It has to be done with care of course, cost modeling is difficult
> >> (we need to have a good estimate of n and m or need to version
> >> the whole nest).  That said, usually we attempt the reverse
> >transform.
> >
> >Before tuning the cost model good enough, we may implement this
> >optimization by
> >adding a new optimization command line option. This won't hurt gcc,
> >right?
> 
> New options not enabled by default tend to bitrot, be broken from the start
> and won't be used by the lazy user. So I see no point in doing that. 
> 

Understand. I mean we can enable it by default eventually, but we need to
implement and tune it step by step. It is unrealistic to work out the best cost
model at the very beginning.

[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops

2021-01-08 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598

--- Comment #5 from Jiangning Liu  ---
> It has to be done with care of course, cost modeling is difficult
> (we need to have a good estimate of n and m or need to version
> the whole nest).  That said, usually we attempt the reverse transform.

Before tuning the cost model good enough, we may implement this optimization by
adding a new optimization command line option. This won't hurt gcc, right?

> 
> My personal opinion is that hinting the user to possibly refactor
> his code (guided by profiling to be not too noisy) is much
> prefered to the idea that the compiler can ever apply such transform
> to the loops where it matters and not to the loops where it is
> harmful.

Sometimes, it is not always easy for the user to modify the code, and even the
user may be lazy and reluctant to change the code. This kind of Memory
Gathering Optimization can make end-user's life easier.

[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops

2021-01-08 Thread jiangning.liu at amperecomputing dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598

--- Comment #2 from Jiangning Liu  ---
Loop distribution can only handle very simple case. If the inner loop has
complicated control flow and other memory accesses with loop-carried
dependence, it would be hard to handle it. For example,

int foo (int n, int m, A *pa) {
  int sum;

  for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) {
  sum += pa[j].pb->pc->val;  // each value is repeatedly loaded "n" times
  sum = sum % 7;
}
sum = sum % 13;
  }

  return sum;
}

Alternatively, we can detect "invariant" dependent memory loads for the nested
loops with alias conflict checked. If the outer loop is hot enough, we could
have a chance to "hoist" them to create cache.

As for temp storage, is it a gcc's rule of thumb not to introduce temp storage
on heap, or it is just gcc doesn't have it yet and we want to have it?

[Bug web/95380] New: ipcp-unit-growth was renamed to ipa-cp-unit-growth

2020-05-27 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95380

Bug ID: 95380
   Summary: ipcp-unit-growth was renamed to ipa-cp-unit-growth
   Product: gcc
   Version: 10.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: web
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jiangning.liu at amperecomputing dot com
  Target Milestone: ---

https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options

Option ipcp-unit-growth (9.1.0) has been renamed to ipa-cp-unit-growth
(10.1.0), but the document in the link above doesn't reflect the change. The
10.1.0 pdf document at https://gcc.gnu.org/onlinedocs/gcc-10.1.0/gcc.pdf also
doesn't have correct info.

[Bug c++/93163] internal compiler error: verify_gimple failed

2020-01-05 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93163

Jiangning Liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from Jiangning Liu  ---
Confirmed that the issue has been fixed on trunk.

[Bug c/93163] internal compiler error: verify_gimple failed

2020-01-05 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93163

--- Comment #1 from Jiangning Liu  ---
Created attachment 47591
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47591&action=edit
bad case from llvm build

[Bug c/93163] New: internal compiler error: verify_gimple failed

2020-01-05 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93163

Bug ID: 93163
   Summary: internal compiler error: verify_gimple failed
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jiangning.liu at amperecomputing dot com
  Target Milestone: ---

LLVM trunk build with gcc trunk exposed failure "internal compiler error:
verify_gimple failed".

$ g++ -O3 -c bad.cpp
bad.cpp: In constructor
‘{anonymous}::AArch64SIMDInstrOpt::AArch64SIMDInstrOpt()’:
bad.cpp:140602:3: error: incorrect sharing of tree nodes
140602 |   AArch64SIMDInstrOpt() : MachineFunctionPass(ID) {
   |   ^~~
*D.397057
D.397057->RC = FPR128RegClass;
bad.cpp:140602:3: error: incorrect sharing of tree nodes
*D.397057
D.397057->RC = FPR128RegClass;
bad.cpp:140602:3: error: incorrect sharing of tree nodes
*D.397057
D.397057->RC = FPR64RegClass;
bad.cpp:140602:3: error: incorrect sharing of tree nodes
*D.397057
D.397057->RC = FPR128RegClass;
bad.cpp:140602:3: error: incorrect sharing of tree nodes
*D.397057
D.397057->RC = FPR64RegClass;
bad.cpp:140602:3: error: incorrect sharing of tree nodes
*D.397057
D.397057->RC = FPR128RegClass;
bad.cpp:140602:3: error: incorrect sharing of tree nodes
*D.397057
D.397057->RC = FPR64RegClass;
bad.cpp:140602:3: error: incorrect sharing of tree nodes
*D.397057
D.397057->RC = FPR128RegClass;
bad.cpp:140602:3: error: incorrect sharing of tree nodes
*D.397057
D.397057->RC = FPR128RegClass;
bad.cpp:140602:3: error: incorrect sharing of tree nodes
*D.397057
D.397057->RC = FPR64RegClass;
bad.cpp:140602:3: error: incorrect sharing of tree nodes
*D.397057
D.397057->RC = FPR128RegClass;
bad.cpp:140602:3: error: incorrect sharing of tree nodes
*D.397057
D.397057->RC = FPR64RegClass;
bad.cpp:140602:3: error: incorrect sharing of tree nodes
*D.397057
D.397057->RC = FPR128RegClass;
bad.cpp:140602:3: error: incorrect sharing of tree nodes
*D.397057
D.397057->RC = FPR64RegClass;
during GIMPLE pass: cfg
bad.cpp:140602:3: internal compiler error: verify_gimple failed
0x100bbff verify_gimple_in_cfg(function*, bool)
../../gcc/gcc/tree-cfg.c:5445
0xebad33 execute_function_todo
../../gcc/gcc/passes.c:1983
0xebbc4b execute_todo
../../gcc/gcc/passes.c:2037
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.

Bisect run shows the failure is related to commit
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=279576

[Bug tree-optimization/92649] dead store elimination

2019-11-25 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92649

--- Comment #5 from Jiangning Liu  ---
Unrolling 1024 iterations would increase code size a lot, so usually we don't
do that. 1024 is only an example. Without knowing we could eliminate most of
them, we don't really want to do loop unrolling, I guess.

Yes. Assigning 5 to all a's elements is only an example as well. It could be
any random value or predefined number.

Let me give a more complicated case,

extern int rand(void);

#define LIVE_SIZE 100
#define DATA_SIZE 256

int f(void)
{
int a[DATA_SIZE], b[DATA_SIZE][DATA_SIZE];
int i,j;
long long s = 0;
int next;

for (i=0; i

[Bug tree-optimization/92649] dead store elimination

2019-11-25 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92649

--- Comment #3 from Jiangning Liu  ---
It is a stupid test, but it is simplified from a real application.

To solve even more complicated scenario, this simple case needs to be addressed
first.

If we change the case to be as below,

int f(void)
{
int i, a[1024], s=0;

for (i=0; i<1024; i++)
a[i] = 5;

for (i=0; i<37; i++)
s += a[i];
return s;
}

the loop peeling will not work, but compiler should still know the store to
elements with index >= 37 can all be eliminated. Can any framework in GCC solve
this problem?

[Bug tree-optimization/92649] New: dead store elimination

2019-11-24 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92649

Bug ID: 92649
   Summary: dead store elimination
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jiangning.liu at amperecomputing dot com
  Target Milestone: ---

For this small case,

int f(void)
{
int i, a[1024];

for (i=0; i<1024; i++)
a[i] = 5;
return a[0];
}

"gcc -O3" can't figure out the memory stores from a[1] to a[1023] all can be
eliminated. The assembly code for aarch64 is as below.

moviv0.4s, 0x5
sub sp, sp, #4096
mov x0, sp
add x1, sp, 4096
.L2:
str q0, [x0], 16
cmp x0, x1
bne .L2
ldr w0, [sp]
add sp, sp, 4096
ret

[Bug tree-optimization/91246] vectorization failure for a small loop to search array element

2019-07-24 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91246

--- Comment #3 from Jiangning Liu  ---
Expect to vectorize the inner loop by generating the code below for x86,

vpbroadcastd [mem], ymm0
vpaddd [mem], ymm0, ymm1
vpbroadcastd reg, ymm2
vpcmpeqd ymm2, ymm1, k0
kortestw k0, k0
cmovne ...

AArch64 should have vectorization instructions counterpart to implement the
same functionality.

[Bug tree-optimization/91246] vectorization failure for a small loop to search array element

2019-07-24 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91246

--- Comment #2 from Jiangning Liu  ---
Created attachment 46626
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46626&action=edit
A new test

Attached is a test case that is more closely matching the real-world code.

[Bug tree-optimization/91246] New: vectorization failure for a small loop to search array element

2019-07-24 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91246

Bug ID: 91246
   Summary: vectorization failure for a small loop to search array
element
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jiangning.liu at amperecomputing dot com
  Target Milestone: ---

For the following simple case, the inner loop can be completely removed by
vectorization. GCC fails to do that. SIZE can be either 4 or 8.

#define SIZE 4
int f(int *data, int x)
{
int i, j;
int s = 0;

for (i = 0; i < 1024; i++) {
int found = 0;
for (j = 0; j < SIZE; j++) {
if (data[j] == x) {
found = 1;
break;
}
}
s += found;
}

return s;
}

[Bug middle-end/91195] [10 regression] incorrect may be used uninitialized smw (272711, 273474]

2019-07-23 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91195

Jiangning Liu  changed:

   What|Removed |Added

 CC||msebor at gcc dot gnu.org

--- Comment #8 from Jiangning Liu  ---
Martin is arguing setting the no-warning bit in middle-end for this scenario is
not a robust solution at
https://gcc.gnu.org/ml/gcc-patches/2019-07/msg01525.html.

What about moving the case below to -O3? Could it be acceptable by
-Wmaybe-uninitialized tests?

tree base = get_base_address (lhs);
if (!nontrap->contains (lhs)
&& auto_var_p (base)
&& TREE_ADDRESSABLE (base)
&& optimization_level > 2)
  {
/* Do conditional store replacement by inserting a load. */
  }

[Bug middle-end/91195] [10 regression] incorrect may be used uninitialized smw (272711, 273474]

2019-07-22 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91195

--- Comment #6 from Jiangning Liu  ---
It seems -Werror=maybe-uninitialized cannot always work, and it fails to report
the error message for the case below. However, the option name is "maybe-xxx",
so I can understand it is OK, but for the same reason, it should be also OK if
we report error message for the original case.

$ cat pr89430-1.c
unsigned test(unsigned k, unsigned b) {
unsigned a[2];
if (b < a[k]) {
a[k] = b;
}
return a[0]+a[1];
}
$ gcc -O2 -S pr89430-1.c -Werror=maybe-uninitialized
$ cat pr89430-1.s
.file   "pr89430-1.c"
.text
.p2align 4
.globl  test
.type   test, @function
test:
.LFB0:
.cfi_startproc
movl%edi, %edi
cmpl%esi, -8(%rsp,%rdi,4)
cmovbe  -8(%rsp,%rdi,4), %esi
movl%esi, -8(%rsp,%rdi,4)
movl-4(%rsp), %eax
addl-8(%rsp), %eax
ret
.cfi_endproc
.LFE0:
.size   test, .-test
.ident  "GCC: (GNU) 10.0.0 20190722 (experimental)"
.section.note.GNU-stack,"",@progbits

[Bug middle-end/91195] [10 regression] incorrect may be used uninitialized smw (272711, 273474]

2019-07-21 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91195

--- Comment #3 from Jiangning Liu  ---
The gcc compilation difference between FOR_UP_LIMIT is 3 and 4 is that,
cunrolli can do loop unrolling when FOR_UP_LIMIT is 3, for which the control
flow can be significantly simplified, so the conditional store optimization in
phiopt will not be triggered.

The following code is generated with conditional store optimization, and
"cstore_8 = MEM  [(void *)&Msg][0];" is inserted in the else branch
"if (m1_9(D) != 0B)" statement.

   [local count: 214748364]:
  if (m1_9(D) != 0B)
goto ; [70.00%]
  else
goto ; [30.00%]

   [local count: 64424509]:
  cstore_8 = MEM  [(void *)&Msg][0];

   [local count: 214748364]:
  # num_2 = PHI <1(2), 0(3)>
  # cstore_4 = PHI 
  MEM  [(void *)&Msg][0] = cstore_4;

The possible solution is to disable this optimization when
"-Werror=maybe-uninitialized" is enabled.

[Bug tree-optimization/89134] A missing optimization opportunity for a simple branch in loop

2019-03-29 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89134

--- Comment #13 from Jiangning Liu  
---
Feng already sent out the 1st patch at
https://gcc.gnu.org/ml/gcc-patches/2019-03/msg00541.html .

But the 2nd one is related to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89713 .

[Bug rtl-optimization/89430] A missing ifcvt optimization to generate csel

2019-02-26 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89430

--- Comment #8 from Jiangning Liu  ---
It is related to https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02998.html

Bernd's patch is an overkill.

[Bug rtl-optimization/89430] A missing ifcvt optimization to generate csel

2019-02-21 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89430

--- Comment #7 from Jiangning Liu  ---
To avoid "readonly" issue, try this case,

unsigned test(unsigned k, unsigned b) {
unsigned a[2];
if (b < a[k]) {
a[k] = b;
}
return a[0]+a[2];
}

Variable a is local, and it is NOT readonly, so now the following code is
generated,

sub sp, sp, #16
uxtwx0, w0
add x2, sp, 8
ldr w3, [x2, x0, lsl 2]
cmp w3, w1
bls .L2
str w1, [x2, x0, lsl 2]
.L2:
ldr w1, [sp, 8]
ldr w0, [sp, 16]
add sp, sp, 16
add w0, w1, w0
ret

But gcc should generate code below instead,

uxtwx2, w0
add x3, sp, 8
ldr w5, [sp, 16]
ldr w4, [x3, x2, lsl 2]
cmp w4, w1
cselw1, w1, w4, hi
str w1, [x3, x2, lsl 2]
ldr w0, [sp, 8]
add sp, sp, 16
add w0, w0, w5
ret

Any other glass jaw?

[Bug rtl-optimization/89430] A missing ifcvt optimization to generate csel

2019-02-21 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89430

--- Comment #6 from Jiangning Liu  ---
(In reply to Richard Biener from comment #5)
> (In reply to Jiangning Liu from comment #4)
> > >We need to be careful with loads
> > >or stores, for instance a load might not trap, while a store would,
> > >so if we see a dominating read access this doesn't mean that a later
> > >write access would not trap.  
> > 
> > Why? For this case, there is a dominating load for the same address. I don't
> > see why it might trap. Any example?
> 
> The memory might be mapped readonly.

But in such a simple basic block, how can it be mapped readonly? We can easily
know it is NOT to do readonly mapping.

[Bug rtl-optimization/89430] A missing ifcvt optimization to generate csel

2019-02-21 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89430

--- Comment #4 from Jiangning Liu  ---
>We need to be careful with loads
>or stores, for instance a load might not trap, while a store would,
>so if we see a dominating read access this doesn't mean that a later
>write access would not trap.  

Why? For this case, there is a dominating load for the same address. I don't
see why it might trap. Any example?

[Bug rtl-optimization/89430] New: A missing ifcvt optimization to generate csel

2019-02-21 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89430

Bug ID: 89430
   Summary: A missing ifcvt optimization to generate csel
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jiangning.liu at amperecomputing dot com
  Target Milestone: ---

For a small case,

unsigned *a;
void test(unsigned k, unsigned b) {
if (b < a[k]) {
a[k] = b;
}
}

"gcc -O3 -S" generates,

adrpx2, a
uxtwx0, w0
ldr x2, [x2, #:lo12:a]
ldr w3, [x2, x0, lsl 2]
cmp w3, w1
bls .L1
str w1, [x2, x0, lsl 2]

Actually we should use csel instruction instead of conditional branch, so
expect to have the followings generated,

adrpx2, a
uxtwx0, w0
ldr x2, [x2, #:lo12:a]
ldr w3, [x2, x0, lsl 2]
cmp w3, w1
cselw1, w1, w3, hi
str w1, [x2, x0, lsl 2]

RTL optimization ifcvt misses this opportunity.

[Bug tree-optimization/89134] A missing optimization opportunity for a simple branch in loop

2019-01-31 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89134

--- Comment #10 from Jiangning Liu  
---
(In reply to Martin Sebor from comment #9)
> But since GCC emits infinite loops regardless of whether or not
> they have any side-effects, whether inc() is pure or not may not matter. 

I think "for (; it != m.end (); ++it);  /* get an empty loop */" is a finite
loop.

[Bug tree-optimization/89134] A missing optimization opportunity for a simple branch in loop

2019-01-31 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89134

--- Comment #5 from Jiangning Liu  ---
The loop below should be treated as a finite loop,

for (iter = booktable.begin(); iter!=booktable.end(); ++iter) {
   ...
}

so there is a chance to optimize away the empty loop, in which do_something
doesn't exist at all.

[Bug tree-optimization/89134] A missing optimization opportunity for a simple branch in loop

2019-01-31 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89134

Jiangning Liu  changed:

   What|Removed |Added

 Status|RESOLVED|UNCONFIRMED
 Resolution|INVALID |---

--- Comment #2 from Jiangning Liu  ---
The original case is only a simple example, and what if GCC can figure out it
is NOT an infinite loop? For example,

std::map BookTable;

BookTable::iterator iter;
BookTable booktable;
for (iter = booktable.begin(); iter!=booktable.end(); ++iter) {
   if (b) {
  b = do_something();
   }
}

Then GCC should be able to figure out this loop is a finite loop due to using
standard C++ STL std::map. The cost of iterating std::map might be high, so
we'd better consider optimize away the empty loop.

[Bug tree-optimization/89134] New: A missing optimization opportunity for a simple branch in loop

2019-01-30 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89134

Bug ID: 89134
   Summary: A missing optimization opportunity for a simple branch
in loop
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jiangning.liu at amperecomputing dot com
  Target Milestone: ---

For this simple case,

__attribute__((pure)) __attribute__((noinline)) int inc(int i)
{
/* Do something else here */

return i+1;
}
extern int do_something(void);
extern int b;
void test(int n)
{
for (int i=0; i

[Bug tree-optimization/88492] New: SLP optimization generates ugly code

2018-12-13 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492

Bug ID: 88492
   Summary: SLP optimization generates ugly code
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jiangning.liu at amperecomputing dot com
  Target Milestone: ---

For aarch64, SLP optimization generates ugly code for the case below,

int test_slp( unsigned char *b )
{
unsigned int tmp[4][4];
int sum = 0;
for( int i = 0; i < 4; i++, b += 4 )
{
tmp[i][0] = b[0];
tmp[i][2] = b[1];
tmp[i][1] = b[2];
tmp[i][3] = b[3];
}
for( int i = 0; i < 4; i++ )
{
sum += tmp[0][i] + tmp[1][i] + tmp[2][i] + tmp[3][i];
}
return sum;
}

With command line "gcc -O3", the following code is generated,

 :
   0:   9001adrpx1, 0 
   4:   d10103ffsub sp, sp, #0x40
   8:   3dc1ldr q1, [x0]
   c:   3dc00020ldr q0, [x1]
  10:   4e21tbl v1.16b, {v1.16b}, v0.16b
  14:   2f08a422uxtlv2.8h, v1.8b
  18:   6f08a421uxtl2   v1.8h, v1.16b
  1c:   2f10a443uxtlv3.4s, v2.4h
  20:   6f10a442uxtl2   v2.4s, v2.8h
  24:   2f10a420uxtlv0.4s, v1.4h
  28:   6f10a421uxtl2   v1.4s, v1.8h
  2c:   9e660060fmovx0, d3
  30:   ad000be3stp q3, q2, [sp]
  34:   b9401be8ldr w8, [sp, #24]
  38:   ad0107e0stp q0, q1, [sp, #32]
  3c:   9e660022fmovx2, d1
  40:   d360fc01lsr x1, x0, #32
  44:   9e660040fmovx0, d2
  48:   294117e6ldp w6, w5, [sp, #8]
  4c:   d360fc43lsr x3, x2, #32
  50:   b9402be2ldr w2, [sp, #40]
  54:   d360fc07lsr x7, x0, #32
  58:   9e66fmovx0, d0
  5c:   0ea18400add v0.2s, v0.2s, v1.2s
  60:   0b0100e7add w7, w7, w1
  64:   0b0800c6add w6, w6, w8
  68:   b9401fe8ldr w8, [sp, #28]
  6c:   d360fc00lsr x0, x0, #32
  70:   1e260001fmovw1, s0
  74:   0ea28460add v0.2s, v3.2s, v2.2s
  78:   0b63add w3, w3, w0
  7c:   0b070063add w3, w3, w7
  80:   29471fe0ldp w0, w7, [sp, #56]
  84:   1e260004fmovw4, s0
  88:   0b42add w2, w2, w0
  8c:   b9402fe0ldr w0, [sp, #44]
  90:   0b060042add w2, w2, w6
  94:   0b040021add w1, w1, w4
  98:   0b07add w0, w0, w7
  9c:   0b030021add w1, w1, w3
  a0:   0b0800a3add w3, w5, w8
  a4:   0b020021add w1, w1, w2
  a8:   0b03add w0, w0, w3
  ac:   0b20add w0, w1, w0
  b0:   910103ffadd sp, sp, #0x40
  b4:   d65f03c0ret

In the code, vectorization code is generated, but there are ugly instructions
generated as well, e.g. memory store and register copy from SIMD register to
general purpose register.

With command line "gcc -O3 -fno-tree-slp-vectorize", the following code can be
generated, and it looks pretty clean. Usually, this code sequence is friendly
to hardware prefetch.

 :
   0:   39402004ldrbw4, [x0, #8]
   4:   39401002ldrbw2, [x0, #4]
   8:   39403001ldrbw1, [x0, #12]
   c:   3943ldrbw3, [x0]
  10:   39402806ldrbw6, [x0, #10]
  14:   0b040021add w1, w1, w4
  18:   39401805ldrbw5, [x0, #6]
  1c:   0b020063add w3, w3, w2
  20:   39403804ldrbw4, [x0, #14]
  24:   0b030021add w1, w1, w3
  28:   39400802ldrbw2, [x0, #2]
  2c:   39400403ldrbw3, [x0, #1]
  30:   0b060084add w4, w4, w6
  34:   39402407ldrbw7, [x0, #9]
  38:   0b050042add w2, w2, w5
  3c:   39401406ldrbw6, [x0, #5]
  40:   0b020084add w4, w4, w2
  44:   39403405ldrbw5, [x0, #13]
  48:   0b040021add w1, w1, w4
  4c:   0b060063add w3, w3, w6
  50:   39400c02ldrbw2, [x0, #3]
  54:   0b0700a5add w5, w5, w7
  58:   39403c04ldrbw4, [x0, #15]
  5c:   0b050063add w3, w3, w5
  60:   39401c06ldrbw6, [x0, #7]
  64:   39402c05ldrbw5, [x0, #11]
  68:   0b030021add w1, w1, w3
  6c:   0b060040add w0, w2, w6
  70:   0b050082add w2, w4, w5
  74:   0b02add w0, w0, w2
  78:   0b20add w0, w1, w0
  7c:   d65f03c0ret

Anyway, it looks the heuristic rule to enable SLP optimization needs to be
improved.

[Bug tree-optimization/88459] New: vectorization failure for a simple sum reduction loop

2018-12-11 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88459

Bug ID: 88459
   Summary: vectorization failure for a simple sum reduction loop
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jiangning.liu at amperecomputing dot com
  Target Milestone: ---

For the simple loop below, gcc -O3 fails to vectorize it.

unsigned int tmp[1024];
unsigned int test_vec(int n)
{
int sum = 0;
for(int i = 0; i < 1024; i++)
{
sum += tmp[i];
}
return sum;
}

The kernel loop is,

.L2:
ldr w2, [x1], 4
add w0, w0, w2
cmp x3, x1
bne .L2


But if we change the data type of sum from "int" to "unsigned int" as below,

unsigned int tmp[1024];
unsigned int test_vec(int n)
{
unsigned int sum = 0;
for(int i = 0; i < 1024; i++)
{
sum += tmp[i];
}
return sum;
}

gcc can vectorize it, and the kernel loop is like,

.L2:
ldr q1, [x0], 16
add v0.4s, v0.4s, v1.4s
cmp x1, x0
bne .L2

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

2018-12-07 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #4 from Jiangning Liu  ---
I expect "gcc -O3 -flto" could work.

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

2018-12-06 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #2 from Jiangning Liu  ---
memcmp doesn't return the position where they differ.

[Bug tree-optimization/88398] New: vectorization failure for a small loop to do byte comparison

2018-12-06 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

Bug ID: 88398
   Summary: vectorization failure for a small loop to do byte
comparison
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jiangning.liu at amperecomputing dot com
  Target Milestone: ---

For the small case below, GCC -O3 can't vectorize the small loop to do byte
comparison in func2.

void *malloc(long unsigned int);
typedef struct {
unsigned char *buffer;
} data;

static unsigned char *func1(data *d)
{
return d->buffer;
}

static int func2(int max, int pos, unsigned char *cur)
{
unsigned char *p = cur + pos;
int len = 0;
while (++len != max)
if (p[len] != cur[len])
break;
return cur[len];
}

int main (int argc) {
data d;
d.buffer = malloc(2*argc);
return func2(argc, argc, func1(&d));
}

At the moment, the following code is generated for this loop,

  4004d4:   38616862ldrbw2, [x3,x1]
  4004d8:   6b5fcmp w2, w0
  4004dc:   54a1b.ne4004f0 
  4004e0:   38616880ldrbw0, [x4,x1]
  4004e4:   6b01027fcmp w19, w1
  4004e8:   91000421add x1, x1, #0x1
  4004ec:   5441b.ne4004d4 

In fact, this loop can be vectorized by checking if the comparison size is
aligned to SIMD register length. It may introduce run time overhead, but cost
model could make decision on doing it or not.

[Bug tree-optimization/88259] New: vectorization failure for a typical loop for getting max value and index

2018-11-29 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88259

Bug ID: 88259
   Summary: vectorization failure for a typical loop for getting
max value and index
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jiangning.liu at amperecomputing dot com
  Target Milestone: ---

GCC -O3 can't vectorize the following typical loop for getting max value and
index from an array.

void test_vec(int *data, int n) {
int best_i, best = 0;

for (int i = 0; i < n; i++) {
if (data[i] > best) {
best = data[i];
best_i = i;
}
}

data[best_i] = data[0];
data[0] = best;
}

The code generated in the kernel loop is as below,

.L4:
ldr w4, [x0, x2, lsl 2]
cmp w3, w4
cselw6, w4, w3, lt
cselw5, w2, w5, lt
add x2, x2, 1
mov w3, w6
cmp w1, w2
bgt .L4

If n is a constant like 1024, gcc -O3 still fails to vectorize it.

If we only get the max value and keep only one statement in the if statement
inside the loop,

void test_vec(int *data, int n) {
int best = 0;
for (int i = 0; i < n; i++) {
if (data[i] > best) {
best = data[i];
}
}

data[0] = best;
}

"gcc -O3" can do vectorization and the kernel loop is like below,

.L4:
ldr q1, [x2], 16
smaxv0.4s, v0.4s, v1.4s
cmp x2, x3
bne .L4

[Bug tree-optimization/86530] Vectorization failure for a simple loop

2018-07-16 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86530

--- Comment #1 from Jiangning Liu  ---
Created attachment 44396
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44396&action=edit
vectorization failure

Attached is -O3 result for aarch64, in which no vectorization code generated at
all.

[Bug tree-optimization/86530] New: Vectorization failure for a simple loop

2018-07-16 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86530

Bug ID: 86530
   Summary: Vectorization failure for a simple loop
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jiangning.liu at amperecomputing dot com
  Target Milestone: ---

GCC -O3 can't vectorize the following simple case. 

$ cat test_loop_2.c
int test_loop_2(char *p1, char *p2)
{
int s = 0;
for(int i=0; i<4; i++, p1+=4, p2+=4)
{
s += (p1[0]-p2[0]) + (p1[1]-p2[1]) + (p1[2]-p2[2]) + (p1[3]-p2[3]);
}

return s;
}

The vector size is 4*1=4 bytes, and it doesn't directly fit into 8-byte or
16-byte vector, but we still can extend the element to be 32-bit, and use the
vector operations on 4*4=16 bytes vector.

[Bug tree-optimization/86504] vectorization failure for a nest loop

2018-07-12 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86504

--- Comment #1 from Jiangning Liu  ---
Created attachment 44387
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44387&action=edit
bad vectorizatoin result for boundary size 8

[Bug tree-optimization/86504] New: vectorization failure for a nest loop

2018-07-12 Thread jiangning.liu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86504

Bug ID: 86504
   Summary: vectorization failure for a nest loop
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jiangning.liu at amperecomputing dot com
  Target Milestone: ---

Created attachment 44386
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44386&action=edit
bad vectorizatoin result for boundary size 16

For the case below, the code generated by “gcc -O3” is very ugly, and the inner
loop can be correctly vectorized. Please refer to attached file
test_loop_inner_16.s.

char g_d[1024], g_s1[1024], g_s2[1024];
void test_loop(void)
{
char *d = g_d, *s1 = g_s1, *s2 = g_s2;

for ( int y = 0; y < 128; y++ )
{
for ( int x = 0; x < 16; x++ )
d[x] = s1[x] + s2[x];
d += 16;
}
}

If we change inner loop “for ( int x = 0; x < 16; x++ )” to be like “for ( int
x = 0; x < 32; x++ )”, i.e. the loop boundary size changes from 16 to 32, very
beautiful vectorization code would be generated. For example, the code below is
the aarch64 result for loop boundary size 32, and it the same case for x86.

test_loop:
.LFB0:
.cfi_startproc
adrpx2, g_s1
adrpx3, g_s2
add x2, x2, :lo12:g_s1
add x3, x3, :lo12:g_s2
adrpx0, g_d
adrpx1, g_d+2048
add x0, x0, :lo12:g_d
add x1, x1, :lo12:g_d+2048
ldp q1, q2, [x2]
ldp q3, q0, [x3]
add v1.16b, v1.16b, v3.16b
add v0.16b, v0.16b, v2.16b
.p2align 3,,7
.L2:
str q1, [x0]
str q0, [x0, 16]!
cmp x0, x1
bne .L2
ret

The code generated for loop boundary size 8 is also very bad. 

Any idea?