RE: [PATCH PR95696] regrename creates overlapping register allocations for vliw

2020-08-03 Thread Yangfei (Felix)
Hi Richard,

Thanks for reviewing this fix and the detailed suggestions : -)
Looks like my colleague Yunde was having some issue setting up his local 
repo.
I have prepared one for him.  Attached please find the patch.
Bootstrapped and tested on aarch64-linux-gnu.  Please help install if it's 
good to go.

Felix

> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Friday, July 31, 2020 5:33 PM
> To: Zhongyunde 
> Cc: gcc-patches@gcc.gnu.org; Yangfei (Felix) 
> Subject: Re: [PATCH PR95696] regrename creates overlapping register
> allocations for vliw
> 
> Thanks for the update, looks good.  Could you post a changelog too so that I
> can use it when committing?
> 
> The changelog was the only reason I didn't just push the patch, but FWIW, a
> couple of very minor things…
> 
> Zhongyunde  writes:
> > diff --git a/gcc/regrename.c b/gcc/regrename.c old mode 100644 new
> > mode 100755 index 637b3cbe6d7..815ed22805d
> > --- a/gcc/regrename.c
> > +++ b/gcc/regrename.c
> > @@ -684,10 +684,12 @@ merge_chains (du_head_p c1, du_head_p c2)
> >c1->cannot_rename |= c2->cannot_rename;  }
> >
> > -/* Analyze the current function and build chains for renaming.  */
> > +/* Analyze the current function and build chains for renaming.
> > +   If INCLUDE_ALL_BLOCKS_P is set to true, should process all blocks,
> > +   ignoring BB_DISABLE_SCHEDULE.  The default value is true.  */
> 
> I think s/should// here, since GCC comments usually use an imperative style.
> 
> > @@ -737,6 +739,14 @@ regrename_analyze (bitmap bb_mask)
> >if (dump_file)
> > fprintf (dump_file, "\nprocessing block %d:\n", bb1->index);
> >
> > +  if (!include_all_block_p && (bb1->flags & BB_DISABLE_SCHEDULE) != 0)
> > +   {
> > + if (dump_file)
> > +   fprintf (dump_file, "avoid disrupting the sms schedule of bb %d\n",
> > +bb1->index);
> 
> bb1->index should be indented below “dump_file”.
> 
> Richard


pr95696-v0.diff
Description: pr95696-v0.diff


RE: [PATCH PR95961] vect: ICE: in exact_div, at poly-int.h:2182

2020-07-02 Thread Yangfei (Felix)
Hi,

> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Thursday, July 2, 2020 5:17 PM
> To: Yangfei (Felix) 
> Cc: Richard Biener ; Richard Biener
> ; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH PR95961] vect: ICE: in exact_div, at poly-int.h:2182
> 

Cut...

> 
> Thanks, pushed to master with a minor whitespace fix for…

Thanks for finding it.

> > + nscalars = (STMT_SLP_TYPE (stmt_info)
> > +   ? vf * DR_GROUP_SIZE (stmt_info) : vf);
> 
> …the indentation on this line.  Hope you don't mind, but I also “reflowed”
> the commit message to make it fit within 72 chars.
> (The text itself is the same.)

It's OK.  :-)
BTW: Is this the rule for gcc git commit msg format? 72 chars instead of 80 
chars?

Felix


RE: [PATCH PR95961] vect: ICE: in exact_div, at poly-int.h:2182

2020-07-02 Thread Yangfei (Felix)
Hi,

> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Wednesday, July 1, 2020 9:03 PM
> To: Yangfei (Felix) 
> Cc: Richard Biener ; Richard Biener
> ; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH PR95961] vect: ICE: in exact_div, at poly-int.h:2182
> 
> "Yangfei (Felix)"  writes:
> >> On June 30, 2020 4:23:03 PM GMT+02:00, Richard Sandiford
> >>  wrote:
> >> >Richard Biener  writes:
> >> >> So it seems odd to somehow put in the number of vectors...  so to
> >> >> me it would have made sense if it did
> >> >>
> >> >>   possible_npeel_number = lower_bound (nscalars);
> >> >>
> >> >> or whateveris necessary to make the polys happy.  Thus simply
> >> >> elide the vect_get_num_vectors call?  But it's been very long
> >> >> since I've dived into the alignment peeling code...
> >> >
> >> >Ah, I see what you mean.  So rather than:
> >> >
> >> >/* Save info about DR in the hash table.  Also include peeling
> >> >   amounts according to the explanation above.  */
> >> >  for (j = 0; j < possible_npeel_number; j++)
> >> >{
> >> >  vect_peeling_hash_insert (_htab, loop_vinfo,
> >> >  dr_info, npeel_tmp);
> >> >npeel_tmp += target_align / dr_size;
> >> >}
> >> >
> >> >just have something like:
> >> >
> >> >while (known_le (npeel_tmp, nscalars))
> >> >  {
> >> >…
> >> >  }
> >> >
> >> >?
> >>
> >> Yeah.
> >
> > Not sure if I understand correctly.  I am supposing the following check in
> the original code is not necessary if we go like that.
> >
> > 1822   if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> >
> > Is that correct?
> 
> I think we still need it.  I guess there are two choices:
> 
> - make nscalars default to npeel_tmp before the “if” above.

I think this will be simpler.  How about the v2 patch?
Bootstrapped and tested on aarch64-linux-gnu & x86_64-linux-gnu.

Thanks,
Felix


pr95961-v2.diff
Description: pr95961-v2.diff


RE: [PATCH PR95961] vect: ICE: in exact_div, at poly-int.h:2182

2020-07-01 Thread Yangfei (Felix)
Hi,

> -Original Message-
> From: Richard Biener [mailto:rguent...@suse.de]
> Sent: Tuesday, June 30, 2020 10:50 PM
> To: Richard Sandiford 
> Cc: Richard Biener ; Yangfei (Felix)
> ; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH PR95961] vect: ICE: in exact_div, at poly-int.h:2182
> 
> On June 30, 2020 4:23:03 PM GMT+02:00, Richard Sandiford
>  wrote:
> >Richard Biener  writes:
> >> On Tue, 30 Jun 2020, Richard Sandiford wrote:
> >>
> >>> Richard Biener  writes:
> >>> > On Tue, Jun 30, 2020 at 2:18 PM Richard Sandiford
> >>> >  wrote:
> >>> >>
> >>> >> "Yangfei (Felix)"  writes:
> >>> >> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-ctor-1.c
> >b/gcc/testsuite/gcc.dg/vect/vect-ctor-1.c
> >>> >> > index e050db1a2e4..ea39fcac0e0 100644
> >>> >> > --- a/gcc/testsuite/gcc.dg/vect/vect-ctor-1.c
> >>> >> > +++ b/gcc/testsuite/gcc.dg/vect/vect-ctor-1.c
> >>> >> > @@ -1,6 +1,7 @@
> >>> >> >  /* { dg-do compile } */
> >>> >> >  /* { dg-additional-options "-O3" } */
> >>> >> >  /* { dg-additional-options "-mavx2" { target { i?86-*-*
> >x86_64-*-* } } } */
> >>> >> > +/* { dg-additional-options "-march=armv8.2-a+sve
> >-fno-vect-cost-model" { target aarch64*-*-* } } */
> >>> >> >
> >>> >> >  typedef struct {
> >>> >> >  unsigned short mprr_2[5][16][16];
> >>> >>
> >>> >> This test is useful for Advanced SIMD too, so I think we should
> >continue
> >>> >> to test it with whatever options the person running the testsuite
> >chose.
> >>> >> Instead we could duplicate the test in gcc.target/aarch64/sve
> >with
> >>> >> appropriate options.
> >>> >>
> >>> >> > diff --git a/gcc/tree-vect-data-refs.c
> >b/gcc/tree-vect-data-refs.c
> >>> >> > index eb8288e7a85..b30a7d8a3bb 100644
> >>> >> > --- a/gcc/tree-vect-data-refs.c
> >>> >> > +++ b/gcc/tree-vect-data-refs.c
> >>> >> > @@ -1823,8 +1823,11 @@ vect_enhance_data_refs_alignment
> >(loop_vec_info loop_vinfo)
> >>> >> >   {
> >>> >> > poly_uint64 nscalars = (STMT_SLP_TYPE
> >(stmt_info)
> >>> >> > ? vf * DR_GROUP_SIZE
> >(stmt_info) : vf);
> >>> >> > -   possible_npeel_number
> >>> >> > - = vect_get_num_vectors (nscalars, vectype);
> >>> >> > +   if (maybe_lt (nscalars, TYPE_VECTOR_SUBPARTS
> >(vectype)))
> >>> >> > + possible_npeel_number = 0;
> >>> >> > +   else
> >>> >> > + possible_npeel_number
> >>> >> > +   = vect_get_num_vectors (nscalars, vectype);
> >>> >> >
> >>> >> > /* NPEEL_TMP is 0 when there is no
> >misalignment, but also
> >>> >> >allow peeling NELEMENTS.  */
> >>> >>
> >>> >> OK, so this is coming from:
> >>> >>
> >>> >>   int s[16][2];
> >>> >>   …
> >>> >>   … =s[j][1];
> >>> >>
> >>> >> and an SLP node containing 16 instances of “s[j][1]”.  The
> >DR_GROUP_SIZE
> >>> >> is 2 because that's the inner dimension of “s”.
> >>> >>
> >>> >> I don't think maybe_lt is right here though.  The same problem
> >could in
> >>> >> principle happen for cases in which NSCALARS >
> >TYPE_VECTOR_SUBPARTS,
> >>> >> e.g. for different inner dimensions of “s”.
> >>> >>
> >>> >> I think the testcase shows that using DR_GROUP_SIZE in this
> >calculation
> >>> >> is flawed.  I'm not sure whether we can really do better given
> >the current
> >>> >> representation though.  This is one case where having a separate
> >dr_vec_info
> >>> >> per SLP node would help.
> >>> >
> >>> > I guess what the code likes to know is what we now have in
> >SLP_TREE_LANES
> >>> > (or formerly group_size) but t

[PATCH PR95961] vect: ICE: in exact_div, at poly-int.h:2182

2020-06-29 Thread Yangfei (Felix)
Hi,

PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95961 

In the test case for PR95961, vectorization factor computed by 
vect_determine_vectorization_factor is [8,8].  But this is updated to [1,1] 
later by vect_update_vf_for_slp.
When we call vect_get_num_vectors in vect_enhance_data_refs_alignment, the 
number of scalars which is based on the vectorization factor is not a multiple 
of the the
number of elements in the vector type.  This leads to the ICE.  We should check 
that before calling vect_get_num_vectors and set local variable 
'possible_npeel_number' to
zero if there are too few scalars.

Bootstrapped and tested on aarch64-linux-gnu.  ChangeLog update are contained 
in the patch.

Comments?

Thanks,
Felix


pr95961-v1.diff
Description: pr95961-v1.diff


[PATCH] vect: Use vect_relevant_for_alignment_p consistently

2020-06-18 Thread Yangfei (Felix)
Hi,

  Noticed two places in tree-vect-data-refs.c where we can use function 
vect_relevant_for_alignment_p.
  Looks like these two are missed when we were introducing the function.
  Bootstrapped and tested on aarch64-linux-gnu.  OK to go?

  ChangeLog modification is contained in the patch.

Thanks,
Felix


patch-v1.diff
Description: patch-v1.diff


RE: [PATCH] vect: Use LOOP_VINFO_DATAREFS and LOOP_VINFO_DDRS consistently

2020-06-15 Thread Yangfei (Felix)
Hi,

> -Original Message-
> From: Richard Biener [mailto:richard.guent...@gmail.com]
> Sent: Monday, June 15, 2020 5:12 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] vect: Use LOOP_VINFO_DATAREFS and
> LOOP_VINFO_DDRS consistently
> 
> >
> > Thanks for reviewing this.   Could you please help install it?
> 
> Pushed.  Please remember to verify the ChangeLog - I needed to replace
> leading 8 spaces with tabs.
 
Thanks for the effort. 
Looks like the issue was caused by copy-and-paste, will pay attention next 
time.  

Felix


RE: [PATCH] vect: Use LOOP_VINFO_DATAREFS and LOOP_VINFO_DDRS consistently

2020-06-15 Thread Yangfei (Felix)
Hi Richard,

> -Original Message-
> From: Richard Biener [mailto:richard.guent...@gmail.com]
> Sent: Monday, June 15, 2020 3:25 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] vect: Use LOOP_VINFO_DATAREFS and
> LOOP_VINFO_DDRS consistently
> 
> On Sat, Jun 13, 2020 at 4:46 AM Yangfei (Felix) 
> wrote:
> >
> > Hi,
> >
> > This is minor code refactorings in tree-vect-data-refs.c and 
> > tree-vect-loop.c.
> > Use LOOP_VINFO_DATAREFS and LOOP_VINFO_DDRS when possible and
> rename
> > several parameters to make code more consistent.
> >
> > Bootstrapped and tested on aarch64-linux-gnu.  OK?
> 
> OK.

Thanks for reviewing this.   Could you please help install it?

Regards,
Felix


[PATCH] vect: Use LOOP_VINFO_DATAREFS and LOOP_VINFO_DDRS consistently

2020-06-12 Thread Yangfei (Felix)
Hi,

This is minor code refactorings in tree-vect-data-refs.c and tree-vect-loop.c.
Use LOOP_VINFO_DATAREFS and LOOP_VINFO_DDRS when possible and rename
several parameters to make code more consistent.

Bootstrapped and tested on aarch64-linux-gnu.  OK?

Thanks,
Felix

gcc/

+2020-06-13  Felix Yang  
+
+   * tree-vect-data-refs.c (vect_verify_datarefs_alignment): Rename
+   parameter to loop_vinfo and update uses.  Use LOOP_VINFO_DATAREFS
+   when possible.
+   (vect_analyze_data_refs_alignment): Likewise, and use LOOP_VINFO_DDRS
+   when possible.
+   * tree-vect-loop.c (vect_dissolve_slp_only_groups): Use
+   LOOP_VINFO_DATAREFS when possible.
+   (update_epilogue_loop_vinfo): Likewise.


patch-v1.diff
Description: patch-v1.diff


RE: [PATCH PR95570] vect: ICE: Segmentation fault in vect_loop_versioning

2020-06-11 Thread Yangfei (Felix)
Hi,

> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Friday, June 12, 2020 2:29 AM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH PR95570] vect: ICE: Segmentation fault in
> vect_loop_versioning
> 
> "Yangfei (Felix)"  writes:
> > From 30a0196b0afd45bae9291cfab3fee4ad6b90cbcb Mon Sep 17 00:00:00
> 2001
> > From: Fei Yang 
> > Date: Thu, 11 Jun 2020 19:33:22 +0800
> > Subject: [PATCH] vect: Fix an ICE in vect_loop_versioning [PR95570]
> >
> > In the test case for PR95570, the only data reference in the loop is a
> > gather-statter access.  Scalar evolution analysis for this data
> > reference failed, so DR_STEP is NULL_TREE.  This leads to the segmentation
> fault.
> > We should filter out scatter-gather access in
> vect_enhance_data_refs_alignment.
> 
> Looks good, just a couple of very minor details…
> 
> > 2020-06-11  Felix Yang  
> >
> > gcc/
> > PR tree-optimization/95570
> > * tree-vect-data-refs.c (vect_relevant_for_alignment_p): New
> function.
> > (vect_verify_datarefs_alignment): Call it to filter out data 
> > references
> > in the loop whose alignment is irrelevant.
> > (vect_get_peeling_costs_all_drs): Likewise.
> > (vect_peeling_supportable): Likewise.
> > (vect_enhance_data_refs_alignment): Likewise.
> >
> > gcc/testsuite/
> >
> > PR tree-optimization/95570
> > * gcc.dg/vect/pr95570.c: New test.
> > ---
> >  gcc/testsuite/gcc.dg/vect/pr95570.c | 11 
> >  gcc/tree-vect-data-refs.c   | 83 -
> >  2 files changed, 45 insertions(+), 49 deletions(-)  create mode
> > 100644 gcc/testsuite/gcc.dg/vect/pr95570.c
> >
> > diff --git a/gcc/testsuite/gcc.dg/vect/pr95570.c
> > b/gcc/testsuite/gcc.dg/vect/pr95570.c
> > new file mode 100644
> > index 000..b9362614004
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/pr95570.c
> > @@ -0,0 +1,11 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-march=armv8.2-a+sve
> > +-msve-vector-bits=256 -mstrict-align -fwrapv" { target aarch64*-*-* }
> > +} */
> > +
> > +int x[8][32];
> > +
> > +void
> > +foo (int start)
> > +{
> > +  for (int i = start; i < start + 16; i++)
> > +x[start][i] = i;
> > +}
> > diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
> > index 39d5a1b554c..98f6fb76ff9 100644
> > --- a/gcc/tree-vect-data-refs.c
> > +++ b/gcc/tree-vect-data-refs.c
> > @@ -1129,6 +1129,35 @@ vect_update_misalignment_for_peel
> (dr_vec_info *dr_info,
> >SET_DR_MISALIGNMENT (dr_info, DR_MISALIGNMENT_UNKNOWN);  }
> >
> > +/* Return TRUE if alignment is relevant for DR_INFO.  */
> 
> Just “Return true …“ for new code.  TRUE is a hold-over from C days.

OK.

> > +static bool
> > +vect_relevant_for_alignment_p (dr_vec_info *dr_info) {
> > +  stmt_vec_info stmt_info = dr_info->stmt;
> > +
> > +  if (!STMT_VINFO_RELEVANT_P (stmt_info))
> > +return false;
> > +
> > +  /* For interleaving, only the alignment of the first access
> > + matters.  */  if (STMT_VINFO_GROUPED_ACCESS (stmt_info)
> > +  && DR_GROUP_FIRST_ELEMENT (stmt_info) != stmt_info)
> > +return false;
> > +
> > +  /* For scatter-gather or invariant accesses, alignment is irrelevant
> > + for them.  */
> 
> Maybe:
> 
>   /* Scatter-gather and invariant accesses continue to address individual
>  scalars, so vector-level alignment is irrelevant.  */
 
Much better : - )   Updated accordingly.
Also bootstrapped and tested on x86_64-linux-gnu.

Thanks,
Felix

gcc/

+2020-06-12  Felix Yang  
+
+   PR tree-optimization/95570
+   * tree-vect-data-refs.c (vect_relevant_for_alignment_p): New function.
+   (vect_verify_datarefs_alignment): Call it to filter out data references
+   in the loop whose alignment is irrelevant.
+   (vect_get_peeling_costs_all_drs): Likewise.
+   (vect_peeling_supportable): Likewise.
+   (vect_enhance_data_refs_alignment): Likewise.

gcc/testsuite/

+2020-06-12  Felix Yang  
+
+   PR tree-optimization/95570
+   * gcc.dg/vect/pr95570.c: New test.



pr95570-v3.diff
Description: pr95570-v3.diff


RE: [PATCH PR95570] vect: ICE: Segmentation fault in vect_loop_versioning

2020-06-11 Thread Yangfei (Felix)
Hi,

> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Thursday, June 11, 2020 12:23 AM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH PR95570] vect: ICE: Segmentation fault in
> vect_loop_versioning
> 
> "Yangfei (Felix)"  writes:
> > Hi,
> >
> > PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95570
> >
> > Here, we are doing loop versioning for alignment. The only dr here is a
> gather-statter operation: x[start][i].
> > Scalar evolution analysis for this dr failed, so DR_STEP is NULL_TREE, which
> leads to the segfault.
> > But scatter-gather operation should be filtered out in
> vect_enhance_data_refs_alignment.
> > There are similar issues in vect_verify_datarefs_alignment,
> vect_get_peeling_costs_all_drs and vect_peeling_supportable.
> > Proposed patch adds back the necessary tests.  Bootstrapped and tested
> on aarch64-linux-gnu & x86_64-linux-gnu.
> >
> > Test coverage:
> > Existing tests [1] and newly added test ensures coverage for all the changes
> except for the changes in vect_peeling_supportable.
> > Currently I don't have a test to cover the changes in
> vect_peeling_supportable.  Should we keep them?
> 
> Rather than add several instances of the new test, I think it would make
> sense to split the (hopefully) correct conditions in
> vect_enhance_data_refs_alignment out into a subroutine and use it in the
> other sites.  Doing that for vect_peeling_supportable would then be
> justifiable as a clean-up.

OK.

> How about something like vect_relevant_for_alignment_p as the name of
> the subroutine?

Nice name.   Does the v2 patch look better?
Bootstrapped and tested on aarch64-linux-gnu.
Newly added test fail without the fix and pass otherwise.

gcc/

+2020-06-11  Felix Yang  
+
+   PR tree-optimization/95570
+   * tree-vect-data-refs.c (vect_relevant_for_alignment_p): New function.
+   (vect_verify_datarefs_alignment): Call it to filter out data references
+   in the loop whose alignment is irrelevant.
+   (vect_get_peeling_costs_all_drs): Likewise.
+   (vect_peeling_supportable): Likewise.
+   (vect_enhance_data_refs_alignment): Likewise.

gcc/testsuite/

+2020-06-11  Felix Yang  
+
+   PR tree-optimization/95570
+   * gcc.dg/vect/pr95570.c: New test.

Thanks,
Felix


pr95570-v2.diff
Description: pr95570-v2.diff


[PATCH PR95570] vect: ICE: Segmentation fault in vect_loop_versioning

2020-06-10 Thread Yangfei (Felix)
Hi,

PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95570 

Here, we are doing loop versioning for alignment. The only dr here is a 
gather-statter operation: x[start][i]. 
Scalar evolution analysis for this dr failed, so DR_STEP is NULL_TREE, which 
leads to the segfault. 
But scatter-gather operation should be filtered out in 
vect_enhance_data_refs_alignment. 
There are similar issues in vect_verify_datarefs_alignment, 
vect_get_peeling_costs_all_drs and vect_peeling_supportable. 
Proposed patch adds back the necessary tests.  Bootstrapped and tested on 
aarch64-linux-gnu & x86_64-linux-gnu. 

Test coverage: 
Existing tests [1] and newly added test ensures coverage for all the changes 
except for the changes in vect_peeling_supportable. 
Currently I don't have a test to cover the changes in vect_peeling_supportable. 
 Should we keep them? 

Thanks,
Felix

gcc:
+2020-06-10  Felix Yang  
+
+   PR tree-optimization/95570
+   * tree-vect-data-refs.c (vect_verify_datarefs_alignment): Filter
+   out gather-scatter and invariant accesses when iterating over all
+   data references.
+   (vect_get_peeling_costs_all_drs): Likewise.
+   (vect_peeling_supportable): Likewise, and filter out irrelevant data
+   references.
+   (vect_enhance_data_refs_alignment): Likewise when checking if
+   versioning for alignment is needed.

gcc/testsuite:
+2020-06-10  Felix Yang  
+
+   PR tree-optimization/95570
+   * gcc.dg/vect/pr95570.c: New test.


[1]
gcc.dg/pr50133.c
gcc.dg/vect/pr49771.c
gcc.dg/pr87894.c
gcc.dg/vect/bb-slp-42.c
gcc.target/aarch64/sve/mask_gather_load_1.c



pr95570-v1.diff
Description: pr95570-v1.diff


RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-06-04 Thread Yangfei (Felix)
Hi,

> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Tuesday, June 2, 2020 7:17 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org; Uros Bizjak ; Jakub
> Jelinek ; Hongtao Liu ; H.J. Lu
> 
> Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with
> fixed sve vector length
>

Snip...
 
> >
> >> FAIL: gcc.target/i386/avx512f-vcvtps2ph-2.c (test for excess errors)
> >> UNRESOLVED: gcc.target/i386/avx512f-vcvtps2ph-2.c compilation failed
> >> to produce executable
> > 154803c154803
> 
> Looks good.  (I know I said that last time too :-))  I've also tested it on 
> arm-
> linux-gnueabihf and powerpc64le-linux-gnu without problems.

Thanks for reviewing and testing the patch  :-)

> As before, I'll hold off applying until the AVX512 problem is fixed.

Looks like the AVX512 problem is fixed with:

https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=43088bb4dadd3d14b6b594c5f9363fe879f3d7f7
 

I'm using: $ runtest --tool gcc i386.exp=avx512f-vcvtps2ph-2.c

Thanks,
Felix


RE: [PATCH PR95459] aarch64: ICE in aarch64_short_vector_p, at config/aarch64/aarch64.c:16803

2020-06-03 Thread Yangfei (Felix)
Hi Richard,

> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Wednesday, June 3, 2020 1:19 AM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH PR95459] aarch64: ICE in aarch64_short_vector_p, at
> config/aarch64/aarch64.c:16803
> 
> "Yangfei (Felix)"  writes:
> > Hi,
> >
> > Please review this trivial patch fixing an ICE in 
> > aarch64_short_vector_p.
> > Bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95459
> >
> > In aarch64_short_vector_p, we are simply checking whether a type (and
> a mode)
> > is a 64/128-bit short vector or not.  This should not be affected by the
> value
> > of TARGET_SVE.  Simply leave later code to report an error if SVE is
> disabled.
> > Bootstrapped and tested on aarch64-linux-gnu.  OK?
> 
> OK, thanks.  Pushed to master.

Thanks for reviewing and installing the patch : - )

Felix


[PATCH PR95459] aarch64: ICE in aarch64_short_vector_p, at config/aarch64/aarch64.c:16803

2020-06-02 Thread Yangfei (Felix)
Hi,

Please review this trivial patch fixing an ICE in aarch64_short_vector_p.
Bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95459 

In aarch64_short_vector_p, we are simply checking whether a type (and a 
mode)
is a 64/128-bit short vector or not.  This should not be affected by the 
value
of TARGET_SVE.  Simply leave later code to report an error if SVE is 
disabled.
Bootstrapped and tested on aarch64-linux-gnu.  OK?

gcc/ChangeLog
@@ -1,3 +1,9 @@
+2020-06-02  Felix Yang  
+
+   PR target/95459
+   * config/aarch64/aarch64.c (aarch64_short_vector_p):
+   Leave later code to report an error if SVE is disabled.

gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2020-06-02  Felix Yang  
+
+   PR target/95459
+   * gcc.target/aarch64/mgeneral-regs_6.c: New test.

Thanks,
Felix


pr95459-v0.diff
Description: pr95459-v0.diff


PING: RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero

2020-06-02 Thread Yangfei (Felix)
Gentle ping ...

> -Original Message-
> From: Yangfei (Felix)
> Sent: Wednesday, May 27, 2020 11:52 AM
> To: 'Segher Boessenkool' 
> Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) 
> Subject: RE: [PATCH PR94026] combine missed opportunity to simplify
> comparisons with zero
> 
> Hi,
> 
> > -Original Message-
> > From: Segher Boessenkool [mailto:seg...@kernel.crashing.org]
> > Sent: Tuesday, May 26, 2020 11:32 PM
> > To: Yangfei (Felix) 
> > Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A)
> > 
> > Subject: Re: [PATCH PR94026] combine missed opportunity to simplify
> > comparisons with zero
> 
> Snip...
> 
> >
> > Yes, please try to get this sorted somehow.  Maybe you can ask other
> > people in your company that have this same problem?
> 
> Will try and see.
> 
> > > > > +   new_rtx = gen_rtx_AND (mode, new_rtx,
> > > > > +  gen_int_mode (mask << real_pos,
> mode));
> > > > > + }
> > > >
> > > > So this changes
> > > >   ((X >> C) & M) == ...
> > > > to
> > > >   (X & (M << C)) == ...
> > > > ?
> > > >
> > > > Where then does it check what ... is?  This is only valid like
> > > > this if that is
> > zero.
> > > >
> > > > Why should this go in combine and not in simplify-rtx instead?
> > >
> > > True.  This is only valid when ... is zero.
> > > That's why we need the "&& equality_comparison " condition here.
> >
> > But that doesn't test if the other side of the comparison is 0.
> 
> Well, the caller has ensured that.
> 
> Here, local variable "equality_comparison" in
> make_compound_operation_int depends on parameter "in_code":
>  8088   if (in_code == EQ)
>  8089 {
>  8090   equality_comparison = true;
>  8091   in_code = COMPARE;
>  8092 }
> 
> The only caller of make_compound_operation_int is
> make_compound_operation.
> The comment of the caller says something about " in_code ":
> 
>  8512IN_CODE says what kind of expression we are processing.  Normally, it
> is
>  8513SET.  In a memory address it is MEM.  When processing the arguments
> of
>  8514a comparison or a COMPARE against zero, it is COMPARE, or EQ if
> more
>  8515precisely it is an equality comparison against zero.  */
> 
> For the given test case, we have a call trace of:
> (gdb) bt
> #0  make_compound_operation_int (mode=..., x_ptr=0xbd08,
> in_code=COMPARE, next_code_ptr=0xbd1c) at ../../gcc-
> git/gcc/combine.c:8248
> #1  0x0208983c in make_compound_operation (x=0xb211c768,
> in_code=EQ) at ../../gcc-git/gcc/combine.c:8539
> #2  0x020970fc in simplify_comparison (code=NE,
> pop0=0xc1e8, pop1=0xc1e0) at ../../gcc-
> git/gcc/combine.c:13032
> #3  0x02084544 in simplify_set (x=0xb211c240) at ../../gcc-
> git/gcc/combine.c:6932
> #4  0x02082688 in combine_simplify_rtx (x=0xb211c240,
> op0_mode=E_VOIDmode, in_dest=0, in_cond=0) at ../../gcc-
> git/gcc/combine.c:6445
> #5  0x0208025c in subst (x=0xb211c240, from=0xb211c138,
> to=0xb211c150, in_dest=0, in_cond=0, unique_copy=0)
> at ../../gcc-git/gcc/combine.c:5724
> #6  0x02079110 in try_combine (i3=0xb22cc3c0, i2=0xb22cc340,
> i1=0x0, i0=0x0, new_direct_jump_p=0xceb4,
> last_combined_insn=0xb22cc3c0) at ../../gcc-git/gcc/combine.c:3413
> #7  0x02073004 in combine_instructions (f=0xb211d038,
> nregs=103) at ../../gcc-git/gcc/combine.c:1305
> #8  0x0209cc50 in rest_of_handle_combine () at ../../gcc-
> git/gcc/combine.c:15088
> 
> In simplify_comparison (combine.c:13032):
> 
> 13028   rtx_code op0_mco_code = SET;
> 13029   if (op1 == const0_rtx)
> 13030 op0_mco_code = code == NE || code == EQ ? EQ : COMPARE;
> 13031
> 13032   op0 = make_compound_operation (op0, op0_mco_code);
> 13033   op1 = make_compound_operation (op1, SET);
> 
> 
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.dg/pr94026.c
> > > > > @@ -0,0 +1,21 @@
> > > > > +/* { dg-do compile { target aarch64*-*-* i?86-*-* x86_64-*-* }
> > > > > +} */
> > > >
> > > > Why restrict this to only some targets?
> > >
> > > That's because I only have these targets for verification.
> > > But I think this can work on other targets.  Removed from the v4 patch.
> > > Could you please help check

RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-06-01 Thread Yangfei (Felix)
Hi,

> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Monday, June 1, 2020 4:47 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org; Uros Bizjak ; Jakub
> Jelinek ; Hongtao Liu ; H.J. Lu
> 
> Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with
> fixed sve vector length

Snip...
 
> Sounds good.  Maybe at this point the x_inner and y_inner code is getting
> complicated enough to put into a lambda too:
> 
>   x_inner = ... (x);
>   y_inner = ... (y);
> 
> Just a suggestion though.

Yes, that's a good suggestion.  I see the code becomes more cleaner with 
another lambda.
 
> Yeah, looks good.
> 
> Formatting nit though: multi-line conditions should be wrapped in (...),
> i.e.:
> 
> return (...
> && ...
> && ...);
> 

Done.  v6 patch is based on trunk 20200601.
Bootstrapped and tested on aarch64-linux-gnu. 
Also bootstrapped on x86-64-linux-gnu with --enable-multilib (for building -m32 
x86 libgcc).
Regresssion test on x86-64-linux-gnu looks good except for the following 
failures which has been confirmed by x86 devs: 

> FAIL: gcc.target/i386/avx512f-vcvtps2ph-2.c (test for excess errors)
> UNRESOLVED: gcc.target/i386/avx512f-vcvtps2ph-2.c compilation failed to 
> produce executable
154803c154803

Thanks,
Felix



pr95254-v6.diff
Description: pr95254-v6.diff


RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-05-31 Thread Yangfei (Felix)
Hi,

> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Sunday, May 31, 2020 12:01 AM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org; Uros Bizjak ; Jakub
> Jelinek ; Hongtao Liu ; H.J. Lu
> 
> Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with
> fixed sve vector length
> 

Snip...

> >
> > The v5 patch attached addressed this issue.
> >
> > There two added changes compared with the v4 patch:
> > 1. In candidate_mem_p, mov_optab for innermode should be available.
> >  In this case, mov_optab for SDmode is not there and subreg are added
> back by emit_move_insn_1.  So we won't get the benefit with the patch.
> 
> I agree we should have this check.  I think the rule applies to all of the
> transforms though, not just the mem one, so we should add the check to the
> register and constant cases too.

OK.  I changed to make this an extra condition for calculating x_inner & y 
_inner.

> > 2. Instead of using adjust_address, I changed to use adjust_address_nv to
> avoid the emit of invalid insn 13.
> > The latter call to validize_mem() in emit_move_insn will take care of 
> > the
> address for us.
> 
> The validation performed by validize_mem is the same as that performed by
> adjust_address, so the only case this should make a difference is for
> push_operands:

True.

>   /* If X or Y are memory references, verify that their addresses are valid
>  for the machine.  */
>   if (MEM_P (x)
>   && (! memory_address_addr_space_p (GET_MODE (x), XEXP (x, 0),
>MEM_ADDR_SPACE (x))
> && ! push_operand (x, GET_MODE (x
> x = validize_mem (x);
> 
>   if (MEM_P (y)
>   && ! memory_address_addr_space_p (GET_MODE (y), XEXP (y, 0),
>   MEM_ADDR_SPACE (y)))
> y = validize_mem (y);
> 
> So I think the fix is to punt on push_operands instead (and continue to use
> adjust_address rather than adjust_address_nv).

Not sure if I understand it correctly.
Do you mean excluding push_operand in candidate_mem_p? Like:

 3830   auto candidate_mem_p = [&](machine_mode innermode, rtx mem) {
 3831 return !targetm.can_change_mode_class (innermode, GET_MODE (mem), 
ALL_REGS)
 3832&& !push_operand (mem, GET_MODE (mem))
 3833/* Not a candiate if innermode requires too much alignment.  */
 3834&& (MEM_ALIGN (mem) >= GET_MODE_ALIGNMENT (innermode)
 3835|| targetm.slow_unaligned_access (GET_MODE (mem),
 3836  MEM_ALIGN (mem))
 3837|| !targetm.slow_unaligned_access (innermode, MEM_ALIGN 
(mem)));
 3838   };

Thanks,
Felix


RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-05-30 Thread Yangfei (Felix)
Hi,

> -Original Message-
> From: Yangfei (Felix)
> Sent: Friday, May 29, 2020 2:56 PM
> To: 'Hongtao Liu' ; H.J. Lu 
> Cc: gcc-patches@gcc.gnu.org; Uros Bizjak ; Jakub
> Jelinek ; Richard Sandiford
> 
> Subject: RE: [PATCH PR95254] aarch64: gcc generate inefficient code with
> fixed sve vector length

Snip...

> 
> Yes, I tried your configure and reproduced the error.  Thanks for pointing 
> this
> out.
> The patch can pass bootstrap on x86_64 with the following configure options.
> Surprised to see that it failed to build with your configure.

Turns out that this ICE triggers under x86 -m32.

Command to reproduce:
~/build-gcc/x86_64-pc-linux-gnu/32/libgcc$ gcc  -g -O2 -m32 -O2  -g -O2 
-DIN_GCC-W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual 
-Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition  -isystem 
./include   -fpic -mlong-double-80 -DUSE_ELF_SYMVER -fcf-protection -mshstk -g 
-DIN_LIBGCC2 -fbuilding-libgcc -fno-stack-protector   -fpic -mlong-double-80 
-DUSE_ELF_SYMVER -fcf-protection -mshstk -I. -I. -I../../.././gcc 
-I../../../../gcc-git/libgcc -I../../../../gcc-git/libgcc/. 
-I../../../../gcc-git/libgcc/../gcc -I../../../../gcc-git/libgcc/../include 
-I../../../../gcc-git/libgcc/config/libbid -DENABLE_DECIMAL_BID_FORMAT 
-DHAVE_CC_TLS  -DUSE_TLS -o _isinfd32.o -MT _isinfd32.o -MD -MP -MF 
_isinfd32.dep -c ../../../../gcc-git/libgcc/config/libbid/_isinfd32.c

Source:
 28 int
 29 isinfd32 (_Decimal32 x) {
 30   int res;
 31   UINT64 x64;
 32   union decimal32 ux;
 33
 34   ux.d = x;
 35   x64 = __bid32_to_bid64 (ux.i);
 36   res = __bid64_isInf (x64);
 37   return (res);
 38 }

On the crash site (emit_single_push_insn), we have three insns:
(gdb) p debug_rtx (prev)
(insn 12 0 13 (parallel [
(set (reg/f:SI 7 sp)
(plus:SI (reg/f:SI 7 sp)
(const_int -12 [0xfff4])))
(clobber (reg:CC 17 flags))
]) "../../../../gcc-git/libgcc/config/libbid/_isinfd32.c":35:9 -1
 (expr_list:REG_ARGS_SIZE (const_int 12 [0xc])
(nil)))
$1 = void
(gdb) p debug_rtx (last)
(insn 14 13 0 (set (mem:SI (reg/f:SI 87) [0  S4 A32])
(subreg:SI (reg/v:SD 86 [ x ]) 0)) 
"../../../../gcc-git/libgcc/config/libbid/_isinfd32.c":35:9 -1
 (nil))
$2 = void
(gdb) p debug_rtx (PREV_INSN (last))
(insn 13 12 14 (set (reg/f:SI 87)
(pre_dec:SI (reg/f:SI 7 sp))) 
"../../../../gcc-git/libgcc/config/libbid/_isinfd32.c":35:9 -1
 (nil))
$3 = void

Here, insn 13 is invalid. It is emitted by: x = adjust_address (x, GET_MODE 
(y_inner), 0);

The v5 patch attached addressed this issue.

There two added changes compared with the v4 patch:
1. In candidate_mem_p, mov_optab for innermode should be available. 
 In this case, mov_optab for SDmode is not there and subreg are added back 
by emit_move_insn_1.  So we won't get the benefit with the patch. 

2. Instead of using adjust_address, I changed to use adjust_address_nv to avoid 
the emit of invalid insn 13. 
The latter call to validize_mem() in emit_move_insn will take care of the 
address for us. 

Bootstrapped and tested on aarch64-linux-gnu.  I am running another test on 
x86.  
Richard, could you please take a further look? 

Thanks,
Felix


pr95254-v5.diff
Description: pr95254-v5.diff


RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-05-29 Thread Yangfei (Felix)
Hi,

> -Original Message-
> From: Hongtao Liu [mailto:crazy...@gmail.com]
> Sent: Friday, May 29, 2020 11:24 AM
> To: H.J. Lu 
> Cc: Yangfei (Felix) ; gcc-patches@gcc.gnu.org;
> Uros Bizjak ; Jakub Jelinek ;
> Richard Sandiford 
> Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with
> fixed sve vector length
> 

Snip...

> > >
> > > This is due to define_subst magic.  The generators automatically
> > > create a vec_merge form of the instruction based on the information
> > > in the  attributes.
> > >
> > > AFAICT the rtl above is for the line-125 instruction, which looks ok.
> > > The problem is the line-126 instruction, since vcvtps2ph doesn't
> > > AIUI allow zero masking.
> > >
> 
> zero masking is not allowed for mem_operand here, but available for
> register_operand.
> there's something wrong in the pattern, we need to fix it.
> (define_insn "avx512f_vcvtps2ph512"

Thanks for confirming that :-)

> 
> > > The "mask" define_subst allows both zeroing and merging, so I guess
> > > this means that the pattern should either be using a different
> > > define_subst, or should be enforcing merging in some other way.
> > > Please could one of the x86 devs take a look?
> > >
> >
> > Hongtao, can you take a look?
> >
> > Thanks.
> >
> >
> > --
> > H.J.
> 
> BTW, i failed to build gcc when apply pr95254-v4.txt.
> 
> gcc configure:
> 
> Using built-in specs.
> COLLECT_GCC=./gcc/xgcc
> Target: x86_64-pc-linux-gnu
> Configured with: ../../gcc/gnu-toolchain/gcc/configure
> --enable-languages=c,c++,fortran --disable-bootstrap Thread model: posix
> Supported LTO compression algorithms: zlib gcc version 11.0.0 20200528
> (experimental) (GCC)
> 
> host on x86_64 rel8.

Yes, I tried your configure and reproduced the error.  Thanks for pointing this 
out.
The patch can pass bootstrap on x86_64 with the following configure options.
Surprised to see that it failed to build with your configure.

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/home/yangfei/gcc-hacking/install-gcc/libexec/gcc/x86_64-pc-linux-gnu/11/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-git/configure 
--prefix=/home/yangfei/gcc-hacking/install-gcc 
--enable-languages=c,c++,objc,obj-c++,fortran,lto --enable-shared 
--enable-threads=posix --enable-checking=yes --disable-multilib 
--with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions 
--enable-gnu-unique-object --enable-linker-build-id 
--with-gcc-major-version-only --enable-plugin --enable-initfini-array 
--without-isl --disable-libmpx --enable-gnu-indirect-function
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 11.0.0 20200526 (experimental) (GCC)

Felix


RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-05-28 Thread Yangfei (Felix)
Hi,

> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Thursday, May 28, 2020 12:07 AM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with
> fixed sve vector length
> 

Snip...

> 
> Ah, OK.  But in that case, shouldn't we allow the change if the original
> unaligned MEM was also “slow”?
> 
> I guess there might be cases in which both modes are slow enough for the
> hook to return true for them, but one is worse than the other.
> But I don't think there's much we can do about that as things stand:
> changing the mode might move from a slow mode to a slower mode, but it
> might move in the other direction too.

Good point.

> > +2020-05-27  Felix Yang  
> > +   Richard Sandiford  
> 
> I appreciate the gesture, but I don't think it's appropriate to list me as an
> author.  I haven't written any of the code, I've just reviewed it. :-)

OK.

> > diff --git a/gcc/expr.c b/gcc/expr.c
> > index dfbeae71518..3035791c764 100644
> > --- a/gcc/expr.c
> > +++ b/gcc/expr.c
> > @@ -3814,6 +3814,69 @@ emit_move_insn (rtx x, rtx y)
> >gcc_assert (mode != BLKmode
> >   && (GET_MODE (y) == mode || GET_MODE (y) == VOIDmode));
> >
> > +  /* If we have a copy which looks like one of the following patterns:
> 
> s/which/that/ (I think)

OK.

> > +   (set (subreg:M1 (reg:M2 ...)) (subreg:M1 (reg:M2 ...)))
> > +   (set (subreg:M1 (reg:M2 ...)) (mem:M1 ADDR))
> > +   (set (mem:M1 ADDR) (subreg:M1 (reg:M2 ...)))
> > +   (set (subreg:M1 (reg:M2 ...)) (constant C))
> > + where mode M1 is equal in size to M2 and target hook
> can_change_mode_class
> > + (M1, M2, ALL_REGS) returns false, try to remove the subreg.  This
> avoids
> > + an implicit round trip through memory.  */
> 
> How about:
> 
>  where mode M1 is equal in size to M2, try to detect whether the
>  mode change involves an implicit round trip through memory.
>  If so, see if we can avoid that by removing the subregs and
>  doing the move in mode M2 instead.  */
> 
> > +  else if (x_inner != NULL_RTX
> > +  && MEM_P (y)
> > +  && ! targetm.can_change_mode_class (GET_MODE (x_inner),
> > +  mode, ALL_REGS)
> > +  /* Stop if the inner mode requires too much alignment.  */
> > +  && (! targetm.slow_unaligned_access (GET_MODE (x_inner),
> > +   MEM_ALIGN (y))
> > +  || MEM_ALIGN (y) >= GET_MODE_ALIGNMENT (GET_MODE
> (x_inner
> 
> It's better to check the alignment first, since it's cheaper.
> So taking the comment above into account, I think this ends up as:
> 
>  && (MEM_ALIGN (y) >= GET_MODE_ALIGNMENT (GET_MODE
> (x_inner))
>  || targetm.slow_unaligned_access (mode, MEM_ALIGN (y)
>  || !targetm.slow_unaligned_access (GET_MODE (x_inner),
> MEM_ALIGN (y))
> 
> (Note: no space after "!", although the sources aren't as consistent about
> that as they could be.)

OK.

> TBH I think it would be good to avoid duplicating such a complicated condition
> in both directions, so at the risk of getting flamed, how about using a 
> lambda?
> 
>   auto candidate_mem_p = [&](machine_mode inner_mode, rtx mem) {
> return ...;
>   };
> 
> with ... containing everything after the MEM_P check?

Yes, this avoids duplicating code.

> Looks good otherwise, thanks,

Thanks for reviewing this.
Attached please find the v5 patch.
Note: we also need to modify local variable "mode" once we catch one case.  I 
see test failure without this change.

Bootstrapped and tested on aarch64-linux-gnu.
Also bootstrapped on x86_64-linux-gnu.  Regression test show two failed tests 
on this platform:

1> FAIL: gcc.target/i386/avx512f-vcvtps2ph-2.c (test for excess errors)
2> FAIL: gcc.target/i386/pr67609.c scan-assembler movdqa

I have adjust the second one in the v4 patch. But The first one looks strange 
to me.
I see gcc emits invalid x86 vcvtps2ph instrunctions which looks like:

125 vcvtps2ph   $0, %zmm0, -112(%rbp){%k1}
126 vcvtps2ph   $0, %zmm0, -80(%rbp){%k1}{z}

This happens in the combine phase, where an combined insn looks like:

1989 Trying 31 -> 33:
199031: r106:V16HI=vec_merge(unspec[r103:V16SF,0] 
133,[frame:DI-0x60],r109:HI)
199133: [frame:DI-0x60]=r106:V16HI
1992   REG_DEAD r106:V16HI
1993 Successfully matched this instruction:
1994 (set (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame)
1995  

RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-05-27 Thread Yangfei (Felix)
Hi,

> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Tuesday, May 26, 2020 11:58 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with
> fixed sve vector length
> 
> Sorry for the slow reply, was off for a few days.
> 
> I think the new code ought to happen earlier in emit_move_insn, before:
> 
>   if (CONSTANT_P (y))
> {
> 
> That way, all the canonicalisation happens on the mode we actually want the
> move to have.

OK. That’s a good point.

> "Yangfei (Felix)"  writes:
> > diff --git a/gcc/expr.c b/gcc/expr.c
> > index dfbeae71518..4442fb83367 100644
> > --- a/gcc/expr.c
> > +++ b/gcc/expr.c
> > @@ -3852,6 +3852,62 @@ emit_move_insn (rtx x, rtx y)
> >
> >gcc_assert (mode != BLKmode);
> >
> > +  rtx x_inner = NULL_RTX;
> > +  rtx y_inner = NULL_RTX;
> > +  machine_mode x_inner_mode, y_inner_mode;
> > +
> > +  if (SUBREG_P (x)
> > +  && REG_P (SUBREG_REG (x))
> > +  && known_eq (SUBREG_BYTE (x), 0))
> > +{
> > +  x_inner = SUBREG_REG (x);
> > +  x_inner_mode = GET_MODE (x_inner);
> > +}
> > +  if (SUBREG_P (y)
> > +  && REG_P (SUBREG_REG (y))
> > +  && known_eq (SUBREG_BYTE (y), 0))
> > +{
> > +  y_inner = SUBREG_REG (y);
> > +  y_inner_mode = GET_MODE (y_inner);
> > +}
> 
> The later code is only interested in SUBREG_REGs that are the same size as
> "mode", so I think it would make sense to check that in the "if"s above
> instead of checking SUBREG_BYTE.  (SUBREG_BYTE is always zero when the
> modes are the same size, but the reverse is not true.)
> 
> It might also be better to avoid [xy]_inner_mode and just use GET_MODE
> where necessary.
> 
> It would be good to have a block comment above the code to explain what
> we're doing.

Good suggestion. Done.

> > +  if (x_inner != NULL_RTX
> > +  && y_inner != NULL_RTX
> > +  && x_inner_mode == y_inner_mode
> > +  && known_eq (GET_MODE_SIZE (x_inner_mode), GET_MODE_SIZE
> (mode))
> > +  && ! targetm.can_change_mode_class (x_inner_mode, mode,
> ALL_REGS))
> > +{
> > +  x = x_inner;
> > +  y = y_inner;
> > +}
> > +  else if (x_inner != NULL_RTX && CONSTANT_P (y)
> 
> Formatting nit: one subcondition per line when the condition spans multiple
> lines.

OK.

> > +  && known_eq (GET_MODE_SIZE (x_inner_mode),
> GET_MODE_SIZE (mode))
> > +  && ! targetm.can_change_mode_class (x_inner_mode, mode,
> ALL_REGS)
> > +  && targetm.legitimate_constant_p (x_inner_mode, y))
> 
> This call isn't valid, since the mode has to match the rtx.  ("y" still has 
> mode
> "mode" at this point.)  I think instead we should just do:
> 
>  && (y_inner = simplify_subreg (GET_MODE (x_inner), y, mode, 0))
> 
> to convert the constant, and use it if the result is nonnull.
> The existing CONSTANT_P emit_move_insn code will handle cases in which
> the new constant isn't legitimate.

Good catch. Done.

> > +
> > +{
> > +  x = x_inner;
> > +}
> > +  else if (x_inner != NULL_RTX && MEM_P (y)
> > +  && known_eq (GET_MODE_SIZE (x_inner_mode),
> GET_MODE_SIZE (mode))
> > +  && ! targetm.can_change_mode_class (x_inner_mode, mode,
> ALL_REGS)
> > +  && (! targetm.slow_unaligned_access (x_inner_mode,
> MEM_ALIGN (y))
> > +  || MEM_ALIGN (y) >= GET_MODE_ALIGNMENT
> (x_inner_mode)))
> 
> What is the last condition protecting against?  Seems worth a comment.

Comment added.  Here I am intended to avoid generating a slow unaligned memory 
access.
Machine modes like VNx2HImode may have an small alignment than modes like V4HI.
For the given test case, SLP forces the alignment of memory access of mode 
VNx2HImode to be 32 bytes.
In theory, we may have other cases where alignment of innermode is bigger than 
that of the outermode.

Attached please find the v3 patch.  Bootstrapped and tested on 
aarch64-linux-gnu.
Does it look better?

gcc/ChangeLog:
+2020-05-27  Felix Yang  
+   Richard Sandiford  
+
+   PR target/95254
+   * expr.c (emit_move_insn): If we have a copy of which src and/or dest
+   is a subreg, try to remove the subreg when innermode and outermode are
+   equal in size and targetm.can_change_mode_class (outermode, innermode,
+   ALL_REGS) returns false.

testsuite/ChangeLog:
+2020-05-27  Felix Yang  
+   Richard Sandiford  
+
+   PR target/95254
+   * gcc.target/aarch64/pr95254.c: New test.

Thanks,
Felix




pr95254-v3.diff
Description: pr95254-v3.diff


RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero

2020-05-26 Thread Yangfei (Felix)
Hi,

> -Original Message-
> From: Segher Boessenkool [mailto:seg...@kernel.crashing.org]
> Sent: Tuesday, May 26, 2020 11:32 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) 
> Subject: Re: [PATCH PR94026] combine missed opportunity to simplify
> comparisons with zero

Snip...

> 
> Yes, please try to get this sorted somehow.  Maybe you can ask other people
> in your company that have this same problem?

Will try and see.

> > > > + new_rtx = gen_rtx_AND (mode, new_rtx,
> > > > +gen_int_mode (mask << real_pos, mode));
> > > > +   }
> > >
> > > So this changes
> > >   ((X >> C) & M) == ...
> > > to
> > >   (X & (M << C)) == ...
> > > ?
> > >
> > > Where then does it check what ... is?  This is only valid like this if 
> > > that is
> zero.
> > >
> > > Why should this go in combine and not in simplify-rtx instead?
> >
> > True.  This is only valid when ... is zero.
> > That's why we need the "&& equality_comparison " condition here.
> 
> But that doesn't test if the other side of the comparison is 0.

Well, the caller has ensured that.

Here, local variable "equality_comparison" in make_compound_operation_int 
depends on parameter "in_code":
 8088   if (in_code == EQ)
 8089 {
 8090   equality_comparison = true;
 8091   in_code = COMPARE;
 8092 }

The only caller of make_compound_operation_int is make_compound_operation.
The comment of the caller says something about " in_code ":

 8512IN_CODE says what kind of expression we are processing.  Normally, it 
is
 8513SET.  In a memory address it is MEM.  When processing the arguments of
 8514a comparison or a COMPARE against zero, it is COMPARE, or EQ if more
 8515precisely it is an equality comparison against zero.  */

For the given test case, we have a call trace of:
(gdb) bt
#0  make_compound_operation_int (mode=..., x_ptr=0xbd08, 
in_code=COMPARE, next_code_ptr=0xbd1c) at 
../../gcc-git/gcc/combine.c:8248
#1  0x0208983c in make_compound_operation (x=0xb211c768, 
in_code=EQ) at ../../gcc-git/gcc/combine.c:8539
#2  0x020970fc in simplify_comparison (code=NE, pop0=0xc1e8, 
pop1=0xc1e0) at ../../gcc-git/gcc/combine.c:13032
#3  0x02084544 in simplify_set (x=0xb211c240) at 
../../gcc-git/gcc/combine.c:6932
#4  0x02082688 in combine_simplify_rtx (x=0xb211c240, 
op0_mode=E_VOIDmode, in_dest=0, in_cond=0) at ../../gcc-git/gcc/combine.c:6445
#5  0x0208025c in subst (x=0xb211c240, from=0xb211c138, 
to=0xb211c150, in_dest=0, in_cond=0, unique_copy=0)
at ../../gcc-git/gcc/combine.c:5724
#6  0x02079110 in try_combine (i3=0xb22cc3c0, i2=0xb22cc340, 
i1=0x0, i0=0x0, new_direct_jump_p=0xceb4,
last_combined_insn=0xb22cc3c0) at ../../gcc-git/gcc/combine.c:3413
#7  0x02073004 in combine_instructions (f=0xb211d038, nregs=103) at 
../../gcc-git/gcc/combine.c:1305
#8  0x0209cc50 in rest_of_handle_combine () at 
../../gcc-git/gcc/combine.c:15088

In simplify_comparison (combine.c:13032):

13028   rtx_code op0_mco_code = SET;
13029   if (op1 == const0_rtx)
13030 op0_mco_code = code == NE || code == EQ ? EQ : COMPARE;
13031
13032   op0 = make_compound_operation (op0, op0_mco_code);
13033   op1 = make_compound_operation (op1, SET);


> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.dg/pr94026.c
> > > > @@ -0,0 +1,21 @@
> > > > +/* { dg-do compile { target aarch64*-*-* i?86-*-* x86_64-*-* } }
> > > > +*/
> > >
> > > Why restrict this to only some targets?
> >
> > That's because I only have these targets for verification.
> > But I think this can work on other targets.  Removed from the v4 patch.
> > Could you please help check the other ports?
> 
> In general, you should never restrict anything to some targets simply
> because you haven't tested it on other targets.
> 
> If it is a good test it will just work on those other targets.  Traffic on 
> gcc-
> testresults@ will show you if it actually does.

OK.  Thanks for pointing this out  :-)

> > > > +/* { dg-options "-O2 -fdump-rtl-combine" } */
> > > > +
> > > > +int
> > > > +foo (int c)
> > > > +{
> > > > +  int a = (c >> 8) & 7;
> > > > +
> > > > +  if (a >= 2) {
> > > > +return 1;
> > > > +  }
> > > > +
> > > > +  return 0;
> > > > +}
> > > > +
> > > > +/* The combine phas

RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero

2020-05-25 Thread Yangfei (Felix)
Hi,

> -Original Message-
> From: Segher Boessenkool [mailto:seg...@kernel.crashing.org]
> Sent: Tuesday, May 26, 2020 12:27 AM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) 
> Subject: Re: [PATCH PR94026] combine missed opportunity to simplify
> comparisons with zero

Snip...

> > I am using Outlook and I didn't find the place to change the MIME type
> > : - (
> 
> The simplest option is to use a different email client, one that plays nicely
> with others.  You use git, maybe you could even use git-send-email?

The bad news is that it would be hard to switch to a different email client 
with my company's IT policy  :-( 
But I think I can ask IT if that is possible. Sorry for the trouble.

> I'll paste things manually...
> 
> > From a19238c02c1e6ab9593a14a13e1e3dff90ed Mon Sep 17 00:00:00
> 2001
> > From: Fei Yang 
> > Date: Mon, 25 May 2020 10:19:30 +0800
> > Subject: [PATCH] combine: missed opportunity to simplify comparisons
> > with zero  [PR94026]
> 
> (Capital "M" on "Missed" please)
> 
> But, the subject should say what the patch *does*.  So maybe
>   combine: Simplify more comparisons with zero (PR94026)

OK. 

> > If we have (and (lshiftrt X C) M) and M is a constant that would
> > select a field of bits within an item, but not the entire word, fold
> > this into a simple AND if we are in an equality comparison against zero.
> 
> But that subject doesn't really describe what the patch does, anyway?

OK.  Modified in the v4 patch.  Does it look better?

> > gcc/
> > PR rtl-optimization/94026
> > * combine.c (make_compound_operation_int): If we have (and
> > (lshiftrt X C) M) and M is a constant that would select a field
> > of bits within an item, but not the entire word, fold this into
> > a simple AND if we are in an equality comparison.
> >
> > gcc/testsuite/
> > PR rtl-optimization/94026
> > * gcc.dg/pr94026.c: New test.
> 
> > --- a/gcc/ChangeLog
> > +++ b/gcc/ChangeLog
> > @@ -1,3 +1,11 @@
> > +2020-05-25  Felix Yang  
> > +
> > +   PR rtl-optimization/94026
> > +   * combine.c (make_compound_operation_int): If we have (and
> > +   (lshiftrt X C) M) and M is a constant that would select a field
> > +   of bits within an item, but not the entire word, fold this into
> > +   a simple AND if we are in an equality comparison.
> 
> Don't put the changelog in the patch.

OK.  I paste it here:

gcc/ChangeLog

+2020-05-26  Felix Yang  
+
+   PR rtl-optimization/94026
+   * combine.c (make_compound_operation_int): If we have (and
+   (lshiftrt X C) M) and M is a constant that would select a field
+   of bits within an item, but not the entire word, fold this into
+   a simple AND if we are in an equality comparison.

gcc/testsuite/ChangeLog

+2020-05-26  Felix Yang  
+
+   PR rtl-optimization/94026
+   * gcc.dg/pr94026.c: New test.

> > diff --git a/gcc/combine.c b/gcc/combine.c index
> > b044f29fd36..76d62b0bd17 100644
> > --- a/gcc/combine.c
> > +++ b/gcc/combine.c
> > @@ -8178,6 +8178,10 @@ make_compound_operation_int
> (scalar_int_mode mode, rtx *x_ptr,
> >if (!CONST_INT_P (XEXP (x, 1)))
> > break;
> >
> > +  HOST_WIDE_INT pos;
> > +  unsigned HOST_WIDE_INT len;
> > +  pos = get_pos_from_mask (UINTVAL (XEXP (x, 1)), );
> 
>   unsigned HOST_WIDE_INT len;
>   HOST_WIDE_INT pos = get_pos_from_mask (UINTVAL (XEXP (x, 1)), );
> 
> > @@ -8231,6 +8235,22 @@ make_compound_operation_int
> (scalar_int_mode mode, rtx *x_ptr,
> >   new_rtx = make_compound_operation (new_rtx, in_code);
> > }
> >
> > +  /* If we have (and (lshiftrt X C) M) and M is a constant that would
> select
> > +a field of bits within an item, but not the entire word, this might be
> > +representable by a simple AND if we are in an equality comparison.
> */
> > +  else if (pos > 0 && equality_comparison
> 
> That "&& equality_comparison" should be on a separate line as well.

OK.

> > +  && GET_CODE (XEXP (x, 0)) == LSHIFTRT
> > +  && CONST_INT_P (XEXP (XEXP (x, 0), 1))
> > +  && pos + UINTVAL (XEXP (XEXP (x, 0), 1))
> > + <= GET_MODE_BITSIZE (mode))
> > +   {
> > + new_rtx = make_compound_operation (XEXP (XEXP (x, 0), 0),
> next_code);
> > + HOST_WIDE_INT real_pos = pos + UINTVAL (XEXP (XEXP (x, 0), 1));
> > + unsigned HOST_WIDE_INT mask = ((unsigned HOST_WIDE_INT)1 <<
> len) -
> > +1;
> 
> Space after cast

RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero

2020-05-24 Thread Yangfei (Felix)
Hi,

> -Original Message-
> From: Segher Boessenkool [mailto:seg...@kernel.crashing.org]
> Sent: Saturday, May 23, 2020 10:57 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) 
> Subject: Re: [PATCH PR94026] combine missed opportunity to simplify
> comparisons with zero
> 
> Hi!
> 
> Sorry this is taking so long.
> 
> On Wed, May 06, 2020 at 08:57:52AM +, Yangfei (Felix) wrote:
> > > On Tue, Mar 24, 2020 at 06:30:12AM +, Yangfei (Felix) wrote:
> > > > I modified combine emitting a simple AND operation instead of
> > > > making one
> > > zero_extract for this scenario.
> > > > Attached please find the new patch.  Hope this solves both of our
> concerns.
> > >
> > > This looks promising.  I'll try it out, see what it does on other
> > > targets.  (It will have to wait for GCC 11 stage 1, of course).
> 
> It creates better code on all targets :-)  A quite small improvement, but not
> entirely trivial.

Thanks for the effort.  It's great to hear that :- )
Attached please find the v3 patch.  Rebased on the latest trunk. 
Bootstrapped and tested on aarch64-linux-gnu.  Could you please help install it?

> > > p.s.  Please use a correct mime type?  application/octet-stream
> > > isn't something I can reply to.  Just text/plain is fine :-)
> >
> > I have using plain text now, hope that works for you.  :-)
> 
> Nope:
> 
> [-- Attachment #2: pr94026-v2.diff --]
> [-- Type: application/octet-stream, Encoding: base64, Size: 5.9K --]

This time I switched to use UUEncode type for the attachment.  Does it work?
I am using Outlook and I didn't find the place to change the MIME type : - (

Felix


pr94026-v3.diff
Description: pr94026-v3.diff


RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-05-22 Thread Yangfei (Felix)
Hi Richard,

Thanks for the suggestions.

> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Thursday, May 21, 2020 5:22 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with
> fixed sve vector length
> 
> "Yangfei (Felix)"  writes:
> > Hi,
> >
> >   Notice a tiny SVE-related performance issue:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95254
> >
> >   For the given test case, SLP succeeds with VNx8HImode with or without
> option -msve-vector-bits=256.
> >   The root cause for the difference is that we choose a different mode in
> aarch64_vectorize_related_mode under -msve-vector-bits=256:
> VNx2HImode instead of V4HImode.
> >   Then in the final tree ssa forwprop pass, we need to do a VIEW_CONVERT
> from V4HImode to VNx2HImode.
> >
> >   PATCH catch and simplify the pattern in
> aarch64_expand_sve_mem_move, emitting a mov pattern of V4HImode
> instead.
> >   I am assuming endianness does not make a difference here considering
> this simplification.
> >   Bootstrap and tested on aarch64-linux-gnu.  Comments?
> 
> I think we should try to catch this at the gimple level, possibly during SLP 
> itself.

Agreed.  It's better if this can be handled in SLP. 
For the given test case, the difference reflect itself after the final ssa 
forwprop. 
So I guess it might be hard for SLP to catch and evaluate in the vect cost 
model.

> Although the patch handles the case in which the V4HI is stored directly to
> memory, I assume it won't help if the code instead does:
> 
> for (i = 0; i < 4; i++)
>   b[i] = u.a[i] + 1;

SLP failed if you modify like that.  

> targetm.can_change_mode_class (..., ALL_REGS) would be a good indicator
> of whether the needed VIEW_CONVERT_EXPR is cheap or expensive.
> 
> I agree it might still be worth handling this in the move patterns too.
> It feels like a target-independent optimisation though, and for example
> should also apply to V4HI moves involving subregs of VNx2HIs.
> 
> So I think it would be worth trying to do this in emit_move_insn.
> In principle it would be useful for:
> 
>   // M1 and M2 equal size, !targetm.can_change_mode_class (M1, M2,
> ALL_REGS)
> 
>   (set (subreg:M1 (reg:M2 ...)) (subreg:M1 (reg:M2 ...)))
>   (set (subreg:M1 (reg:M2 ...)) (mem:M1 ADDR))
>   (set (mem:M1 ADDR) (subreg:M1 (reg:M2 ...)))
>   (set (subreg:M1 (reg:M2 ...)) (constant C))
> 
> It would be nice if we could do this even without the
> can_change_mode_class condition, provided that it doesn't turn valid M1
> constants or MEMs into invalid M2 ones (or at least, M2 ones that need to be
> validated).
> Unfortunately, going that far is likely to interfere with target-specific 
> choices,
> so it's probably too dangerous.
> 
> With the can_change_mode_class condition it should be fine though, since
> it's avoiding an implicit round trip through memory.  The change should be a
> win even if the MEMs or constants need legitimising for M2.

Good suggestion.  I worked out a v2 patch with the logic moved to in 
emit_move_insn.
For (set (subreg:M1 (reg:M2 ...)) (constant C)) case, I am simply requiring C 
should also be a legitimate constant for M2.
It works for the test case.  I will do a full test for the v2 patch if you like 
it.

> > [...]
> > +  if (inner != NULL_RTX
> > +  && aarch64_classify_vector_mode (inner_mode) == VEC_ADVSIMD
> > +  && GET_MODE_INNER (mode) == GET_MODE_INNER (inner_mode)
> > +  && known_eq (GET_MODE_SIZE (mode), GET_MODE_SIZE
> (inner_mode))
> > +  && GET_MODE_BITSIZE (inner_mode).to_constant () <= alignment)
> > +{
> > +  rtx addr, mem;
> > +  if (MEM_P (src))
> > +   {
> > + addr = XEXP (src, 0);
> > + mem = gen_rtx_MEM (inner_mode, addr);
> > + emit_move_insn (inner, mem);
> > +   }
> > +  else
> > +   {
> > + addr = XEXP (dest, 0);
> > + mem = gen_rtx_MEM (inner_mode, addr);
> > + emit_move_insn (mem, inner);
> 
> gen_rtx_MEM shouldn't be used to create new versions of existing MEMs,
> since it drops all the attributes.  It's better to use something like
> adjust_address instead.  That will also take care of making sure that the
> address is valid for the new mode.

It was my fault.  I should have realized that difference.  :-)

Felix


pr95254-v2.diff
Description: pr95254-v2.diff


[PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-05-21 Thread Yangfei (Felix)
Hi,

  Notice a tiny SVE-related performance issue:  
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95254 

  For the given test case, SLP succeeds with VNx8HImode with or without option 
-msve-vector-bits=256.
  The root cause for the difference is that we choose a different mode in 
aarch64_vectorize_related_mode under -msve-vector-bits=256: VNx2HImode instead 
of V4HImode.
  Then in the final tree ssa forwprop pass, we need to do a VIEW_CONVERT from 
V4HImode to VNx2HImode.

  PATCH catch and simplify the pattern in aarch64_expand_sve_mem_move, emitting 
a mov pattern of V4HImode instead.
  I am assuming endianness does not make a difference here considering this 
simplification.
  Bootstrap and tested on aarch64-linux-gnu.  Comments?

Thanks,
Felix




pr95254-v1.diff
Description: pr95254-v1.diff


RE: [PATCH PR94991] aarch64: ICE: Segmentation fault with option -mgeneral-regs-only

2020-05-11 Thread Yangfei (Felix)
Hi,

> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Monday, May 11, 2020 10:27 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH PR94991] aarch64: ICE: Segmentation fault with option -
> mgeneral-regs-only
> 
> LGTM.  Pushed with one minor formatting fix:
> 
> > @@ -1364,7 +1364,11 @@
> >  if (!TARGET_FLOAT)
> >{
> > aarch64_err_no_fpadvsimd (mode);
> > -   FAIL;
> > +   machine_mode intmode
> > +   = int_mode_for_size (GET_MODE_BITSIZE (mode),
> 0).require ();
> 
> The "=" should only be indented by two spaces relative to the first line.

Thanks for fixing the bad formatting  :- )
I was expecting issues like that to be reported by contrib/check_GNU_style.sh.
Will be more carefull.

Felix


[PATCH PR94991] aarch64: ICE: Segmentation fault with option -mgeneral-regs-only

2020-05-07 Thread Yangfei (Felix)
Hi,

  Witnessed another ICE with option -mgeneral-regs-only. 
  I have created a bug for that: 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94991 

  For the given testcase, we are doing FAIL for scalar floating move expand 
pattern since TARGET_FLOAT
  is false with option -mgeneral-regs-only. But move expand pattern cannot 
fail. It would be better to 
  replace the FAIL with code that bitcasts to the equivalent integer mode, 
using gen_lowpart.

  Bootstrap and tested on aarch64-linux-gnu.  Comments?

Thanks,
Felix


pr94991-v1.diff
Description: pr94991-v1.diff


RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero

2020-05-06 Thread Yangfei (Felix)
Hi,

> -Original Message-
> From: Segher Boessenkool [mailto:seg...@kernel.crashing.org]
> Sent: Tuesday, March 24, 2020 10:58 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) 
> Subject: Re: [PATCH PR94026] combine missed opportunity to simplify
> comparisons with zero
> 
> On Tue, Mar 24, 2020 at 06:30:12AM +, Yangfei (Felix) wrote:
> > I modified combine emitting a simple AND operation instead of making one
> zero_extract for this scenario.
> > Attached please find the new patch.  Hope this solves both of our concerns.
> 
> This looks promising.  I'll try it out, see what it does on other targets.  
> (It will
> have to wait for GCC 11 stage 1, of course).

I see GCC11 stage 1 opens for commits now.
I have rebased the patch on the latest repo.  Attached please find the v2 patch.
Bootstrapped and tested on x86-64-linux-gnu and aarch64-linux-gnu.
Is this good to go?
 
> 
> p.s.  Please use a correct mime type?  application/octet-stream isn't
> something I can reply to.  Just text/plain is fine :-)

I have using plain text now, hope that works for you.  :-)

Thanks,
Felix


pr94026-v2.diff
Description: pr94026-v2.diff


RE: [PATCH PR94784] ICE: in simplify_vector_constructor, at tree-ssa-forwprop.c:2482

2020-04-27 Thread Yangfei (Felix)
> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Monday, April 27, 2020 6:10 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH PR94784] ICE: in simplify_vector_constructor, at tree-
> ssa-forwprop.c:2482
> 
> > Good suggestion.  Modified accordingly.
> >
> >> LGTM otherwise, and sorry for the breakage.
> >
> > Does the v2 patch look better?
> >
> > Manually run the following three tests with runtest:
> > gcc.target/aarch64/sve/acle/general/pr94683.c
> > gcc.target/aarch64/sve/acle/general/pr94700.c
> > gcc.dg/pr94784.c
> 
> Thanks, pushed to master after testing on aarch64-linux-gnu.

My local bootstrap and regression tests for the v2 patch have just finished.  
The results looks good too.  Thanks for the efforts : - )

Best regards,
Felix
 





RE: [PATCH PR94784] ICE: in simplify_vector_constructor, at tree-ssa-forwprop.c:2482

2020-04-27 Thread Yangfei (Felix)
> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Monday, April 27, 2020 3:51 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH PR94784] ICE: in simplify_vector_constructor, at tree-
> ssa-forwprop.c:2482
> 
> "Yangfei (Felix)"  writes:
> > Hi,
> >
> > PR tree-optimization/94269
> > * tree-ssa-math-opts.c (convert_plusminus_to_widen): Restrict
> > -   this
> > -   operation to single basic block.
> > +   this operation to single basic block.
> 
> Not wrong, but might as well leave the entry as-is.

OK.  Leave it there.

> >
> We don't need to fall back to constant comparisons here.
> We should assert for known_eq between the TYPE_VECTOR_SUBPARTS
> instead.
> 
> Same for the other assert.

Good suggestion.  Modified accordingly.

> LGTM otherwise, and sorry for the breakage.

Does the v2 patch look better?

Manually run the following three tests with runtest:
gcc.target/aarch64/sve/acle/general/pr94683.c
gcc.target/aarch64/sve/acle/general/pr94700.c
gcc.dg/pr94784.c

Thanks for your help,
Felix


pr94784-v2.diff
Description: pr94784-v2.diff


[PATCH PR94784] ICE: in simplify_vector_constructor, at tree-ssa-forwprop.c:2482

2020-04-27 Thread Yangfei (Felix)
Hi,

  I see one gcc_assert was introduce in:  
https://gcc.gnu.org/pipermail/gcc-patches/2020-April/544271.html
  This is causing an ICE for certain cases.  I have created a PR for this: 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94784
  I did some check and it looks like everything works fine before the ICE.  In 
the testcase we have two vectors with the same ABI identity but with different 
TYPE_MODEs.
  The proposed patch flips the assert around so that it checks that the two 
vectors have equal TYPE_VECTOR_SUBPARTS and that converting the corresponding 
element types is a useless_type_conversion_p.

  Bootstrap and tested on aarch64-linux-gnu.  OK?

Thanks,
Felix


pr94784-v1.diff
Description: pr94784-v1.diff


[PATCH] aarch64: add tests for CPP predefines under -mgeneral-regs-only

2020-04-23 Thread Yangfei (Felix)
Hi,

I noticed that gcc.target/aarch64/pragma_cpp_predefs_1.c performs testing 
for -mgeneral-regs-only.
This adds similar testing in the following two tests to make sure CPP 
predefines redefinitions on #pragma
works as expected when -mgeneral-regs-only option is specified (See 
PR94678):
gcc.target/aarch64/pragma_cpp_predefs_2.c
gcc.target/aarch64/pragma_cpp_predefs_3.c

The two tests pass with the modification.  OK?

gcc/testsuite/
PR target/94678
* gcc.target/aarch64/pragma_cpp_predefs_2.c: Fix typos, pop_pragma ->
pop_options. Add tests for general-regs-only.
* gcc.target/aarch64/pragma_cpp_predefs_3.c: Add tests for
general-regs-only.

Thanks for your help,
Felix


add-sve-predef-tests-v1.diff
Description: add-sve-predef-tests-v1.diff


RE: [PATCH PR94678] aarch64: unexpected result with -mgeneral-regs-only and sve

2020-04-22 Thread Yangfei (Felix)
Hi,

> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Wednesday, April 22, 2020 6:03 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH PR94678] aarch64: unexpected result with -mgeneral-
> regs-only and sve
> 
> Mostly LGTM, just a couple of minor points:

Thanks for the very careful code review.  :-) 
I think the revised patch fixed these points.  
GCC builds OK and the newly added test still works. 
Please help push if it's good to go. 

Felix


0001-aarch64-unexpected-result-with-mgeneral-regs-only-an.patch
Description: 0001-aarch64-unexpected-result-with-mgeneral-regs-only-an.patch


RE: [PATCH PR94678] aarch64: unexpected result with -mgeneral-regs-only and sve

2020-04-22 Thread Yangfei (Felix)
> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Tuesday, April 21, 2020 6:11 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH PR94678] aarch64: unexpected result with -mgeneral-
> regs-only and sve
> > Should aarch64_sve::init_builtins ()/aarch64_sve::handle_arm_sve_h () be
> guarded by TARGET_SVE?
> >
> > Could you please confirm that?
> 
> Yeah, that's right.  As Jakub says, the SVE stuff is (deliberately) registered
> unconditionally because it's possible to switch SVE on and off later.  Also,
> protecting it with just TARGET_SVE would mean that we'd continue to
> register SVE2 functions even if SVE2 isn't currently enabled.

Well, I was thinking maybe we can call aarch64_sve::init_builtins () in 
aarch64_pragma_target_parse when SVE is switched on.
But I think I will leave this alone.

> So what matters is whether SVE is enabled at the point of use, not the point
> of the #include.  FWIW, arm_neon.h works the same way: the same
> functions are defined regardless of what the current prevailing architecture 
> is,
> and what matters is whether the necessary features are enabled when the
> functions are called.  (Inlining fails if they're not.) But because we're
> implementing arm_sve.h directly in the compiler, we don't need the
> overhead of a full target switch when defining the functions.
>
> And like you say, the second of the above tests makes sure that turning SVE
> on later does indeed work, while the first makes sure that we get an
> appropriate error if SVE is disabled at the point of use.

Thanks for the details, it helps :-)
I have modified accordingly and attached please find the adapted v2 patch.
Bootstrap and tested on aarch64 Linux platform.  Does it look better?

Note that we need to disable TARGET_GENERAL_REGS_ONLY before 
register_builtin_types.
Otherwise we got ICEs like:
: internal compiler error: in register_builtin_types, at 
config/aarch64/aarch64-sve-builtins.cc:3336
0x185bcfb register_builtin_types
../../gcc-git/gcc/config/aarch64/aarch64-sve-builtins.cc:3332
0x185c467 aarch64_sve::init_builtins()
../../gcc-git/gcc/config/aarch64/aarch64-sve-builtins.cc:3375
0x17c075b aarch64_init_builtins
../../gcc-git/gcc/config/aarch64/aarch64.c:13086

, where TYPE_MODE (vectype) is BLKmode.

The reason is targetm.vector_mode_supported_p (mode) and 
have_regs_of_mode[mode] are false when TARGET_GENERAL_REGS_ONLY is enabled.
(gdb) bt
#0  vector_type_mode (t=0xb7aa2f18) at ../../gcc-git/gcc/tree.c:13825
#1  0x01297ee0 in layout_type (type=0xb7aa2f18) at 
../../gcc-git/gcc/stor-layout.c:2400
#2  0x016e19d8 in make_vector_type (innertype=0xb7aa2a80, 
nunits=..., mode=E_VNx16BImode) at ../../gcc-git/gcc/tree.c:9984
#3  0x016e79b4 in build_truth_vector_type_for_mode (nunits=..., 
mask_mode=E_VNx16BImode) at ../../gcc-git/gcc/tree.c:10929
#4  0x0185bb00 in aarch64_sve::register_builtin_types () at 
../../gcc-git/gcc/config/aarch64/aarch64-sve-builtins.cc:3331
#5  0x0185c468 in aarch64_sve::init_builtins () at 
../../gcc-git/gcc/config/aarch64/aarch64-sve-builtins.cc:3375
#6  0x017c075c in aarch64_init_builtins () at 
../../gcc-git/gcc/config/aarch64/aarch64.c:13086
#7  0x00a69cb4 in c_define_builtins 
(va_list_ref_type_node=0xb79ea540, va_list_arg_type_node=0xb79e79d8)
at ../../gcc-git/gcc/c-family/c-common.c:3973

Felix


pr94678-v2.patch
Description: pr94678-v2.patch


RE: [PATCH PR94678] aarch64: unexpected result with -mgeneral-regs-only and sve

2020-04-21 Thread Yangfei (Felix)
> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Tuesday, April 21, 2020 4:01 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH PR94678] aarch64: unexpected result with -mgeneral-
> regs-only and sve
> 
> "Yangfei (Felix)"  writes:
> > Hi,
> >
> >   It looks like there are several issues out there for sve codegen with -
> mgeneral-regs-only.
> >   I have created a bug for that:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94678
> >
> >   We do ISA extension checks for SVE in
> check_required_extensions(aarch64-sve-builtins.cc).
> >   I think we may also need to check -mgeneral-regs-only there and issue an
> error message when this option is specified.
> >   This would be cheaper as compared with adding &&
> TARGET_GENERAL_REGS_ONLY to TARGET_SVE and similar macros.
> 
> We should probably do both.
> 
> The middle end should never try to use vector patterns when the vector
> modes have been disabled by !have_regs_of_mode.  But it's still wrong for
> the target to provide patterns that would inevitably lead to spill failure 
> due to
> lack of registers.  So I think we should check !TARGET_GENERAL_REGS_ONLY
> in TARGET_SVE.

Yes, that's right.  And I have a question here:
Should aarch64_sve::init_builtins ()/aarch64_sve::handle_arm_sve_h () be 
guarded by TARGET_SVE?

I mean in aarch64_init_builtins:
if (TARGET_SVE)
  aarch64_sve::init_builtins ();

and in aarch64_pragma_aarch64:
if (TARGET_SVE)
  aarch64_sve::handle_arm_sve_h ();

It looks to me that this is not wanted from the following two tests: 
./gcc.target/aarch64/sve/acle/general/nosve_1.c
./gcc.target/aarch64/sve/acle/general/nosve_2.c

Could you please confirm that?  

> I guess the main danger is for instructions like ADDVL, ADDPL and CNT[BHWD]
> that do actually operate on general registers.  Perhaps there'll be weird
> corner cases in which the compiler wants to know the VL even for -mgeneral-
> regs.  I can't think of one off-hand though.
> 
> If that becomes a problem, we can introduce a second macro to control the
> general register operations.  But I think we can safely assume for now that
> one macro is enough.

OK, let's see.


> >   Attached please find the proposed patch.  Bootstrap and tested on
> aarch64 Linux platform.
> >   Suggestions?
> >
> > Thanks,
> > Felix
> >
> > diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 721928d..c109d7b
> > 100644
> > --- a/gcc/ChangeLog
> > +++ b/gcc/ChangeLog
> > @@ -1,3 +1,11 @@
> > +2020-04-21  Felix Yang  
> > +
> > +   PR target/94678
> > +   * config/aarch64/aarch64-sve-builtins.cc
> (check_required_extensions):
> > +   Add check for TARGET_GENERAL_REGS_ONLY.
> > +   (report_missing_extension): Print different error message under
> > +   TARGET_GENERAL_REGS_ONLY.
> > +
> >  2020-04-20  Andreas Krebbel  
> >
> > * config/s390/vector.md ("popcountv8hi2_vx", "popcountv4si2_vx")
> > diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc
> > b/gcc/config/aarch64/aarch64-sve-builtins.cc
> > index ca4a0eb..4a77db7 100644
> > --- a/gcc/config/aarch64/aarch64-sve-builtins.cc
> > +++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
> > @@ -649,11 +649,21 @@ report_missing_extension (location_t location,
> tree fndecl,
> >if (reported_missing_extension_p)
> >  return;
> >
> > -  error_at (location, "ACLE function %qD requires ISA extension %qs",
> > -   fndecl, extension);
> > -  inform (location, "you can enable %qs using the command-line"
> > - " option %<-march%>, or by using the %"
> > - " attribute or pragma", extension);
> > +  if (TARGET_GENERAL_REGS_ONLY)
> > +{
> > +  error_at (location, "ACLE function %qD requires ISA extension %qs"
> > +   " which is incompatible with the use of %qs",
> > +   fndecl, extension, "-mgeneral-regs-only");
> > +}
> > +  else
> > +{
> > +  error_at (location, "ACLE function %qD requires ISA extension %qs",
> > +   fndecl, extension);
> > +  inform (location, "you can enable %qs using the command-line"
> > + " option %<-march%>, or by using the %"
> > + " attribute or pragma", extension);
> > +}
> > +
> >reported_missing_extension_p = true;  }
> >
> > @@ -666,7 +676,14 @@ check_required_extensions (location_t location,
> > tree fndecl,  {
> >uint64_t m

[PATCH PR94678] aarch64: unexpected result with -mgeneral-regs-only and sve

2020-04-21 Thread Yangfei (Felix)
Hi,

  It looks like there are several issues out there for sve codegen with 
-mgeneral-regs-only. 
  I have created a bug for that: 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94678 

  We do ISA extension checks for SVE in 
check_required_extensions(aarch64-sve-builtins.cc). 
  I think we may also need to check -mgeneral-regs-only there and issue an 
error message when this option is specified. 
  This would be cheaper as compared with adding && TARGET_GENERAL_REGS_ONLY to 
TARGET_SVE and similar macros. 
  Attached please find the proposed patch.  Bootstrap and tested on aarch64 
Linux platform.  
  Suggestions?

Thanks,
Felix


pr94678-v1.patch
Description: pr94678-v1.patch


RE: [PATCH] ICE: in vectorizable_load, at tree-vect-stmts.c:9173

2020-03-31 Thread Yangfei (Felix)
Hi!

> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Tuesday, March 31, 2020 4:55 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org; rguent...@suse.de
> Subject: Re: [PATCH] ICE: in vectorizable_load, at tree-vect-stmts.c:9173
> >
> > Yes, I have modified accordingly.  Attached please find the adapted patch.
> > Bootstrapped and tested on aarch64-linux-gnu.  Newly add test will fail
> without the patch and pass otherwise.
> 
> Looks good.  OK for master.

Thanks for reviewing this.  

> > I think I need a sponsor if this patch can go separately.
> 
> Yeah, please fill in the form on:
> 
>https://sourceware.org/cgi-bin/pdw/ps_form.cgi
> 
> listing me as sponsor.

Hmm, I already have an account : - )  
But my networking does not work well and I am having some trouble committing 
the patch.  
Could you please help commit this patch?  

Felix


RE: [PATCH] ICE: in vectorizable_load, at tree-vect-stmts.c:9173

2020-03-31 Thread Yangfei (Felix)
Hi!

> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Monday, March 30, 2020 8:08 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org; rguent...@suse.de
> Subject: Re: [PATCH] ICE: in vectorizable_load, at tree-vect-stmts.c:9173
> 
> "Yangfei (Felix)"  writes:
> > Hi!
> >> -----Original Message-
> >> From: Yangfei (Felix)
> >> Sent: Monday, March 30, 2020 5:28 PM
> >> To: gcc-patches@gcc.gnu.org
> >> Cc: 'rguent...@suse.de' 
> >> Subject: [PATCH] ICE: in vectorizable_load, at tree-vect-stmts.c:9173
> >>
> >> Hi,
> >>
> >> New bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94398
> >>
> >> With -mstrict-align, aarch64_builtin_support_vector_misalignment will
> >> returns false when misalignment factor is unknown at compile time.
> >> Then vect_supportable_dr_alignment returns dr_unaligned_unsupported,
> >> which triggers the ICE.  I have pasted the call trace on the bug report.
> >>
> >> vect_supportable_dr_alignment is expected to return either dr_aligned
> >> or dr_unaligned_supported for masked operations.
> >> But it seems that this function only catches internal_fn
> >> IFN_MASK_LOAD & IFN_MASK_STORE.
> >> We are emitting a mask gather load here for this test case.
> >> As backends have their own vector misalignment support policy, I am
> >> supposing this should be better handled in the auto-vect shared code.
> >>
> >
> > I can only reply to comment on the bug here as my account got locked by the
> bugzilla system for now.
> 
> Sorry to hear that.  What was the reason?

Looks like it got filtered by spamassassin.  Admin has helped unlocked it.  

> > The way Richard (rsandifo) suggests also works for me and I agree it should
> be more consistent and better for compile time.
> > One question here: when will a IFN_MASK_LOAD/IFN_MASK_STORE be
> passed to vect_supportable_dr_alignment?
> 
> I'd expect that to happen in the same cases as for unmasked load/stores.
> It certainly will for unconditional loads and stores that get masked via 
> full-loop
> masking.
> 
> In principle, the same rules apply to both masked and unmasked accesses.
> But for SVE, both masked and unmasked accesses should support misalignment,
> which is why I think the current target hook is probably wrong for SVE +
> -mstrict-align.

I stopped looking into the backend further when I saw no distinction for 
different type of access
in the target hook aarch64_builtin_support_vector_misalignment. 

> > @@ -8051,8 +8051,15 @@ vectorizable_store (stmt_vec_info stmt_info,
> gimple_stmt_iterator *gsi,
> >auto_vec dr_chain (group_size);
> >oprnds.create (group_size);
> >
> > -  alignment_support_scheme
> > -= vect_supportable_dr_alignment (first_dr_info, false);
> > +  /* Strided accesses perform only component accesses, alignment
> > + is irrelevant for them.  */
> > +  if (STMT_VINFO_STRIDED_P (first_dr_info->stmt)
> > +  && !STMT_VINFO_GROUPED_ACCESS (first_dr_info->stmt))
> 
> I think this should be based on memory_access_type ==
> VMAT_GATHER_SCATTER instead.  At this point, STMT_VINFO_* describes
> properties of the original scalar access(es) while memory_access_type
> describes the vector implementation strategy.  It's the latter that matters
> here.
> 
> Same thing for loads.

Yes, I have modified accordingly.  Attached please find the adapted patch.  
Bootstrapped and tested on aarch64-linux-gnu.  Newly add test will fail without 
the patch and pass otherwise.  
I think I need a sponsor if this patch can go separately.  

Thanks,
Felix


pr94398-v1.patch
Description: pr94398-v1.patch


RE: [PATCH] ICE: in vectorizable_load, at tree-vect-stmts.c:9173

2020-03-30 Thread Yangfei (Felix)
Hi!
> -Original Message-
> From: Yangfei (Felix)
> Sent: Monday, March 30, 2020 5:28 PM
> To: gcc-patches@gcc.gnu.org
> Cc: 'rguent...@suse.de' 
> Subject: [PATCH] ICE: in vectorizable_load, at tree-vect-stmts.c:9173
> 
> Hi,
> 
> New bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94398
> 
> With -mstrict-align, aarch64_builtin_support_vector_misalignment will returns
> false when misalignment factor is unknown at compile time.
> Then vect_supportable_dr_alignment returns dr_unaligned_unsupported,
> which triggers the ICE.  I have pasted the call trace on the bug report.
> 
> vect_supportable_dr_alignment is expected to return either dr_aligned or
> dr_unaligned_supported for masked operations.
> But it seems that this function only catches internal_fn IFN_MASK_LOAD &
> IFN_MASK_STORE.
> We are emitting a mask gather load here for this test case.
> As backends have their own vector misalignment support policy, I am supposing
> this should be better handled in the auto-vect shared code.
> 

I can only reply to comment on the bug here as my account got locked by the 
bugzilla system for now. 
The way Richard (rsandifo) suggests also works for me and I agree it should be 
more consistent and better for compile time. 
One question here: when will a IFN_MASK_LOAD/IFN_MASK_STORE be passed to 
vect_supportable_dr_alignment? 

New patch:
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 12beef6..2825023 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -8051,8 +8051,15 @@ vectorizable_store (stmt_vec_info stmt_info, 
gimple_stmt_iterator *gsi,
   auto_vec dr_chain (group_size);
   oprnds.create (group_size);

-  alignment_support_scheme
-= vect_supportable_dr_alignment (first_dr_info, false);
+  /* Strided accesses perform only component accesses, alignment
+ is irrelevant for them.  */
+  if (STMT_VINFO_STRIDED_P (first_dr_info->stmt)
+  && !STMT_VINFO_GROUPED_ACCESS (first_dr_info->stmt))
+alignment_support_scheme = dr_unaligned_supported;
+  else
+alignment_support_scheme
+  = vect_supportable_dr_alignment (first_dr_info, false);
+
   gcc_assert (alignment_support_scheme);
   vec_loop_masks *loop_masks
 = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
@@ -9168,8 +9175,15 @@ vectorizable_load (stmt_vec_info stmt_info, 
gimple_stmt_iterator *gsi,
   ref_type = reference_alias_ptr_type (DR_REF (first_dr_info->dr));
 }

-  alignment_support_scheme
-= vect_supportable_dr_alignment (first_dr_info, false);
+  /* Strided accesses perform only component accesses, alignment
+ is irrelevant for them.  */
+  if (STMT_VINFO_STRIDED_P (first_dr_info->stmt)
+  && !STMT_VINFO_GROUPED_ACCESS (first_dr_info->stmt))
+alignment_support_scheme = dr_unaligned_supported;
+  else
+alignment_support_scheme
+  = vect_supportable_dr_alignment (first_dr_info, false);
+
   gcc_assert (alignment_support_scheme);
   vec_loop_masks *loop_masks
 = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)


[PATCH] ICE: in vectorizable_load, at tree-vect-stmts.c:9173

2020-03-30 Thread Yangfei (Felix)
Hi,

New bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94398 

With -mstrict-align, aarch64_builtin_support_vector_misalignment will returns 
false when misalignment factor is unknown at compile time.
Then vect_supportable_dr_alignment returns dr_unaligned_unsupported, which 
triggers the ICE.  I have pasted the call trace on the bug report.

vect_supportable_dr_alignment is expected to return either dr_aligned or 
dr_unaligned_supported for masked operations.
But it seems that this function only catches internal_fn IFN_MASK_LOAD & 
IFN_MASK_STORE.
We are emitting a mask gather load here for this test case.
As backends have their own vector misalignment support policy, I am supposing 
this should be better handled in the auto-vect shared code.

Proposed fix:
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 0192aa6..67d3345 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -6509,11 +6509,26 @@ vect_supportable_dr_alignment (dr_vec_info *dr_info,

   /* For now assume all conditional loads/stores support unaligned
  access without any special code.  */
-  if (gcall *stmt = dyn_cast  (stmt_info->stmt))
-if (gimple_call_internal_p (stmt)
-   && (gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
-   || gimple_call_internal_fn (stmt) == IFN_MASK_STORE))
-  return dr_unaligned_supported;
+  gcall *call = dyn_cast  (stmt_info->stmt);
+  if (call && gimple_call_internal_p (call))
+{
+  internal_fn ifn = gimple_call_internal_fn (call);
+  switch (ifn)
+   {
+ case IFN_MASK_LOAD:
+ case IFN_MASK_LOAD_LANES:
+ case IFN_MASK_GATHER_LOAD:
+ case IFN_MASK_STORE:
+ case IFN_MASK_STORE_LANES:
+ case IFN_MASK_SCATTER_STORE:
+   return dr_unaligned_supported;
+ default:
+   break;
+   }
+}
+
+  if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+return dr_unaligned_supported;

   if (loop_vinfo)
 {

Suggestions?

Thanks,
Felix


RE: [RFC] Should widening_mul should consider block frequency?

2020-03-26 Thread Yangfei (Felix)
> -Original Message-
> From: Richard Biener [mailto:richard.guent...@gmail.com]
> Sent: Thursday, March 26, 2020 3:37 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [RFC] Should widening_mul should consider block frequency?
> 
> >
> > That's a good point.  I have attached the v2 patch.
> > Also did a spec2017 test on aarch64, no obvious impact witnessed with this.
> > Can you sponsor this patch please?  My networking does not work well
> > and I am having some trouble pushing it : - (
> 
> Pushed.  For the future can you please attach patches suitable for git am?

Sure, will do.  Thanks for the help : - )  

Felix


RE: [RFC] Should widening_mul should consider block frequency?

2020-03-25 Thread Yangfei (Felix)
Hi!

> -Original Message-
> From: Richard Biener [mailto:richard.guent...@gmail.com]
> Sent: Tuesday, March 24, 2020 10:14 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [RFC] Should widening_mul should consider block frequency?
> 
> > > As written in the PR I'd follow fma generation and restrict defs to the 
> > > same
> BB.
> >
> > Thanks for the suggestion.  That should be more consistent.
> > Attached please find the adapted patch.
> > Bootstrap and tested on both x86_64 and aarch64 Linux platform.
> 
> OK with moving the BB check before the is_widening_mult_p call since it's way
> cheaper.

That's a good point.  I have attached the v2 patch.  
Also did a spec2017 test on aarch64, no obvious impact witnessed with this.  
Can you sponsor this patch please?  My networking does not work well and I am 
having some trouble pushing it : - (  

git commit msg: 

widening_mul: restrict ops to be defined in the same basic-block when 
convert plusminus to widen

In the testcase for PR94269, widening_mul moves two multiply instructions 
from outside the loop to inside
the loop, merging with two add instructions separately.  This increases the 
cost of the loop.  Like FMA detection
in the same pass, simply restrict ops to be defined in the same basic-block 
to avoid possibly moving multiply
to a different block with a higher execution frequency.  

2020-03-26  Felix Yang  

gcc/
PR tree-optimization/94269
* tree-ssa-math-opts.c (convert_plusminus_to_widen): Restrict this
operation to single basic block.

gcc/testsuite/
PR tree-optimization/94269
* gcc.dg/pr94269.c: New test.

change log:

gcc:
+2020-03-26  Felix Yang  
+
+   PR tree-optimization/94269
+   * tree-ssa-math-opts.c (convert_plusminus_to_widen): Restrict this
+   operation to single basic block.

gcc/testsuite:
+2020-03-26  Felix Yang  
+
+   PR tree-optimization/94269
+   * gcc.dg/pr94269.c: New test.
+


pr94269-v2.patch
Description: pr94269-v2.patch


RE: [RFC] Should widening_mul should consider block frequency?

2020-03-24 Thread Yangfei (Felix)
Hi!

> -Original Message-
> From: Richard Biener [mailto:richard.guent...@gmail.com]
> Sent: Monday, March 23, 2020 11:25 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [RFC] Should widening_mul should consider block frequency?
> 
> On Mon, Mar 23, 2020 at 10:53 AM Yangfei (Felix) 
> wrote:
> >
> > Hi,
> >
> >   I created a bug for this issue:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94269
> >   Looks like widening_mul phase may move multiply instruction from outside
> the loop to inside the loop, merging with one add instruction inside the loop.
> >   This will increase the cost of the loop at least on aarch64 (4 cycles vs 1
> cycle).  I think widening_mul should consider block frequency when doing such
> a combination.
> >   I mean something like:
> 
> As written in the PR I'd follow fma generation and restrict defs to the same 
> BB.

Thanks for the suggestion.  That should be more consistent.  
Attached please find the adapted patch.  
Bootstrap and tested on both x86_64 and aarch64 Linux platform.  

gcc:
+2020-03-24  Felix Yang  
+
+   PR tree-optimization/94269
+   * tree-ssa-math-opts.c (convert_plusminus_to_widen): Restrict this
+   operation to single basic block.

gcc/testsuite:
+2020-03-24  Felix Yang  
+
+   PR tree-optimization/94269
+   * gcc.dg/pr94269.c: New test.
+

Thanks,
Felix


pr94269-v1.patch
Description: pr94269-v1.patch


RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero

2020-03-24 Thread Yangfei (Felix)
Hi!

> -Original Message-
> From: Segher Boessenkool [mailto:seg...@kernel.crashing.org]
> Sent: Monday, March 23, 2020 8:10 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) 
> Subject: Re: [PATCH PR94026] combine missed opportunity to simplify
> comparisons with zero
> 
> Yeah, maybe not in simplify-rtx.c, hrm.  There is SELECT_CC_MODE for these
> things, and combine knows about that (not many passes do).

I modified combine emitting a simple AND operation instead of making one 
zero_extract for this scenario.  
Attached please find the new patch.  Hope this solves both of our concerns.  

gcc
+2020-03-24  Felix Yang   
+
+   PR rtl-optimization/94026
+   * combine.c (make_compound_operation_int): If we are have (and
+   (lshiftrt X C) M) and M is a constant that would select a field
+   of bits within an item, but not the entire word, fold this into
+   a simple AND if we are in an equality comparison.

gcc/testsuite
+2020-03-24  Felix Yang   
+
+   PR rtl-optimization/94026
+   * gcc.dg/pr94026.c: New test.


Thanks,
Felix


pr94026-v1.patch
Description: pr94026-v1.patch


[RFC] Should widening_mul should consider block frequency?

2020-03-23 Thread Yangfei (Felix)
Hi,

  I created a bug for this issue: 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94269 
  Looks like widening_mul phase may move multiply instruction from outside the 
loop to inside the loop, merging with one add instruction inside the loop.  
  This will increase the cost of the loop at least on aarch64 (4 cycles vs 1 
cycle).  I think widening_mul should consider block frequency when doing such a 
combination.  
  I mean something like:
diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c
index 54ba035..4439452 100644
--- a/gcc/tree-ssa-math-opts.c
+++ b/gcc/tree-ssa-math-opts.c
@@ -2721,7 +2721,10 @@ convert_plusminus_to_widen (gimple_stmt_iterator *gsi, 
gimple *stmt,
 {
   if (!has_single_use (rhs1)
  || !is_widening_mult_p (rhs1_stmt, , _rhs1,
- , _rhs2))
+ , _rhs2)
+ || (gimple_bb (rhs1_stmt) != gimple_bb (stmt)
+ && gimple_bb (rhs1_stmt)->count.to_frequency(cfun)
+< gimple_bb (stmt)->count.to_frequency(cfun)))
return false;
   add_rhs = rhs2;
   conv_stmt = conv1_stmt;
@@ -2730,7 +2733,10 @@ convert_plusminus_to_widen (gimple_stmt_iterator *gsi, 
gimple *stmt,
 {
   if (!has_single_use (rhs2)
  || !is_widening_mult_p (rhs2_stmt, , _rhs1,
- , _rhs2))
+ , _rhs2)
+ || (gimple_bb (rhs2_stmt) != gimple_bb (stmt)
+ && gimple_bb (rhs2_stmt)->count.to_frequency(cfun)
+< gimple_bb (stmt)->count.to_frequency(cfun)))
return false;
   add_rhs = rhs1;
   conv_stmt = conv2_stmt;

  Comments?

Thanks,
Felix


RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero

2020-03-23 Thread Yangfei (Felix)
> -Original Message-
> From: Segher Boessenkool [mailto:seg...@kernel.crashing.org]
> Sent: Friday, March 20, 2020 9:38 AM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) 
> Subject: Re: [PATCH PR94026] combine missed opportunity to simplify
> comparisons with zero
> 
> On Thu, Mar 19, 2020 at 01:43:40AM +, Yangfei (Felix) wrote:
> > 2. Given that the patterns for ubfx and ubfiz are already not simple, I am
> afraid the pattern we got by combining the three would be much complex.
> >   And even more complex when further merged with insn 14 here in order to
> make sure that we are doing a equality comparison with zero.
> 
> It will be just as simple as with the other approach:

I think the problem here is how to make sure we are doing a ***equality*** 
comparison with zero.  
We can only do the transformation under this condition.  
Then I see combine tries the following pattern: 

173 Failed to match this instruction:
174 (set (reg:SI 101)
175 (ne:SI (and:SI (lshiftrt:SI (reg:SI 102)
176 (const_int 8 [0x8]))
177 (const_int 6 [0x6]))
178 (const_int 0 [0])))

But this cannot match a 'tst' instruction as the above pattern does not clobber 
the CC flag register.  
Also this means that we depend on specific uses cases, so may need different 
patterns to match all possible cases.  

> > > Another approach:
> > >
> > > Trying 7 -> 9:
> > > 7: r99:SI=r103:SI>>0x8
> > >   REG_DEAD r103:SI
> > > 9: cc:CC_NZ=cmp(r99:SI&0x6,0)
> > >   REG_DEAD r99:SI
> > > Failed to match this instruction:
> > > (set (reg:CC_NZ 66 cc)
> > > (compare:CC_NZ (and:SI (lshiftrt:SI (reg:SI 103)
> > > (const_int 8 [0x8]))
> > > (const_int 6 [0x6]))
> > > (const_int 0 [0])))
> > >
> > > This can be recognised as just that "tst" insn, no?  But combine (or
> > > simplify-rtx) should get rid of the shift here, just the "and" is
> > > simpler after all (it just needs to change the constant for that).
> >
> > No, this does not mean an equality comparison with zero.  I have mentioned
> this in my previous mail.
> 
> This should be simplified to
> (set (reg:CC_NZ 66 cc)
>  (compare:CC_NZ (and:SI (reg:SI 103)
> (const_int 1536))
> (const_int 0)))
> (but it isn't), and that is just *and3nr_compare0, which is a "tst"
> instruction.  If this is fixed (in simplify-rtx.c), it will work as you want.

But I don't think it's correct for logic in simplify-rtx.c to further simplify 
this rtl:  

(compare:CC_NZ (and:SI (lshiftrt:SI (reg:SI 102)
(const_int 8 [0x8]))
(const_int 6 [0x6]))
(const_int 0 [0]))

The reason is that it knows nothing about CC_NZ.  
CC_NZ is aarch64-port specific and it does not necessarily mean a equality 
comparison with zero.  
Correct me if I missed anything.  

Thanks,
Felix


RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero

2020-03-18 Thread Yangfei (Felix)
Hi,

> -Original Message-
> From: Segher Boessenkool [mailto:seg...@kernel.crashing.org]
> Sent: Thursday, March 19, 2020 7:52 AM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) 
> Subject: Re: [PATCH PR94026] combine missed opportunity to simplify
> comparisons with zero
> 
> Hi!
> 
> On Tue, Mar 17, 2020 at 02:05:19AM +, Yangfei (Felix) wrote:
> > > Trying 7 -> 8:
> > > 7: r99:SI=r103:SI>>0x8
> > >   REG_DEAD r103:SI
> > > 8: r100:SI=r99:SI&0x6
> > >   REG_DEAD r99:SI
> > > Failed to match this instruction:
> > > (set (reg:SI 100)
> > > (and:SI (lshiftrt:SI (reg:SI 103)
> > > (const_int 8 [0x8]))
> > > (const_int 6 [0x6])))
> > >
> > > That should match already, perhaps with a splitter.  aarch64 does
> > > not have very generic rotate-and-mask (or -insert) instructions, so
> > > the
> > > aarch64 backend needs to help combine with the less trivial cases.
> > >
> > > If you have a splitter for *this* one, all else will probably work
> > > "automatically": you split it to two ubfm, and the second of those
> > > can then merge into the compare instruction, and everything works out.
> >
> > Do you mean splitting the above pattern into a combination of ubfx and 
> > ubfiz?
> (Both are aliases of ubfm).
> 
> Sure.  The problem with aarch's bitfield instruction is that either the 
> source or
> the dest has to be right-aligned, which isn't natural for the compiler.
> 
> > I still don't see how the benefit can be achieved.
> > The following is the expected assembly for the test case:
> > tst x0, 1536
> > csetw0, ne
> > ret
> > This may not happen when the remaining ubfx is there.  Also what
> instruction be matched when ubfiz is merged into the compare?
> > Anything I missed?
> 
> The second insn could combine with the compare, and then that can combine
> back further.

Let me paste the RTL input to the combine phase:
/
(insn 6 3 7 2 (set (reg:SI 98)
(ashiftrt:SI (reg:SI 102)
(const_int 8 [0x8]))) "foo.c":3:16 742 
{*aarch64_ashr_sisd_or_int_si3}
 (expr_list:REG_DEAD (reg:SI 102)
(nil)))
(note 7 6 8 2 NOTE_INSN_DELETED)
(insn 8 7 9 2 (set (reg:CC_NZ 66 cc)
(compare:CC_NZ (and:SI (reg:SI 98)
(const_int 6 [0x6]))
(const_int 0 [0]))) "foo.c":5:8 698 {*andsi3nr_compare0}
 (expr_list:REG_DEAD (reg:SI 98)
(nil)))
(note 9 8 14 2 NOTE_INSN_DELETED)
(insn 14 9 15 2 (set (reg/i:SI 0 x0)
(ne:SI (reg:CC_NZ 66 cc)
(const_int 0 [0]))) "foo.c":10:1 494 {aarch64_cstoresi}
 (expr_list:REG_DEAD (reg:CC 66 cc)
(nil)))
*/

Two issues that I can see here:
1. When the ubfiz is combined with the compare, the combined insn does not 
necessarily mean a equality comparison with zero.  
  This is also the case when all the three insns (ubfx & ubfiz & compare) are 
combined together.  

2. Given that the patterns for ubfx and ubfiz are already not simple, I am 
afraid the pattern we got by combining the three would be much complex.
  And even more complex when further merged with insn 14 here in order to make 
sure that we are doing a equality comparison with zero.  

So it looks difficult when we go this port-specific way without matching a 
"zero_extact".  

> Another approach:
> 
> Trying 7 -> 9:
> 7: r99:SI=r103:SI>>0x8
>   REG_DEAD r103:SI
> 9: cc:CC_NZ=cmp(r99:SI&0x6,0)
>   REG_DEAD r99:SI
> Failed to match this instruction:
> (set (reg:CC_NZ 66 cc)
> (compare:CC_NZ (and:SI (lshiftrt:SI (reg:SI 103)
> (const_int 8 [0x8]))
> (const_int 6 [0x6]))
> (const_int 0 [0])))
> 
> This can be recognised as just that "tst" insn, no?  But combine (or
> simplify-rtx) should get rid of the shift here, just the "and" is simpler 
> after all (it
> just needs to change the constant for that).

No, this does not mean an equality comparison with zero.  I have mentioned this 
in my previous mail.  

Thanks,
Felix


RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero

2020-03-16 Thread Yangfei (Felix)
Hi,

> -Original Message-
> From: Segher Boessenkool [mailto:seg...@kernel.crashing.org]
> Sent: Tuesday, March 17, 2020 1:58 AM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) 
> Subject: Re: [PATCH PR94026] combine missed opportunity to simplify
> comparisons with zero
> 
> On Mon, Mar 16, 2020 at 06:29:39AM +, Yangfei (Felix) wrote:
> > Sorry for not getting your point here.
> 
> Trying 7 -> 8:
> 7: r99:SI=r103:SI>>0x8
>   REG_DEAD r103:SI
> 8: r100:SI=r99:SI&0x6
>   REG_DEAD r99:SI
> Failed to match this instruction:
> (set (reg:SI 100)
> (and:SI (lshiftrt:SI (reg:SI 103)
> (const_int 8 [0x8]))
> (const_int 6 [0x6])))
> 
> That should match already, perhaps with a splitter.  aarch64 does not have
> very generic rotate-and-mask (or -insert) instructions, so the
> aarch64 backend needs to help combine with the less trivial cases.
> 
> If you have a splitter for *this* one, all else will probably work
> "automatically": you split it to two ubfm, and the second of those can then
> merge into the compare instruction, and everything works out.

Do you mean splitting the above pattern into a combination of ubfx and ubfiz?  
(Both are aliases of ubfm).  
I still don't see how the benefit can be achieved.  
The following is the expected assembly for the test case:  
tst x0, 1536
csetw0, ne
ret
This may not happen when the remaining ubfx is there.  Also what instruction be 
matched when ubfiz is merged into the compare?  
Anything I missed?  

> > Also, this issue is there for ports like x86.  If we go that way, then we 
> > need
> to handle each port affected.
> 
> Yes, you need to do target-specific stuff in every backend separately.
> 
> > So I am inclined to handle this in an arch-independent way.
> 
> But you don't.  The transformation you do looks to be actively harmful on
> many targets.  (I haven't tested it yet, that takes 8h currently).

Now I know your concern about zero_extract.  Maybe this should be mentioned in 
docs like gccint.  
Also it's interesting to see how this may affect on those archs.  

Thanks,
Felix


RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero

2020-03-16 Thread Yangfei (Felix)
Hi,

> -Original Message-
> From: Segher Boessenkool [mailto:seg...@kernel.crashing.org]
> Sent: Saturday, March 14, 2020 12:07 AM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) 
> Subject: Re: [PATCH PR94026] combine missed opportunity to simplify
> comparisons with zero
> 
> On Fri, Mar 13, 2020 at 03:21:18AM +, Yangfei (Felix) wrote:
> > > On Wed, Mar 04, 2020 at 08:39:36AM +, Yangfei (Felix) wrote:
> > > >   This is a simple fix for PR94026.
> > > >   With this fix, combine will try make an extraction if we are in
> > > > a equality
> > > comparison and this is an AND
> > > >   with a constant which is power of two minus one.  Shift here
> > > > should be an
> > > constant.  For example, combine
> > > >   will transform (compare (and (lshiftrt x 8) 6) 0) to (compare
> > > > (zero_extract
> > > (x 2 9)) 0).
> > >
> > > Why is that a good thing?
> >
> > The reported test case is reduced from spec2017 541.leela_r.  I have pasted
> original code snippet on the bugzilla.
> > We found other compilers like aocc/llvm can catch this pattern and simplify 
> > it.
> 
> That wasn't my question, let me rephrase: why would writing it as zero_extract
> (instead of as a more canonical form) be wanted?

Sorry for not getting your point here. 

> The aarch backend only has zero_extract formulations for most of the bitfield
> instructions.  If you fix that problem, all of this should go away?  Like, the
> testcase in the PR starts with
> 
> Trying 7 -> 8:
> 7: r99:SI=r103:SI>>r104:SI#0
>   REG_DEAD r104:SI
>   REG_DEAD r103:SI
> 8: r100:SI=r99:SI&0x6
>   REG_DEAD r99:SI
> Failed to match this instruction:
> (set (reg:SI 100)
> (and:SI (ashiftrt:SI (reg:SI 103)
> (subreg:QI (reg:SI 104) 0))
> (const_int 6 [0x6])))
> 
> and that should match already (that's an ubfm (ubfx))?

For aarch64, if we use "ubfm/ubfx" for the reduced test case, then we still 
need to do a compare with zero.  Then we won't get the benefit.  
For aarch64, we need to emit a "tst" instruction here.  So we need to catch 
something like:  

149 (set (reg:CC_NZ 66 cc)
150 (compare:CC_NZ (and:SI (lshiftrt:SI (reg:SI 102)
151 (const_int 8 [0x8]))
152 (const_int 6 [0x6]))
153 (const_int 0 [0])))

But this pattern is not accurate enough: we can only accept equality comparison 
with zero here (as indicated by the checking of equality_comparison in my 
original patch).  
Also, this issue is there for ports like x86.  If we go that way, then we need 
to handle each port affected.  
So I am inclined to handle this in an arch-independent way.  
I looked into tree phases like fwprop & fold-const before, but didn't see an 
appropriate point to catch this opportunity.  
Then I came to the combine phase.  

> 
> > > (There should be thorough tests on many archs, showing it helps on
> > > average, and it doesn't regress anything.  I can do that for you, but not
> right now).
> >
> > I only have aarch64 & x86_64 linux available and have tested this patch with
> spec17 on both platforms.
> > No obvious improvement & regression witnessed.  This is expected as only
> one instruction is reduced here.
> 
> What should be tested is what new combinations are done, and which are *no
> longer* done.

In theory, we won't lose but emit more zero_extract with my patch.  

> > > In general, we should have *fewer* zero_extract, not more.
> 
> Some reasons for that:
> 
> 1) All those can be expressed with simpler operations as well;
> 2) Most very similar expressions cannot be expressed as zero_extract,
> although many architectures can handle (some of) those just fine;
> 3) The optimizers do not handle zero_extract very well at all (this includes
> simplify-rtx, to start with).
> 
> sign_extract is nastier -- we really want to have a sign_extend that works on
> separate bits, not as coarse as address units as we have now -- but it 
> currently
> isn't handled much either.

Thanks for explaining this.  I have to admit that I didn't realize this issue 
when I was creating my original patch.  


Felix


RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero

2020-03-12 Thread Yangfei (Felix)
> -Original Message-
> From: Segher Boessenkool [mailto:seg...@kernel.crashing.org]
> Sent: Friday, March 13, 2020 7:50 AM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) 
> Subject: Re: [PATCH PR94026] combine missed opportunity to simplify
> comparisons with zero
> 
> Hi!
> 
> Please Cc: me on combine patches; you sent it nine days ago, and I didn't see 
> it
> until now.

OK.  

> On Wed, Mar 04, 2020 at 08:39:36AM +, Yangfei (Felix) wrote:
> >   This is a simple fix for PR94026.
> >   With this fix, combine will try make an extraction if we are in a equality
> comparison and this is an AND
> >   with a constant which is power of two minus one.  Shift here should be an
> constant.  For example, combine
> >   will transform (compare (and (lshiftrt x 8) 6) 0) to (compare 
> > (zero_extract
> (x 2 9)) 0).
> 
> Why is that a good thing?

The reported test case is reduced from spec2017 541.leela_r.  I have pasted 
original code snippet on the bugzilla.  
We found other compilers like aocc/llvm can catch this pattern and simplify it. 
 

> (There should be thorough tests on many archs, showing it helps on average,
> and it doesn't regress anything.  I can do that for you, but not right now).

I only have aarch64 & x86_64 linux available and have tested this patch with 
spec17 on both platforms.  
No obvious improvement & regression witnessed.  This is expected as only one 
instruction is reduced here.  
It's appreciated if this can be tested on other archs.  

> The code needs more comments, and the commit message should say what is
> done and why you made those choices.
> In general, we should have *fewer* zero_extract, not more.

OK.  I can add more comments & commit message if finally we are inclined to go 
with this patch.  

Thanks,
Felix


RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero

2020-03-05 Thread Yangfei (Felix)
> -Original Message-
> From: Jeff Law [mailto:l...@redhat.com]
> Sent: Thursday, March 5, 2020 11:37 PM
> To: Yangfei (Felix) ; gcc-patches@gcc.gnu.org
> Cc: Zhanghaijian (A) 
> Subject: Re: [PATCH PR94026] combine missed opportunity to simplify
> comparisons with zero
> 
> On Wed, 2020-03-04 at 08:39 +, Yangfei (Felix) wrote:
> > Hi,
> >
> >   This is a simple fix for PR94026.
> >   With this fix, combine will try make an extraction if we are in a
> > equality comparison and this is an AND
> >   with a constant which is power of two minus one.  Shift here should
> > be an constant.  For example, combine
> >   will transform (compare (and (lshiftrt x 8) 6) 0) to (compare
> > (zero_extract (x 2 9)) 0).
> >
> >   Added one test case for this.  Bootstrap and tested on both x86_64
> > and
> > aarch64 Linux platform.
> >   Any suggestion?
> >
> > Thanks,
> > Felix
> >
> > gcc:
> > +2020-03-04  Felix Yang  
> > +
> > +   PR rtl-optimization/94026
> > +   * combine.c (make_compound_operation_int): Make an extraction
> > + if we are in a equality comparison and this is an AND with a
> > + constant which is power of two minus one.
> > +
> >
> > gcc/testsuite:
> > +2020-03-04  Felix Yang  
> > +
> > +   PR rtl-optimization/94026
> > +   * gcc.dg/pr94026.c: New test.
> Just a note.  We're in stage4 of our development cycle, meaning we focus on
> regression bugfixes.  I've queued this for evaluation in gcc-11.
> jeff

Sure, this is intended for 11.  Thanks for doing that : - ) 

Best regards,
Felix


[PATCH PR94026] combine missed opportunity to simplify comparisons with zero

2020-03-04 Thread Yangfei (Felix)
Hi,

  This is a simple fix for PR94026.  
  With this fix, combine will try make an extraction if we are in a equality 
comparison and this is an AND
  with a constant which is power of two minus one.  Shift here should be an 
constant.  For example, combine
  will transform (compare (and (lshiftrt x 8) 6) 0) to (compare (zero_extract 
(x 2 9)) 0).  

  Added one test case for this.  Bootstrap and tested on both x86_64 and 
aarch64 Linux platform.  
  Any suggestion?  

Thanks,
Felix

gcc:
+2020-03-04  Felix Yang  
+
+   PR rtl-optimization/94026
+   * combine.c (make_compound_operation_int): Make an extraction
+ if we are in a equality comparison and this is an AND with a
+ constant which is power of two minus one.
+

gcc/testsuite:
+2020-03-04  Felix Yang  
+
+   PR rtl-optimization/94026
+   * gcc.dg/pr94026.c: New test.
+
diff --git a/gcc/combine.c b/gcc/combine.c
index 58366a6d331..c05064fc333 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -8170,14 +8170,31 @@ make_compound_operation_int (scalar_int_mode mode, rtx 
*x_ptr,
   if (!CONST_INT_P (XEXP (x, 1)))
break;
 
+  HOST_WIDE_INT pos;
+  unsigned HOST_WIDE_INT len;
+  pos = get_pos_from_mask (UINTVAL (XEXP (x, 1)), );
+
   /* If the constant is a power of two minus one and the first operand
-is a logical right shift, make an extraction.  */
+is a logical right shift, make an extraction.
+If we are in a equality comparison and this is an AND with a constant
+which is power of two minus one, also make an extraction.  */
   if (GET_CODE (XEXP (x, 0)) == LSHIFTRT
- && (i = exact_log2 (UINTVAL (XEXP (x, 1)) + 1)) >= 0)
+ && (pos == 0 || (pos > 0 && equality_comparison
+  && CONST_INT_P (XEXP (XEXP (x, 0), 1)
{
  new_rtx = make_compound_operation (XEXP (XEXP (x, 0), 0), next_code);
- new_rtx = make_extraction (mode, new_rtx, 0, XEXP (XEXP (x, 0), 1),
-i, 1, 0, in_code == COMPARE);
+ if (pos == 0)
+   {
+ new_rtx = make_extraction (mode, new_rtx, 0,
+XEXP (XEXP (x, 0), 1), len, 1, 0,
+in_code == COMPARE);
+   }
+ else
+   {
+ int real_pos = pos + UINTVAL (XEXP (XEXP (x, 0), 1));
+ new_rtx = make_extraction (mode, new_rtx, real_pos, NULL_RTX,
+len, 1, 0, in_code == COMPARE);
+   }
}
 
   /* Same as previous, but for (subreg (lshiftrt ...)) in first op.  */
@@ -8186,13 +8203,25 @@ make_compound_operation_int (scalar_int_mode mode, rtx 
*x_ptr,
   && is_a  (GET_MODE (SUBREG_REG (XEXP (x, 0))),
  _mode)
   && GET_CODE (SUBREG_REG (XEXP (x, 0))) == LSHIFTRT
-  && (i = exact_log2 (UINTVAL (XEXP (x, 1)) + 1)) >= 0)
+  && (pos == 0
+  || (pos > 0 && equality_comparison
+  && CONST_INT_P (XEXP (SUBREG_REG (XEXP (x, 0)), 1)
{
  rtx inner_x0 = SUBREG_REG (XEXP (x, 0));
  new_rtx = make_compound_operation (XEXP (inner_x0, 0), next_code);
- new_rtx = make_extraction (inner_mode, new_rtx, 0,
-XEXP (inner_x0, 1),
-i, 1, 0, in_code == COMPARE);
+ if (pos == 0)
+   {
+ new_rtx = make_extraction (inner_mode, new_rtx, 0,
+XEXP (inner_x0, 1),
+len, 1, 0, in_code == COMPARE);
+   }
+ else
+   {
+ int real_pos = pos + UINTVAL (XEXP (inner_x0, 1));
+ new_rtx = make_extraction (inner_mode, new_rtx, real_pos,
+NULL_RTX, len, 1, 0,
+in_code == COMPARE);
+   }
 
  /* If we narrowed the mode when dropping the subreg, then we lose.  */
  if (GET_MODE_SIZE (inner_mode) < GET_MODE_SIZE (mode))
@@ -8200,10 +8229,10 @@ make_compound_operation_int (scalar_int_mode mode, rtx 
*x_ptr,
 
  /* If that didn't give anything, see if the AND simplifies on
 its own.  */
- if (!new_rtx && i >= 0)
+ if (!new_rtx)
{
  new_rtx = make_compound_operation (XEXP (x, 0), next_code);
- new_rtx = make_extraction (mode, new_rtx, 0, NULL_RTX, i, 1,
+ new_rtx = make_extraction (mode, new_rtx, pos, NULL_RTX, len, 1,
 0, in_code == COMPARE);
}
}
@@ -8212,7 +8241,7 @@ make_compound_operation_int (scalar_int_mode mode, rtx 
*x_ptr,
|| GET_CODE (XEXP (x, 0)) == IOR)
   && GET_CODE (XEXP (XEXP (x, 0), 0)) == LSHIFTRT
   && GET_CODE (XEXP 

Re: [PATCH, AArch64] atomics: prefetch the destination for write prior to ldxr/stxr loops

2016-03-07 Thread Yangfei (Felix)
> On Mon, Mar 7, 2016 at 7:27 PM, Yangfei (Felix) <felix.y...@huawei.com> wrote:
> > Hi,
> >
> > As discussed in LKML:
> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html, 
> the
> cost of changing a cache line
> > from shared to exclusive state can be significant on aarch64 cores,
> especially when this is triggered by an exclusive store, since it may
> > result in having to retry the transaction.
> > This patch makes use of the "prfm PSTL1STRM" instruction to prefetch
> cache lines for write prior to ldxr/stxr loops generated by the ll/sc atomic
> routines.
> > Bootstrapped on AArch64 server, is it OK?
> 
> 
> I don't think this is a good thing in general.  For an example on ThunderX, 
> the
> prefetch just adds a cycle for no benefit.  This really depends on the
> micro-architecture of the core and how LDXR/STXR are
> implemented.   So after this patch, it will slow down ThunderX.
> 
> Thanks,
> Andrew Pinski
> 

Hi Andrew,

   I am not quite clear about the ThunderX micro-arch.  But, Yes, I agree it 
depends on the micro-architecture of the core.  
   As the mentioned kernel patch is merged upstream, I think the added prefetch 
instruction in atomic routines is good for most of AArch64 cores in the market. 
 
   If it does nothing good for ThunderX, then how about adding some checking 
here?  I mean disabling the the generation of the prfm if we are tuning for 
ThunderX.  
   
Thanks,
Felix


[PATCH, AArch64] atomics: prefetch the destination for write prior to ldxr/stxr loops

2016-03-07 Thread Yangfei (Felix)
Hi,

As discussed in LKML: 
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html, 
the cost of changing a cache line
from shared to exclusive state can be significant on aarch64 cores, 
especially when this is triggered by an exclusive store, since it may
result in having to retry the transaction. 
This patch makes use of the "prfm PSTL1STRM" instruction to prefetch cache 
lines for write prior to ldxr/stxr loops generated by the ll/sc atomic 
routines. 
Bootstrapped on AArch64 server, is it OK? 

Thanks,
Felix


aarch64-atomics-v0.diff
Description: aarch64-atomics-v0.diff


Re: [PATCH] Only accept BUILT_IN_NORMAL stringops for interesting_stringop_to_profile_p

2015-08-20 Thread Yangfei (Felix)
, stmt, 
HIST_TYPE_SINGLE_VALUE);
   if (!histogram)
 return false;
+
   val = histogram-hvalue.counters[0];
   count = histogram-hvalue.counters[1];
   all = histogram-hvalue.counters[2];
   gimple_remove_histogram_value (cfun, stmt, histogram);
+
   /* We require that count is at least half of all; this means
  that for the transformation to fire the value must be constant
  at least 80% of time.  */
@@ -1719,8 +1712,10 @@ gimple_stringops_transform (gimple_stmt_iterator *
 prob = GCOV_COMPUTE_SCALE (count, all);
   else
 prob = 0;
+
   dest = gimple_call_arg (stmt, 0);
   dest_align = get_pointer_alignment (dest);
+  fcode = DECL_FUNCTION_CODE (gimple_call_fndecl (stmt));
   switch (fcode)
 {
 case BUILT_IN_MEMCPY:
@@ -1811,6 +1806,7 @@ stringop_block_profile (gimple stmt, unsigned int
 

 /* Find values inside STMT for that we want to measure histograms for
division/modulo optimization.  */
+
 static void
 gimple_divmod_values_to_profile (gimple stmt, histogram_values *values)
 {
@@ -1891,21 +1887,18 @@ gimple_indirect_call_to_profile (gimple stmt, hist
 
 /* Find values inside STMT for that we want to measure histograms for
string operations.  */
+
 static void
 gimple_stringops_values_to_profile (gimple stmt, histogram_values *values)
 {
-  tree fndecl;
-  tree blck_size;
   tree dest;
+  tree blck_size;
   int size_arg;
 
-  if (gimple_code (stmt) != GIMPLE_CALL)
+  if (!gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))
 return;
-  fndecl = gimple_call_fndecl (stmt);
-  if (!fndecl)
-return;
 
-  if (!interesting_stringop_to_profile_p (fndecl, stmt, size_arg))
+  if (!interesting_stringop_to_profile_p (stmt, size_arg))
 return;
 
   dest = gimple_call_arg (stmt, 0);
@@ -1919,6 +1912,7 @@ gimple_stringops_values_to_profile (gimple stmt, h
   values-safe_push (gimple_alloc_histogram_value (cfun, HIST_TYPE_AVERAGE,
   stmt, blck_size));
 }
+
   if (TREE_CODE (blck_size) != INTEGER_CST)
 values-safe_push (gimple_alloc_histogram_value (cfun, HIST_TYPE_IOR,
 stmt, dest));
Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 141081)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,12 @@
+2014-08-20  Felix Yang  felix.y...@huawei.com
+   Jiji Jiang  jiangj...@huawei.com
+
+   * value-prof.c (interesting_stringop_to_profile_p): Removed FNDECL 
argument
+   and get builtin function code directly from CALL.
+   (gimple_stringop_fixed_value): Modified accordingly.
+   (gimple_stringops_transform, gimple_stringops_values_to_profile): 
Modified
+   accordingly and only accept BUILT_IN_NORMAL string operations.
+
 2015-08-18  Segher Boessenkool  seg...@kernel.crashing.org
 
Backport from mainline:



 
 On Thu, Aug 20, 2015 at 5:17 AM, Yangfei (Felix) felix.y...@huawei.com
 wrote:
  Hi,
 
  As DECL_FUNCTION_CODE is overloaded for builtin functions in different
 classes, so need to check builtin class before using fcode.
  Patch posted below.  Bootstrapped on x86_64-suse-linux, OK for trunk?
  Thanks.
 
 Ugh.  The code in the callers already looks like it could have some TLC, like
 instead of
 
   fndecl = gimple_call_fndecl (stmt);
   if (!fndecl)
 return false;
   fcode = DECL_FUNCTION_CODE (fndecl);
   if (!interesting_stringop_to_profile_p (fndecl, stmt, size_arg))
 return false;
 
 simply do
 
   if (!gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))
 return false;
   if (!interesting_stringop_to_profile_p (gimple_call_fndecl (stmt), ))
 
 similar for the other caller.  interesting_stringop_to_profile_p can also get
 function-code directly from stmt, removing the redundant first argument or 
 even
 do the gimple_call_builtin_p call itself.
 
 Mind reworking the patch accordingly?
 
 Thanks,
 Richard.



patch-v2.diff
Description: patch-v2.diff


[PATCH] Only accept BUILT_IN_NORMAL stringops for interesting_stringop_to_profile_p

2015-08-19 Thread Yangfei (Felix)
Hi,

As DECL_FUNCTION_CODE is overloaded for builtin functions in different 
classes, so need to check builtin class before using fcode. 
Patch posted below.  Bootstrapped on x86_64-suse-linux, OK for trunk? 
Thanks. 

Index: gcc/value-prof.c
===
--- gcc/value-prof.c(revision 141081)
+++ gcc/value-prof.c(working copy)
@@ -1547,8 +1547,12 @@ gimple_ic_transform (gimple_stmt_iterator *gsi)
 static bool
 interesting_stringop_to_profile_p (tree fndecl, gimple call, int *size_arg)
 {
-  enum built_in_function fcode = DECL_FUNCTION_CODE (fndecl);
+  enum built_in_function fcode;
 
+  if (DECL_BUILT_IN_CLASS (fndecl) != BUILT_IN_NORMAL)
+return false;
+
+  fcode = DECL_FUNCTION_CODE (fndecl);
   if (fcode != BUILT_IN_MEMCPY  fcode != BUILT_IN_MEMPCPY
fcode != BUILT_IN_MEMSET  fcode != BUILT_IN_BZERO)
 return false;
Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 141081)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,9 @@
+2015-08-20  Felix Yang  felix.y...@huawei.com
+   Jiji Jiang  jiangj...@huawei.com
+
+   * value-prof.c (interesting_stringop_to_profile_p): Only accept string
+   operations which belong to the BUILT_IN_NORMAL builtin class.
+
 2015-08-18  Segher Boessenkool  seg...@kernel.crashing.org
 
Backport from mainline:


[PING^2, AArch64] Add long-call attribute and pragma interfaces

2015-05-05 Thread Yangfei (Felix)
Patch ping ...


 
  On 04/02/2015 11:59 PM, Yangfei (Felix) wrote:
   Patch ping: https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01148.html
 
  This patch needs documentation for the new attributes and pragmas
  before it can be committed.  (Since this is a new feature I think it
  has to wait until stage 1, too, but that's not my call.)
 
  -Sandra
 
 
 Sure, I will update the docs when this patch is approved by the port 
 maintainers.
 I didn't get any feedback from Richard for the v2 patch.  Thanks.


Re: [PING, AArch64] Add long-call attribute and pragma interfaces

2015-04-12 Thread Yangfei (Felix)
 On 04/02/2015 11:59 PM, Yangfei (Felix) wrote: 
  Patch ping: https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01148.html
 
 This patch needs documentation for the new attributes and pragmas before it
 can be committed.  (Since this is a new feature I think it has to wait until 
 stage 1,
 too, but that's not my call.)
 
 -Sandra


Sure, I will update the docs when this patch is approved by the port 
maintainers.  
I didn't get any feedback from Richard for the v2 patch.  Thanks.  


Re: [RFC AArch64] Implement TARGET_PROMOTE_FUNCTION_MODE for ILP32 code generation

2015-04-07 Thread Yangfei (Felix)
Hi Andrew, 

Sorry for the late reply.  Seems that I misunderstood the AAPCS64 
specification.  
Thanks for the clarification.  


 
  On Mar 16, 2015, at 2:28 AM, Yangfei (Felix) felix.y...@huawei.com wrote:
 
  Hi,
 
   For this trivial testcase:
 
  extern int bar (int , int);
 
  int foo (int *a, int *b)
  {
 return bar (*a, *b);
  }
 
  I noticed that GCC generate redundant zero-extension instructions under 
  ILP32
 (aarch64-linux-gnu-gcc -S -O2 -mabi=ilp32).
  Assembly code:
 .arch armv8-a+fp+simd
 .file   1.c
 .text
 .align  2
 .p2align 3,,7
 .global foo
 .type   foo, %function
  foo:
 uxtwx0, w0
 uxtwx1, w1
 ldr w0, [x0]
 ldr w1, [x1]
 b   bar
 .size   foo, .-foo
 .ident  GCC: (20140403) 5.0.0 20150312 (experimental)
 
  According the ILP32 ABI, the two uxtw instrunctions here is not necessary.
  The following is a proposed patch to fix this issue, i.e. promoting pointer
 function arguments to word_mode.
  But I don't know whether it's a good idea to do this for pointer return 
  values.
  Any comments?
 
 
 Actually they are required. The abi says the upper 32bits are undefined for
 arguments smaller then 64bits. I had this discussion a year or more ago about 
 this
 case.
 
 A simple struct like
 struct a { int * b; int c; };
 
 Will break the code if we don't have the zero extends
 
 Try
 void f(int *);
 void g(struct a d)
 {
   f(d.b);
 }
 
 And see that there is no zero extends inside g.  I saw this exact thing when
 working on getting gccgo working.
 
 It also means the assembly functions in glibc are broken and need to be fixed.
 
 Thanks,
 Andrew
 
 
  Index: gcc/config/aarch64/aarch64.c
 
 =
 ==
  --- gcc/config/aarch64/aarch64.c(revision 221393)
  +++ gcc/config/aarch64/aarch64.c(working copy)
  @@ -1587,7 +1587,7 @@ aarch64_function_value (const_tree type, const_tre
machine_mode ag_mode;
 
mode = TYPE_MODE (type);
  -  if (INTEGRAL_TYPE_P (type))
  +  if (INTEGRAL_TYPE_P (type) || POINTER_TYPE_P (type))
  mode = promote_function_mode (type, mode, unsignedp, func, 1);
 
if (aarch64_return_in_msb (type))
  @@ -1650,6 +1650,24 @@ aarch64_function_value_regno_p (const unsigned
 int
return false;
  }
 
  +/* Implement TARGET_PROMOTE_FUNCTION_MODE.  */
  +
  +static machine_mode
  +aarch64_promote_function_mode (const_tree type, machine_mode mode,
  +   int *punsignedp, const_tree fntype,
  +   int for_return)
  +{
  +  /* Pointer function arguments and return values are promoted to
  +word_mode.  */
  +  if (type != NULL_TREE  POINTER_TYPE_P (type))
  +{
  +  *punsignedp = POINTERS_EXTEND_UNSIGNED;
  +  return word_mode;
  +}
  +
  +  return default_promote_function_mode (type, mode, punsignedp, fntype,
  +for_return); }
  +
  /* Implement TARGET_RETURN_IN_MEMORY.
 
 If the type T of the result of a function is such that @@ -11329,6
  +11347,9 @@ aarch64_gen_adjusted_ldpstp (rtx *operands, bool l #define
  TARGET_OVERRIDE_OPTIONS_AFTER_CHANGE \
aarch64_override_options_after_change
 
  +#undef TARGET_PROMOTE_FUNCTION_MODE
  +#define TARGET_PROMOTE_FUNCTION_MODE
 aarch64_promote_function_mode
  +
  #undef TARGET_PASS_BY_REFERENCE
  #define TARGET_PASS_BY_REFERENCE aarch64_pass_by_reference
  aarch64-promote-v2.diff


[PING, AArch64] Add long-call attribute and pragma interfaces

2015-04-03 Thread Yangfei (Felix)
Patch ping: https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01148.html 

Thanks. 


[RFC AArch64] Implement TARGET_PROMOTE_FUNCTION_MODE for ILP32 code generation

2015-03-16 Thread Yangfei (Felix)
Hi,

  For this trivial testcase: 

extern int bar (int , int);

int foo (int *a, int *b)
{
return bar (*a, *b);
}

I noticed that GCC generate redundant zero-extension instructions under ILP32 
(aarch64-linux-gnu-gcc -S -O2 -mabi=ilp32).
Assembly code:
.arch armv8-a+fp+simd
.file   1.c
.text
.align  2
.p2align 3,,7
.global foo
.type   foo, %function
foo:
uxtwx0, w0
uxtwx1, w1
ldr w0, [x0]
ldr w1, [x1]
b   bar
.size   foo, .-foo
.ident  GCC: (20140403) 5.0.0 20150312 (experimental)

According the ILP32 ABI, the two uxtw instrunctions here is not necessary.
The following is a proposed patch to fix this issue, i.e. promoting pointer 
function arguments to word_mode.
But I don't know whether it's a good idea to do this for pointer return values.
Any comments?


Index: gcc/config/aarch64/aarch64.c
===
--- gcc/config/aarch64/aarch64.c(revision 221393)
+++ gcc/config/aarch64/aarch64.c(working copy)
@@ -1587,7 +1587,7 @@ aarch64_function_value (const_tree type, const_tre
   machine_mode ag_mode;
 
   mode = TYPE_MODE (type);
-  if (INTEGRAL_TYPE_P (type))
+  if (INTEGRAL_TYPE_P (type) || POINTER_TYPE_P (type))
 mode = promote_function_mode (type, mode, unsignedp, func, 1);
 
   if (aarch64_return_in_msb (type))
@@ -1650,6 +1650,24 @@ aarch64_function_value_regno_p (const unsigned int
   return false;
 }
 
+/* Implement TARGET_PROMOTE_FUNCTION_MODE.  */
+
+static machine_mode
+aarch64_promote_function_mode (const_tree type, machine_mode mode,
+  int *punsignedp, const_tree fntype,
+  int for_return)
+{
+  /* Pointer function arguments and return values are promoted to word_mode.  
*/
+  if (type != NULL_TREE  POINTER_TYPE_P (type))
+{
+  *punsignedp = POINTERS_EXTEND_UNSIGNED;
+  return word_mode;
+}
+
+  return default_promote_function_mode (type, mode, punsignedp, fntype,
+for_return);
+}
+
 /* Implement TARGET_RETURN_IN_MEMORY.
 
If the type T of the result of a function is such that
@@ -11329,6 +11347,9 @@ aarch64_gen_adjusted_ldpstp (rtx *operands, bool l
 #define TARGET_OVERRIDE_OPTIONS_AFTER_CHANGE \
   aarch64_override_options_after_change
 
+#undef TARGET_PROMOTE_FUNCTION_MODE
+#define TARGET_PROMOTE_FUNCTION_MODE aarch64_promote_function_mode
+
 #undef TARGET_PASS_BY_REFERENCE
 #define TARGET_PASS_BY_REFERENCE aarch64_pass_by_reference


aarch64-promote-v2.diff
Description: aarch64-promote-v2.diff


[PING ^ 5] [PATCH, AARCH64] Add support for -mlong-calls option

2015-02-14 Thread Yangfei (Felix)
Ping ... 


 
 Patch ping: https://gcc.gnu.org/ml/gcc-patches/2014-11/msg02258.html
 Any comments, Richard? Thanks.



Re: [PING] [PATCH] [AArch64, NEON] Add vfms_n_f32, vfmsq_n_f32 and vfmsq_n_f64 specified by the ACLE

2015-01-21 Thread Yangfei (Felix)
 On 21/01/15 09:22, Yangfei (Felix) wrote:
  This is a ping for:
  https://gcc.gnu.org/ml/gcc-patches/2014-12/msg01008.html
  I updated the testcase adding test for vfmsq_n_f64 intrinsic.
  Test OK for both aarch64-linux-gnu and aarch64_be-linux-gnu-gcc.
  OK for the trunk?  Thanks.
 
 
  Index: gcc/ChangeLog
 
 =
 ==
  --- gcc/ChangeLog   (revision 219845)
  +++ gcc/ChangeLog   (working copy)
  @@ -1,3 +1,8 @@
  +2015-01-21  Felix Yang  felix.y...@huawei.com
  +
  +   * config/aarch64/arm_neon.h (vfms_n_f32, vfmsq_n_f32, vfmsq_n_f64):
 New
  +   intrinsics.
  +
 
 Hi Felix,
 
 Thanks for the the patch. It LGTM apart from one point - you seem to have
 missed out vfms_n_f64?
 
 Thanks,
 Tejas.
 

Hello Tejas,

You are right, the vfms_n_f64 is missing here.  And I find that vfma_n_f64 
is not there too. 
I would like to compose up another patch to add them.  So is this patch OK 
for the trunk? 
Thanks.



[PING] [PATCH] [AArch64, NEON] Add vfms_n_f32, vfmsq_n_f32 and vfmsq_n_f64 specified by the ACLE

2015-01-21 Thread Yangfei (Felix)
This is a ping for: https://gcc.gnu.org/ml/gcc-patches/2014-12/msg01008.html
I updated the testcase adding test for vfmsq_n_f64 intrinsic.
Test OK for both aarch64-linux-gnu and aarch64_be-linux-gnu-gcc.
OK for the trunk?  Thanks.


Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 219845)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,8 @@
+2015-01-21  Felix Yang  felix.y...@huawei.com
+
+   * config/aarch64/arm_neon.h (vfms_n_f32, vfmsq_n_f32, vfmsq_n_f64): New
+   intrinsics.
+
 2015-01-19  Jiong Wang  jiong.w...@arm.com
Andrew Pinski  apin...@cavium.com
 
Index: gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfms_n.c
===
--- gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfms_n.c
(revision 0)
+++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfms_n.c
(revision 0)
@@ -0,0 +1,74 @@
+#include arm_neon.h
+#include arm-neon-ref.h
+#include compute-ref-data.h
+
+#if defined(__aarch64__)  defined(__ARM_FEATURE_FMA)
+/* Expected results.  */
+VECT_VAR_DECL(expected,hfloat,32,2) [] = { 0x4438ca3d, 0x44390a3d };
+VECT_VAR_DECL(expected,hfloat,32,4) [] = { 0x44869eb8, 0x4486beb8, 0x4486deb8, 
0x4486feb8 };
+VECT_VAR_DECL(expected,hfloat,64,2) [] = { 0x408906e1532b8520, 
0x40890ee1532b8520 };
+
+#define VECT_VAR_ASSIGN(S,Q,T1,W) S##Q##_##T1##W
+#define ASSIGN(S, Q, T, W, V) T##W##_t S##Q##_##T##W = V
+#define TEST_MSG VFMS_N/VFMSQ_N
+
+void exec_vfms_n (void)
+{
+  /* Basic test: v4=vfms_n(v1,v2), then store the result.  */
+#define TEST_VFMS_N(Q, T1, T2, W, N)   \
+  VECT_VAR(vector_res, T1, W, N) = \
+vfms##Q##_n_##T2##W(VECT_VAR(vector1, T1, W, N),   \
+   VECT_VAR(vector2, T1, W, N),\
+   VECT_VAR_ASSIGN(scalar, Q, T1, W)); \
+  vst1##Q##_##T2##W(VECT_VAR(result, T1, W, N), VECT_VAR(vector_res, T1, W, N))
+
+#define CHECK_VFMS_N_RESULTS(test_name,comment)
\
+  {\
+CHECK_FP(test_name, float, 32, 2, PRIx32, expected, comment);  \
+CHECK_FP(test_name, float, 32, 4, PRIx32, expected, comment);  \
+CHECK_FP(test_name, float, 64, 2, PRIx64, expected, comment);  \
+  }
+
+#define DECL_VFMS_N_VAR(VAR)   \
+  DECL_VARIABLE(VAR, float, 32, 2);\
+  DECL_VARIABLE(VAR, float, 32, 4);\
+  DECL_VARIABLE(VAR, float, 64, 2);
+
+  DECL_VFMS_N_VAR(vector1);
+  DECL_VFMS_N_VAR(vector2);
+  DECL_VFMS_N_VAR(vector3);
+  DECL_VFMS_N_VAR(vector_res);
+
+  clean_results ();
+
+  /* Initialize input vector1 from buffer.  */
+  VLOAD(vector1, buffer, , float, f, 32, 2);
+  VLOAD(vector1, buffer, q, float, f, 32, 4);
+  VLOAD(vector1, buffer, q, float, f, 64, 2);
+
+  /* Choose init value arbitrarily.  */
+  VDUP(vector2, , float, f, 32, 2, -9.3f);
+  VDUP(vector2, q, float, f, 32, 4, -29.7f);
+  VDUP(vector2, q, float, f, 64, 2, -15.8f);
+  
+  /* Choose init value arbitrarily.  */
+  ASSIGN(scalar, , float, 32, 81.2f);
+  ASSIGN(scalar, q, float, 32, 36.8f);
+  ASSIGN(scalar, q, float, 64, 51.7f);
+
+  /* Execute the tests.  */
+  TEST_VFMS_N(, float, f, 32, 2);
+  TEST_VFMS_N(q, float, f, 32, 4);
+  TEST_VFMS_N(q, float, f, 64, 2);
+
+  CHECK_VFMS_N_RESULTS (TEST_MSG, );
+}
+#endif
+
+int main (void)
+{
+#if defined(__aarch64__)  defined(__ARM_FEATURE_FMA)
+  exec_vfms_n ();
+#endif
+  return 0;
+}
Index: gcc/testsuite/ChangeLog
===
--- gcc/testsuite/ChangeLog (revision 219845)
+++ gcc/testsuite/ChangeLog (working copy)
@@ -1,3 +1,7 @@
+2015-01-21  Felix Yang  felix.y...@huawei.com
+
+   * gcc.target/aarch64/advsimd-intrinsics/vfms_n.c: New test.
+
 2015-01-19  Felix Yang  felix.y...@huawei.com
Haijian Zhang  z.zhanghaij...@huawei.com
 
Index: gcc/config/aarch64/arm_neon.h
===
--- gcc/config/aarch64/arm_neon.h   (revision 219845)
+++ gcc/config/aarch64/arm_neon.h   (working copy)
@@ -14774,7 +14774,24 @@ vfmsq_f64 (float64x2_t __a, float64x2_t __b, float
   return __builtin_aarch64_fmav2df (-__b, __c, __a);
 }
 
+__extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
+vfms_n_f32 (float32x2_t __a, float32x2_t __b, float32_t __c)
+{
+  return __builtin_aarch64_fmav2sf (-__b, vdup_n_f32 (__c), __a);
+}
 
+__extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
+vfmsq_n_f32 (float32x4_t __a, float32x4_t __b, float32_t __c)
+{
+  return __builtin_aarch64_fmav4sf (-__b, vdupq_n_f32 (__c), __a);
+}
+
+__extension__ static __inline float64x2_t __attribute__ ((__always_inline__))
+vfmsq_n_f64 (float64x2_t __a, float64x2_t __b, 

[PING] [PATCH] [AArch64, NEON] Fix testcases add by r218484

2015-01-19 Thread Yangfei (Felix)
Hi, 
  This is a ping for: https://gcc.gnu.org/ml/gcc-patches/2014-12/msg01328.html
  OK for the trunk? Thanks.


[PING ^ 4] [RFC PATCH, AARCH64] Add support for -mlong-calls option

2015-01-19 Thread Yangfei (Felix)
Patch ping: https://gcc.gnu.org/ml/gcc-patches/2014-11/msg02258.html 
Any comments, Richard? Thanks. 



Re: [PATCH, autofdo] Some code cleanup

2015-01-17 Thread Yangfei (Felix)
Hi,

I updated the patch adding one ChangeLog entry.
OK for the trunk?  Thanks.


Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 219297)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,12 @@
+2015-01-17  Felix Yang  felix.y...@huawei.com
+
+   * auto-profile.c (afdo_find_equiv_class): Remove unnecessary code.
+   (autofdo_source_profile::get_callsite_total_count,
+   function_instance::get_function_instance_by_decl,
+   string_table::get_index, string_table::get_index_by_decl,
+   afdo_vpt_for_early_inline, afdo_callsite_hot_enough_for_early_inline):
+   Fix comment typos and formatting.
+
 2015-01-06  Sandra Loosemore  san...@codesourcery.com
 
* doc/invoke.texi (RS/6000 and PowerPC Options): Tidy formatting
Index: gcc/auto-profile.c
===
--- gcc/auto-profile.c  (revision 219297)
+++ gcc/auto-profile.c  (working copy)
@@ -96,7 +96,7 @@ along with GCC; see the file COPYING3.  If not see
  standalone symbol, or a clone of a function that is inlined into another
  function.
 
-   Phase 2: Early inline + valur profile transformation.
+   Phase 2: Early inline + value profile transformation.
  Early inline uses autofdo_source_profile to find if a callsite is:
 * inlined in the profiled binary.
 * callee body is hot in the profiling run.
@@ -361,7 +361,7 @@ get_original_name (const char *name)
 
 /* Return the combined location, which is a 32bit integer in which
higher 16 bits stores the line offset of LOC to the start lineno
-   of DECL, The lower 16 bits stores the discrimnator.  */
+   of DECL, The lower 16 bits stores the discriminator.  */
 
 static unsigned
 get_combined_location (location_t loc, tree decl)
@@ -424,7 +424,7 @@ get_inline_stack (location_t locus, inline_stack *
 
 /* Return STMT's combined location, which is a 32bit integer in which
higher 16 bits stores the line offset of LOC to the start lineno
-   of DECL, The lower 16 bits stores the discrimnator.  */
+   of DECL, The lower 16 bits stores the discriminator.  */
 
 static unsigned
 get_relative_location_for_stmt (gimple stmt)
@@ -481,8 +481,8 @@ string_table::get_index (const char *name) const
   string_index_map::const_iterator iter = map_.find (name);
   if (iter == map_.end ())
 return -1;
-  else
-return iter-second;
+
+  return iter-second;
 }
 
 /* Return the index of a given function DECL. Return -1 if DECL is not 
@@ -502,8 +502,8 @@ string_table::get_index_by_decl (tree decl) const
 return ret;
   if (DECL_ABSTRACT_ORIGIN (decl))
 return get_index_by_decl (DECL_ABSTRACT_ORIGIN (decl));
-  else
-return -1;
+
+  return -1;
 }
 
 /* Return the function name of a given INDEX.  */
@@ -569,8 +569,8 @@ function_instance::get_function_instance_by_decl (
 }
   if (DECL_ABSTRACT_ORIGIN (decl))
 return get_function_instance_by_decl (lineno, DECL_ABSTRACT_ORIGIN (decl));
-  else
-return NULL;
+
+  return NULL;
 }
 
 /* Store the profile info for LOC in INFO. Return TRUE if profile info
@@ -597,7 +597,7 @@ function_instance::mark_annotated (location_t loc)
   iter-second.annotated = true;
 }
 
-/* Read the inlinied indirect call target profile for STMT and store it in
+/* Read the inlined indirect call target profile for STMT and store it in
MAP, return the total count for all inlined indirect calls.  */
 
 gcov_type
@@ -824,8 +824,8 @@ autofdo_source_profile::get_callsite_total_count (
   || afdo_string_table-get_index (IDENTIFIER_POINTER (
  DECL_ASSEMBLER_NAME (edge-callee-decl))) != s-name ())
 return 0;
-  else
-return s-total_count ();
+
+  return s-total_count ();
 }
 
 /* Read AutoFDO profile and returns TRUE on success.  */
@@ -956,9 +956,9 @@ read_profile (void)
histograms for indirect-call optimization.
 
This function is actually served for 2 purposes:
-     * before annotation, we need to mark histogram, promote and inline
-     * after annotation, we just need to mark, and let follow-up logic to
-       decide if it needs to promote and inline.  */
+ * before annotation, we need to mark histogram, promote and inline
+ * after annotation, we just need to mark, and let follow-up logic to
+   decide if it needs to promote and inline.  */
 
 static void
 afdo_indirect_call (gimple_stmt_iterator *gsi, const icall_target_map map,
@@ -1054,7 +1054,7 @@ set_edge_annotated (edge e, edge_set *annotated)
 }
 
 /* For a given BB, set its execution count. Attach value profile if a stmt
-   is not in PROMOTED, because we only want to promot an indirect call once.
+   is not in PROMOTED, because we only want to promote an indirect call once.
Return TRUE if BB is annotated.  */
 
 static bool
@@ -1138,7 +1138,7 @@ afdo_find_equiv_class (bb_set *annotated_bb)
 bb1-aux = bb;
 if (bb1-count  bb-count  is_bb_annotated (bb1, 

Re: [PATCH] [AArch64, NEON] Improve vpmaxX vpminX intrinsics

2015-01-13 Thread Yangfei (Felix)
 On 09/12/14 08:17, Yangfei (Felix) wrote:
  On 28 November 2014 at 09:23, Yangfei (Felix) felix.y...@huawei.com
 wrote:
  Hi,
 This patch converts vpmaxX  vpminX intrinsics to use builtin
  functions
  instead of the previous inline assembly syntax.
 Regtested with aarch64-linux-gnu on QEMU.  Also passed the
  glorious
  testsuite of Christophe Lyon.
 OK for the trunk?
 
  Hi Felix,   We know from experience that the advsimd intrinsics tend
  to be fragile for big endian and in general it is fairly easy to
  break the big endian case.  For these advsimd improvements that you
  are working on (that we very much appreciate) it is important to run
  both little endian and big endian regressions.
 
  Thanks
  /Marcus
 
 
  Okay.  Any plan for the advsimd big-endian improvement?
  I rebased this patch over Alan Lawrance's patch:
  https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00279.html
  No regressions for aarch64_be-linux-gnu target too.  OK for the thunk?
 
 
  Index: gcc/ChangeLog
 
 =
 ==
  --- gcc/ChangeLog   (revision 218464)
  +++ gcc/ChangeLog   (working copy)
  @@ -1,3 +1,18 @@
  +2014-12-09  Felix Yang  felix.y...@huawei.com
  +
  +   * config/aarch64/aarch64-simd.md
 (aarch64_maxmin_unspmode): New
  +   pattern.
  +   * config/aarch64/aarch64-simd-builtins.def (smaxp, sminp, umaxp,
  +   uminp, smax_nanp, smin_nanp): New builtins.
  +   * config/aarch64/arm_neon.h (vpmax_s8, vpmax_s16, vpmax_s32,
  +   vpmax_u8, vpmax_u16, vpmax_u32, vpmaxq_s8, vpmaxq_s16,
 vpmaxq_s32,
  +   vpmaxq_u8, vpmaxq_u16, vpmaxq_u32, vpmax_f32, vpmaxq_f32,
 vpmaxq_f64,
  +   vpmaxqd_f64, vpmaxs_f32, vpmaxnm_f32, vpmaxnmq_f32,
 vpmaxnmq_f64,
  +   vpmaxnmqd_f64, vpmaxnms_f32, vpmin_s8, vpmin_s16, vpmin_s32,
 vpmin_u8,
  +   vpmin_u16, vpmin_u32, vpminq_s8, vpminq_s16, vpminq_s32,
 vpminq_u8,
  +   vpminq_u16, vpminq_u32, vpmin_f32, vpminq_f32, vpminq_f64,
 vpminqd_f64,
  +   vpmins_f32, vpminnm_f32, vpminnmq_f32, vpminnmq_f64,
  + vpminnmqd_f64,
  +
 
 
__extension__ static __inline float32x2_t __attribute__
  ((__always_inline__))
  Index: gcc/config/aarch64/aarch64-simd.md
 
 =
 ==
  --- gcc/config/aarch64/aarch64-simd.md  (revision 218464)
  +++ gcc/config/aarch64/aarch64-simd.md  (working copy)
  @@ -1017,6 +1017,28 @@
  DONE;
})
 
  +;; Pairwise Integer Max/Min operations.
  +(define_insn aarch64_maxmin_unspmode
  + [(set (match_operand:VDQ_BHSI 0 register_operand =w)
  +   (unspec:VDQ_BHSI [(match_operand:VDQ_BHSI 1
 register_operand w)
  +(match_operand:VDQ_BHSI 2 register_operand
 w)]
  +   MAXMINV))]
  + TARGET_SIMD
  + maxmin_uns_opp\t%0.Vtype, %1.Vtype, %2.Vtype
  +  [(set_attr type neon_minmaxq)]
  +)
  +
 
 Hi Felix,
 
 Sorry for the delay in getting back to you on this.
 
 If you've rolled aarch64_reduc_maxmin_uns_internalv2si into the above
 pattern, do you still need it? For all its call points, just point them to
 aarch64_maxmin_unspmode?
 
 Thanks,
 Tejas.
 


Hello Tejas,

  I didn't do this yet. 
  Currently the aarch64_reduc_maxmin_uns_internalv2si is only called by 
reduc_maxmin_uns_scal_mode. 
  I find it kind of trouble to handle this due to the use of iterators in the 
caller pattern. 
  Are you going to rework this part? 


[PATCH, autofdo] Some code cleanup

2015-01-12 Thread Yangfei (Felix)
Hi,

  The attached patch does some code cleanup for auto-profile.c: fix typos and 
remove some unnecessary MAX/MIN checks plus some else.
  OK for the trunk? 


Index: gcc/auto-profile.c
===
--- gcc/auto-profile.c  (revision 219297)
+++ gcc/auto-profile.c  (working copy)
@@ -96,7 +96,7 @@ along with GCC; see the file COPYING3.  If not see
  standalone symbol, or a clone of a function that is inlined into another
  function.
 
-   Phase 2: Early inline + valur profile transformation.
+   Phase 2: Early inline + value profile transformation.
  Early inline uses autofdo_source_profile to find if a callsite is:
 * inlined in the profiled binary.
 * callee body is hot in the profiling run.
@@ -361,7 +361,7 @@ get_original_name (const char *name)
 
 /* Return the combined location, which is a 32bit integer in which
higher 16 bits stores the line offset of LOC to the start lineno
-   of DECL, The lower 16 bits stores the discrimnator.  */
+   of DECL, The lower 16 bits stores the discriminator.  */
 
 static unsigned
 get_combined_location (location_t loc, tree decl)
@@ -424,7 +424,7 @@ get_inline_stack (location_t locus, inline_stack *
 
 /* Return STMT's combined location, which is a 32bit integer in which
higher 16 bits stores the line offset of LOC to the start lineno
-   of DECL, The lower 16 bits stores the discrimnator.  */
+   of DECL, The lower 16 bits stores the discriminator.  */
 
 static unsigned
 get_relative_location_for_stmt (gimple stmt)
@@ -481,8 +481,8 @@ string_table::get_index (const char *name) const
   string_index_map::const_iterator iter = map_.find (name);
   if (iter == map_.end ())
 return -1;
-  else
-return iter-second;
+
+  return iter-second;
 }
 
 /* Return the index of a given function DECL. Return -1 if DECL is not 
@@ -502,8 +502,8 @@ string_table::get_index_by_decl (tree decl) const
 return ret;
   if (DECL_ABSTRACT_ORIGIN (decl))
 return get_index_by_decl (DECL_ABSTRACT_ORIGIN (decl));
-  else
-return -1;
+
+  return -1;
 }
 
 /* Return the function name of a given INDEX.  */
@@ -569,8 +569,8 @@ function_instance::get_function_instance_by_decl (
 }
   if (DECL_ABSTRACT_ORIGIN (decl))
 return get_function_instance_by_decl (lineno, DECL_ABSTRACT_ORIGIN (decl));
-  else
-return NULL;
+
+  return NULL;
 }
 
 /* Store the profile info for LOC in INFO. Return TRUE if profile info
@@ -597,7 +597,7 @@ function_instance::mark_annotated (location_t loc)
   iter-second.annotated = true;
 }
 
-/* Read the inlinied indirect call target profile for STMT and store it in
+/* Read the inlined indirect call target profile for STMT and store it in
MAP, return the total count for all inlined indirect calls.  */
 
 gcov_type
@@ -824,8 +824,8 @@ autofdo_source_profile::get_callsite_total_count (
   || afdo_string_table-get_index (IDENTIFIER_POINTER (
  DECL_ASSEMBLER_NAME (edge-callee-decl))) != s-name ())
 return 0;
-  else
-return s-total_count ();
+
+  return s-total_count ();
 }
 
 /* Read AutoFDO profile and returns TRUE on success.  */
@@ -956,9 +956,9 @@ read_profile (void)
histograms for indirect-call optimization.
 
This function is actually served for 2 purposes:
-     * before annotation, we need to mark histogram, promote and inline
-     * after annotation, we just need to mark, and let follow-up logic to
-       decide if it needs to promote and inline.  */
+ * before annotation, we need to mark histogram, promote and inline
+ * after annotation, we just need to mark, and let follow-up logic to
+   decide if it needs to promote and inline.  */
 
 static void
 afdo_indirect_call (gimple_stmt_iterator *gsi, const icall_target_map map,
@@ -1054,7 +1054,7 @@ set_edge_annotated (edge e, edge_set *annotated)
 }
 
 /* For a given BB, set its execution count. Attach value profile if a stmt
-   is not in PROMOTED, because we only want to promot an indirect call once.
+   is not in PROMOTED, because we only want to promote an indirect call once.
Return TRUE if BB is annotated.  */
 
 static bool
@@ -1138,7 +1138,7 @@ afdo_find_equiv_class (bb_set *annotated_bb)
 bb1-aux = bb;
 if (bb1-count  bb-count  is_bb_annotated (bb1, *annotated_bb))
   {
-bb-count = MAX (bb-count, bb1-count);
+bb-count = bb1-count;
 set_bb_annotated (bb, annotated_bb);
   }
   }
@@ -1150,7 +1150,7 @@ afdo_find_equiv_class (bb_set *annotated_bb)
 bb1-aux = bb;
 if (bb1-count  bb-count  is_bb_annotated (bb1, *annotated_bb))
   {
-bb-count = MAX (bb-count, bb1-count);
+bb-count = bb1-count;
 set_bb_annotated (bb, annotated_bb);
   }
   }
@@ -1455,13 +1455,14 @@ afdo_vpt_for_early_inline (stmt_set *promoted_stmt
   }
   }
   }
+
   if (has_vpt)
 {
   optimize_inline_calls 

[PATCH] Fix PR64240

2014-12-16 Thread Yangfei (Felix)
Hi,

  This patch fixes an obvious typo which may affect the DDG creation of SMS and 
make this optimization produce buggy code. 
  Bootstrapped on x86_64-suse-linux.  Also passed check-gcc test for 
aarch64-linux-gnu. 
  OK for the trunk? 


Index: gcc/ddg.c
===
--- gcc/ddg.c   (revision 218582)
+++ gcc/ddg.c   (working copy)
@@ -77,7 +77,7 @@ mark_mem_use (rtx *x, void *)
 {
   subrtx_iterator::array_type array;
   FOR_EACH_SUBRTX (iter, array, *x, NONCONST)
-if (MEM_P (*x))
+if (MEM_P (*iter))
   {
mem_ref_p = true;
break;
Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 218582)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,8 @@
+2014-12-16  Felix Yang  felix.y...@huawei.com
+
+   PR rtl-optimization/64240
+   * ddg.c (mark_mem_use): Check *iter instead of *x.
+
 2014-12-10  Felix Yang  felix.y...@huawei.com
 
* config/aarch64/aarch64-protos.h (aarch64_function_profiler): Remove


pr64240.diff
Description: pr64240.diff


Re: [PATCH] Fix PR64240

2014-12-16 Thread Yangfei (Felix)
 On December 16, 2014 9:51:25 AM CET, Yangfei (Felix) felix.y...@huawei.com
 wrote:
 Hi,
 
 This patch fixes an obvious typo which may affect the DDG creation of
 SMS and make this optimization produce buggy code.
 Bootstrapped on x86_64-suse-linux.  Also passed check-gcc test for
 aarch64-linux-gnu.
   OK for the trunk?
 
 Do you have a testcase? If so please add it.

 OK

Yes, the patch is updated with the testcase added. 


Index: gcc/ddg.c
===
--- gcc/ddg.c   (revision 218582)
+++ gcc/ddg.c   (working copy)
@@ -77,7 +77,7 @@ mark_mem_use (rtx *x, void *)
 {
   subrtx_iterator::array_type array;
   FOR_EACH_SUBRTX (iter, array, *x, NONCONST)
-if (MEM_P (*x))
+if (MEM_P (*iter))
   {
mem_ref_p = true;
break;
Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 218582)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,8 @@
+2014-12-16  Felix Yang  felix.y...@huawei.com
+
+   PR rtl-optimization/64240
+   * ddg.c (mark_mem_use): Check *iter instead of *x.
+
 2014-12-10  Felix Yang  felix.y...@huawei.com
 
* config/aarch64/aarch64-protos.h (aarch64_function_profiler): Remove
Index: gcc/testsuite/gcc.dg/sms-12.c
===
--- gcc/testsuite/gcc.dg/sms-12.c   (revision 0)
+++ gcc/testsuite/gcc.dg/sms-12.c   (revision 0)
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+/* { dg-skip-if  { ! { aarch64-*-* } } { * } {  } } */
+/* { dg-options -O2 -fmodulo-sched -funroll-loops -fdump-rtl-sms --param 
sms-min-sc=1 -fmodulo-sched-allow-regmoves -fPIC } */
+
+extern void abort (void);
+
+int X[1000]={0};
+int Y[1000]={0};
+
+extern void abort (void);
+
+__attribute__ ((noinline))
+int
+foo (int len, long a)
+{
+  int i;
+  long res = a;
+
+  len = 1000;
+  for (i = 0; i  len; i++)
+res += X[i]* Y[i];
+
+  if (res != 601)
+abort ();
+
+}
+
+int
+main ()
+{
+  X[0] = Y[1] = 2;
+  Y[0] = X[1] = 21;
+  X[2] = Y[3] = 3;
+  Y[2] = X[3] = 31;
+  X[4] = Y[5] = 4;
+  Y[4] = X[5] = 41;
+
+  foo (6, 3);
+  return 0;
+}
+
+/* { dg-final { cleanup-rtl-dump sms } } */
+

Property changes on: gcc/testsuite/gcc.dg/sms-12.c
___
Added: svn:executable
   + *

Index: gcc/testsuite/ChangeLog
===
--- gcc/testsuite/ChangeLog (revision 218582)
+++ gcc/testsuite/ChangeLog (working copy)
@@ -1,3 +1,8 @@
+2014-12-16  Felix Yang  felix.y...@huawei.com
+
+   PR rtl-optimization/64240
+   * gcc.dg/sms-12.c: New test.
+
 2014-12-10  Martin Liska  mli...@suse.cz
 
* gcc.dg/ipa/pr63909.c: New test.


pr64240-v2.diff
Description: pr64240-v2.diff


Re: [PATCH] [AArch64, NEON] Fix testcases add by r218484

2014-12-16 Thread Yangfei (Felix)
   #define DECL_VABD_VAR(VAR) \
 be careful with your cut and paste. VABD should probably be VFMA_N here,
 although it's purely a naming convention :-)

The v3 patch attached fixed this minor issue. Thanks.

 It's OK for me with that change, but I'm not a maintainer.
 
 One more question: are there any corner-cases we would want to check?
 (for instance, rounding, nan, infinity, ...)

We don't see any testsuite covers the test of these intrinsics. 
So we are adding these testcases to test the basic functionality. 
For now, I don't see any corner-cases that need to be checked for this patch. 


Index: gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/compute-ref-data.h
===
--- gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/compute-ref-data.h  
(revision 218582)
+++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/compute-ref-data.h  
(working copy)
@@ -142,6 +142,10 @@ VECT_VAR_DECL_INIT(buffer, poly, 16, 8);
 PAD(buffer_pad, poly, 16, 8);
 VECT_VAR_DECL_INIT(buffer, float, 32, 4);
 PAD(buffer_pad, float, 32, 4);
+#ifdef __aarch64__
+VECT_VAR_DECL_INIT(buffer, float, 64, 2);
+PAD(buffer_pad, float, 64, 2);
+#endif
 
 /* The tests for vld1_dup and vdup expect at least 4 entries in the
input buffer, so force 1- and 2-elements initializers to have 4
Index: gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c
===
--- gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c
(revision 218582)
+++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c
(working copy)
@@ -2,6 +2,7 @@
 #include arm-neon-ref.h
 #include compute-ref-data.h
 
+#if defined(__aarch64__)  defined(__ARM_FEATURE_FMA)
 /* Expected results.  */
 VECT_VAR_DECL(expected,hfloat,32,2) [] = { 0x4438ca3d, 0x44390a3d };
 VECT_VAR_DECL(expected,hfloat,32,4) [] = { 0x44869eb8, 0x4486beb8, 0x4486deb8, 
0x4486feb8 };
@@ -9,33 +10,34 @@ VECT_VAR_DECL(expected,hfloat,64,2) [] = { 0x40890
 
 #define VECT_VAR_ASSIGN(S,Q,T1,W) S##Q##_##T1##W
 #define ASSIGN(S, Q, T, W, V) T##W##_t S##Q##_##T##W = V
-#define TEST_MSG VFMA/VFMAQ
+#define TEST_MSG VFMA_N/VFMAQ_N
+
 void exec_vfma_n (void)
 {
   /* Basic test: v4=vfma_n(v1,v2), then store the result.  */
-#define TEST_VFMA(Q, T1, T2, W, N) \
+#define TEST_VFMA_N(Q, T1, T2, W, N)   \
   VECT_VAR(vector_res, T1, W, N) = \
 vfma##Q##_n_##T2##W(VECT_VAR(vector1, T1, W, N),   \
- VECT_VAR(vector2, T1, W, N),  \
- VECT_VAR_ASSIGN(Scalar, Q, T1, W));   
\
+   VECT_VAR(vector2, T1, W, N),\
+   VECT_VAR_ASSIGN(scalar, Q, T1, W)); \
   vst1##Q##_##T2##W(VECT_VAR(result, T1, W, N), VECT_VAR(vector_res, T1, W, N))
 
-#define CHECK_VFMA_RESULTS(test_name,comment)  \
+#define CHECK_VFMA_N_RESULTS(test_name,comment)
\
   {\
 CHECK_FP(test_name, float, 32, 2, PRIx32, expected, comment);  \
 CHECK_FP(test_name, float, 32, 4, PRIx32, expected, comment);  \
-   CHECK_FP(test_name, float, 64, 2, PRIx64, expected, comment);   \
-  }
+CHECK_FP(test_name, float, 64, 2, PRIx64, expected, comment);  \
+  }
 
-#define DECL_VABD_VAR(VAR) \
+#define DECL_VFMA_N_VAR(VAR)   \
   DECL_VARIABLE(VAR, float, 32, 2);\
   DECL_VARIABLE(VAR, float, 32, 4);\
-  DECL_VARIABLE(VAR, float, 64, 2);
+  DECL_VARIABLE(VAR, float, 64, 2);
 
-  DECL_VABD_VAR(vector1);
-  DECL_VABD_VAR(vector2);
-  DECL_VABD_VAR(vector3);
-  DECL_VABD_VAR(vector_res);
+  DECL_VFMA_N_VAR(vector1);
+  DECL_VFMA_N_VAR(vector2);
+  DECL_VFMA_N_VAR(vector3);
+  DECL_VFMA_N_VAR(vector_res);
 
   clean_results ();
 
@@ -50,20 +52,23 @@ void exec_vfma_n (void)
   VDUP(vector2, q, float, f, 64, 2, 15.8f);
   
   /* Choose init value arbitrarily.  */
-  ASSIGN(Scalar, , float, 32, 81.2f);
-  ASSIGN(Scalar, q, float, 32, 36.8f);
-  ASSIGN(Scalar, q, float, 64, 51.7f);
+  ASSIGN(scalar, , float, 32, 81.2f);
+  ASSIGN(scalar, q, float, 32, 36.8f);
+  ASSIGN(scalar, q, float, 64, 51.7f);
 
   /* Execute the tests.  */
-  TEST_VFMA(, float, f, 32, 2);
-  TEST_VFMA(q, float, f, 32, 4);
-  TEST_VFMA(q, float, f, 64, 2);
+  TEST_VFMA_N(, float, f, 32, 2);
+  TEST_VFMA_N(q, float, f, 32, 4);
+  TEST_VFMA_N(q, float, f, 64, 2);
 
-  CHECK_VFMA_RESULTS (TEST_MSG, );
+  CHECK_VFMA_N_RESULTS (TEST_MSG, );
 }
+#endif
 
 int main (void)
 {
+#if defined(__aarch64__)  defined(__ARM_FEATURE_FMA)
   exec_vfma_n ();
+#endif
   return 0;
 }
Index: gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma.c

Re: [PATCH] [AArch64, NEON] Fix testcases add by r218484

2014-12-12 Thread Yangfei (Felix)
Thanks for reviewing the patch.  See my comments inlined:


This patch fix this two issues.  Three changes:
1. vfma_f32, vfmaq_f32, vfms_f32, vfmsq_f32 are only available for
 arm*-*-* target with the FMA feature, we take care of this through the macro
 __ARM_FEATURE_FMA.
2. vfma_n_f32 and vfmaq_n_f32 are only available for aarch64 target, we
 take care of this through the macro __aarch64__.
3. vfmaq_f64, vfmaq_n_f64 and vfmsq_f64 are only available for aarch64
 target, we just exclude test for them to keep the testcases clean. (Note: They
 also pass on aarch64  aarch64_be target and we can add test for them if
 needed).
 I would prefer to have all the available variants tested.

OK, the v2 patch attached have all the available variants added. 

  +#ifdef __aarch64__
   /* Expected results.  */
   VECT_VAR_DECL(expected,hfloat,32,2) [] = { 0x4438ca3d, 0x44390a3d };
   VECT_VAR_DECL(expected,hfloat,32,4) [] = { 0x44869eb8, 0x4486beb8,
  0x4486deb8, 0x4486feb8 };
  -VECT_VAR_DECL(expected,hfloat,64,2) [] = { 0x408906e1532b8520,
  0x40890ee1532b8520 };
 
 Why do you remove this one?

We need to make some changes to the header files for this test. 
Initially, I don't want to touch the header files, so I reduced this testcase 
to a minimal one. 

 
   int main (void)
   {
  +#ifdef __ARM_FEATURE_FMA
 exec_vfms ();
  +#endif
 return 0;
   }
 
 In the other tests, I try to put as much code in common as possible, between 
 the
 'a' and 's' variants (e.g. vmla/vmls). Maybe you can do that as a follow-up?

Yes, I think we can handle this with a follow-on patch.
The v2 patch is tested on armeb-linux-gnueabi, arm-linux-gnueabi, 
aarch64-linux-gnu and aarch64_be-linux-gnu.
How about this one?  Thanks.


Index: gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/compute-ref-data.h
===
--- gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/compute-ref-data.h  
(revision 218582)
+++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/compute-ref-data.h  
(working copy)
@@ -142,6 +142,10 @@ VECT_VAR_DECL_INIT(buffer, poly, 16, 8);
 PAD(buffer_pad, poly, 16, 8);
 VECT_VAR_DECL_INIT(buffer, float, 32, 4);
 PAD(buffer_pad, float, 32, 4);
+#ifdef __aarch64__
+VECT_VAR_DECL_INIT(buffer, float, 64, 2);
+PAD(buffer_pad, float, 64, 2);
+#endif
 
 /* The tests for vld1_dup and vdup expect at least 4 entries in the
input buffer, so force 1- and 2-elements initializers to have 4
Index: gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c
===
--- gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c
(revision 218582)
+++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c
(working copy)
@@ -2,6 +2,7 @@
 #include arm-neon-ref.h
 #include compute-ref-data.h
 
+#if defined(__aarch64__)  defined(__ARM_FEATURE_FMA)
 /* Expected results.  */
 VECT_VAR_DECL(expected,hfloat,32,2) [] = { 0x4438ca3d, 0x44390a3d };
 VECT_VAR_DECL(expected,hfloat,32,4) [] = { 0x44869eb8, 0x4486beb8, 0x4486deb8, 
0x4486feb8 };
@@ -9,28 +10,29 @@ VECT_VAR_DECL(expected,hfloat,64,2) [] = { 0x40890
 
 #define VECT_VAR_ASSIGN(S,Q,T1,W) S##Q##_##T1##W
 #define ASSIGN(S, Q, T, W, V) T##W##_t S##Q##_##T##W = V
-#define TEST_MSG VFMA/VFMAQ
+#define TEST_MSG VFMA_N/VFMAQ_N
+
 void exec_vfma_n (void)
 {
   /* Basic test: v4=vfma_n(v1,v2), then store the result.  */
-#define TEST_VFMA(Q, T1, T2, W, N) \
+#define TEST_VFMA_N(Q, T1, T2, W, N)   \
   VECT_VAR(vector_res, T1, W, N) = \
 vfma##Q##_n_##T2##W(VECT_VAR(vector1, T1, W, N),   \
- VECT_VAR(vector2, T1, W, N),  \
- VECT_VAR_ASSIGN(Scalar, Q, T1, W));   
\
+   VECT_VAR(vector2, T1, W, N),\
+   VECT_VAR_ASSIGN(scalar, Q, T1, W)); \
   vst1##Q##_##T2##W(VECT_VAR(result, T1, W, N), VECT_VAR(vector_res, T1, W, N))
 
-#define CHECK_VFMA_RESULTS(test_name,comment)  \
+#define CHECK_VFMA_N_RESULTS(test_name,comment)
\
   {\
 CHECK_FP(test_name, float, 32, 2, PRIx32, expected, comment);  \
 CHECK_FP(test_name, float, 32, 4, PRIx32, expected, comment);  \
-   CHECK_FP(test_name, float, 64, 2, PRIx64, expected, comment);   \
-  }
+CHECK_FP(test_name, float, 64, 2, PRIx64, expected, comment);  \
+  }
 
 #define DECL_VABD_VAR(VAR) \
   DECL_VARIABLE(VAR, float, 32, 2);\
   DECL_VARIABLE(VAR, float, 32, 4);\
-  DECL_VARIABLE(VAR, float, 64, 2);
+  DECL_VARIABLE(VAR, float, 64, 2);
 
   DECL_VABD_VAR(vector1);
   DECL_VABD_VAR(vector2);
@@ -50,20 +52,23 

[PATCH] [AArch64, NEON] Add vfms_n_f32, vfmsq_n_f32 and vfmsq_n_f64 specified by the ACLE

2014-12-11 Thread Yangfei (Felix)
Hi, 

  This patch add three intrinsics that are required by the ACLE specification. 
  A new testcase is added which covers vfms_n_f32 and vfmsq_n_f32. Tested on 
both aarch64-linux-gnu and aarch64_be-linux-gnu. 
  OK? 


Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 218582)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,8 @@
+2014-12-11  Felix Yang  felix.y...@huawei.com
+
+   * config/aarch64/arm_neon.h (vfms_n_f32, vfmsq_n_f32, vfmsq_n_f64): New
+   intrinsics.
+
 2014-12-10  Felix Yang  felix.y...@huawei.com
 
* config/aarch64/aarch64-protos.h (aarch64_function_profiler): Remove
Index: gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfms_n.c
===
--- gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfms_n.c
(revision 0)
+++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfms_n.c
(revision 0)
@@ -0,0 +1,67 @@
+#include arm_neon.h
+#include arm-neon-ref.h
+#include compute-ref-data.h
+
+#ifdef __aarch64__
+/* Expected results.  */
+VECT_VAR_DECL(expected,hfloat,32,2) [] = { 0x4438ca3d, 0x44390a3d };
+VECT_VAR_DECL(expected,hfloat,32,4) [] = { 0x44869eb8, 0x4486beb8, 0x4486deb8, 
0x4486feb8 };
+
+#define VECT_VAR_ASSIGN(S,Q,T1,W) S##Q##_##T1##W
+#define ASSIGN(S, Q, T, W, V) T##W##_t S##Q##_##T##W = V
+#define TEST_MSG VFMS_N/VFMSQ_N
+
+void exec_vfms_n (void)
+{
+  /* Basic test: v4=vfms_n(v1,v2), then store the result.  */
+#define TEST_VFMS(Q, T1, T2, W, N) \
+  VECT_VAR(vector_res, T1, W, N) = \
+vfms##Q##_n_##T2##W(VECT_VAR(vector1, T1, W, N),   \
+   VECT_VAR(vector2, T1, W, N),\
+   VECT_VAR_ASSIGN(scalar, Q, T1, W)); \
+  vst1##Q##_##T2##W(VECT_VAR(result, T1, W, N), VECT_VAR(vector_res, T1, W, N))
+
+#define CHECK_VFMS_RESULTS(test_name,comment)  \
+  {\
+CHECK_FP(test_name, float, 32, 2, PRIx32, expected, comment);  \
+CHECK_FP(test_name, float, 32, 4, PRIx32, expected, comment);  \
+  }
+
+#define DECL_VABD_VAR(VAR) \
+  DECL_VARIABLE(VAR, float, 32, 2);\
+  DECL_VARIABLE(VAR, float, 32, 4);\
+
+  DECL_VABD_VAR(vector1);
+  DECL_VABD_VAR(vector2);
+  DECL_VABD_VAR(vector3);
+  DECL_VABD_VAR(vector_res);
+
+  clean_results ();
+
+  /* Initialize input vector1 from buffer.  */
+  VLOAD(vector1, buffer, , float, f, 32, 2);
+  VLOAD(vector1, buffer, q, float, f, 32, 4);
+
+  /* Choose init value arbitrarily.  */
+  VDUP(vector2, , float, f, 32, 2, -9.3f);
+  VDUP(vector2, q, float, f, 32, 4, -29.7f);
+  
+  /* Choose init value arbitrarily.  */
+  ASSIGN(scalar, , float, 32, 81.2f);
+  ASSIGN(scalar, q, float, 32, 36.8f);
+
+  /* Execute the tests.  */
+  TEST_VFMS(, float, f, 32, 2);
+  TEST_VFMS(q, float, f, 32, 4);
+
+  CHECK_VFMS_RESULTS (TEST_MSG, );
+}
+#endif
+
+int main (void)
+{
+#ifdef __aarch64__
+  exec_vfms_n ();
+#endif
+  return 0;
+}
Index: gcc/testsuite/ChangeLog
===
--- gcc/testsuite/ChangeLog (revision 218582)
+++ gcc/testsuite/ChangeLog (working copy)
@@ -1,3 +1,7 @@
+2014-12-08  Felix Yang  felix.y...@huawei.com
+
+   * gcc.target/aarch64/advsimd-intrinsics/vfms_n.c: New test.
+
 2014-12-10  Martin Liska  mli...@suse.cz
 
* gcc.dg/ipa/pr63909.c: New test.
Index: gcc/config/aarch64/arm_neon.h
===
--- gcc/config/aarch64/arm_neon.h   (revision 218582)
+++ gcc/config/aarch64/arm_neon.h   (working copy)
@@ -15254,7 +15254,24 @@ vfmsq_f64 (float64x2_t __a, float64x2_t __b, float
   return __builtin_aarch64_fmav2df (-__b, __c, __a);
 }
 
+__extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
+vfms_n_f32 (float32x2_t __a, float32x2_t __b, float32_t __c)
+{
+  return __builtin_aarch64_fmav2sf (-__b, vdup_n_f32 (__c), __a);
+}
 
+__extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
+vfmsq_n_f32 (float32x4_t __a, float32x4_t __b, float32_t __c)
+{
+  return __builtin_aarch64_fmav4sf (-__b, vdupq_n_f32 (__c), __a);
+}
+
+__extension__ static __inline float64x2_t __attribute__ ((__always_inline__))
+vfmsq_n_f64 (float64x2_t __a, float64x2_t __b, float64_t __c)
+{
+  return __builtin_aarch64_fmav2df (-__b, vdupq_n_f64 (__c), __a);
+}
+
 /* vfms_lane  */
 
 __extension__ static __inline float32x2_t __attribute__ ((__always_inline__))


add-vfms_n-v1.diff
Description: add-vfms_n-v1.diff


Re: [PING ^ 3][PATCH, AArch64] Add doloop_end pattern for -fmodulo-sched

2014-12-10 Thread Yangfei (Felix)
  --- gcc/config/aarch64/aarch64.c(revision 217394)
  +++ gcc/config/aarch64/aarch64.c(working copy)
  @@ -10224,6 +10224,9 @@ aarch64_use_by_pieces_infrastructure_p
 (unsigned i
#define TARGET_USE_BY_PIECES_INFRASTRUCTURE_P \
  aarch64_use_by_pieces_infrastructure_p
 
  +#undef TARGET_CAN_USE_DOLOOP_P
  +#define TARGET_CAN_USE_DOLOOP_P can_use_doloop_if_innermost
  +
struct gcc_target targetm = TARGET_INITIALIZER;
 
#include gt-aarch64.h
 
 
 Hi Felix,
 
 This patch causes https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64240
 when sms-3 is tested with -fPIC. It runs fine when I reverse this patch out.
 
 Please could you have a look?
 
 Thanks,
 Tejas.

OK, I have reproduced with -fPIC option. Will take a look.  



Re: [COMMITTED] [PING] [PATCH] [AArch64, NEON] More NEON intrinsics improvement

2014-12-10 Thread Yangfei (Felix)
   +__extension__ static __inline float32x2_t __attribute__
   +((__always_inline__))
   +vfms_f32 (float32x2_t __a, float32x2_t __b, float32x2_t __c) {
   +  return __builtin_aarch64_fmav2sf (-__b, __c, __a); }
   +
   +__extension__ static __inline float32x4_t __attribute__
   +((__always_inline__))
   +vfmsq_f32 (float32x4_t __a, float32x4_t __b, float32x4_t __c) {
   +  return __builtin_aarch64_fmav4sf (-__b, __c, __a); }
   +
   +__extension__ static __inline float64x2_t __attribute__
   +((__always_inline__))
   +vfmsq_f64 (float64x2_t __a, float64x2_t __b, float64x2_t __c) {
   +  return __builtin_aarch64_fmav2df (-__b, __c, __a); }
   +
   +
  
  
   Thanks, the patch looks good. Just one comment:
   You could also add
   float32x2_t vfms_n_f32(float32x2_t a, float32x2_t b, float32_t n)
   and its Q-variant.
 
  You can, if you wish,  deal with Tejas' comment with a follow on patch
  rather than re-spinning this one.   Provided this patch has no
  regressions on a big endian and a little endian test run then you can 
  commit it.
  Thanks
  /Marcus
 
 
  No regressions for aarch64_be-linux-gnu target.  Committed as r218484.
  Will come up with a new patch to deal with Tejas' comment.  Thanks.
 
 My validations of trunk show that your new tests are incorrect: none of them
 compiles because the hfloat64_t type isn't defined.
 
 Also, keep in mind that the tests in this directory are executed by the 
 aarch32
 target too.
 
 Christophe


It seems that some code for the newly added testcases is missing when the patch 
is generated.  
I will fix them soon.  Thanks for pointing this out.  


[PATCH] [AArch64, NEON] Fix testcases add by r218484

2014-12-10 Thread Yangfei (Felix)
Hi,

  We find that the committed patch is not correctly generated from our local 
branch.  This caused some code necessary for the testcases missing. 
  As pointed out by Christophe in 
https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00778.html, we need to rework the 
testcases so that it can work for AArch32 target too. 

  This patch fix this two issues.  Three changes: 
  1. vfma_f32, vfmaq_f32, vfms_f32, vfmsq_f32 are only available for arm*-*-* 
target with the FMA feature, we take care of this through the macro 
__ARM_FEATURE_FMA. 
  2. vfma_n_f32 and vfmaq_n_f32 are only available for aarch64 target, we take 
care of this through the macro __aarch64__. 
  3. vfmaq_f64, vfmaq_n_f64 and vfmsq_f64 are only available for aarch64 
target, we just exclude test for them to keep the testcases clean. (Note: They 
also pass on aarch64  aarch64_be target and we can add test for them if 
needed). 
  
  Tested on armeb-linux-gnueabi, arm-linux-gnueabi, aarch64-linux-gnu and 
aarch64_be-linux-gnu.  OK for the trunk? 
  Sorry if this cause you guys any trouble, we will be more carefull in our 
future work. 


Index: gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c
===
--- gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c
(revision 218582)
+++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c
(working copy)
@@ -2,35 +2,34 @@
 #include arm-neon-ref.h
 #include compute-ref-data.h
 
+#ifdef __aarch64__
 /* Expected results.  */
 VECT_VAR_DECL(expected,hfloat,32,2) [] = { 0x4438ca3d, 0x44390a3d };
 VECT_VAR_DECL(expected,hfloat,32,4) [] = { 0x44869eb8, 0x4486beb8, 0x4486deb8, 
0x4486feb8 };
-VECT_VAR_DECL(expected,hfloat,64,2) [] = { 0x408906e1532b8520, 
0x40890ee1532b8520 };
 
 #define VECT_VAR_ASSIGN(S,Q,T1,W) S##Q##_##T1##W
 #define ASSIGN(S, Q, T, W, V) T##W##_t S##Q##_##T##W = V
-#define TEST_MSG VFMA/VFMAQ
+#define TEST_MSG VFMA_N/VFMAQ_N
+
 void exec_vfma_n (void)
 {
   /* Basic test: v4=vfma_n(v1,v2), then store the result.  */
 #define TEST_VFMA(Q, T1, T2, W, N) \
   VECT_VAR(vector_res, T1, W, N) = \
 vfma##Q##_n_##T2##W(VECT_VAR(vector1, T1, W, N),   \
- VECT_VAR(vector2, T1, W, N),  \
- VECT_VAR_ASSIGN(Scalar, Q, T1, W));   
\
+   VECT_VAR(vector2, T1, W, N),\
+   VECT_VAR_ASSIGN(scalar, Q, T1, W)); \
   vst1##Q##_##T2##W(VECT_VAR(result, T1, W, N), VECT_VAR(vector_res, T1, W, N))
 
 #define CHECK_VFMA_RESULTS(test_name,comment)  \
   {\
 CHECK_FP(test_name, float, 32, 2, PRIx32, expected, comment);  \
 CHECK_FP(test_name, float, 32, 4, PRIx32, expected, comment);  \
-   CHECK_FP(test_name, float, 64, 2, PRIx64, expected, comment);   \
-  }
+  }
 
 #define DECL_VABD_VAR(VAR) \
   DECL_VARIABLE(VAR, float, 32, 2);\
-  DECL_VARIABLE(VAR, float, 32, 4);\
-  DECL_VARIABLE(VAR, float, 64, 2);
+  DECL_VARIABLE(VAR, float, 32, 4);
 
   DECL_VABD_VAR(vector1);
   DECL_VABD_VAR(vector2);
@@ -42,28 +41,27 @@ void exec_vfma_n (void)
   /* Initialize input vector1 from buffer.  */
   VLOAD(vector1, buffer, , float, f, 32, 2);
   VLOAD(vector1, buffer, q, float, f, 32, 4);
-  VLOAD(vector1, buffer, q, float, f, 64, 2);
 
   /* Choose init value arbitrarily.  */
   VDUP(vector2, , float, f, 32, 2, 9.3f);
   VDUP(vector2, q, float, f, 32, 4, 29.7f);
-  VDUP(vector2, q, float, f, 64, 2, 15.8f);
   
   /* Choose init value arbitrarily.  */
-  ASSIGN(Scalar, , float, 32, 81.2f);
-  ASSIGN(Scalar, q, float, 32, 36.8f);
-  ASSIGN(Scalar, q, float, 64, 51.7f);
+  ASSIGN(scalar, , float, 32, 81.2f);
+  ASSIGN(scalar, q, float, 32, 36.8f);
 
   /* Execute the tests.  */
   TEST_VFMA(, float, f, 32, 2);
   TEST_VFMA(q, float, f, 32, 4);
-  TEST_VFMA(q, float, f, 64, 2);
 
   CHECK_VFMA_RESULTS (TEST_MSG, );
 }
+#endif
 
 int main (void)
 {
+#ifdef __aarch64__
   exec_vfma_n ();
+#endif
   return 0;
 }
Index: gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma.c
===
--- gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma.c  (revision 
218582)
+++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma.c  (working copy)
@@ -2,12 +2,13 @@
 #include arm-neon-ref.h
 #include compute-ref-data.h
 
+#ifdef __ARM_FEATURE_FMA
 /* Expected results.  */
 VECT_VAR_DECL(expected,hfloat,32,2) [] = { 0x4438ca3d, 0x44390a3d };
 VECT_VAR_DECL(expected,hfloat,32,4) [] = { 0x44869eb8, 0x4486beb8, 0x4486deb8, 
0x4486feb8 };
-VECT_VAR_DECL(expected,hfloat,64,2) [] = { 0x408906e1532b8520, 
0x40890ee1532b8520 };
 
 #define TEST_MSG VFMA/VFMAQ
+
 void exec_vfma 

Re: [PATCH] [AArch64, NEON] Improve vpmaxX vpminX intrinsics

2014-12-09 Thread Yangfei (Felix)
 On 28 November 2014 at 09:23, Yangfei (Felix) felix.y...@huawei.com wrote:
  Hi,
This patch converts vpmaxX  vpminX intrinsics to use builtin functions
 instead of the previous inline assembly syntax.
Regtested with aarch64-linux-gnu on QEMU.  Also passed the glorious
 testsuite of Christophe Lyon.
OK for the trunk?
 
 Hi Felix,   We know from experience that the advsimd intrinsics tend
 to be fragile for big endian and in general it is fairly easy to break the 
 big endian
 case.  For these advsimd improvements that you are working on (that we very
 much appreciate) it is important to run both little endian and big endian
 regressions.
 
 Thanks
 /Marcus


Okay.  Any plan for the advsimd big-endian improvement? 
I rebased this patch over Alan Lawrance's patch: 
https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00279.html 
No regressions for aarch64_be-linux-gnu target too.  OK for the thunk? 


Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 218464)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,18 @@
+2014-12-09  Felix Yang  felix.y...@huawei.com
+
+   * config/aarch64/aarch64-simd.md (aarch64_maxmin_unspmode): New
+   pattern.
+   * config/aarch64/aarch64-simd-builtins.def (smaxp, sminp, umaxp,
+   uminp, smax_nanp, smin_nanp): New builtins.
+   * config/aarch64/arm_neon.h (vpmax_s8, vpmax_s16, vpmax_s32,
+   vpmax_u8, vpmax_u16, vpmax_u32, vpmaxq_s8, vpmaxq_s16, vpmaxq_s32,
+   vpmaxq_u8, vpmaxq_u16, vpmaxq_u32, vpmax_f32, vpmaxq_f32, vpmaxq_f64,
+   vpmaxqd_f64, vpmaxs_f32, vpmaxnm_f32, vpmaxnmq_f32, vpmaxnmq_f64,
+   vpmaxnmqd_f64, vpmaxnms_f32, vpmin_s8, vpmin_s16, vpmin_s32, vpmin_u8,
+   vpmin_u16, vpmin_u32, vpminq_s8, vpminq_s16, vpminq_s32, vpminq_u8,
+   vpminq_u16, vpminq_u32, vpmin_f32, vpminq_f32, vpminq_f64, vpminqd_f64,
+   vpmins_f32, vpminnm_f32, vpminnmq_f32, vpminnmq_f64, vpminnmqd_f64,
+
 2014-12-07  Felix Yang  felix.y...@huawei.com
Shanyao Chen  chenshan...@huawei.com
 
Index: gcc/config/aarch64/arm_neon.h
===
--- gcc/config/aarch64/arm_neon.h   (revision 218464)
+++ gcc/config/aarch64/arm_neon.h   (working copy)
@@ -8843,491 +8843,7 @@ vpadds_f32 (float32x2_t a)
   return result;
 }
 
-__extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
-vpmax_f32 (float32x2_t a, float32x2_t b)
-{
-  float32x2_t result;
-  __asm__ (fmaxp %0.2s, %1.2s, %2.2s
-   : =w(result)
-   : w(a), w(b)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
-vpmax_s8 (int8x8_t a, int8x8_t b)
-{
-  int8x8_t result;
-  __asm__ (smaxp %0.8b, %1.8b, %2.8b
-   : =w(result)
-   : w(a), w(b)
-   : /* No clobbers */);
-  return result;
-}
-
 __extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
-vpmax_s16 (int16x4_t a, int16x4_t b)
-{
-  int16x4_t result;
-  __asm__ (smaxp %0.4h, %1.4h, %2.4h
-   : =w(result)
-   : w(a), w(b)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
-vpmax_s32 (int32x2_t a, int32x2_t b)
-{
-  int32x2_t result;
-  __asm__ (smaxp %0.2s, %1.2s, %2.2s
-   : =w(result)
-   : w(a), w(b)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
-vpmax_u8 (uint8x8_t a, uint8x8_t b)
-{
-  uint8x8_t result;
-  __asm__ (umaxp %0.8b, %1.8b, %2.8b
-   : =w(result)
-   : w(a), w(b)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline uint16x4_t __attribute__ ((__always_inline__))
-vpmax_u16 (uint16x4_t a, uint16x4_t b)
-{
-  uint16x4_t result;
-  __asm__ (umaxp %0.4h, %1.4h, %2.4h
-   : =w(result)
-   : w(a), w(b)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline uint32x2_t __attribute__ ((__always_inline__))
-vpmax_u32 (uint32x2_t a, uint32x2_t b)
-{
-  uint32x2_t result;
-  __asm__ (umaxp %0.2s, %1.2s, %2.2s
-   : =w(result)
-   : w(a), w(b)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
-vpmaxnm_f32 (float32x2_t a, float32x2_t b)
-{
-  float32x2_t result;
-  __asm__ (fmaxnmp %0.2s,%1.2s,%2.2s
-   : =w(result)
-   : w(a), w(b)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
-vpmaxnmq_f32 (float32x4_t a, float32x4_t b)
-{
-  float32x4_t result;
-  __asm__ (fmaxnmp %0.4s,%1.4s,%2.4s
-   : =w(result)
-   : w(a), w(b)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline float64x2_t

Re: [PATCH] [AArch64, NEON] Improve vpmaxX vpminX intrinsics

2014-12-09 Thread Yangfei (Felix)
 You'll need to rebase over Alan Lawrance's patch.
 https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00279.html

Yes, see my new patch: https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00750.html 

  +;; Pairwise Integer Max/Min operations.
  +(define_insn aarch64_maxmin_unspmode
  + [(set (match_operand:VQ_S 0 register_operand =w)
  +   (unspec:VQ_S [(match_operand:VQ_S 1 register_operand w)
  +(match_operand:VQ_S 2 register_operand w)]
  +   MAXMINV))]
  + TARGET_SIMD
  + maxmin_uns_opp\t%0.Vtype, %1.Vtype, %2.Vtype
  +  [(set_attr type neon_minmaxq)]
  +)
  +
 
 Could you roll aarch64_reduc_maxmin_uns_internalv2si into this pattern?

Will come up with another patch to fix this.  Thanks for pointing this out.

 
 Thanks,
 Tejas.



[PING ^ 3] [RFC PATCH, AARCH64] Add support for -mlong-calls option

2014-12-09 Thread Yangfei (Felix)
Hi, 
  This is a pin for: 
https://gcc.gnu.org/ml/gcc-patches/2014-11/msg02258.html 

Thanks.


[PATCH, trivial] [AArch64] Remove declaration of removed function from aarch64-protos.h

2014-12-09 Thread Yangfei (Felix)
The definition of function aarch64_function_profiler is removed since GCC-4.9. 
But the declaration is still there in aarch64-protos.h.  So remove it. 
OK for the trunk? 


Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 218464)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,8 @@
+2014-12-09  Felix Yang  felix.y...@huawei.com
+
+   * config/aarch64/aarch64-protos.h (aarch64_function_profiler): Remove
+   declaration of removed function.
+
 2014-12-07  Felix Yang  felix.y...@huawei.com
Shanyao Chen  chenshan...@huawei.com
 
Index: gcc/config/aarch64/aarch64-protos.h
===
--- gcc/config/aarch64/aarch64-protos.h (revision 218464)
+++ gcc/config/aarch64/aarch64-protos.h (working copy)
@@ -247,7 +247,6 @@ void aarch64_expand_epilogue (bool);
 void aarch64_expand_mov_immediate (rtx, rtx);
 void aarch64_expand_prologue (void);
 void aarch64_expand_vector_init (rtx, rtx);
-void aarch64_function_profiler (FILE *, int);
 void aarch64_init_cumulative_args (CUMULATIVE_ARGS *, const_tree, rtx,
   const_tree, unsigned);
 void aarch64_init_expanders (void);


[COMMITTED] [PING] [PATCH] [AArch64, NEON] More NEON intrinsics improvement

2014-12-08 Thread Yangfei (Felix)
 On 5 December 2014 at 18:44, Tejas Belagod tejas.bela...@arm.com wrote:
 
 
  +__extension__ static __inline float32x2_t __attribute__
  +((__always_inline__))
  +vfms_f32 (float32x2_t __a, float32x2_t __b, float32x2_t __c) {
  +  return __builtin_aarch64_fmav2sf (-__b, __c, __a); }
  +
  +__extension__ static __inline float32x4_t __attribute__
  +((__always_inline__))
  +vfmsq_f32 (float32x4_t __a, float32x4_t __b, float32x4_t __c) {
  +  return __builtin_aarch64_fmav4sf (-__b, __c, __a); }
  +
  +__extension__ static __inline float64x2_t __attribute__
  +((__always_inline__))
  +vfmsq_f64 (float64x2_t __a, float64x2_t __b, float64x2_t __c) {
  +  return __builtin_aarch64_fmav2df (-__b, __c, __a); }
  +
  +
 
 
  Thanks, the patch looks good. Just one comment:
  You could also add
  float32x2_t vfms_n_f32(float32x2_t a, float32x2_t b, float32_t n) and
  its Q-variant.
 
 You can, if you wish,  deal with Tejas' comment with a follow on patch
 rather than re-spinning this one.   Provided this patch has no
 regressions on a big endian and a little endian test run then you can commit 
 it.
 Thanks
 /Marcus


No regressions for aarch64_be-linux-gnu target.  Committed as r218484.  
Will come up with a new patch to deal with Tejas' comment.  Thanks.  


[PING] [PATCH] [AArch64, NEON] More NEON intrinsics improvement

2014-12-03 Thread Yangfei (Felix)
Any comments?  Thanks.  


 Hi,
  This patch converts more intrinsics to use builtin functions instead of 
 the
 previous inline assembly syntax.
  Passed the glorious testsuite of Christophe Lyon.
 
  Three testcases are added for the testing of intriniscs which are not
 covered by the testsuite:
  gcc.target/aarch64/vfma.c
  gcc.target/aarch64/vfma_n.c
  gcc.target/aarch64/vfms.c
 
  Regtested with aarch64-linux-gnu on QEMU.  OK for the trunk?
 
 
 Index: gcc/ChangeLog
 =
 ==
 --- gcc/ChangeLog (revision 217394)
 +++ gcc/ChangeLog (working copy)
 @@ -1,3 +1,26 @@
 +2014-11-18  Felix Yang  felix.y...@huawei.com
 + Haijian Zhang  z.zhanghaij...@huawei.com
 + Jiji Jiang  jiangj...@huawei.com
 + Pengfei Sui  suipeng...@huawei.com
 +
 + * config/aarch64/arm_neon.h (vrecpe_u32, vrecpeq_u32): Rewrite using
 + builtin functions.
 + (vfma_f32, vfmaq_f32, vfmaq_f64, vfma_n_f32, vfmaq_n_f32,
 vfmaq_n_f64,
 + vfms_f32, vfmsq_f32, vfmsq_f64): Likewise.
 + (vhsub_s8, vhsub_u8, vhsub_s16, vhsub_u16, vhsub_s32, vhsub_u32,
 + vhsubq_s8, vhsubq_u8, vhsubq_s16, vhsubq_u16, vhsubq_s32,
 vhsubq_u32,
 + vsubhn_s16, vsubhn_u16, vsubhn_s32, vsubhn_u32, vsubhn_s64,
 vsubhn_u66,
 + vrsubhn_s16, vrsubhn_u16, vrsubhn_s32, vrsubhn_u32, vrsubhn_s64,
 + vrsubhn_u64, vsubhn_high_s16, vsubhn_high_u16, vsubhn_high_s32,
 + vsubhn_high_u32, vsubhn_high_s64, vsubhn_high_u64, vrsubhn_high_s16,
 + vrsubhn_high_u16, vrsubhn_high_s32, vrsubhn_high_u32,
 vrsubhn_high_s64,
 + vrsubhn_high_u64): Likewise.
 + * config/aarch64/iterators.md (VDQ_SI): New mode iterator.
 + * config/aarch64/aarch64.md (define_c_enum unspec): Add
 UNSPEC_URECPE.
 + * config/aarch64/aarch64-simd.md (aarch64_urecpemode): New
 pattern.
 + * config/aarch64/aarch64-simd-builtins.def (shsub, uhsub, subhn, rsubhn,
 + subhn2, rsubhn2, urecpe): New builtins.
 +
  2014-11-11  Andrew Pinski  apin...@cavium.com
 
   Bug target/61997
 Index: gcc/testsuite/gcc.target/aarch64/narrow_high-intrinsics.c
 =
 ==
 --- gcc/testsuite/gcc.target/aarch64/narrow_high-intrinsics.c (revision 
 217394)
 +++ gcc/testsuite/gcc.target/aarch64/narrow_high-intrinsics.c (working copy)
 @@ -107,9 +107,9 @@ ONE (vmovn_high, uint16x8_t, uint16x4_t, uint32x4_
 ONE (vmovn_high, uint32x4_t, uint32x2_t, uint64x2_t, u64)
 
 
 -/* { dg-final { scan-assembler-times \\tsubhn2 v 6} }  */
 +/* { dg-final { scan-assembler-times \\tsubhn2\\tv 6} }  */
  /* { dg-final { scan-assembler-times \\taddhn2\\tv 6} }  */
 -/* { dg-final { scan-assembler-times rsubhn2 v 6} }  */
 +/* { dg-final { scan-assembler-times rsubhn2\\tv 6} }  */
  /* { dg-final { scan-assembler-times raddhn2\\tv 6} }  */
  /* { dg-final { scan-assembler-times \\trshrn2 v 6} }  */
  /* { dg-final { scan-assembler-times \\tshrn2 v 6} }  */
 Index: gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c
 =
 ==
 --- gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c  
 (revision 0)
 +++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c  
 (revision
 0)
 @@ -0,0 +1,69 @@
 +#include arm_neon.h
 +#include arm-neon-ref.h
 +#include compute-ref-data.h
 +
 +/* Expected results.  */
 +VECT_VAR_DECL(expected,hfloat,32,2) [] = { 0x4438ca3d, 0x44390a3d };
 +VECT_VAR_DECL(expected,hfloat,32,4) [] = { 0x44869eb8, 0x4486beb8,
 +0x4486deb8, 0x4486feb8 };
 +VECT_VAR_DECL(expected,hfloat,64,2) [] = { 0x408906e1532b8520,
 +0x40890ee1532b8520 };
 +
 +#define VECT_VAR_ASSIGN(S,Q,T1,W) S##Q##_##T1##W #define ASSIGN(S,
 Q,
 +T, W, V) T##W##_t S##Q##_##T##W = V #define TEST_MSG VFMA/VFMAQ
 +void exec_vfma_n (void)
 +{
 +  /* Basic test: v4=vfma_n(v1,v2), then store the result.  */
 +#define TEST_VFMA(Q, T1, T2, W, N)   \
 +  VECT_VAR(vector_res, T1, W, N) =   \
 +vfma##Q##_n_##T2##W(VECT_VAR(vector1, T1, W, N), \
 +   VECT_VAR(vector2, T1, W, N),  \
 +   VECT_VAR_ASSIGN(Scalar, Q, T1, W));   
 \
 +  vst1##Q##_##T2##W(VECT_VAR(result, T1, W, N), VECT_VAR(vector_res,
 +T1, W, N))
 +
 +#define CHECK_VFMA_RESULTS(test_name,comment)
 \
 +  {  \
 +CHECK_FP(test_name, float, 32, 2, PRIx32, expected, comment);\
 +CHECK_FP(test_name, float, 32, 4, PRIx32, expected, comment);\
 + CHECK_FP(test_name, float, 64, 2, PRIx64, expected, comment);   \
 +  }
 +
 +#define DECL_VABD_VAR(VAR)   \
 +  DECL_VARIABLE(VAR, float, 32, 2);  \
 +  DECL_VARIABLE(VAR, float, 32, 4);  \
 +  DECL_VARIABLE(VAR, float, 64, 2);
 +
 +  DECL_VABD_VAR(vector1);
 +  

[PATCH] [AArch64, NEON] Improve vpmaxX vpminX intrinsics

2014-11-28 Thread Yangfei (Felix)
Hi, 
  This patch converts vpmaxX  vpminX intrinsics to use builtin functions 
instead of the previous inline assembly syntax. 
  Regtested with aarch64-linux-gnu on QEMU.  Also passed the glorious testsuite 
of Christophe Lyon. 
  OK for the trunk? 


Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 218128)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,19 @@
+2014-11-28  Felix Yang  felix.y...@huawei.com
+
+   * config/aarch64/aarch64-simd.md (aarch64_maxmin_unspmode): New
+   pattern.
+   * config/aarch64/aarch64-simd-builtins.def (smaxp, sminp, umaxp,
+   uminp, smax_nanp, smin_nanp): New builtins.
+   * config/aarch64/arm_neon.h (vpmax_s8, vpmax_s16, vpmax_s32,
+   vpmax_u8, vpmax_u16, vpmax_u32, vpmaxq_s8, vpmaxq_s16, vpmaxq_s32,
+   vpmaxq_u8, vpmaxq_u16, vpmaxq_u32, vpmax_f32, vpmaxq_f32, vpmaxq_f64,
+   vpmaxqd_f64, vpmaxs_f32, vpmaxnm_f32, vpmaxnmq_f32, vpmaxnmq_f64,
+   vpmaxnmqd_f64, vpmaxnms_f32, vpmin_s8, vpmin_s16, vpmin_s32, vpmin_u8,
+   vpmin_u16, vpmin_u32, vpminq_s8, vpminq_s16, vpminq_s32, vpminq_u8,
+   vpminq_u16, vpminq_u32, vpmin_f32, vpminq_f32, vpminq_f64, vpminqd_f64,
+   vpmins_f32, vpminnm_f32, vpminnmq_f32, vpminnmq_f64, vpminnmqd_f64,
+   vpminnms_f32): Rewrite using builtin functions.
+
 2014-11-27  Richard Biener  rguent...@suse.de
 
PR middle-end/64088
Index: gcc/config/aarch64/arm_neon.h
===
--- gcc/config/aarch64/arm_neon.h   (revision 218128)
+++ gcc/config/aarch64/arm_neon.h   (working copy)
@@ -8975,491 +8975,7 @@ vpadds_f32 (float32x2_t a)
   return result;
 }
 
-__extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
-vpmax_f32 (float32x2_t a, float32x2_t b)
-{
-  float32x2_t result;
-  __asm__ (fmaxp %0.2s, %1.2s, %2.2s
-   : =w(result)
-   : w(a), w(b)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
-vpmax_s8 (int8x8_t a, int8x8_t b)
-{
-  int8x8_t result;
-  __asm__ (smaxp %0.8b, %1.8b, %2.8b
-   : =w(result)
-   : w(a), w(b)
-   : /* No clobbers */);
-  return result;
-}
-
 __extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
-vpmax_s16 (int16x4_t a, int16x4_t b)
-{
-  int16x4_t result;
-  __asm__ (smaxp %0.4h, %1.4h, %2.4h
-   : =w(result)
-   : w(a), w(b)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
-vpmax_s32 (int32x2_t a, int32x2_t b)
-{
-  int32x2_t result;
-  __asm__ (smaxp %0.2s, %1.2s, %2.2s
-   : =w(result)
-   : w(a), w(b)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
-vpmax_u8 (uint8x8_t a, uint8x8_t b)
-{
-  uint8x8_t result;
-  __asm__ (umaxp %0.8b, %1.8b, %2.8b
-   : =w(result)
-   : w(a), w(b)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline uint16x4_t __attribute__ ((__always_inline__))
-vpmax_u16 (uint16x4_t a, uint16x4_t b)
-{
-  uint16x4_t result;
-  __asm__ (umaxp %0.4h, %1.4h, %2.4h
-   : =w(result)
-   : w(a), w(b)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline uint32x2_t __attribute__ ((__always_inline__))
-vpmax_u32 (uint32x2_t a, uint32x2_t b)
-{
-  uint32x2_t result;
-  __asm__ (umaxp %0.2s, %1.2s, %2.2s
-   : =w(result)
-   : w(a), w(b)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
-vpmaxnm_f32 (float32x2_t a, float32x2_t b)
-{
-  float32x2_t result;
-  __asm__ (fmaxnmp %0.2s,%1.2s,%2.2s
-   : =w(result)
-   : w(a), w(b)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
-vpmaxnmq_f32 (float32x4_t a, float32x4_t b)
-{
-  float32x4_t result;
-  __asm__ (fmaxnmp %0.4s,%1.4s,%2.4s
-   : =w(result)
-   : w(a), w(b)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline float64x2_t __attribute__ ((__always_inline__))
-vpmaxnmq_f64 (float64x2_t a, float64x2_t b)
-{
-  float64x2_t result;
-  __asm__ (fmaxnmp %0.2d,%1.2d,%2.2d
-   : =w(result)
-   : w(a), w(b)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline float64_t __attribute__ ((__always_inline__))
-vpmaxnmqd_f64 (float64x2_t a)
-{
-  float64_t result;
-  __asm__ (fmaxnmp %d0,%1.2d
-   : =w(result)
-   : w(a)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline float32_t __attribute__ ((__always_inline__))
-vpmaxnms_f32 (float32x2_t a)
-{
-  

Re: [PATCH PR59593] [arm] Backport r217772 r217826 to 4.8 4.9

2014-11-28 Thread Yangfei (Felix)
 I've backported this fix to 4.8  4.9 branch.
 These patches have been tested for armeb-none-eabi-gcc/g++ with qemu, and
 both the test results were ok.

Looks OK with me.  Ramana, is this OK for the 4.8  4.9 branches?  Thanks. 



Re: [PATCH, PR63742][ARM] Fix arm *movhi_insn_arch4 pattern for big-endian

2014-11-20 Thread Yangfei (Felix)
 On 19/11/14 09:29, Yangfei (Felix) wrote:
  Sorry for missing the point.  It seems to me that 't2' here will
  conflict with
  condition of the pattern *movhi_insn_arch4:
  TARGET_ARM
arm_arch4
(register_operand (operands[0], HImode)
   || register_operand (operands[1], HImode))
 
  #define TARGET_ARM  (! TARGET_THUMB)
  /* 32-bit Thumb-2 code.  */
  #define TARGET_THUMB2   (TARGET_THUMB 
  arm_arch_thumb2)
 
 
  Bah, Indeed ! - I misremembered the t2 there, my mistake.
 
  Yes you are right there, but what I'd like you to do is to use that
  mechanism rather than putting all this logic in the predicate.
 
  So, I'd prefer you to add a v6t2 to the values for the arch
  attribute, don't forget to update the comments above.
 
  and in arch_enabled you need to enforce this with
 
 (and (eq_attr arch v6t2)
  (match_test TARGET_32BIT  arm_arch6 
 arm_arch_thumb2))
  (const_string yes)
 
  And in the pattern use v6t2 ...
 
  arm_arch_thumb2 implies that this is at the architecture level of v6t2.
  Therefore TARGET_ARM  arm_arch_thumb2 implies ARM state.
 
 
  Hi Ramana,
   Thank you for your suggestions.  I rebased the patch on the latest 
  trunk
 and updated it accordingly.
   As this patch will not work for architectures older than armv6t2,  I 
  also
 prefer Thomas's patch to fix for them.
   I am currently performing test for this patch.  Assuming no issues pops
 up, OK for the trunk?
   And is it necessary to backport this patch to the 4.8  4.9 branches?
 
 
 I've applied the following as obvious after Kugan mentioned on IRC this 
 morning
 noticing a movwne r0, #-32768. Obviously this won't be accepted as is by the
 assembler and we should be using the %L character. Applied to trunk as 
 obvious.
 
 Felix, How did you test this patch ?
 
 regards
 Ramana


I regtested the patch for arm-eabi-gcc/g++  big-endian with qemu.  The test 
result is OK.  That's strange ...  

This issue can be reproduced by the following testcase.  Thanks for fixing it.  

#include stdio.h
unsigned short v = 0x5678;
int i;
int j = 0;
int *ptr = j;
int func()
{
for (i = 0; i  1; ++i)
{
*ptr = -1;
v = 0xF234;
}
return v;
}

 
 2014-11-20  Ramana Radhakrishnan  ramana.radhakrish...@arm.com
 
  PR target/59593
  * config/arm/arm.md (*movhi_insn): Use right formatting
  for immediate.


Re: [PATCH, PR63742][ARM] Fix arm *movhi_insn_arch4 pattern for big-endian

2014-11-20 Thread Yangfei (Felix)
  On 19/11/14 09:29, Yangfei (Felix) wrote:
   Sorry for missing the point.  It seems to me that 't2' here will
   conflict with
   condition of the pattern *movhi_insn_arch4:
   TARGET_ARM
 arm_arch4
 (register_operand (operands[0], HImode)
|| register_operand (operands[1], HImode))
  
   #define TARGET_ARM  (! TARGET_THUMB)
   /* 32-bit Thumb-2 code.  */
   #define TARGET_THUMB2   (TARGET_THUMB 
   arm_arch_thumb2)
  
  
   Bah, Indeed ! - I misremembered the t2 there, my mistake.
  
   Yes you are right there, but what I'd like you to do is to use that
   mechanism rather than putting all this logic in the predicate.
  
   So, I'd prefer you to add a v6t2 to the values for the arch
   attribute, don't forget to update the comments above.
  
   and in arch_enabled you need to enforce this with
  
  (and (eq_attr arch v6t2)
   (match_test TARGET_32BIT  arm_arch6 
  arm_arch_thumb2))
 (const_string yes)
  
   And in the pattern use v6t2 ...
  
   arm_arch_thumb2 implies that this is at the architecture level of v6t2.
   Therefore TARGET_ARM  arm_arch_thumb2 implies ARM state.
  
  
   Hi Ramana,
Thank you for your suggestions.  I rebased the patch on the
   latest trunk
  and updated it accordingly.
As this patch will not work for architectures older than
   armv6t2,  I also
  prefer Thomas's patch to fix for them.
I am currently performing test for this patch.  Assuming no
   issues pops
  up, OK for the trunk?
And is it necessary to backport this patch to the 4.8  4.9 branches?
  
 
  I've applied the following as obvious after Kugan mentioned on IRC
  this morning noticing a movwne r0, #-32768. Obviously this won't be
  accepted as is by the assembler and we should be using the %L character.
 Applied to trunk as obvious.
 
  Felix, How did you test this patch ?
 
  regards
  Ramana
 
 
 I regtested the patch for arm-eabi-gcc/g++  big-endian with qemu.  The test
 result is OK.  That's strange ...
 
 This issue can be reproduced by the following testcase.  Thanks for fixing it.
 
 #include stdio.h
 unsigned short v = 0x5678;
 int i;
 int j = 0;
 int *ptr = j;
 int func()
 {
 for (i = 0; i  1; ++i)
 {
 *ptr = -1;
 v = 0xF234;
 }
 return v;
 }


And the architecture level is set to armv7-a by default when testing. 


Re: [PATCH, PR63742][ARM] Fix arm *movhi_insn_arch4 pattern for big-endian

2014-11-19 Thread Yangfei (Felix)
  Sorry for missing the point.  It seems to me that 't2' here will conflict 
  with
 condition of the pattern *movhi_insn_arch4:
 TARGET_ARM
   arm_arch4
   (register_operand (operands[0], HImode)
  || register_operand (operands[1], HImode))
 
  #define TARGET_ARM  (! TARGET_THUMB)
  /* 32-bit Thumb-2 code.  */
  #define TARGET_THUMB2   (TARGET_THUMB 
 arm_arch_thumb2)
 
 
 Bah, Indeed ! - I misremembered the t2 there, my mistake.
 
 Yes you are right there, but what I'd like you to do is to use that mechanism
 rather than putting all this logic in the predicate.
 
 So, I'd prefer you to add a v6t2 to the values for the arch attribute, 
 don't forget
 to update the comments above.
 
 and in arch_enabled you need to enforce this with
 
   (and (eq_attr arch v6t2)
(match_test TARGET_32BIT  arm_arch6  arm_arch_thumb2))
(const_string yes)
 
 And in the pattern use v6t2 ...
 
 arm_arch_thumb2 implies that this is at the architecture level of v6t2.
 Therefore TARGET_ARM  arm_arch_thumb2 implies ARM state.


Hi Ramana, 
Thank you for your suggestions.  I rebased the patch on the latest trunk 
and updated it accordingly. 
As this patch will not work for architectures older than armv6t2,  I also 
prefer Thomas's patch to fix for them. 
I am currently performing test for this patch.  Assuming no issues pops up, 
OK for the trunk?  
And is it necessary to backport this patch to the 4.8  4.9 branches? 


Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 217717)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,11 @@
+2014-11-19  Felix Yang  felix.y...@huawei.com
+   Shanyao Chen  chenshan...@huawei.com
+
+   PR target/59593
+   * config/arm/arm.md (define_attr arch): Add v6t2.
+   (define_attr arch_enabled): Add test for the above.
+   (*movhi_insn_arch4): Add new alternative.
+
 2014-11-18  Felix Yang  felix.y...@huawei.com
 
* config/aarch64/aarch64.c (doloop_end): New pattern.
Index: gcc/config/arm/arm.md
===
--- gcc/config/arm/arm.md   (revision 217717)
+++ gcc/config/arm/arm.md   (working copy)
@@ -125,9 +125,10 @@
 ; This can be a for ARM, t for either of the Thumbs, 32 for
 ; TARGET_32BIT, t1 or t2 to specify a specific Thumb mode.  v6
 ; for ARM or Thumb-2 with arm_arch6, and nov6 for ARM without
-; arm_arch6.  This attribute is used to compute attribute enabled,
-; use type any to enable an alternative in all cases.
-(define_attr arch 
any,a,t,32,t1,t2,v6,nov6,neon_for_64bits,avoid_neon_for_64bits,iwmmxt,iwmmxt2,armv6_or_vfpv3
+; arm_arch6.  v6t2 for Thumb-2 with arm_arch6.  This attribute is
+; used to compute attribute enabled, use type any to enable an
+; alternative in all cases.
+(define_attr arch 
any,a,t,32,t1,t2,v6,nov6,v6t2,neon_for_64bits,avoid_neon_for_64bits,iwmmxt,iwmmxt2,armv6_or_vfpv3
   (const_string any))
 
 (define_attr arch_enabled no,yes
@@ -162,6 +163,10 @@
  (match_test TARGET_32BIT  !arm_arch6))
 (const_string yes)
 
+(and (eq_attr arch v6t2)
+ (match_test TARGET_32BIT  arm_arch6  arm_arch_thumb2))
+(const_string yes)
+
 (and (eq_attr arch avoid_neon_for_64bits)
  (match_test TARGET_NEON)
  (not (match_test TARGET_PREFER_NEON_64BITS)))
@@ -6288,8 +6293,8 @@
 
 ;; Pattern to recognize insn generated default case above
 (define_insn *movhi_insn_arch4
-  [(set (match_operand:HI 0 nonimmediate_operand =r,r,m,r)
-   (match_operand:HI 1 general_operand  rIk,K,r,mi))]
+  [(set (match_operand:HI 0 nonimmediate_operand =r,r,r,m,r)
+   (match_operand:HI 1 general_operand  rIk,K,n,r,mi))]
   TARGET_ARM
 arm_arch4
 (register_operand (operands[0], HImode)
@@ -6297,16 +6302,19 @@
   @
mov%?\\t%0, %1\\t%@ movhi
mvn%?\\t%0, #%B1\\t%@ movhi
+   movw%?\\t%0, %1\\t%@ movhi
str%(h%)\\t%1, %0\\t%@ movhi
ldr%(h%)\\t%0, %1\\t%@ movhi
   [(set_attr predicable yes)
-   (set_attr pool_range *,*,*,256)
-   (set_attr neg_pool_range *,*,*,244)
+   (set_attr pool_range *,*,*,*,256)
+   (set_attr neg_pool_range *,*,*,*,244)
+   (set_attr arch *,*,v6t2,*,*)
(set_attr_alternative type
  [(if_then_else (match_operand 1 const_int_operand 
)
 (const_string mov_imm )
 (const_string mov_reg))
   (const_string mvn_imm)
+  (const_string mov_imm)
   (const_string store1)
   (const_string load1)])]
 )


arm-patch-v3.diff
Description: arm-patch-v3.diff


[PATCH] [AArch64, NEON] More NEON intrinsics improvement

2014-11-18 Thread Yangfei (Felix)
Hi,
 This patch converts more intrinsics to use builtin functions instead of 
the previous inline assembly syntax. 
 Passed the glorious testsuite of Christophe Lyon. 

 Three testcases are added for the testing of intriniscs which are not 
covered by the testsuite: 
 gcc.target/aarch64/vfma.c
 gcc.target/aarch64/vfma_n.c
 gcc.target/aarch64/vfms.c

 Regtested with aarch64-linux-gnu on QEMU.  OK for the trunk? 


Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 217394)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,26 @@
+2014-11-18  Felix Yang  felix.y...@huawei.com
+   Haijian Zhang  z.zhanghaij...@huawei.com
+   Jiji Jiang  jiangj...@huawei.com
+   Pengfei Sui  suipeng...@huawei.com
+
+   * config/aarch64/arm_neon.h (vrecpe_u32, vrecpeq_u32): Rewrite using
+   builtin functions.
+   (vfma_f32, vfmaq_f32, vfmaq_f64, vfma_n_f32, vfmaq_n_f32, vfmaq_n_f64,
+   vfms_f32, vfmsq_f32, vfmsq_f64): Likewise.
+   (vhsub_s8, vhsub_u8, vhsub_s16, vhsub_u16, vhsub_s32, vhsub_u32,
+   vhsubq_s8, vhsubq_u8, vhsubq_s16, vhsubq_u16, vhsubq_s32, vhsubq_u32,
+   vsubhn_s16, vsubhn_u16, vsubhn_s32, vsubhn_u32, vsubhn_s64, vsubhn_u66,
+   vrsubhn_s16, vrsubhn_u16, vrsubhn_s32, vrsubhn_u32, vrsubhn_s64,
+   vrsubhn_u64, vsubhn_high_s16, vsubhn_high_u16, vsubhn_high_s32,
+   vsubhn_high_u32, vsubhn_high_s64, vsubhn_high_u64, vrsubhn_high_s16,
+   vrsubhn_high_u16, vrsubhn_high_s32, vrsubhn_high_u32, vrsubhn_high_s64,
+   vrsubhn_high_u64): Likewise.
+   * config/aarch64/iterators.md (VDQ_SI): New mode iterator.
+   * config/aarch64/aarch64.md (define_c_enum unspec): Add UNSPEC_URECPE.
+   * config/aarch64/aarch64-simd.md (aarch64_urecpemode): New pattern.
+   * config/aarch64/aarch64-simd-builtins.def (shsub, uhsub, subhn, rsubhn,
+   subhn2, rsubhn2, urecpe): New builtins.
+
 2014-11-11  Andrew Pinski  apin...@cavium.com
 
Bug target/61997
Index: gcc/testsuite/gcc.target/aarch64/narrow_high-intrinsics.c
===
--- gcc/testsuite/gcc.target/aarch64/narrow_high-intrinsics.c   (revision 
217394)
+++ gcc/testsuite/gcc.target/aarch64/narrow_high-intrinsics.c   (working copy)
@@ -107,9 +107,9 @@ ONE (vmovn_high, uint16x8_t, uint16x4_t, uint32x4_
 ONE (vmovn_high, uint32x4_t, uint32x2_t, uint64x2_t, u64)
 
 
-/* { dg-final { scan-assembler-times \\tsubhn2 v 6} }  */
+/* { dg-final { scan-assembler-times \\tsubhn2\\tv 6} }  */
 /* { dg-final { scan-assembler-times \\taddhn2\\tv 6} }  */
-/* { dg-final { scan-assembler-times rsubhn2 v 6} }  */
+/* { dg-final { scan-assembler-times rsubhn2\\tv 6} }  */
 /* { dg-final { scan-assembler-times raddhn2\\tv 6} }  */
 /* { dg-final { scan-assembler-times \\trshrn2 v 6} }  */
 /* { dg-final { scan-assembler-times \\tshrn2 v 6} }  */
Index: gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c
===
--- gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c
(revision 0)
+++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c
(revision 0)
@@ -0,0 +1,69 @@
+#include arm_neon.h
+#include arm-neon-ref.h
+#include compute-ref-data.h
+
+/* Expected results.  */
+VECT_VAR_DECL(expected,hfloat,32,2) [] = { 0x4438ca3d, 0x44390a3d };
+VECT_VAR_DECL(expected,hfloat,32,4) [] = { 0x44869eb8, 0x4486beb8, 0x4486deb8, 
0x4486feb8 };
+VECT_VAR_DECL(expected,hfloat,64,2) [] = { 0x408906e1532b8520, 
0x40890ee1532b8520 };
+
+#define VECT_VAR_ASSIGN(S,Q,T1,W) S##Q##_##T1##W
+#define ASSIGN(S, Q, T, W, V) T##W##_t S##Q##_##T##W = V
+#define TEST_MSG VFMA/VFMAQ
+void exec_vfma_n (void)
+{
+  /* Basic test: v4=vfma_n(v1,v2), then store the result.  */
+#define TEST_VFMA(Q, T1, T2, W, N) \
+  VECT_VAR(vector_res, T1, W, N) = \
+vfma##Q##_n_##T2##W(VECT_VAR(vector1, T1, W, N),   \
+ VECT_VAR(vector2, T1, W, N),  \
+ VECT_VAR_ASSIGN(Scalar, Q, T1, W));   
\
+  vst1##Q##_##T2##W(VECT_VAR(result, T1, W, N), VECT_VAR(vector_res, T1, W, N))
+
+#define CHECK_VFMA_RESULTS(test_name,comment)  \
+  {\
+CHECK_FP(test_name, float, 32, 2, PRIx32, expected, comment);  \
+CHECK_FP(test_name, float, 32, 4, PRIx32, expected, comment);  \
+   CHECK_FP(test_name, float, 64, 2, PRIx64, expected, comment);   \
+  }
+
+#define DECL_VABD_VAR(VAR) \
+  DECL_VARIABLE(VAR, float, 32, 2);\
+  DECL_VARIABLE(VAR, float, 32, 4);\
+  DECL_VARIABLE(VAR, float, 64, 2);
+
+  DECL_VABD_VAR(vector1);
+  DECL_VABD_VAR(vector2);
+  DECL_VABD_VAR(vector3);
+  

Re: [PATCH 0/3][AArch64]More intrinsics/builtins improvements

2014-11-18 Thread Yangfei (Felix)
Yeah, I agree that your approach is better.  I missed this point.  Thanks.


 
 Ah, sorry for the duplication of effort. And thanks for the heads-up about
 upcoming work! I don't think I have any plans for any of those others at the
 moment.
 
 In the case of vld1_dup, however, I'm going to argue that my approach
 (https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01718.html) is better, in that 
 a
 builtin is opaque (blocks optimization) for the midend, whereas gcc vector
 extensions (as in vdup_n_...) allows the midend to perform constant-folding, 
 etc.
 Does that make sense?
 
 --Alan
 
 Yangfei (Felix) wrote:
  These three are logically independent, but all on a common theme, and
  I've tested them all together by
 
  bootstrapped + check-gcc on aarch64-none-elf cross-tested check-gcc
  on aarch64_be-none-elf
 
  Ok for trunk?
 
 
  Hi Alan,
 
  It seems that we are duplicating the work for the vld1_dup part. (Refer 
  to
 my message: https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01462.html)
  I have a plan to improve these intrinsics/builtins:  vrsubhnX, vrsqrtX,
 vqrdmulX, vqmovX, vqdmulhqX, vqdmulhX, vpminX, vpmaxX, vpaddX, vpadaX
  vmvnX, vmulxX,
 vmovnX, vmlsX, vhsubX, vcvtX, vcopyX, vaddlvX, vabX, vfmX, vrecpeX, vcntX,
 vclsX
  And work for these intrinsics is in progress:  vfmX, vrecpeX, vhsubX,
 vcntX, vclsX
  Please let me know if you guys want to work on any of them.  Thanks.
 
 



Re: [PING ^ 3][PATCH, AArch64] Add doloop_end pattern for -fmodulo-sched

2014-11-18 Thread Yangfei (Felix)
Yes, major code is borrowed from ARM port with the expected mode of loop pseudo 
changed to DImode. 
The function loop_canon_p called in sms_schedule can make sure that SMS is only 
performed for innermost loops. 
But I think it's a good idea to implement the TARGET_CAN_USE_DOLOOP_P hook 
here. 
See my updated patch below.  How about this one? 


Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 217394)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,8 @@
+2014-11-13  Felix Yang  felix.y...@huawei.com
+
+   * config/aarch64/aarch64.c (doloop_end): New pattern.
+   * config/aarch64/aarch64.md (TARGET_CAN_USE_DOLOOP_P): Implement.
+
 2014-11-11  Andrew Pinski  apin...@cavium.com
 
Bug target/61997
Index: gcc/config/aarch64/aarch64.md
===
--- gcc/config/aarch64/aarch64.md   (revision 217394)
+++ gcc/config/aarch64/aarch64.md   (working copy)
@@ -4087,6 +4087,47 @@
   [(set_attr type mrs)])
 
 
+;; Define the subtract-one-and-jump insns so loop.c
+;; knows what to generate.
+(define_expand doloop_end
+  [(use (match_operand 0  ))  ; loop pseudo
+   (use (match_operand 1  ))] ; label
+  
+{
+  /* Currently SMS relies on the do-loop pattern to recognize loops
+ where (1) the control part consists of all insns defining and/or
+ using a certain 'count' register and (2) the loop count can be
+ adjusted by modifying this register prior to the loop.
+ ??? The possible introduction of a new block to initialize the
+ new IV can potentially affect branch optimizations.  */
+  if (optimize  0  flag_modulo_sched)
+{
+  rtx s0;
+  rtx bcomp;
+  rtx loc_ref;
+  rtx cc_reg;
+  rtx insn;
+  rtx cmp;
+
+  if (GET_MODE (operands[0]) != DImode)
+   FAIL;
+
+  s0 = operands [0];
+  insn = emit_insn (gen_adddi3_compare0 (s0, s0, GEN_INT (-1)));
+
+  cmp = XVECEXP (PATTERN (insn), 0, 0);
+  cc_reg = SET_DEST (cmp);
+  bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
+  loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [1]);
+  emit_jump_insn (gen_rtx_SET (VOIDmode, pc_rtx,
+  gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+loc_ref, pc_rtx)));
+  DONE;
+}
+  else
+FAIL;
+})
+
 ;; AdvSIMD Stuff
 (include aarch64-simd.md)
 
Index: gcc/config/aarch64/aarch64.c
===
--- gcc/config/aarch64/aarch64.c(revision 217394)
+++ gcc/config/aarch64/aarch64.c(working copy)
@@ -10224,6 +10224,9 @@ aarch64_use_by_pieces_infrastructure_p (unsigned i
 #define TARGET_USE_BY_PIECES_INFRASTRUCTURE_P \
   aarch64_use_by_pieces_infrastructure_p
 
+#undef TARGET_CAN_USE_DOLOOP_P
+#define TARGET_CAN_USE_DOLOOP_P can_use_doloop_if_innermost
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include gt-aarch64.h


 
 On 17 November 2014 07:59, Yangfei (Felix) felix.y...@huawei.com wrote:
 
  +2014-11-13  Felix Yang  felix.y...@huawei.com
  +
  + * config/aarch64/aarch64.md (doloop_end): New pattern.
  +
 
 This looks like a straight copy of the ARM code, but without the
 TARGET_CAN_USE_DOLOOP_P definition.  If the reason for including this code
 is for the benefit of module-sched then should the hook be defined to limit 
 the
 use of this pattern to inner most loops only?
 
 Cheers
 /Marcus


aarch64-doloop-v4.diff
Description: aarch64-doloop-v4.diff


Re: [PATCH, PR63742][ARM] Fix arm *movhi_insn_arch4 pattern for big-endian

2014-11-18 Thread Yangfei (Felix)
 On 06/11/14 08:35, Yangfei (Felix) wrote:
The idea is simple: Use movw for certain const source operand
  instead of
  ldrh.  And exclude the const values which cannot be handled by
  mov/mvn/movw.
I am doing regression test for this patch.  Assuming no issue
  pops up,
  OK for trunk?
 
  So, doesn't that makes the bug latent for architectures older than
  armv6t2 and big endian and only fixed this in ARM state ? I'd prefer
  a complete solution please. What about *thumb2_movhi_insn in
 thumb2.md ?
 
 
  Actually, we fix the bug by excluding the const values which cannot be 
  handled.
 The patch still works even without the adding of movw here.
  The new movw alternative here is just an small code optimization for 
  certain
 arch. We can handle consts by movw instead of ldrh and this better for
 performance.
  We find that this bug is not reproducible for *thumb2_movhi_insn. The reason
 is that this pattern can always move consts using movw.
 
 Please fix the PR number in your final commit with PR 59593.
 
  Index: gcc/config/arm/predicates.md
 
 =
 ==
  --- gcc/config/arm/predicates.md(revision 216838)
  +++ gcc/config/arm/predicates.md(working copy)
  @@ -144,6 +144,11 @@
 (and (match_code const_int)
  (match_test INTVAL (op) == 0)))
 
  +(define_predicate arm_movw_immediate_operand
  +  (and (match_test TARGET_32BIT  arm_arch_thumb2)
  +   (and (match_code const_int)
  +   (match_test (INTVAL (op)  0x) == 0
  +
   ;; Something valid on the RHS of an ARM data-processing instruction
  (define_predicate arm_rhs_operand
 (ior (match_operand 0 s_register_operand) @@ -211,6 +216,11 @@
 (ior (match_operand 0 arm_rhs_operand)
  (match_operand 0 arm_not_immediate_operand)))
 
  +(define_predicate arm_hi_operand
  +  (ior (match_operand 0 arm_rhsm_operand)
  +   (ior (match_operand 0 arm_not_immediate_operand)
  +(match_operand 0 arm_movw_immediate_operand
  +
 
 I think you don't need any of these predicates.
 
 
   (define_predicate arm_di_operand
 (ior (match_operand 0 s_register_operand)
  (match_operand 0 arm_immediate_di_operand)))
  Index: gcc/config/arm/arm.md
 
 =
 ==
  --- gcc/config/arm/arm.md   (revision 216838)
  +++ gcc/config/arm/arm.md   (working copy)
  @@ -6285,8 +6285,8 @@
 
   ;; Pattern to recognize insn generated default case above
  (define_insn *movhi_insn_arch4
  -  [(set (match_operand:HI 0 nonimmediate_operand =r,r,m,r)
  -   (match_operand:HI 1 general_operand  rIk,K,r,mi))]
  +  [(set (match_operand:HI 0 nonimmediate_operand =r,r,r,m,r)
  +   (match_operand:HI 1 arm_hi_operand rIk,K,j,r,mi))]
 
 Use `n' instead of `j' - movw can handle all of the numerical constants for 
 HImode
 values. And the predicate can remain general_operand.
 


Hello Ramana,

  We need to make sure that movw is only used for architectures which satisfy 
arm_arch_thumb2 as indicated in the following predicate added:

+(define_predicate arm_movw_immediate_operand
+  (and (match_test TARGET_32BIT  arm_arch_thumb2)
+   (and (match_code const_int)
+   (match_test (INTVAL (op)  0x) == 0

  I am modifying the predicate in order to fix issue for other older 
architectures.
  It seems we can't achieve this by simply using 'n' instead of 'j' here, right?
  Thanks.



Re: [PING ^ 3][PATCH, AArch64] Add doloop_end pattern for -fmodulo-sched

2014-11-18 Thread Yangfei (Felix)
 On 11/18/2014 11:48 AM, Yangfei (Felix) wrote:
  +(define_expand doloop_end
  +  [(use (match_operand 0  ))  ; loop pseudo
  +   (use (match_operand 1  ))] ; label
  +  
  +{
  +  /* Currently SMS relies on the do-loop pattern to recognize loops
  + where (1) the control part consists of all insns defining and/or
  + using a certain 'count' register and (2) the loop count can be
  + adjusted by modifying this register prior to the loop.
  + ??? The possible introduction of a new block to initialize the
  + new IV can potentially affect branch optimizations.  */
  +  if (optimize  0  flag_modulo_sched)
 
 You'd be better off moving this condition into the expansion predicate (which 
 is
 currently ).
 
 This short-circuits a lot of unnecessary work.  See  pass_rtl_doloop::gate.
 
 
 r~


Yeah, that's a good idea.  See my updated patch :-)


Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 217394)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,8 @@
+2014-11-13  Felix Yang  felix.y...@huawei.com
+
+   * config/aarch64/aarch64.c (doloop_end): New pattern.
+   * config/aarch64/aarch64.md (TARGET_CAN_USE_DOLOOP_P): Implement.
+
 2014-11-11  Andrew Pinski  apin...@cavium.com
 
Bug target/61997
Index: gcc/config/aarch64/aarch64.md
===
--- gcc/config/aarch64/aarch64.md   (revision 217394)
+++ gcc/config/aarch64/aarch64.md   (working copy)
@@ -4087,6 +4087,43 @@
   [(set_attr type mrs)])
 
 
+;; Define the subtract-one-and-jump insns so loop.c
+;; knows what to generate.
+(define_expand doloop_end
+  [(use (match_operand 0  ))  ; loop pseudo
+   (use (match_operand 1  ))] ; label
+  optimize  0  flag_modulo_sched
+{
+  rtx s0;
+  rtx bcomp;
+  rtx loc_ref;
+  rtx cc_reg;
+  rtx insn;
+  rtx cmp;
+
+  /* Currently SMS relies on the do-loop pattern to recognize loops
+ where (1) the control part consists of all insns defining and/or
+ using a certain 'count' register and (2) the loop count can be
+ adjusted by modifying this register prior to the loop.
+ ??? The possible introduction of a new block to initialize the
+ new IV can potentially affect branch optimizations.  */
+
+  if (GET_MODE (operands[0]) != DImode)
+FAIL;
+
+  s0 = operands [0];
+  insn = emit_insn (gen_adddi3_compare0 (s0, s0, GEN_INT (-1)));
+
+  cmp = XVECEXP (PATTERN (insn), 0, 0);
+  cc_reg = SET_DEST (cmp);
+  bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
+  loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [1]);
+  emit_jump_insn (gen_rtx_SET (VOIDmode, pc_rtx,
+  gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
+loc_ref, pc_rtx)));
+  DONE;
+})
+
 ;; AdvSIMD Stuff
 (include aarch64-simd.md)
 
Index: gcc/config/aarch64/aarch64.c
===
--- gcc/config/aarch64/aarch64.c(revision 217394)
+++ gcc/config/aarch64/aarch64.c(working copy)
@@ -10224,6 +10224,9 @@ aarch64_use_by_pieces_infrastructure_p (unsigned i
 #define TARGET_USE_BY_PIECES_INFRASTRUCTURE_P \
   aarch64_use_by_pieces_infrastructure_p
 
+#undef TARGET_CAN_USE_DOLOOP_P
+#define TARGET_CAN_USE_DOLOOP_P can_use_doloop_if_innermost
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include gt-aarch64.h



aarch64-doloop-v5.diff
Description: aarch64-doloop-v5.diff


[PING ^ 2][RFC PATCH, AARCH64] Add support for -mlong-calls option

2014-11-18 Thread Yangfei (Felix)
Ping again?  Any comment please? 


 
 Ping?  I hope this patch can catch up with stage 1 of GCC-5.0.  Thanks.
 
 
 
 
   Hi Felix,
  
   Sorry for the delay responding, I've been out of the office recently
   and I'm only just catching up on a backlog of GCC related emails.
  
   I'm in two minds about this; I can potentially see the need for
   attributes to enable long calls for specific calls, and maybe also
   for pragmas that can be used to efficiently mark a group of
   functions in that way; but I don't really see the value in adding a
   -mlong-calls option to do
  this globally.
  
   The reasoning is as follows: long calls are generally very expensive
   and relatively few functions should need them in most applications
   (since code that needs to span more than a single block of 128Mbytes
   - the span of a BL or B instruction - will be very rare in reality).
  
   The best way to handle very large branches for those rare cases
   where you do have a very large contiguous block of code more than
   128MB is by having the linker insert veneers when needed; the code
   will branch to the veneer which will insert an indirect branch at
   that point (the ABI guarantees that at function call boundaries IP0
   and IP1 will not contain live values, making them available for such 
   purposes).
  
   In a very small number of cases it might be desirable to mark
   specific functions as being too far away to reach; in those cases
   the attributes and pragma methods can be used to mark such calls as being
 far calls.
  
   Aside: The reason -mlong-calls was added to GCC for ARM is that the
   code there pre-dates the EABI, which introduced the concept of
   link-time veneering of calls - the option should be unnecessary now
   that almost everyone uses the EABI as the basis for their platform
   ABI.  We don't have such a legacy for AArch64 and I'd need to see
   strong
  justification for its use before adding the option there as well.
  
   So please can you rework the patch to remove the -mlong-calls option
   and just leave the attribute and pragma interfaces.
  
   R.
 
 
  Hello Richard,
 
Thanks for the comments.  I agree with the idea.
And I updated the patch with the -mlong-calls option removed and use
  short call by default.
Reg-tested for aarch64-linux-gnu with qemu.  Is it OK for trunk?
 
 
  Index: gcc/ChangeLog
 
 =
  ==
  --- gcc/ChangeLog   (revision 217394)
  +++ gcc/ChangeLog   (working copy)
  @@ -1,3 +1,25 @@
  +2014-11-12  Felix Yang  felix.y...@huawei.com
  +   Haijian Zhang  z.zhanghaij...@huawei.com
  +
  +   * config/aarch64/aarch64.h (REGISTER_TARGET_PRAGMAS): Define.
  +   * config/aarch64/aarch64.c (aarch64_set_default_type_attributes,
  +   aarch64_attribute_table, aarch64_comp_type_attributes,
  +   aarch64_decl_is_long_call_p, aarch64_function_in_section_p,
  +   aarch64_pr_long_calls, aarch64_pr_no_long_calls,
  +   aarch64_pr_long_calls_off): New functions.
  +   (TARGET_SET_DEFAULT_TYPE_ATTRIBUTES): Define as
  +   aarch64_set_default_type_attributes.
  +   (TARGET_ATTRIBUTE_TABLE): Define as aarch64_attribute_table.
  +   (TARGET_COMP_TYPE_ATTRIBUTES): Define as
  aarch64_comp_type_attribute.
  +   (aarch64_pragma_enum): New enum.
  +   (aarch64_attribute_table): New attribute table.
  +   * config/aarch64/aarch64-protos.h (aarch64_pr_long_calls,
  +   aarch64_pr_no_long_calls, aarch64_pr_long_calls_off): New declarations.
  +   * config/aarch64/aarch64.md (sibcall, sibcall_value): Modified to
  +   generate indirect call for sibling call when needed.
  +   * config/aarch64/predicate.md (aarch64_call_insn_operand): Modified to
  +   exclude a symbol_ref for an indirect call.
  +
   2014-11-11  Andrew Pinski  apin...@cavium.com
 
  Bug target/61997
  Index: gcc/testsuite/gcc.target/aarch64/long-calls-1.c
 
 =
  ==
  --- gcc/testsuite/gcc.target/aarch64/long-calls-1.c (revision 0)
  +++ gcc/testsuite/gcc.target/aarch64/long-calls-1.c (revision 0)
  @@ -0,0 +1,133 @@
  +/* Check that long calls to different sections are not optimized to
  +bl.  */
  +/* { dg-do compile } */
  +/* { dg-options -O2 } */
  +/* This test expects that short calls are the default.  */
  +/* { dg-skip-if -mlong-calls in use { *-*-* } { -mlong-calls }
  +{  } } */
  +
  +#define section(S) __attribute__((section(S))) #define weak
  +__attribute__((weak)) #define noinline __attribute__((noinline))
  +#define long_call __attribute__((long_call)) #define short_call
  +__attribute__((short_call))
  +
  +#define REMOTE_CALL(ID, TARGET_ATTRS, CALL_ATTRS)  \
  +  const char *TARGET_ATTRS ID (void);  
  \
  +  const char *CALL_ATTRS call_##ID (void) { return ID () + 1; }
  +
  +#define EXTERN_CALL(ID, TARGET_ATTRS, CALL_ATTRS)  \
  +  const char *TARGET_ATTRS noinline ID (void) { return 

Re: [PATCH, PR63742][ARM] Fix arm *movhi_insn_arch4 pattern for big-endian

2014-11-18 Thread Yangfei (Felix)
 On 18/11/14 11:02, Yangfei (Felix) wrote:
  On 06/11/14 08:35, Yangfei (Felix) wrote:
 The idea is simple: Use movw for certain const source
  operand instead of
  ldrh.  And exclude the const values which cannot be handled by
  mov/mvn/movw.
 I am doing regression test for this patch.  Assuming no
  issue pops up,
  OK for trunk?
 
  So, doesn't that makes the bug latent for architectures older than
  armv6t2 and big endian and only fixed this in ARM state ? I'd
  prefer a complete solution please. What about *thumb2_movhi_insn in
  thumb2.md ?
 
 
  Actually, we fix the bug by excluding the const values which cannot be
 handled.
  The patch still works even without the adding of movw here.
  The new movw alternative here is just an small code optimization
  for certain
  arch. We can handle consts by movw instead of ldrh and this better
  for performance.
  We find that this bug is not reproducible for *thumb2_movhi_insn.
  The reason
  is that this pattern can always move consts using movw.
 
  Please fix the PR number in your final commit with PR 59593.
 
  Index: gcc/config/arm/predicates.md
 
 
 =
  ==
  --- gcc/config/arm/predicates.md  (revision 216838)
  +++ gcc/config/arm/predicates.md  (working copy)
  @@ -144,6 +144,11 @@
  (and (match_code const_int)
   (match_test INTVAL (op) == 0)))
 
  +(define_predicate arm_movw_immediate_operand
  +  (and (match_test TARGET_32BIT  arm_arch_thumb2)
  +   (and (match_code const_int)
  + (match_test (INTVAL (op)  0x) == 0
  +
;; Something valid on the RHS of an ARM data-processing
  instruction (define_predicate arm_rhs_operand
  (ior (match_operand 0 s_register_operand) @@ -211,6 +216,11 @@
  (ior (match_operand 0 arm_rhs_operand)
   (match_operand 0 arm_not_immediate_operand)))
 
  +(define_predicate arm_hi_operand
  +  (ior (match_operand 0 arm_rhsm_operand)
  +   (ior (match_operand 0 arm_not_immediate_operand)
  +(match_operand 0 arm_movw_immediate_operand
  +
 
  I think you don't need any of these predicates.
 
 
(define_predicate arm_di_operand
  (ior (match_operand 0 s_register_operand)
   (match_operand 0 arm_immediate_di_operand)))
  Index: gcc/config/arm/arm.md
 
 
 =
  ==
  --- gcc/config/arm/arm.md (revision 216838)
  +++ gcc/config/arm/arm.md (working copy)
  @@ -6285,8 +6285,8 @@
 
;; Pattern to recognize insn generated default case above
  (define_insn *movhi_insn_arch4
  -  [(set (match_operand:HI 0 nonimmediate_operand =r,r,m,r)
  - (match_operand:HI 1 general_operand  rIk,K,r,mi))]
  +  [(set (match_operand:HI 0 nonimmediate_operand =r,r,r,m,r)
  + (match_operand:HI 1 arm_hi_operand rIk,K,j,r,mi))]
 
  Use `n' instead of `j' - movw can handle all of the numerical
  constants for HImode values. And the predicate can remain general_operand.
 
 
 Did you read my comment about set_attr arch further down in the thread ?
 
  Look at the set_attr arch alternative and set this to t2 for the movw
 alternative. I would expect that to be enough to sort this out instead of 
 adding all
 this code.
 

Sorry for missing the point.  It seems to me that 't2' here will conflict with 
condition of the pattern *movhi_insn_arch4: 
  TARGET_ARM
arm_arch4
(register_operand (operands[0], HImode)
   || register_operand (operands[1], HImode))

#define TARGET_ARM  (! TARGET_THUMB)
/* 32-bit Thumb-2 code.  */
#define TARGET_THUMB2   (TARGET_THUMB  arm_arch_thumb2)

Right? Thanks.



Re: [PING ^ 2][RFC PATCH, AARCH64] Add support for -mlong-calls option

2014-11-18 Thread Yangfei (Felix)
 On Tue, Nov 18, 2014 at 11:51 AM, Yangfei (Felix) felix.y...@huawei.com
 wrote:
  Ping again?  Any comment please?
 
 
 Pinging daily is only going to irritate people. Please desist from doing so.
 
 Ramana


Oh, thanks for reminding me.  And sorry if this bothers you guys.  
The end of stage1 of GCC 5.0 causes me to push this a little bit :-)  


 
 
 
  Ping?  I hope this patch can catch up with stage 1 of GCC-5.0.  Thanks.
 
 
 
 
Hi Felix,
   
Sorry for the delay responding, I've been out of the office
recently and I'm only just catching up on a backlog of GCC related 
emails.
   
I'm in two minds about this; I can potentially see the need for
attributes to enable long calls for specific calls, and maybe
also for pragmas that can be used to efficiently mark a group of
functions in that way; but I don't really see the value in adding
a -mlong-calls option to do
   this globally.
   
The reasoning is as follows: long calls are generally very
expensive and relatively few functions should need them in most
applications (since code that needs to span more than a single
block of 128Mbytes
- the span of a BL or B instruction - will be very rare in reality).
   
The best way to handle very large branches for those rare cases
where you do have a very large contiguous block of code more than
128MB is by having the linker insert veneers when needed; the
code will branch to the veneer which will insert an indirect
branch at that point (the ABI guarantees that at function call
boundaries IP0 and IP1 will not contain live values, making them 
available
 for such purposes).
   
In a very small number of cases it might be desirable to mark
specific functions as being too far away to reach; in those cases
the attributes and pragma methods can be used to mark such calls
as being
  far calls.
   
Aside: The reason -mlong-calls was added to GCC for ARM is that
the code there pre-dates the EABI, which introduced the concept
of link-time veneering of calls - the option should be
unnecessary now that almost everyone uses the EABI as the basis
for their platform ABI.  We don't have such a legacy for AArch64
and I'd need to see strong
   justification for its use before adding the option there as well.
   
So please can you rework the patch to remove the -mlong-calls
option and just leave the attribute and pragma interfaces.
   
R.
  
  
   Hello Richard,
  
 Thanks for the comments.  I agree with the idea.
 And I updated the patch with the -mlong-calls option removed and
   use short call by default.
 Reg-tested for aarch64-linux-gnu with qemu.  Is it OK for trunk?
  
  
   Index: gcc/ChangeLog
  
 
 =
   ==
   --- gcc/ChangeLog   (revision 217394)
   +++ gcc/ChangeLog   (working copy)
   @@ -1,3 +1,25 @@
   +2014-11-12  Felix Yang  felix.y...@huawei.com
   +   Haijian Zhang  z.zhanghaij...@huawei.com
   +
   +   * config/aarch64/aarch64.h (REGISTER_TARGET_PRAGMAS): Define.
   +   * config/aarch64/aarch64.c (aarch64_set_default_type_attributes,
   +   aarch64_attribute_table, aarch64_comp_type_attributes,
   +   aarch64_decl_is_long_call_p, aarch64_function_in_section_p,
   +   aarch64_pr_long_calls, aarch64_pr_no_long_calls,
   +   aarch64_pr_long_calls_off): New functions.
   +   (TARGET_SET_DEFAULT_TYPE_ATTRIBUTES): Define as
   +   aarch64_set_default_type_attributes.
   +   (TARGET_ATTRIBUTE_TABLE): Define as aarch64_attribute_table.
   +   (TARGET_COMP_TYPE_ATTRIBUTES): Define as
   aarch64_comp_type_attribute.
   +   (aarch64_pragma_enum): New enum.
   +   (aarch64_attribute_table): New attribute table.
   +   * config/aarch64/aarch64-protos.h (aarch64_pr_long_calls,
   +   aarch64_pr_no_long_calls, aarch64_pr_long_calls_off): New
 declarations.
   +   * config/aarch64/aarch64.md (sibcall, sibcall_value): Modified to
   +   generate indirect call for sibling call when needed.
   +   * config/aarch64/predicate.md (aarch64_call_insn_operand): Modified
 to
   +   exclude a symbol_ref for an indirect call.
   +
2014-11-11  Andrew Pinski  apin...@cavium.com
  
   Bug target/61997
   Index: gcc/testsuite/gcc.target/aarch64/long-calls-1.c
  
 
 =
   ==
   --- gcc/testsuite/gcc.target/aarch64/long-calls-1.c (revision 0)
   +++ gcc/testsuite/gcc.target/aarch64/long-calls-1.c (revision 0)
   @@ -0,0 +1,133 @@
   +/* Check that long calls to different sections are not optimized
   +to bl.  */
   +/* { dg-do compile } */
   +/* { dg-options -O2 } */
   +/* This test expects that short calls are the default.  */
   +/* { dg-skip-if -mlong-calls in use { *-*-* } { -mlong-calls
   +} {  } } */
   +
   +#define section(S) __attribute__((section(S))) #define weak
   +__attribute__((weak)) #define noinline __attribute__((noinline))
   +#define

Re: [PING][PATCH] [AARCH64, NEON] Improve vcls(q?) vcnt(q?) and vld1(q?)_dup intrinsics

2014-11-18 Thread Yangfei (Felix)
 On 17 November 2014 06:58, Yangfei (Felix) felix.y...@huawei.com wrote:
  PING?
  BTW: It seems that Alan's way of improving vld1(q?)_dup intrinsic is more
 elegant.
  So is the improvement of vcls(q?) vcnt(q?) OK for trunk?  Thanks.
 
 Please rebase over Alan's patch and repost, thank you /Marcus


I rebased the patch on the latest trunk. 
Regtested for aarch64-linux-gnu with qemu. 
OK for the trunk? 


Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 217717)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,14 @@
+2014-11-13  Felix Yang  felix.y...@huawei.com
+   Shanyao Chen  chenshan...@huawei.com
+
+   * config/aarch64/aarch64-simd.md (clrsbmode2, popcountmode2): New
+   patterns.
+   * config/aarch64/aarch64-simd-builtins.def (clrsb, popcount): New
+   builtins.
+   * config/aarch64/arm_neon.h (vcls_s8, vcls_s16, vcls_s32, vclsq_s8,
+   vclsq_s16, vclsq_s32, vcnt_p8, vcnt_s8, vcnt_u8, vcntq_p8, vcntq_s8,
+   vcntq_u8): Rewrite using builtin functions.
+
 2014-11-18  Felix Yang  felix.y...@huawei.com
 
* config/aarch64/aarch64.c (doloop_end): New pattern.
Index: gcc/config/aarch64/arm_neon.h
===
--- gcc/config/aarch64/arm_neon.h   (revision 217717)
+++ gcc/config/aarch64/arm_neon.h   (working copy)
@@ -5317,138 +5317,6 @@ vaddlvq_u32 (uint32x4_t a)
   return result;
 }
 
-__extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
-vcls_s8 (int8x8_t a)
-{
-  int8x8_t result;
-  __asm__ (cls %0.8b,%1.8b
-   : =w(result)
-   : w(a)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
-vcls_s16 (int16x4_t a)
-{
-  int16x4_t result;
-  __asm__ (cls %0.4h,%1.4h
-   : =w(result)
-   : w(a)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
-vcls_s32 (int32x2_t a)
-{
-  int32x2_t result;
-  __asm__ (cls %0.2s,%1.2s
-   : =w(result)
-   : w(a)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
-vclsq_s8 (int8x16_t a)
-{
-  int8x16_t result;
-  __asm__ (cls %0.16b,%1.16b
-   : =w(result)
-   : w(a)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
-vclsq_s16 (int16x8_t a)
-{
-  int16x8_t result;
-  __asm__ (cls %0.8h,%1.8h
-   : =w(result)
-   : w(a)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline int32x4_t __attribute__ ((__always_inline__))
-vclsq_s32 (int32x4_t a)
-{
-  int32x4_t result;
-  __asm__ (cls %0.4s,%1.4s
-   : =w(result)
-   : w(a)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline poly8x8_t __attribute__ ((__always_inline__))
-vcnt_p8 (poly8x8_t a)
-{
-  poly8x8_t result;
-  __asm__ (cnt %0.8b,%1.8b
-   : =w(result)
-   : w(a)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
-vcnt_s8 (int8x8_t a)
-{
-  int8x8_t result;
-  __asm__ (cnt %0.8b,%1.8b
-   : =w(result)
-   : w(a)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
-vcnt_u8 (uint8x8_t a)
-{
-  uint8x8_t result;
-  __asm__ (cnt %0.8b,%1.8b
-   : =w(result)
-   : w(a)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline poly8x16_t __attribute__ ((__always_inline__))
-vcntq_p8 (poly8x16_t a)
-{
-  poly8x16_t result;
-  __asm__ (cnt %0.16b,%1.16b
-   : =w(result)
-   : w(a)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
-vcntq_s8 (int8x16_t a)
-{
-  int8x16_t result;
-  __asm__ (cnt %0.16b,%1.16b
-   : =w(result)
-   : w(a)
-   : /* No clobbers */);
-  return result;
-}
-
-__extension__ static __inline uint8x16_t __attribute__ ((__always_inline__))
-vcntq_u8 (uint8x16_t a)
-{
-  uint8x16_t result;
-  __asm__ (cnt %0.16b,%1.16b
-   : =w(result)
-   : w(a)
-   : /* No clobbers */);
-  return result;
-}
-
 #define vcopyq_lane_f32(a, b, c, d) \
   __extension__ \
 ({  \
@@ -14082,6 +13950,44 @@ vcltzd_f64 (float64_t __a)
   return __a  0.0 ? -1ll : 0ll;
 }
 
+/* vcls.  */
+
+__extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
+vcls_s8 (int8x8_t __a)
+{
+  return

[PING ^ 3][PATCH, AArch64] Add doloop_end pattern for -fmodulo-sched

2014-11-17 Thread Yangfei (Felix)
PING?  Is it OK for me to apply this patch?  Thanks.


 
  On 11/12/2014 11:01 AM, Yangfei (Felix) wrote:
   +(define_expand doloop_end
   +  [(use (match_operand 0  ))  ; loop pseudo
   +   (use (match_operand 1  ))] ; label
   +  
   +  
   +{
 
  Drop the  surrounding the { }.
 
 
  r~
 
 
 Hello,
 I updated the patch with the redundant  removed.  Is it OK for trunk
 now?
 I hope this patch can catch up with stage 1 of GCC-5.0.  Thanks.
 
 
 Index: gcc/ChangeLog
 =
 ==
 --- gcc/ChangeLog (revision 217394)
 +++ gcc/ChangeLog (working copy)
 @@ -1,3 +1,7 @@
 +2014-11-13  Felix Yang  felix.y...@huawei.com
 +
 + * config/aarch64/aarch64.md (doloop_end): New pattern.
 +
  2014-11-11  Andrew Pinski  apin...@cavium.com
 
   Bug target/61997
 Index: gcc/config/aarch64/aarch64.md
 =
 ==
 --- gcc/config/aarch64/aarch64.md (revision 217394)
 +++ gcc/config/aarch64/aarch64.md (working copy)
 @@ -4087,6 +4087,47 @@
[(set_attr type mrs)])
 
 
 +;; Define the subtract-one-and-jump insns so loop.c ;; knows what to
 +generate.
 +(define_expand doloop_end
 +  [(use (match_operand 0  ))  ; loop pseudo
 +   (use (match_operand 1  ))] ; label
 +  
 +{
 +  /* Currently SMS relies on the do-loop pattern to recognize loops
 + where (1) the control part consists of all insns defining and/or
 + using a certain 'count' register and (2) the loop count can be
 + adjusted by modifying this register prior to the loop.
 + ??? The possible introduction of a new block to initialize the
 + new IV can potentially affect branch optimizations.  */
 +  if (optimize  0  flag_modulo_sched)
 +{
 +  rtx s0;
 +  rtx bcomp;
 +  rtx loc_ref;
 +  rtx cc_reg;
 +  rtx insn;
 +  rtx cmp;
 +
 +  if (GET_MODE (operands[0]) != DImode)
 + FAIL;
 +
 +  s0 = operands [0];
 +  insn = emit_insn (gen_adddi3_compare0 (s0, s0, GEN_INT (-1)));
 +
 +  cmp = XVECEXP (PATTERN (insn), 0, 0);
 +  cc_reg = SET_DEST (cmp);
 +  bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
 +  loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [1]);
 +  emit_jump_insn (gen_rtx_SET (VOIDmode, pc_rtx,
 +gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
 +  loc_ref, pc_rtx)));
 +  DONE;
 +}
 +  else
 +FAIL;
 +})
 +
  ;; AdvSIMD Stuff
  (include aarch64-simd.md)



  1   2   >