RE: [PATCH PR95696] regrename creates overlapping register allocations for vliw
Hi Richard, Thanks for reviewing this fix and the detailed suggestions : -) Looks like my colleague Yunde was having some issue setting up his local repo. I have prepared one for him. Attached please find the patch. Bootstrapped and tested on aarch64-linux-gnu. Please help install if it's good to go. Felix > -Original Message- > From: Richard Sandiford [mailto:richard.sandif...@arm.com] > Sent: Friday, July 31, 2020 5:33 PM > To: Zhongyunde > Cc: gcc-patches@gcc.gnu.org; Yangfei (Felix) > Subject: Re: [PATCH PR95696] regrename creates overlapping register > allocations for vliw > > Thanks for the update, looks good. Could you post a changelog too so that I > can use it when committing? > > The changelog was the only reason I didn't just push the patch, but FWIW, a > couple of very minor things… > > Zhongyunde writes: > > diff --git a/gcc/regrename.c b/gcc/regrename.c old mode 100644 new > > mode 100755 index 637b3cbe6d7..815ed22805d > > --- a/gcc/regrename.c > > +++ b/gcc/regrename.c > > @@ -684,10 +684,12 @@ merge_chains (du_head_p c1, du_head_p c2) > >c1->cannot_rename |= c2->cannot_rename; } > > > > -/* Analyze the current function and build chains for renaming. */ > > +/* Analyze the current function and build chains for renaming. > > + If INCLUDE_ALL_BLOCKS_P is set to true, should process all blocks, > > + ignoring BB_DISABLE_SCHEDULE. The default value is true. */ > > I think s/should// here, since GCC comments usually use an imperative style. > > > @@ -737,6 +739,14 @@ regrename_analyze (bitmap bb_mask) > >if (dump_file) > > fprintf (dump_file, "\nprocessing block %d:\n", bb1->index); > > > > + if (!include_all_block_p && (bb1->flags & BB_DISABLE_SCHEDULE) != 0) > > + { > > + if (dump_file) > > + fprintf (dump_file, "avoid disrupting the sms schedule of bb %d\n", > > +bb1->index); > > bb1->index should be indented below “dump_file”. > > Richard pr95696-v0.diff Description: pr95696-v0.diff
RE: [PATCH PR95961] vect: ICE: in exact_div, at poly-int.h:2182
Hi, > -Original Message- > From: Richard Sandiford [mailto:richard.sandif...@arm.com] > Sent: Thursday, July 2, 2020 5:17 PM > To: Yangfei (Felix) > Cc: Richard Biener ; Richard Biener > ; gcc-patches@gcc.gnu.org > Subject: Re: [PATCH PR95961] vect: ICE: in exact_div, at poly-int.h:2182 > Cut... > > Thanks, pushed to master with a minor whitespace fix for… Thanks for finding it. > > + nscalars = (STMT_SLP_TYPE (stmt_info) > > + ? vf * DR_GROUP_SIZE (stmt_info) : vf); > > …the indentation on this line. Hope you don't mind, but I also “reflowed” > the commit message to make it fit within 72 chars. > (The text itself is the same.) It's OK. :-) BTW: Is this the rule for gcc git commit msg format? 72 chars instead of 80 chars? Felix
RE: [PATCH PR95961] vect: ICE: in exact_div, at poly-int.h:2182
Hi, > -Original Message- > From: Richard Sandiford [mailto:richard.sandif...@arm.com] > Sent: Wednesday, July 1, 2020 9:03 PM > To: Yangfei (Felix) > Cc: Richard Biener ; Richard Biener > ; gcc-patches@gcc.gnu.org > Subject: Re: [PATCH PR95961] vect: ICE: in exact_div, at poly-int.h:2182 > > "Yangfei (Felix)" writes: > >> On June 30, 2020 4:23:03 PM GMT+02:00, Richard Sandiford > >> wrote: > >> >Richard Biener writes: > >> >> So it seems odd to somehow put in the number of vectors... so to > >> >> me it would have made sense if it did > >> >> > >> >> possible_npeel_number = lower_bound (nscalars); > >> >> > >> >> or whateveris necessary to make the polys happy. Thus simply > >> >> elide the vect_get_num_vectors call? But it's been very long > >> >> since I've dived into the alignment peeling code... > >> > > >> >Ah, I see what you mean. So rather than: > >> > > >> >/* Save info about DR in the hash table. Also include peeling > >> > amounts according to the explanation above. */ > >> > for (j = 0; j < possible_npeel_number; j++) > >> >{ > >> > vect_peeling_hash_insert (_htab, loop_vinfo, > >> > dr_info, npeel_tmp); > >> >npeel_tmp += target_align / dr_size; > >> >} > >> > > >> >just have something like: > >> > > >> >while (known_le (npeel_tmp, nscalars)) > >> > { > >> >… > >> > } > >> > > >> >? > >> > >> Yeah. > > > > Not sure if I understand correctly. I am supposing the following check in > the original code is not necessary if we go like that. > > > > 1822 if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))) > > > > Is that correct? > > I think we still need it. I guess there are two choices: > > - make nscalars default to npeel_tmp before the “if” above. I think this will be simpler. How about the v2 patch? Bootstrapped and tested on aarch64-linux-gnu & x86_64-linux-gnu. Thanks, Felix pr95961-v2.diff Description: pr95961-v2.diff
RE: [PATCH PR95961] vect: ICE: in exact_div, at poly-int.h:2182
Hi, > -Original Message- > From: Richard Biener [mailto:rguent...@suse.de] > Sent: Tuesday, June 30, 2020 10:50 PM > To: Richard Sandiford > Cc: Richard Biener ; Yangfei (Felix) > ; gcc-patches@gcc.gnu.org > Subject: Re: [PATCH PR95961] vect: ICE: in exact_div, at poly-int.h:2182 > > On June 30, 2020 4:23:03 PM GMT+02:00, Richard Sandiford > wrote: > >Richard Biener writes: > >> On Tue, 30 Jun 2020, Richard Sandiford wrote: > >> > >>> Richard Biener writes: > >>> > On Tue, Jun 30, 2020 at 2:18 PM Richard Sandiford > >>> > wrote: > >>> >> > >>> >> "Yangfei (Felix)" writes: > >>> >> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-ctor-1.c > >b/gcc/testsuite/gcc.dg/vect/vect-ctor-1.c > >>> >> > index e050db1a2e4..ea39fcac0e0 100644 > >>> >> > --- a/gcc/testsuite/gcc.dg/vect/vect-ctor-1.c > >>> >> > +++ b/gcc/testsuite/gcc.dg/vect/vect-ctor-1.c > >>> >> > @@ -1,6 +1,7 @@ > >>> >> > /* { dg-do compile } */ > >>> >> > /* { dg-additional-options "-O3" } */ > >>> >> > /* { dg-additional-options "-mavx2" { target { i?86-*-* > >x86_64-*-* } } } */ > >>> >> > +/* { dg-additional-options "-march=armv8.2-a+sve > >-fno-vect-cost-model" { target aarch64*-*-* } } */ > >>> >> > > >>> >> > typedef struct { > >>> >> > unsigned short mprr_2[5][16][16]; > >>> >> > >>> >> This test is useful for Advanced SIMD too, so I think we should > >continue > >>> >> to test it with whatever options the person running the testsuite > >chose. > >>> >> Instead we could duplicate the test in gcc.target/aarch64/sve > >with > >>> >> appropriate options. > >>> >> > >>> >> > diff --git a/gcc/tree-vect-data-refs.c > >b/gcc/tree-vect-data-refs.c > >>> >> > index eb8288e7a85..b30a7d8a3bb 100644 > >>> >> > --- a/gcc/tree-vect-data-refs.c > >>> >> > +++ b/gcc/tree-vect-data-refs.c > >>> >> > @@ -1823,8 +1823,11 @@ vect_enhance_data_refs_alignment > >(loop_vec_info loop_vinfo) > >>> >> > { > >>> >> > poly_uint64 nscalars = (STMT_SLP_TYPE > >(stmt_info) > >>> >> > ? vf * DR_GROUP_SIZE > >(stmt_info) : vf); > >>> >> > - possible_npeel_number > >>> >> > - = vect_get_num_vectors (nscalars, vectype); > >>> >> > + if (maybe_lt (nscalars, TYPE_VECTOR_SUBPARTS > >(vectype))) > >>> >> > + possible_npeel_number = 0; > >>> >> > + else > >>> >> > + possible_npeel_number > >>> >> > + = vect_get_num_vectors (nscalars, vectype); > >>> >> > > >>> >> > /* NPEEL_TMP is 0 when there is no > >misalignment, but also > >>> >> >allow peeling NELEMENTS. */ > >>> >> > >>> >> OK, so this is coming from: > >>> >> > >>> >> int s[16][2]; > >>> >> … > >>> >> … =s[j][1]; > >>> >> > >>> >> and an SLP node containing 16 instances of “s[j][1]”. The > >DR_GROUP_SIZE > >>> >> is 2 because that's the inner dimension of “s”. > >>> >> > >>> >> I don't think maybe_lt is right here though. The same problem > >could in > >>> >> principle happen for cases in which NSCALARS > > >TYPE_VECTOR_SUBPARTS, > >>> >> e.g. for different inner dimensions of “s”. > >>> >> > >>> >> I think the testcase shows that using DR_GROUP_SIZE in this > >calculation > >>> >> is flawed. I'm not sure whether we can really do better given > >the current > >>> >> representation though. This is one case where having a separate > >dr_vec_info > >>> >> per SLP node would help. > >>> > > >>> > I guess what the code likes to know is what we now have in > >SLP_TREE_LANES > >>> > (or formerly group_size) but t
[PATCH PR95961] vect: ICE: in exact_div, at poly-int.h:2182
Hi, PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95961 In the test case for PR95961, vectorization factor computed by vect_determine_vectorization_factor is [8,8]. But this is updated to [1,1] later by vect_update_vf_for_slp. When we call vect_get_num_vectors in vect_enhance_data_refs_alignment, the number of scalars which is based on the vectorization factor is not a multiple of the the number of elements in the vector type. This leads to the ICE. We should check that before calling vect_get_num_vectors and set local variable 'possible_npeel_number' to zero if there are too few scalars. Bootstrapped and tested on aarch64-linux-gnu. ChangeLog update are contained in the patch. Comments? Thanks, Felix pr95961-v1.diff Description: pr95961-v1.diff
[PATCH] vect: Use vect_relevant_for_alignment_p consistently
Hi, Noticed two places in tree-vect-data-refs.c where we can use function vect_relevant_for_alignment_p. Looks like these two are missed when we were introducing the function. Bootstrapped and tested on aarch64-linux-gnu. OK to go? ChangeLog modification is contained in the patch. Thanks, Felix patch-v1.diff Description: patch-v1.diff
RE: [PATCH] vect: Use LOOP_VINFO_DATAREFS and LOOP_VINFO_DDRS consistently
Hi, > -Original Message- > From: Richard Biener [mailto:richard.guent...@gmail.com] > Sent: Monday, June 15, 2020 5:12 PM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH] vect: Use LOOP_VINFO_DATAREFS and > LOOP_VINFO_DDRS consistently > > > > > Thanks for reviewing this. Could you please help install it? > > Pushed. Please remember to verify the ChangeLog - I needed to replace > leading 8 spaces with tabs. Thanks for the effort. Looks like the issue was caused by copy-and-paste, will pay attention next time. Felix
RE: [PATCH] vect: Use LOOP_VINFO_DATAREFS and LOOP_VINFO_DDRS consistently
Hi Richard, > -Original Message- > From: Richard Biener [mailto:richard.guent...@gmail.com] > Sent: Monday, June 15, 2020 3:25 PM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH] vect: Use LOOP_VINFO_DATAREFS and > LOOP_VINFO_DDRS consistently > > On Sat, Jun 13, 2020 at 4:46 AM Yangfei (Felix) > wrote: > > > > Hi, > > > > This is minor code refactorings in tree-vect-data-refs.c and > > tree-vect-loop.c. > > Use LOOP_VINFO_DATAREFS and LOOP_VINFO_DDRS when possible and > rename > > several parameters to make code more consistent. > > > > Bootstrapped and tested on aarch64-linux-gnu. OK? > > OK. Thanks for reviewing this. Could you please help install it? Regards, Felix
[PATCH] vect: Use LOOP_VINFO_DATAREFS and LOOP_VINFO_DDRS consistently
Hi, This is minor code refactorings in tree-vect-data-refs.c and tree-vect-loop.c. Use LOOP_VINFO_DATAREFS and LOOP_VINFO_DDRS when possible and rename several parameters to make code more consistent. Bootstrapped and tested on aarch64-linux-gnu. OK? Thanks, Felix gcc/ +2020-06-13 Felix Yang + + * tree-vect-data-refs.c (vect_verify_datarefs_alignment): Rename + parameter to loop_vinfo and update uses. Use LOOP_VINFO_DATAREFS + when possible. + (vect_analyze_data_refs_alignment): Likewise, and use LOOP_VINFO_DDRS + when possible. + * tree-vect-loop.c (vect_dissolve_slp_only_groups): Use + LOOP_VINFO_DATAREFS when possible. + (update_epilogue_loop_vinfo): Likewise. patch-v1.diff Description: patch-v1.diff
RE: [PATCH PR95570] vect: ICE: Segmentation fault in vect_loop_versioning
Hi, > -Original Message- > From: Richard Sandiford [mailto:richard.sandif...@arm.com] > Sent: Friday, June 12, 2020 2:29 AM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH PR95570] vect: ICE: Segmentation fault in > vect_loop_versioning > > "Yangfei (Felix)" writes: > > From 30a0196b0afd45bae9291cfab3fee4ad6b90cbcb Mon Sep 17 00:00:00 > 2001 > > From: Fei Yang > > Date: Thu, 11 Jun 2020 19:33:22 +0800 > > Subject: [PATCH] vect: Fix an ICE in vect_loop_versioning [PR95570] > > > > In the test case for PR95570, the only data reference in the loop is a > > gather-statter access. Scalar evolution analysis for this data > > reference failed, so DR_STEP is NULL_TREE. This leads to the segmentation > fault. > > We should filter out scatter-gather access in > vect_enhance_data_refs_alignment. > > Looks good, just a couple of very minor details… > > > 2020-06-11 Felix Yang > > > > gcc/ > > PR tree-optimization/95570 > > * tree-vect-data-refs.c (vect_relevant_for_alignment_p): New > function. > > (vect_verify_datarefs_alignment): Call it to filter out data > > references > > in the loop whose alignment is irrelevant. > > (vect_get_peeling_costs_all_drs): Likewise. > > (vect_peeling_supportable): Likewise. > > (vect_enhance_data_refs_alignment): Likewise. > > > > gcc/testsuite/ > > > > PR tree-optimization/95570 > > * gcc.dg/vect/pr95570.c: New test. > > --- > > gcc/testsuite/gcc.dg/vect/pr95570.c | 11 > > gcc/tree-vect-data-refs.c | 83 - > > 2 files changed, 45 insertions(+), 49 deletions(-) create mode > > 100644 gcc/testsuite/gcc.dg/vect/pr95570.c > > > > diff --git a/gcc/testsuite/gcc.dg/vect/pr95570.c > > b/gcc/testsuite/gcc.dg/vect/pr95570.c > > new file mode 100644 > > index 000..b9362614004 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/pr95570.c > > @@ -0,0 +1,11 @@ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "-march=armv8.2-a+sve > > +-msve-vector-bits=256 -mstrict-align -fwrapv" { target aarch64*-*-* } > > +} */ > > + > > +int x[8][32]; > > + > > +void > > +foo (int start) > > +{ > > + for (int i = start; i < start + 16; i++) > > +x[start][i] = i; > > +} > > diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c > > index 39d5a1b554c..98f6fb76ff9 100644 > > --- a/gcc/tree-vect-data-refs.c > > +++ b/gcc/tree-vect-data-refs.c > > @@ -1129,6 +1129,35 @@ vect_update_misalignment_for_peel > (dr_vec_info *dr_info, > >SET_DR_MISALIGNMENT (dr_info, DR_MISALIGNMENT_UNKNOWN); } > > > > +/* Return TRUE if alignment is relevant for DR_INFO. */ > > Just “Return true …“ for new code. TRUE is a hold-over from C days. OK. > > +static bool > > +vect_relevant_for_alignment_p (dr_vec_info *dr_info) { > > + stmt_vec_info stmt_info = dr_info->stmt; > > + > > + if (!STMT_VINFO_RELEVANT_P (stmt_info)) > > +return false; > > + > > + /* For interleaving, only the alignment of the first access > > + matters. */ if (STMT_VINFO_GROUPED_ACCESS (stmt_info) > > + && DR_GROUP_FIRST_ELEMENT (stmt_info) != stmt_info) > > +return false; > > + > > + /* For scatter-gather or invariant accesses, alignment is irrelevant > > + for them. */ > > Maybe: > > /* Scatter-gather and invariant accesses continue to address individual > scalars, so vector-level alignment is irrelevant. */ Much better : - ) Updated accordingly. Also bootstrapped and tested on x86_64-linux-gnu. Thanks, Felix gcc/ +2020-06-12 Felix Yang + + PR tree-optimization/95570 + * tree-vect-data-refs.c (vect_relevant_for_alignment_p): New function. + (vect_verify_datarefs_alignment): Call it to filter out data references + in the loop whose alignment is irrelevant. + (vect_get_peeling_costs_all_drs): Likewise. + (vect_peeling_supportable): Likewise. + (vect_enhance_data_refs_alignment): Likewise. gcc/testsuite/ +2020-06-12 Felix Yang + + PR tree-optimization/95570 + * gcc.dg/vect/pr95570.c: New test. pr95570-v3.diff Description: pr95570-v3.diff
RE: [PATCH PR95570] vect: ICE: Segmentation fault in vect_loop_versioning
Hi, > -Original Message- > From: Richard Sandiford [mailto:richard.sandif...@arm.com] > Sent: Thursday, June 11, 2020 12:23 AM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH PR95570] vect: ICE: Segmentation fault in > vect_loop_versioning > > "Yangfei (Felix)" writes: > > Hi, > > > > PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95570 > > > > Here, we are doing loop versioning for alignment. The only dr here is a > gather-statter operation: x[start][i]. > > Scalar evolution analysis for this dr failed, so DR_STEP is NULL_TREE, which > leads to the segfault. > > But scatter-gather operation should be filtered out in > vect_enhance_data_refs_alignment. > > There are similar issues in vect_verify_datarefs_alignment, > vect_get_peeling_costs_all_drs and vect_peeling_supportable. > > Proposed patch adds back the necessary tests. Bootstrapped and tested > on aarch64-linux-gnu & x86_64-linux-gnu. > > > > Test coverage: > > Existing tests [1] and newly added test ensures coverage for all the changes > except for the changes in vect_peeling_supportable. > > Currently I don't have a test to cover the changes in > vect_peeling_supportable. Should we keep them? > > Rather than add several instances of the new test, I think it would make > sense to split the (hopefully) correct conditions in > vect_enhance_data_refs_alignment out into a subroutine and use it in the > other sites. Doing that for vect_peeling_supportable would then be > justifiable as a clean-up. OK. > How about something like vect_relevant_for_alignment_p as the name of > the subroutine? Nice name. Does the v2 patch look better? Bootstrapped and tested on aarch64-linux-gnu. Newly added test fail without the fix and pass otherwise. gcc/ +2020-06-11 Felix Yang + + PR tree-optimization/95570 + * tree-vect-data-refs.c (vect_relevant_for_alignment_p): New function. + (vect_verify_datarefs_alignment): Call it to filter out data references + in the loop whose alignment is irrelevant. + (vect_get_peeling_costs_all_drs): Likewise. + (vect_peeling_supportable): Likewise. + (vect_enhance_data_refs_alignment): Likewise. gcc/testsuite/ +2020-06-11 Felix Yang + + PR tree-optimization/95570 + * gcc.dg/vect/pr95570.c: New test. Thanks, Felix pr95570-v2.diff Description: pr95570-v2.diff
[PATCH PR95570] vect: ICE: Segmentation fault in vect_loop_versioning
Hi, PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95570 Here, we are doing loop versioning for alignment. The only dr here is a gather-statter operation: x[start][i]. Scalar evolution analysis for this dr failed, so DR_STEP is NULL_TREE, which leads to the segfault. But scatter-gather operation should be filtered out in vect_enhance_data_refs_alignment. There are similar issues in vect_verify_datarefs_alignment, vect_get_peeling_costs_all_drs and vect_peeling_supportable. Proposed patch adds back the necessary tests. Bootstrapped and tested on aarch64-linux-gnu & x86_64-linux-gnu. Test coverage: Existing tests [1] and newly added test ensures coverage for all the changes except for the changes in vect_peeling_supportable. Currently I don't have a test to cover the changes in vect_peeling_supportable. Should we keep them? Thanks, Felix gcc: +2020-06-10 Felix Yang + + PR tree-optimization/95570 + * tree-vect-data-refs.c (vect_verify_datarefs_alignment): Filter + out gather-scatter and invariant accesses when iterating over all + data references. + (vect_get_peeling_costs_all_drs): Likewise. + (vect_peeling_supportable): Likewise, and filter out irrelevant data + references. + (vect_enhance_data_refs_alignment): Likewise when checking if + versioning for alignment is needed. gcc/testsuite: +2020-06-10 Felix Yang + + PR tree-optimization/95570 + * gcc.dg/vect/pr95570.c: New test. [1] gcc.dg/pr50133.c gcc.dg/vect/pr49771.c gcc.dg/pr87894.c gcc.dg/vect/bb-slp-42.c gcc.target/aarch64/sve/mask_gather_load_1.c pr95570-v1.diff Description: pr95570-v1.diff
RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length
Hi, > -Original Message- > From: Richard Sandiford [mailto:richard.sandif...@arm.com] > Sent: Tuesday, June 2, 2020 7:17 PM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org; Uros Bizjak ; Jakub > Jelinek ; Hongtao Liu ; H.J. Lu > > Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with > fixed sve vector length > Snip... > > > >> FAIL: gcc.target/i386/avx512f-vcvtps2ph-2.c (test for excess errors) > >> UNRESOLVED: gcc.target/i386/avx512f-vcvtps2ph-2.c compilation failed > >> to produce executable > > 154803c154803 > > Looks good. (I know I said that last time too :-)) I've also tested it on > arm- > linux-gnueabihf and powerpc64le-linux-gnu without problems. Thanks for reviewing and testing the patch :-) > As before, I'll hold off applying until the AVX512 problem is fixed. Looks like the AVX512 problem is fixed with: https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=43088bb4dadd3d14b6b594c5f9363fe879f3d7f7 I'm using: $ runtest --tool gcc i386.exp=avx512f-vcvtps2ph-2.c Thanks, Felix
RE: [PATCH PR95459] aarch64: ICE in aarch64_short_vector_p, at config/aarch64/aarch64.c:16803
Hi Richard, > -Original Message- > From: Richard Sandiford [mailto:richard.sandif...@arm.com] > Sent: Wednesday, June 3, 2020 1:19 AM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH PR95459] aarch64: ICE in aarch64_short_vector_p, at > config/aarch64/aarch64.c:16803 > > "Yangfei (Felix)" writes: > > Hi, > > > > Please review this trivial patch fixing an ICE in > > aarch64_short_vector_p. > > Bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95459 > > > > In aarch64_short_vector_p, we are simply checking whether a type (and > a mode) > > is a 64/128-bit short vector or not. This should not be affected by the > value > > of TARGET_SVE. Simply leave later code to report an error if SVE is > disabled. > > Bootstrapped and tested on aarch64-linux-gnu. OK? > > OK, thanks. Pushed to master. Thanks for reviewing and installing the patch : - ) Felix
[PATCH PR95459] aarch64: ICE in aarch64_short_vector_p, at config/aarch64/aarch64.c:16803
Hi, Please review this trivial patch fixing an ICE in aarch64_short_vector_p. Bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95459 In aarch64_short_vector_p, we are simply checking whether a type (and a mode) is a 64/128-bit short vector or not. This should not be affected by the value of TARGET_SVE. Simply leave later code to report an error if SVE is disabled. Bootstrapped and tested on aarch64-linux-gnu. OK? gcc/ChangeLog @@ -1,3 +1,9 @@ +2020-06-02 Felix Yang + + PR target/95459 + * config/aarch64/aarch64.c (aarch64_short_vector_p): + Leave later code to report an error if SVE is disabled. gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2020-06-02 Felix Yang + + PR target/95459 + * gcc.target/aarch64/mgeneral-regs_6.c: New test. Thanks, Felix pr95459-v0.diff Description: pr95459-v0.diff
PING: RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero
Gentle ping ... > -Original Message- > From: Yangfei (Felix) > Sent: Wednesday, May 27, 2020 11:52 AM > To: 'Segher Boessenkool' > Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) > Subject: RE: [PATCH PR94026] combine missed opportunity to simplify > comparisons with zero > > Hi, > > > -Original Message- > > From: Segher Boessenkool [mailto:seg...@kernel.crashing.org] > > Sent: Tuesday, May 26, 2020 11:32 PM > > To: Yangfei (Felix) > > Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) > > > > Subject: Re: [PATCH PR94026] combine missed opportunity to simplify > > comparisons with zero > > Snip... > > > > > Yes, please try to get this sorted somehow. Maybe you can ask other > > people in your company that have this same problem? > > Will try and see. > > > > > > + new_rtx = gen_rtx_AND (mode, new_rtx, > > > > > + gen_int_mode (mask << real_pos, > mode)); > > > > > + } > > > > > > > > So this changes > > > > ((X >> C) & M) == ... > > > > to > > > > (X & (M << C)) == ... > > > > ? > > > > > > > > Where then does it check what ... is? This is only valid like > > > > this if that is > > zero. > > > > > > > > Why should this go in combine and not in simplify-rtx instead? > > > > > > True. This is only valid when ... is zero. > > > That's why we need the "&& equality_comparison " condition here. > > > > But that doesn't test if the other side of the comparison is 0. > > Well, the caller has ensured that. > > Here, local variable "equality_comparison" in > make_compound_operation_int depends on parameter "in_code": > 8088 if (in_code == EQ) > 8089 { > 8090 equality_comparison = true; > 8091 in_code = COMPARE; > 8092 } > > The only caller of make_compound_operation_int is > make_compound_operation. > The comment of the caller says something about " in_code ": > > 8512IN_CODE says what kind of expression we are processing. Normally, it > is > 8513SET. In a memory address it is MEM. When processing the arguments > of > 8514a comparison or a COMPARE against zero, it is COMPARE, or EQ if > more > 8515precisely it is an equality comparison against zero. */ > > For the given test case, we have a call trace of: > (gdb) bt > #0 make_compound_operation_int (mode=..., x_ptr=0xbd08, > in_code=COMPARE, next_code_ptr=0xbd1c) at ../../gcc- > git/gcc/combine.c:8248 > #1 0x0208983c in make_compound_operation (x=0xb211c768, > in_code=EQ) at ../../gcc-git/gcc/combine.c:8539 > #2 0x020970fc in simplify_comparison (code=NE, > pop0=0xc1e8, pop1=0xc1e0) at ../../gcc- > git/gcc/combine.c:13032 > #3 0x02084544 in simplify_set (x=0xb211c240) at ../../gcc- > git/gcc/combine.c:6932 > #4 0x02082688 in combine_simplify_rtx (x=0xb211c240, > op0_mode=E_VOIDmode, in_dest=0, in_cond=0) at ../../gcc- > git/gcc/combine.c:6445 > #5 0x0208025c in subst (x=0xb211c240, from=0xb211c138, > to=0xb211c150, in_dest=0, in_cond=0, unique_copy=0) > at ../../gcc-git/gcc/combine.c:5724 > #6 0x02079110 in try_combine (i3=0xb22cc3c0, i2=0xb22cc340, > i1=0x0, i0=0x0, new_direct_jump_p=0xceb4, > last_combined_insn=0xb22cc3c0) at ../../gcc-git/gcc/combine.c:3413 > #7 0x02073004 in combine_instructions (f=0xb211d038, > nregs=103) at ../../gcc-git/gcc/combine.c:1305 > #8 0x0209cc50 in rest_of_handle_combine () at ../../gcc- > git/gcc/combine.c:15088 > > In simplify_comparison (combine.c:13032): > > 13028 rtx_code op0_mco_code = SET; > 13029 if (op1 == const0_rtx) > 13030 op0_mco_code = code == NE || code == EQ ? EQ : COMPARE; > 13031 > 13032 op0 = make_compound_operation (op0, op0_mco_code); > 13033 op1 = make_compound_operation (op1, SET); > > > > > > > --- /dev/null > > > > > +++ b/gcc/testsuite/gcc.dg/pr94026.c > > > > > @@ -0,0 +1,21 @@ > > > > > +/* { dg-do compile { target aarch64*-*-* i?86-*-* x86_64-*-* } > > > > > +} */ > > > > > > > > Why restrict this to only some targets? > > > > > > That's because I only have these targets for verification. > > > But I think this can work on other targets. Removed from the v4 patch. > > > Could you please help check
RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length
Hi, > -Original Message- > From: Richard Sandiford [mailto:richard.sandif...@arm.com] > Sent: Monday, June 1, 2020 4:47 PM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org; Uros Bizjak ; Jakub > Jelinek ; Hongtao Liu ; H.J. Lu > > Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with > fixed sve vector length Snip... > Sounds good. Maybe at this point the x_inner and y_inner code is getting > complicated enough to put into a lambda too: > > x_inner = ... (x); > y_inner = ... (y); > > Just a suggestion though. Yes, that's a good suggestion. I see the code becomes more cleaner with another lambda. > Yeah, looks good. > > Formatting nit though: multi-line conditions should be wrapped in (...), > i.e.: > > return (... > && ... > && ...); > Done. v6 patch is based on trunk 20200601. Bootstrapped and tested on aarch64-linux-gnu. Also bootstrapped on x86-64-linux-gnu with --enable-multilib (for building -m32 x86 libgcc). Regresssion test on x86-64-linux-gnu looks good except for the following failures which has been confirmed by x86 devs: > FAIL: gcc.target/i386/avx512f-vcvtps2ph-2.c (test for excess errors) > UNRESOLVED: gcc.target/i386/avx512f-vcvtps2ph-2.c compilation failed to > produce executable 154803c154803 Thanks, Felix pr95254-v6.diff Description: pr95254-v6.diff
RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length
Hi, > -Original Message- > From: Richard Sandiford [mailto:richard.sandif...@arm.com] > Sent: Sunday, May 31, 2020 12:01 AM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org; Uros Bizjak ; Jakub > Jelinek ; Hongtao Liu ; H.J. Lu > > Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with > fixed sve vector length > Snip... > > > > The v5 patch attached addressed this issue. > > > > There two added changes compared with the v4 patch: > > 1. In candidate_mem_p, mov_optab for innermode should be available. > > In this case, mov_optab for SDmode is not there and subreg are added > back by emit_move_insn_1. So we won't get the benefit with the patch. > > I agree we should have this check. I think the rule applies to all of the > transforms though, not just the mem one, so we should add the check to the > register and constant cases too. OK. I changed to make this an extra condition for calculating x_inner & y _inner. > > 2. Instead of using adjust_address, I changed to use adjust_address_nv to > avoid the emit of invalid insn 13. > > The latter call to validize_mem() in emit_move_insn will take care of > > the > address for us. > > The validation performed by validize_mem is the same as that performed by > adjust_address, so the only case this should make a difference is for > push_operands: True. > /* If X or Y are memory references, verify that their addresses are valid > for the machine. */ > if (MEM_P (x) > && (! memory_address_addr_space_p (GET_MODE (x), XEXP (x, 0), >MEM_ADDR_SPACE (x)) > && ! push_operand (x, GET_MODE (x > x = validize_mem (x); > > if (MEM_P (y) > && ! memory_address_addr_space_p (GET_MODE (y), XEXP (y, 0), > MEM_ADDR_SPACE (y))) > y = validize_mem (y); > > So I think the fix is to punt on push_operands instead (and continue to use > adjust_address rather than adjust_address_nv). Not sure if I understand it correctly. Do you mean excluding push_operand in candidate_mem_p? Like: 3830 auto candidate_mem_p = [&](machine_mode innermode, rtx mem) { 3831 return !targetm.can_change_mode_class (innermode, GET_MODE (mem), ALL_REGS) 3832&& !push_operand (mem, GET_MODE (mem)) 3833/* Not a candiate if innermode requires too much alignment. */ 3834&& (MEM_ALIGN (mem) >= GET_MODE_ALIGNMENT (innermode) 3835|| targetm.slow_unaligned_access (GET_MODE (mem), 3836 MEM_ALIGN (mem)) 3837|| !targetm.slow_unaligned_access (innermode, MEM_ALIGN (mem))); 3838 }; Thanks, Felix
RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length
Hi, > -Original Message- > From: Yangfei (Felix) > Sent: Friday, May 29, 2020 2:56 PM > To: 'Hongtao Liu' ; H.J. Lu > Cc: gcc-patches@gcc.gnu.org; Uros Bizjak ; Jakub > Jelinek ; Richard Sandiford > > Subject: RE: [PATCH PR95254] aarch64: gcc generate inefficient code with > fixed sve vector length Snip... > > Yes, I tried your configure and reproduced the error. Thanks for pointing > this > out. > The patch can pass bootstrap on x86_64 with the following configure options. > Surprised to see that it failed to build with your configure. Turns out that this ICE triggers under x86 -m32. Command to reproduce: ~/build-gcc/x86_64-pc-linux-gnu/32/libgcc$ gcc -g -O2 -m32 -O2 -g -O2 -DIN_GCC-W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition -isystem ./include -fpic -mlong-double-80 -DUSE_ELF_SYMVER -fcf-protection -mshstk -g -DIN_LIBGCC2 -fbuilding-libgcc -fno-stack-protector -fpic -mlong-double-80 -DUSE_ELF_SYMVER -fcf-protection -mshstk -I. -I. -I../../.././gcc -I../../../../gcc-git/libgcc -I../../../../gcc-git/libgcc/. -I../../../../gcc-git/libgcc/../gcc -I../../../../gcc-git/libgcc/../include -I../../../../gcc-git/libgcc/config/libbid -DENABLE_DECIMAL_BID_FORMAT -DHAVE_CC_TLS -DUSE_TLS -o _isinfd32.o -MT _isinfd32.o -MD -MP -MF _isinfd32.dep -c ../../../../gcc-git/libgcc/config/libbid/_isinfd32.c Source: 28 int 29 isinfd32 (_Decimal32 x) { 30 int res; 31 UINT64 x64; 32 union decimal32 ux; 33 34 ux.d = x; 35 x64 = __bid32_to_bid64 (ux.i); 36 res = __bid64_isInf (x64); 37 return (res); 38 } On the crash site (emit_single_push_insn), we have three insns: (gdb) p debug_rtx (prev) (insn 12 0 13 (parallel [ (set (reg/f:SI 7 sp) (plus:SI (reg/f:SI 7 sp) (const_int -12 [0xfff4]))) (clobber (reg:CC 17 flags)) ]) "../../../../gcc-git/libgcc/config/libbid/_isinfd32.c":35:9 -1 (expr_list:REG_ARGS_SIZE (const_int 12 [0xc]) (nil))) $1 = void (gdb) p debug_rtx (last) (insn 14 13 0 (set (mem:SI (reg/f:SI 87) [0 S4 A32]) (subreg:SI (reg/v:SD 86 [ x ]) 0)) "../../../../gcc-git/libgcc/config/libbid/_isinfd32.c":35:9 -1 (nil)) $2 = void (gdb) p debug_rtx (PREV_INSN (last)) (insn 13 12 14 (set (reg/f:SI 87) (pre_dec:SI (reg/f:SI 7 sp))) "../../../../gcc-git/libgcc/config/libbid/_isinfd32.c":35:9 -1 (nil)) $3 = void Here, insn 13 is invalid. It is emitted by: x = adjust_address (x, GET_MODE (y_inner), 0); The v5 patch attached addressed this issue. There two added changes compared with the v4 patch: 1. In candidate_mem_p, mov_optab for innermode should be available. In this case, mov_optab for SDmode is not there and subreg are added back by emit_move_insn_1. So we won't get the benefit with the patch. 2. Instead of using adjust_address, I changed to use adjust_address_nv to avoid the emit of invalid insn 13. The latter call to validize_mem() in emit_move_insn will take care of the address for us. Bootstrapped and tested on aarch64-linux-gnu. I am running another test on x86. Richard, could you please take a further look? Thanks, Felix pr95254-v5.diff Description: pr95254-v5.diff
RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length
Hi, > -Original Message- > From: Hongtao Liu [mailto:crazy...@gmail.com] > Sent: Friday, May 29, 2020 11:24 AM > To: H.J. Lu > Cc: Yangfei (Felix) ; gcc-patches@gcc.gnu.org; > Uros Bizjak ; Jakub Jelinek ; > Richard Sandiford > Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with > fixed sve vector length > Snip... > > > > > > This is due to define_subst magic. The generators automatically > > > create a vec_merge form of the instruction based on the information > > > in the attributes. > > > > > > AFAICT the rtl above is for the line-125 instruction, which looks ok. > > > The problem is the line-126 instruction, since vcvtps2ph doesn't > > > AIUI allow zero masking. > > > > > zero masking is not allowed for mem_operand here, but available for > register_operand. > there's something wrong in the pattern, we need to fix it. > (define_insn "avx512f_vcvtps2ph512" Thanks for confirming that :-) > > > > The "mask" define_subst allows both zeroing and merging, so I guess > > > this means that the pattern should either be using a different > > > define_subst, or should be enforcing merging in some other way. > > > Please could one of the x86 devs take a look? > > > > > > > Hongtao, can you take a look? > > > > Thanks. > > > > > > -- > > H.J. > > BTW, i failed to build gcc when apply pr95254-v4.txt. > > gcc configure: > > Using built-in specs. > COLLECT_GCC=./gcc/xgcc > Target: x86_64-pc-linux-gnu > Configured with: ../../gcc/gnu-toolchain/gcc/configure > --enable-languages=c,c++,fortran --disable-bootstrap Thread model: posix > Supported LTO compression algorithms: zlib gcc version 11.0.0 20200528 > (experimental) (GCC) > > host on x86_64 rel8. Yes, I tried your configure and reproduced the error. Thanks for pointing this out. The patch can pass bootstrap on x86_64 with the following configure options. Surprised to see that it failed to build with your configure. Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/home/yangfei/gcc-hacking/install-gcc/libexec/gcc/x86_64-pc-linux-gnu/11/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ../gcc-git/configure --prefix=/home/yangfei/gcc-hacking/install-gcc --enable-languages=c,c++,objc,obj-c++,fortran,lto --enable-shared --enable-threads=posix --enable-checking=yes --disable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --enable-plugin --enable-initfini-array --without-isl --disable-libmpx --enable-gnu-indirect-function Thread model: posix Supported LTO compression algorithms: zlib gcc version 11.0.0 20200526 (experimental) (GCC) Felix
RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length
Hi, > -Original Message- > From: Richard Sandiford [mailto:richard.sandif...@arm.com] > Sent: Thursday, May 28, 2020 12:07 AM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with > fixed sve vector length > Snip... > > Ah, OK. But in that case, shouldn't we allow the change if the original > unaligned MEM was also “slow”? > > I guess there might be cases in which both modes are slow enough for the > hook to return true for them, but one is worse than the other. > But I don't think there's much we can do about that as things stand: > changing the mode might move from a slow mode to a slower mode, but it > might move in the other direction too. Good point. > > +2020-05-27 Felix Yang > > + Richard Sandiford > > I appreciate the gesture, but I don't think it's appropriate to list me as an > author. I haven't written any of the code, I've just reviewed it. :-) OK. > > diff --git a/gcc/expr.c b/gcc/expr.c > > index dfbeae71518..3035791c764 100644 > > --- a/gcc/expr.c > > +++ b/gcc/expr.c > > @@ -3814,6 +3814,69 @@ emit_move_insn (rtx x, rtx y) > >gcc_assert (mode != BLKmode > > && (GET_MODE (y) == mode || GET_MODE (y) == VOIDmode)); > > > > + /* If we have a copy which looks like one of the following patterns: > > s/which/that/ (I think) OK. > > + (set (subreg:M1 (reg:M2 ...)) (subreg:M1 (reg:M2 ...))) > > + (set (subreg:M1 (reg:M2 ...)) (mem:M1 ADDR)) > > + (set (mem:M1 ADDR) (subreg:M1 (reg:M2 ...))) > > + (set (subreg:M1 (reg:M2 ...)) (constant C)) > > + where mode M1 is equal in size to M2 and target hook > can_change_mode_class > > + (M1, M2, ALL_REGS) returns false, try to remove the subreg. This > avoids > > + an implicit round trip through memory. */ > > How about: > > where mode M1 is equal in size to M2, try to detect whether the > mode change involves an implicit round trip through memory. > If so, see if we can avoid that by removing the subregs and > doing the move in mode M2 instead. */ > > > + else if (x_inner != NULL_RTX > > + && MEM_P (y) > > + && ! targetm.can_change_mode_class (GET_MODE (x_inner), > > + mode, ALL_REGS) > > + /* Stop if the inner mode requires too much alignment. */ > > + && (! targetm.slow_unaligned_access (GET_MODE (x_inner), > > + MEM_ALIGN (y)) > > + || MEM_ALIGN (y) >= GET_MODE_ALIGNMENT (GET_MODE > (x_inner > > It's better to check the alignment first, since it's cheaper. > So taking the comment above into account, I think this ends up as: > > && (MEM_ALIGN (y) >= GET_MODE_ALIGNMENT (GET_MODE > (x_inner)) > || targetm.slow_unaligned_access (mode, MEM_ALIGN (y) > || !targetm.slow_unaligned_access (GET_MODE (x_inner), > MEM_ALIGN (y)) > > (Note: no space after "!", although the sources aren't as consistent about > that as they could be.) OK. > TBH I think it would be good to avoid duplicating such a complicated condition > in both directions, so at the risk of getting flamed, how about using a > lambda? > > auto candidate_mem_p = [&](machine_mode inner_mode, rtx mem) { > return ...; > }; > > with ... containing everything after the MEM_P check? Yes, this avoids duplicating code. > Looks good otherwise, thanks, Thanks for reviewing this. Attached please find the v5 patch. Note: we also need to modify local variable "mode" once we catch one case. I see test failure without this change. Bootstrapped and tested on aarch64-linux-gnu. Also bootstrapped on x86_64-linux-gnu. Regression test show two failed tests on this platform: 1> FAIL: gcc.target/i386/avx512f-vcvtps2ph-2.c (test for excess errors) 2> FAIL: gcc.target/i386/pr67609.c scan-assembler movdqa I have adjust the second one in the v4 patch. But The first one looks strange to me. I see gcc emits invalid x86 vcvtps2ph instrunctions which looks like: 125 vcvtps2ph $0, %zmm0, -112(%rbp){%k1} 126 vcvtps2ph $0, %zmm0, -80(%rbp){%k1}{z} This happens in the combine phase, where an combined insn looks like: 1989 Trying 31 -> 33: 199031: r106:V16HI=vec_merge(unspec[r103:V16SF,0] 133,[frame:DI-0x60],r109:HI) 199133: [frame:DI-0x60]=r106:V16HI 1992 REG_DEAD r106:V16HI 1993 Successfully matched this instruction: 1994 (set (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame) 1995
RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length
Hi, > -Original Message- > From: Richard Sandiford [mailto:richard.sandif...@arm.com] > Sent: Tuesday, May 26, 2020 11:58 PM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with > fixed sve vector length > > Sorry for the slow reply, was off for a few days. > > I think the new code ought to happen earlier in emit_move_insn, before: > > if (CONSTANT_P (y)) > { > > That way, all the canonicalisation happens on the mode we actually want the > move to have. OK. That’s a good point. > "Yangfei (Felix)" writes: > > diff --git a/gcc/expr.c b/gcc/expr.c > > index dfbeae71518..4442fb83367 100644 > > --- a/gcc/expr.c > > +++ b/gcc/expr.c > > @@ -3852,6 +3852,62 @@ emit_move_insn (rtx x, rtx y) > > > >gcc_assert (mode != BLKmode); > > > > + rtx x_inner = NULL_RTX; > > + rtx y_inner = NULL_RTX; > > + machine_mode x_inner_mode, y_inner_mode; > > + > > + if (SUBREG_P (x) > > + && REG_P (SUBREG_REG (x)) > > + && known_eq (SUBREG_BYTE (x), 0)) > > +{ > > + x_inner = SUBREG_REG (x); > > + x_inner_mode = GET_MODE (x_inner); > > +} > > + if (SUBREG_P (y) > > + && REG_P (SUBREG_REG (y)) > > + && known_eq (SUBREG_BYTE (y), 0)) > > +{ > > + y_inner = SUBREG_REG (y); > > + y_inner_mode = GET_MODE (y_inner); > > +} > > The later code is only interested in SUBREG_REGs that are the same size as > "mode", so I think it would make sense to check that in the "if"s above > instead of checking SUBREG_BYTE. (SUBREG_BYTE is always zero when the > modes are the same size, but the reverse is not true.) > > It might also be better to avoid [xy]_inner_mode and just use GET_MODE > where necessary. > > It would be good to have a block comment above the code to explain what > we're doing. Good suggestion. Done. > > + if (x_inner != NULL_RTX > > + && y_inner != NULL_RTX > > + && x_inner_mode == y_inner_mode > > + && known_eq (GET_MODE_SIZE (x_inner_mode), GET_MODE_SIZE > (mode)) > > + && ! targetm.can_change_mode_class (x_inner_mode, mode, > ALL_REGS)) > > +{ > > + x = x_inner; > > + y = y_inner; > > +} > > + else if (x_inner != NULL_RTX && CONSTANT_P (y) > > Formatting nit: one subcondition per line when the condition spans multiple > lines. OK. > > + && known_eq (GET_MODE_SIZE (x_inner_mode), > GET_MODE_SIZE (mode)) > > + && ! targetm.can_change_mode_class (x_inner_mode, mode, > ALL_REGS) > > + && targetm.legitimate_constant_p (x_inner_mode, y)) > > This call isn't valid, since the mode has to match the rtx. ("y" still has > mode > "mode" at this point.) I think instead we should just do: > > && (y_inner = simplify_subreg (GET_MODE (x_inner), y, mode, 0)) > > to convert the constant, and use it if the result is nonnull. > The existing CONSTANT_P emit_move_insn code will handle cases in which > the new constant isn't legitimate. Good catch. Done. > > + > > +{ > > + x = x_inner; > > +} > > + else if (x_inner != NULL_RTX && MEM_P (y) > > + && known_eq (GET_MODE_SIZE (x_inner_mode), > GET_MODE_SIZE (mode)) > > + && ! targetm.can_change_mode_class (x_inner_mode, mode, > ALL_REGS) > > + && (! targetm.slow_unaligned_access (x_inner_mode, > MEM_ALIGN (y)) > > + || MEM_ALIGN (y) >= GET_MODE_ALIGNMENT > (x_inner_mode))) > > What is the last condition protecting against? Seems worth a comment. Comment added. Here I am intended to avoid generating a slow unaligned memory access. Machine modes like VNx2HImode may have an small alignment than modes like V4HI. For the given test case, SLP forces the alignment of memory access of mode VNx2HImode to be 32 bytes. In theory, we may have other cases where alignment of innermode is bigger than that of the outermode. Attached please find the v3 patch. Bootstrapped and tested on aarch64-linux-gnu. Does it look better? gcc/ChangeLog: +2020-05-27 Felix Yang + Richard Sandiford + + PR target/95254 + * expr.c (emit_move_insn): If we have a copy of which src and/or dest + is a subreg, try to remove the subreg when innermode and outermode are + equal in size and targetm.can_change_mode_class (outermode, innermode, + ALL_REGS) returns false. testsuite/ChangeLog: +2020-05-27 Felix Yang + Richard Sandiford + + PR target/95254 + * gcc.target/aarch64/pr95254.c: New test. Thanks, Felix pr95254-v3.diff Description: pr95254-v3.diff
RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero
Hi, > -Original Message- > From: Segher Boessenkool [mailto:seg...@kernel.crashing.org] > Sent: Tuesday, May 26, 2020 11:32 PM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) > Subject: Re: [PATCH PR94026] combine missed opportunity to simplify > comparisons with zero Snip... > > Yes, please try to get this sorted somehow. Maybe you can ask other people > in your company that have this same problem? Will try and see. > > > > + new_rtx = gen_rtx_AND (mode, new_rtx, > > > > +gen_int_mode (mask << real_pos, mode)); > > > > + } > > > > > > So this changes > > > ((X >> C) & M) == ... > > > to > > > (X & (M << C)) == ... > > > ? > > > > > > Where then does it check what ... is? This is only valid like this if > > > that is > zero. > > > > > > Why should this go in combine and not in simplify-rtx instead? > > > > True. This is only valid when ... is zero. > > That's why we need the "&& equality_comparison " condition here. > > But that doesn't test if the other side of the comparison is 0. Well, the caller has ensured that. Here, local variable "equality_comparison" in make_compound_operation_int depends on parameter "in_code": 8088 if (in_code == EQ) 8089 { 8090 equality_comparison = true; 8091 in_code = COMPARE; 8092 } The only caller of make_compound_operation_int is make_compound_operation. The comment of the caller says something about " in_code ": 8512IN_CODE says what kind of expression we are processing. Normally, it is 8513SET. In a memory address it is MEM. When processing the arguments of 8514a comparison or a COMPARE against zero, it is COMPARE, or EQ if more 8515precisely it is an equality comparison against zero. */ For the given test case, we have a call trace of: (gdb) bt #0 make_compound_operation_int (mode=..., x_ptr=0xbd08, in_code=COMPARE, next_code_ptr=0xbd1c) at ../../gcc-git/gcc/combine.c:8248 #1 0x0208983c in make_compound_operation (x=0xb211c768, in_code=EQ) at ../../gcc-git/gcc/combine.c:8539 #2 0x020970fc in simplify_comparison (code=NE, pop0=0xc1e8, pop1=0xc1e0) at ../../gcc-git/gcc/combine.c:13032 #3 0x02084544 in simplify_set (x=0xb211c240) at ../../gcc-git/gcc/combine.c:6932 #4 0x02082688 in combine_simplify_rtx (x=0xb211c240, op0_mode=E_VOIDmode, in_dest=0, in_cond=0) at ../../gcc-git/gcc/combine.c:6445 #5 0x0208025c in subst (x=0xb211c240, from=0xb211c138, to=0xb211c150, in_dest=0, in_cond=0, unique_copy=0) at ../../gcc-git/gcc/combine.c:5724 #6 0x02079110 in try_combine (i3=0xb22cc3c0, i2=0xb22cc340, i1=0x0, i0=0x0, new_direct_jump_p=0xceb4, last_combined_insn=0xb22cc3c0) at ../../gcc-git/gcc/combine.c:3413 #7 0x02073004 in combine_instructions (f=0xb211d038, nregs=103) at ../../gcc-git/gcc/combine.c:1305 #8 0x0209cc50 in rest_of_handle_combine () at ../../gcc-git/gcc/combine.c:15088 In simplify_comparison (combine.c:13032): 13028 rtx_code op0_mco_code = SET; 13029 if (op1 == const0_rtx) 13030 op0_mco_code = code == NE || code == EQ ? EQ : COMPARE; 13031 13032 op0 = make_compound_operation (op0, op0_mco_code); 13033 op1 = make_compound_operation (op1, SET); > > > > --- /dev/null > > > > +++ b/gcc/testsuite/gcc.dg/pr94026.c > > > > @@ -0,0 +1,21 @@ > > > > +/* { dg-do compile { target aarch64*-*-* i?86-*-* x86_64-*-* } } > > > > +*/ > > > > > > Why restrict this to only some targets? > > > > That's because I only have these targets for verification. > > But I think this can work on other targets. Removed from the v4 patch. > > Could you please help check the other ports? > > In general, you should never restrict anything to some targets simply > because you haven't tested it on other targets. > > If it is a good test it will just work on those other targets. Traffic on > gcc- > testresults@ will show you if it actually does. OK. Thanks for pointing this out :-) > > > > +/* { dg-options "-O2 -fdump-rtl-combine" } */ > > > > + > > > > +int > > > > +foo (int c) > > > > +{ > > > > + int a = (c >> 8) & 7; > > > > + > > > > + if (a >= 2) { > > > > +return 1; > > > > + } > > > > + > > > > + return 0; > > > > +} > > > > + > > > > +/* The combine phas
RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero
Hi, > -Original Message- > From: Segher Boessenkool [mailto:seg...@kernel.crashing.org] > Sent: Tuesday, May 26, 2020 12:27 AM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) > Subject: Re: [PATCH PR94026] combine missed opportunity to simplify > comparisons with zero Snip... > > I am using Outlook and I didn't find the place to change the MIME type > > : - ( > > The simplest option is to use a different email client, one that plays nicely > with others. You use git, maybe you could even use git-send-email? The bad news is that it would be hard to switch to a different email client with my company's IT policy :-( But I think I can ask IT if that is possible. Sorry for the trouble. > I'll paste things manually... > > > From a19238c02c1e6ab9593a14a13e1e3dff90ed Mon Sep 17 00:00:00 > 2001 > > From: Fei Yang > > Date: Mon, 25 May 2020 10:19:30 +0800 > > Subject: [PATCH] combine: missed opportunity to simplify comparisons > > with zero [PR94026] > > (Capital "M" on "Missed" please) > > But, the subject should say what the patch *does*. So maybe > combine: Simplify more comparisons with zero (PR94026) OK. > > If we have (and (lshiftrt X C) M) and M is a constant that would > > select a field of bits within an item, but not the entire word, fold > > this into a simple AND if we are in an equality comparison against zero. > > But that subject doesn't really describe what the patch does, anyway? OK. Modified in the v4 patch. Does it look better? > > gcc/ > > PR rtl-optimization/94026 > > * combine.c (make_compound_operation_int): If we have (and > > (lshiftrt X C) M) and M is a constant that would select a field > > of bits within an item, but not the entire word, fold this into > > a simple AND if we are in an equality comparison. > > > > gcc/testsuite/ > > PR rtl-optimization/94026 > > * gcc.dg/pr94026.c: New test. > > > --- a/gcc/ChangeLog > > +++ b/gcc/ChangeLog > > @@ -1,3 +1,11 @@ > > +2020-05-25 Felix Yang > > + > > + PR rtl-optimization/94026 > > + * combine.c (make_compound_operation_int): If we have (and > > + (lshiftrt X C) M) and M is a constant that would select a field > > + of bits within an item, but not the entire word, fold this into > > + a simple AND if we are in an equality comparison. > > Don't put the changelog in the patch. OK. I paste it here: gcc/ChangeLog +2020-05-26 Felix Yang + + PR rtl-optimization/94026 + * combine.c (make_compound_operation_int): If we have (and + (lshiftrt X C) M) and M is a constant that would select a field + of bits within an item, but not the entire word, fold this into + a simple AND if we are in an equality comparison. gcc/testsuite/ChangeLog +2020-05-26 Felix Yang + + PR rtl-optimization/94026 + * gcc.dg/pr94026.c: New test. > > diff --git a/gcc/combine.c b/gcc/combine.c index > > b044f29fd36..76d62b0bd17 100644 > > --- a/gcc/combine.c > > +++ b/gcc/combine.c > > @@ -8178,6 +8178,10 @@ make_compound_operation_int > (scalar_int_mode mode, rtx *x_ptr, > >if (!CONST_INT_P (XEXP (x, 1))) > > break; > > > > + HOST_WIDE_INT pos; > > + unsigned HOST_WIDE_INT len; > > + pos = get_pos_from_mask (UINTVAL (XEXP (x, 1)), ); > > unsigned HOST_WIDE_INT len; > HOST_WIDE_INT pos = get_pos_from_mask (UINTVAL (XEXP (x, 1)), ); > > > @@ -8231,6 +8235,22 @@ make_compound_operation_int > (scalar_int_mode mode, rtx *x_ptr, > > new_rtx = make_compound_operation (new_rtx, in_code); > > } > > > > + /* If we have (and (lshiftrt X C) M) and M is a constant that would > select > > +a field of bits within an item, but not the entire word, this might be > > +representable by a simple AND if we are in an equality comparison. > */ > > + else if (pos > 0 && equality_comparison > > That "&& equality_comparison" should be on a separate line as well. OK. > > + && GET_CODE (XEXP (x, 0)) == LSHIFTRT > > + && CONST_INT_P (XEXP (XEXP (x, 0), 1)) > > + && pos + UINTVAL (XEXP (XEXP (x, 0), 1)) > > + <= GET_MODE_BITSIZE (mode)) > > + { > > + new_rtx = make_compound_operation (XEXP (XEXP (x, 0), 0), > next_code); > > + HOST_WIDE_INT real_pos = pos + UINTVAL (XEXP (XEXP (x, 0), 1)); > > + unsigned HOST_WIDE_INT mask = ((unsigned HOST_WIDE_INT)1 << > len) - > > +1; > > Space after cast
RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero
Hi, > -Original Message- > From: Segher Boessenkool [mailto:seg...@kernel.crashing.org] > Sent: Saturday, May 23, 2020 10:57 PM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) > Subject: Re: [PATCH PR94026] combine missed opportunity to simplify > comparisons with zero > > Hi! > > Sorry this is taking so long. > > On Wed, May 06, 2020 at 08:57:52AM +, Yangfei (Felix) wrote: > > > On Tue, Mar 24, 2020 at 06:30:12AM +, Yangfei (Felix) wrote: > > > > I modified combine emitting a simple AND operation instead of > > > > making one > > > zero_extract for this scenario. > > > > Attached please find the new patch. Hope this solves both of our > concerns. > > > > > > This looks promising. I'll try it out, see what it does on other > > > targets. (It will have to wait for GCC 11 stage 1, of course). > > It creates better code on all targets :-) A quite small improvement, but not > entirely trivial. Thanks for the effort. It's great to hear that :- ) Attached please find the v3 patch. Rebased on the latest trunk. Bootstrapped and tested on aarch64-linux-gnu. Could you please help install it? > > > p.s. Please use a correct mime type? application/octet-stream > > > isn't something I can reply to. Just text/plain is fine :-) > > > > I have using plain text now, hope that works for you. :-) > > Nope: > > [-- Attachment #2: pr94026-v2.diff --] > [-- Type: application/octet-stream, Encoding: base64, Size: 5.9K --] This time I switched to use UUEncode type for the attachment. Does it work? I am using Outlook and I didn't find the place to change the MIME type : - ( Felix pr94026-v3.diff Description: pr94026-v3.diff
RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length
Hi Richard, Thanks for the suggestions. > -Original Message- > From: Richard Sandiford [mailto:richard.sandif...@arm.com] > Sent: Thursday, May 21, 2020 5:22 PM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with > fixed sve vector length > > "Yangfei (Felix)" writes: > > Hi, > > > > Notice a tiny SVE-related performance issue: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95254 > > > > For the given test case, SLP succeeds with VNx8HImode with or without > option -msve-vector-bits=256. > > The root cause for the difference is that we choose a different mode in > aarch64_vectorize_related_mode under -msve-vector-bits=256: > VNx2HImode instead of V4HImode. > > Then in the final tree ssa forwprop pass, we need to do a VIEW_CONVERT > from V4HImode to VNx2HImode. > > > > PATCH catch and simplify the pattern in > aarch64_expand_sve_mem_move, emitting a mov pattern of V4HImode > instead. > > I am assuming endianness does not make a difference here considering > this simplification. > > Bootstrap and tested on aarch64-linux-gnu. Comments? > > I think we should try to catch this at the gimple level, possibly during SLP > itself. Agreed. It's better if this can be handled in SLP. For the given test case, the difference reflect itself after the final ssa forwprop. So I guess it might be hard for SLP to catch and evaluate in the vect cost model. > Although the patch handles the case in which the V4HI is stored directly to > memory, I assume it won't help if the code instead does: > > for (i = 0; i < 4; i++) > b[i] = u.a[i] + 1; SLP failed if you modify like that. > targetm.can_change_mode_class (..., ALL_REGS) would be a good indicator > of whether the needed VIEW_CONVERT_EXPR is cheap or expensive. > > I agree it might still be worth handling this in the move patterns too. > It feels like a target-independent optimisation though, and for example > should also apply to V4HI moves involving subregs of VNx2HIs. > > So I think it would be worth trying to do this in emit_move_insn. > In principle it would be useful for: > > // M1 and M2 equal size, !targetm.can_change_mode_class (M1, M2, > ALL_REGS) > > (set (subreg:M1 (reg:M2 ...)) (subreg:M1 (reg:M2 ...))) > (set (subreg:M1 (reg:M2 ...)) (mem:M1 ADDR)) > (set (mem:M1 ADDR) (subreg:M1 (reg:M2 ...))) > (set (subreg:M1 (reg:M2 ...)) (constant C)) > > It would be nice if we could do this even without the > can_change_mode_class condition, provided that it doesn't turn valid M1 > constants or MEMs into invalid M2 ones (or at least, M2 ones that need to be > validated). > Unfortunately, going that far is likely to interfere with target-specific > choices, > so it's probably too dangerous. > > With the can_change_mode_class condition it should be fine though, since > it's avoiding an implicit round trip through memory. The change should be a > win even if the MEMs or constants need legitimising for M2. Good suggestion. I worked out a v2 patch with the logic moved to in emit_move_insn. For (set (subreg:M1 (reg:M2 ...)) (constant C)) case, I am simply requiring C should also be a legitimate constant for M2. It works for the test case. I will do a full test for the v2 patch if you like it. > > [...] > > + if (inner != NULL_RTX > > + && aarch64_classify_vector_mode (inner_mode) == VEC_ADVSIMD > > + && GET_MODE_INNER (mode) == GET_MODE_INNER (inner_mode) > > + && known_eq (GET_MODE_SIZE (mode), GET_MODE_SIZE > (inner_mode)) > > + && GET_MODE_BITSIZE (inner_mode).to_constant () <= alignment) > > +{ > > + rtx addr, mem; > > + if (MEM_P (src)) > > + { > > + addr = XEXP (src, 0); > > + mem = gen_rtx_MEM (inner_mode, addr); > > + emit_move_insn (inner, mem); > > + } > > + else > > + { > > + addr = XEXP (dest, 0); > > + mem = gen_rtx_MEM (inner_mode, addr); > > + emit_move_insn (mem, inner); > > gen_rtx_MEM shouldn't be used to create new versions of existing MEMs, > since it drops all the attributes. It's better to use something like > adjust_address instead. That will also take care of making sure that the > address is valid for the new mode. It was my fault. I should have realized that difference. :-) Felix pr95254-v2.diff Description: pr95254-v2.diff
[PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length
Hi, Notice a tiny SVE-related performance issue: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95254 For the given test case, SLP succeeds with VNx8HImode with or without option -msve-vector-bits=256. The root cause for the difference is that we choose a different mode in aarch64_vectorize_related_mode under -msve-vector-bits=256: VNx2HImode instead of V4HImode. Then in the final tree ssa forwprop pass, we need to do a VIEW_CONVERT from V4HImode to VNx2HImode. PATCH catch and simplify the pattern in aarch64_expand_sve_mem_move, emitting a mov pattern of V4HImode instead. I am assuming endianness does not make a difference here considering this simplification. Bootstrap and tested on aarch64-linux-gnu. Comments? Thanks, Felix pr95254-v1.diff Description: pr95254-v1.diff
RE: [PATCH PR94991] aarch64: ICE: Segmentation fault with option -mgeneral-regs-only
Hi, > -Original Message- > From: Richard Sandiford [mailto:richard.sandif...@arm.com] > Sent: Monday, May 11, 2020 10:27 PM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH PR94991] aarch64: ICE: Segmentation fault with option - > mgeneral-regs-only > > LGTM. Pushed with one minor formatting fix: > > > @@ -1364,7 +1364,11 @@ > > if (!TARGET_FLOAT) > >{ > > aarch64_err_no_fpadvsimd (mode); > > - FAIL; > > + machine_mode intmode > > + = int_mode_for_size (GET_MODE_BITSIZE (mode), > 0).require (); > > The "=" should only be indented by two spaces relative to the first line. Thanks for fixing the bad formatting :- ) I was expecting issues like that to be reported by contrib/check_GNU_style.sh. Will be more carefull. Felix
[PATCH PR94991] aarch64: ICE: Segmentation fault with option -mgeneral-regs-only
Hi, Witnessed another ICE with option -mgeneral-regs-only. I have created a bug for that: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94991 For the given testcase, we are doing FAIL for scalar floating move expand pattern since TARGET_FLOAT is false with option -mgeneral-regs-only. But move expand pattern cannot fail. It would be better to replace the FAIL with code that bitcasts to the equivalent integer mode, using gen_lowpart. Bootstrap and tested on aarch64-linux-gnu. Comments? Thanks, Felix pr94991-v1.diff Description: pr94991-v1.diff
RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero
Hi, > -Original Message- > From: Segher Boessenkool [mailto:seg...@kernel.crashing.org] > Sent: Tuesday, March 24, 2020 10:58 PM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) > Subject: Re: [PATCH PR94026] combine missed opportunity to simplify > comparisons with zero > > On Tue, Mar 24, 2020 at 06:30:12AM +, Yangfei (Felix) wrote: > > I modified combine emitting a simple AND operation instead of making one > zero_extract for this scenario. > > Attached please find the new patch. Hope this solves both of our concerns. > > This looks promising. I'll try it out, see what it does on other targets. > (It will > have to wait for GCC 11 stage 1, of course). I see GCC11 stage 1 opens for commits now. I have rebased the patch on the latest repo. Attached please find the v2 patch. Bootstrapped and tested on x86-64-linux-gnu and aarch64-linux-gnu. Is this good to go? > > p.s. Please use a correct mime type? application/octet-stream isn't > something I can reply to. Just text/plain is fine :-) I have using plain text now, hope that works for you. :-) Thanks, Felix pr94026-v2.diff Description: pr94026-v2.diff
RE: [PATCH PR94784] ICE: in simplify_vector_constructor, at tree-ssa-forwprop.c:2482
> -Original Message- > From: Richard Sandiford [mailto:richard.sandif...@arm.com] > Sent: Monday, April 27, 2020 6:10 PM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH PR94784] ICE: in simplify_vector_constructor, at tree- > ssa-forwprop.c:2482 > > > Good suggestion. Modified accordingly. > > > >> LGTM otherwise, and sorry for the breakage. > > > > Does the v2 patch look better? > > > > Manually run the following three tests with runtest: > > gcc.target/aarch64/sve/acle/general/pr94683.c > > gcc.target/aarch64/sve/acle/general/pr94700.c > > gcc.dg/pr94784.c > > Thanks, pushed to master after testing on aarch64-linux-gnu. My local bootstrap and regression tests for the v2 patch have just finished. The results looks good too. Thanks for the efforts : - ) Best regards, Felix
RE: [PATCH PR94784] ICE: in simplify_vector_constructor, at tree-ssa-forwprop.c:2482
> -Original Message- > From: Richard Sandiford [mailto:richard.sandif...@arm.com] > Sent: Monday, April 27, 2020 3:51 PM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH PR94784] ICE: in simplify_vector_constructor, at tree- > ssa-forwprop.c:2482 > > "Yangfei (Felix)" writes: > > Hi, > > > > PR tree-optimization/94269 > > * tree-ssa-math-opts.c (convert_plusminus_to_widen): Restrict > > - this > > - operation to single basic block. > > + this operation to single basic block. > > Not wrong, but might as well leave the entry as-is. OK. Leave it there. > > > We don't need to fall back to constant comparisons here. > We should assert for known_eq between the TYPE_VECTOR_SUBPARTS > instead. > > Same for the other assert. Good suggestion. Modified accordingly. > LGTM otherwise, and sorry for the breakage. Does the v2 patch look better? Manually run the following three tests with runtest: gcc.target/aarch64/sve/acle/general/pr94683.c gcc.target/aarch64/sve/acle/general/pr94700.c gcc.dg/pr94784.c Thanks for your help, Felix pr94784-v2.diff Description: pr94784-v2.diff
[PATCH PR94784] ICE: in simplify_vector_constructor, at tree-ssa-forwprop.c:2482
Hi, I see one gcc_assert was introduce in: https://gcc.gnu.org/pipermail/gcc-patches/2020-April/544271.html This is causing an ICE for certain cases. I have created a PR for this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94784 I did some check and it looks like everything works fine before the ICE. In the testcase we have two vectors with the same ABI identity but with different TYPE_MODEs. The proposed patch flips the assert around so that it checks that the two vectors have equal TYPE_VECTOR_SUBPARTS and that converting the corresponding element types is a useless_type_conversion_p. Bootstrap and tested on aarch64-linux-gnu. OK? Thanks, Felix pr94784-v1.diff Description: pr94784-v1.diff
[PATCH] aarch64: add tests for CPP predefines under -mgeneral-regs-only
Hi, I noticed that gcc.target/aarch64/pragma_cpp_predefs_1.c performs testing for -mgeneral-regs-only. This adds similar testing in the following two tests to make sure CPP predefines redefinitions on #pragma works as expected when -mgeneral-regs-only option is specified (See PR94678): gcc.target/aarch64/pragma_cpp_predefs_2.c gcc.target/aarch64/pragma_cpp_predefs_3.c The two tests pass with the modification. OK? gcc/testsuite/ PR target/94678 * gcc.target/aarch64/pragma_cpp_predefs_2.c: Fix typos, pop_pragma -> pop_options. Add tests for general-regs-only. * gcc.target/aarch64/pragma_cpp_predefs_3.c: Add tests for general-regs-only. Thanks for your help, Felix add-sve-predef-tests-v1.diff Description: add-sve-predef-tests-v1.diff
RE: [PATCH PR94678] aarch64: unexpected result with -mgeneral-regs-only and sve
Hi, > -Original Message- > From: Richard Sandiford [mailto:richard.sandif...@arm.com] > Sent: Wednesday, April 22, 2020 6:03 PM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH PR94678] aarch64: unexpected result with -mgeneral- > regs-only and sve > > Mostly LGTM, just a couple of minor points: Thanks for the very careful code review. :-) I think the revised patch fixed these points. GCC builds OK and the newly added test still works. Please help push if it's good to go. Felix 0001-aarch64-unexpected-result-with-mgeneral-regs-only-an.patch Description: 0001-aarch64-unexpected-result-with-mgeneral-regs-only-an.patch
RE: [PATCH PR94678] aarch64: unexpected result with -mgeneral-regs-only and sve
> -Original Message- > From: Richard Sandiford [mailto:richard.sandif...@arm.com] > Sent: Tuesday, April 21, 2020 6:11 PM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH PR94678] aarch64: unexpected result with -mgeneral- > regs-only and sve > > Should aarch64_sve::init_builtins ()/aarch64_sve::handle_arm_sve_h () be > guarded by TARGET_SVE? > > > > Could you please confirm that? > > Yeah, that's right. As Jakub says, the SVE stuff is (deliberately) registered > unconditionally because it's possible to switch SVE on and off later. Also, > protecting it with just TARGET_SVE would mean that we'd continue to > register SVE2 functions even if SVE2 isn't currently enabled. Well, I was thinking maybe we can call aarch64_sve::init_builtins () in aarch64_pragma_target_parse when SVE is switched on. But I think I will leave this alone. > So what matters is whether SVE is enabled at the point of use, not the point > of the #include. FWIW, arm_neon.h works the same way: the same > functions are defined regardless of what the current prevailing architecture > is, > and what matters is whether the necessary features are enabled when the > functions are called. (Inlining fails if they're not.) But because we're > implementing arm_sve.h directly in the compiler, we don't need the > overhead of a full target switch when defining the functions. > > And like you say, the second of the above tests makes sure that turning SVE > on later does indeed work, while the first makes sure that we get an > appropriate error if SVE is disabled at the point of use. Thanks for the details, it helps :-) I have modified accordingly and attached please find the adapted v2 patch. Bootstrap and tested on aarch64 Linux platform. Does it look better? Note that we need to disable TARGET_GENERAL_REGS_ONLY before register_builtin_types. Otherwise we got ICEs like: : internal compiler error: in register_builtin_types, at config/aarch64/aarch64-sve-builtins.cc:3336 0x185bcfb register_builtin_types ../../gcc-git/gcc/config/aarch64/aarch64-sve-builtins.cc:3332 0x185c467 aarch64_sve::init_builtins() ../../gcc-git/gcc/config/aarch64/aarch64-sve-builtins.cc:3375 0x17c075b aarch64_init_builtins ../../gcc-git/gcc/config/aarch64/aarch64.c:13086 , where TYPE_MODE (vectype) is BLKmode. The reason is targetm.vector_mode_supported_p (mode) and have_regs_of_mode[mode] are false when TARGET_GENERAL_REGS_ONLY is enabled. (gdb) bt #0 vector_type_mode (t=0xb7aa2f18) at ../../gcc-git/gcc/tree.c:13825 #1 0x01297ee0 in layout_type (type=0xb7aa2f18) at ../../gcc-git/gcc/stor-layout.c:2400 #2 0x016e19d8 in make_vector_type (innertype=0xb7aa2a80, nunits=..., mode=E_VNx16BImode) at ../../gcc-git/gcc/tree.c:9984 #3 0x016e79b4 in build_truth_vector_type_for_mode (nunits=..., mask_mode=E_VNx16BImode) at ../../gcc-git/gcc/tree.c:10929 #4 0x0185bb00 in aarch64_sve::register_builtin_types () at ../../gcc-git/gcc/config/aarch64/aarch64-sve-builtins.cc:3331 #5 0x0185c468 in aarch64_sve::init_builtins () at ../../gcc-git/gcc/config/aarch64/aarch64-sve-builtins.cc:3375 #6 0x017c075c in aarch64_init_builtins () at ../../gcc-git/gcc/config/aarch64/aarch64.c:13086 #7 0x00a69cb4 in c_define_builtins (va_list_ref_type_node=0xb79ea540, va_list_arg_type_node=0xb79e79d8) at ../../gcc-git/gcc/c-family/c-common.c:3973 Felix pr94678-v2.patch Description: pr94678-v2.patch
RE: [PATCH PR94678] aarch64: unexpected result with -mgeneral-regs-only and sve
> -Original Message- > From: Richard Sandiford [mailto:richard.sandif...@arm.com] > Sent: Tuesday, April 21, 2020 4:01 PM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH PR94678] aarch64: unexpected result with -mgeneral- > regs-only and sve > > "Yangfei (Felix)" writes: > > Hi, > > > > It looks like there are several issues out there for sve codegen with - > mgeneral-regs-only. > > I have created a bug for that: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94678 > > > > We do ISA extension checks for SVE in > check_required_extensions(aarch64-sve-builtins.cc). > > I think we may also need to check -mgeneral-regs-only there and issue an > error message when this option is specified. > > This would be cheaper as compared with adding && > TARGET_GENERAL_REGS_ONLY to TARGET_SVE and similar macros. > > We should probably do both. > > The middle end should never try to use vector patterns when the vector > modes have been disabled by !have_regs_of_mode. But it's still wrong for > the target to provide patterns that would inevitably lead to spill failure > due to > lack of registers. So I think we should check !TARGET_GENERAL_REGS_ONLY > in TARGET_SVE. Yes, that's right. And I have a question here: Should aarch64_sve::init_builtins ()/aarch64_sve::handle_arm_sve_h () be guarded by TARGET_SVE? I mean in aarch64_init_builtins: if (TARGET_SVE) aarch64_sve::init_builtins (); and in aarch64_pragma_aarch64: if (TARGET_SVE) aarch64_sve::handle_arm_sve_h (); It looks to me that this is not wanted from the following two tests: ./gcc.target/aarch64/sve/acle/general/nosve_1.c ./gcc.target/aarch64/sve/acle/general/nosve_2.c Could you please confirm that? > I guess the main danger is for instructions like ADDVL, ADDPL and CNT[BHWD] > that do actually operate on general registers. Perhaps there'll be weird > corner cases in which the compiler wants to know the VL even for -mgeneral- > regs. I can't think of one off-hand though. > > If that becomes a problem, we can introduce a second macro to control the > general register operations. But I think we can safely assume for now that > one macro is enough. OK, let's see. > > Attached please find the proposed patch. Bootstrap and tested on > aarch64 Linux platform. > > Suggestions? > > > > Thanks, > > Felix > > > > diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 721928d..c109d7b > > 100644 > > --- a/gcc/ChangeLog > > +++ b/gcc/ChangeLog > > @@ -1,3 +1,11 @@ > > +2020-04-21 Felix Yang > > + > > + PR target/94678 > > + * config/aarch64/aarch64-sve-builtins.cc > (check_required_extensions): > > + Add check for TARGET_GENERAL_REGS_ONLY. > > + (report_missing_extension): Print different error message under > > + TARGET_GENERAL_REGS_ONLY. > > + > > 2020-04-20 Andreas Krebbel > > > > * config/s390/vector.md ("popcountv8hi2_vx", "popcountv4si2_vx") > > diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc > > b/gcc/config/aarch64/aarch64-sve-builtins.cc > > index ca4a0eb..4a77db7 100644 > > --- a/gcc/config/aarch64/aarch64-sve-builtins.cc > > +++ b/gcc/config/aarch64/aarch64-sve-builtins.cc > > @@ -649,11 +649,21 @@ report_missing_extension (location_t location, > tree fndecl, > >if (reported_missing_extension_p) > > return; > > > > - error_at (location, "ACLE function %qD requires ISA extension %qs", > > - fndecl, extension); > > - inform (location, "you can enable %qs using the command-line" > > - " option %<-march%>, or by using the %" > > - " attribute or pragma", extension); > > + if (TARGET_GENERAL_REGS_ONLY) > > +{ > > + error_at (location, "ACLE function %qD requires ISA extension %qs" > > + " which is incompatible with the use of %qs", > > + fndecl, extension, "-mgeneral-regs-only"); > > +} > > + else > > +{ > > + error_at (location, "ACLE function %qD requires ISA extension %qs", > > + fndecl, extension); > > + inform (location, "you can enable %qs using the command-line" > > + " option %<-march%>, or by using the %" > > + " attribute or pragma", extension); > > +} > > + > >reported_missing_extension_p = true; } > > > > @@ -666,7 +676,14 @@ check_required_extensions (location_t location, > > tree fndecl, { > >uint64_t m
[PATCH PR94678] aarch64: unexpected result with -mgeneral-regs-only and sve
Hi, It looks like there are several issues out there for sve codegen with -mgeneral-regs-only. I have created a bug for that: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94678 We do ISA extension checks for SVE in check_required_extensions(aarch64-sve-builtins.cc). I think we may also need to check -mgeneral-regs-only there and issue an error message when this option is specified. This would be cheaper as compared with adding && TARGET_GENERAL_REGS_ONLY to TARGET_SVE and similar macros. Attached please find the proposed patch. Bootstrap and tested on aarch64 Linux platform. Suggestions? Thanks, Felix pr94678-v1.patch Description: pr94678-v1.patch
RE: [PATCH] ICE: in vectorizable_load, at tree-vect-stmts.c:9173
Hi! > -Original Message- > From: Richard Sandiford [mailto:richard.sandif...@arm.com] > Sent: Tuesday, March 31, 2020 4:55 PM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org; rguent...@suse.de > Subject: Re: [PATCH] ICE: in vectorizable_load, at tree-vect-stmts.c:9173 > > > > Yes, I have modified accordingly. Attached please find the adapted patch. > > Bootstrapped and tested on aarch64-linux-gnu. Newly add test will fail > without the patch and pass otherwise. > > Looks good. OK for master. Thanks for reviewing this. > > I think I need a sponsor if this patch can go separately. > > Yeah, please fill in the form on: > >https://sourceware.org/cgi-bin/pdw/ps_form.cgi > > listing me as sponsor. Hmm, I already have an account : - ) But my networking does not work well and I am having some trouble committing the patch. Could you please help commit this patch? Felix
RE: [PATCH] ICE: in vectorizable_load, at tree-vect-stmts.c:9173
Hi! > -Original Message- > From: Richard Sandiford [mailto:richard.sandif...@arm.com] > Sent: Monday, March 30, 2020 8:08 PM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org; rguent...@suse.de > Subject: Re: [PATCH] ICE: in vectorizable_load, at tree-vect-stmts.c:9173 > > "Yangfei (Felix)" writes: > > Hi! > >> -----Original Message- > >> From: Yangfei (Felix) > >> Sent: Monday, March 30, 2020 5:28 PM > >> To: gcc-patches@gcc.gnu.org > >> Cc: 'rguent...@suse.de' > >> Subject: [PATCH] ICE: in vectorizable_load, at tree-vect-stmts.c:9173 > >> > >> Hi, > >> > >> New bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94398 > >> > >> With -mstrict-align, aarch64_builtin_support_vector_misalignment will > >> returns false when misalignment factor is unknown at compile time. > >> Then vect_supportable_dr_alignment returns dr_unaligned_unsupported, > >> which triggers the ICE. I have pasted the call trace on the bug report. > >> > >> vect_supportable_dr_alignment is expected to return either dr_aligned > >> or dr_unaligned_supported for masked operations. > >> But it seems that this function only catches internal_fn > >> IFN_MASK_LOAD & IFN_MASK_STORE. > >> We are emitting a mask gather load here for this test case. > >> As backends have their own vector misalignment support policy, I am > >> supposing this should be better handled in the auto-vect shared code. > >> > > > > I can only reply to comment on the bug here as my account got locked by the > bugzilla system for now. > > Sorry to hear that. What was the reason? Looks like it got filtered by spamassassin. Admin has helped unlocked it. > > The way Richard (rsandifo) suggests also works for me and I agree it should > be more consistent and better for compile time. > > One question here: when will a IFN_MASK_LOAD/IFN_MASK_STORE be > passed to vect_supportable_dr_alignment? > > I'd expect that to happen in the same cases as for unmasked load/stores. > It certainly will for unconditional loads and stores that get masked via > full-loop > masking. > > In principle, the same rules apply to both masked and unmasked accesses. > But for SVE, both masked and unmasked accesses should support misalignment, > which is why I think the current target hook is probably wrong for SVE + > -mstrict-align. I stopped looking into the backend further when I saw no distinction for different type of access in the target hook aarch64_builtin_support_vector_misalignment. > > @@ -8051,8 +8051,15 @@ vectorizable_store (stmt_vec_info stmt_info, > gimple_stmt_iterator *gsi, > >auto_vec dr_chain (group_size); > >oprnds.create (group_size); > > > > - alignment_support_scheme > > -= vect_supportable_dr_alignment (first_dr_info, false); > > + /* Strided accesses perform only component accesses, alignment > > + is irrelevant for them. */ > > + if (STMT_VINFO_STRIDED_P (first_dr_info->stmt) > > + && !STMT_VINFO_GROUPED_ACCESS (first_dr_info->stmt)) > > I think this should be based on memory_access_type == > VMAT_GATHER_SCATTER instead. At this point, STMT_VINFO_* describes > properties of the original scalar access(es) while memory_access_type > describes the vector implementation strategy. It's the latter that matters > here. > > Same thing for loads. Yes, I have modified accordingly. Attached please find the adapted patch. Bootstrapped and tested on aarch64-linux-gnu. Newly add test will fail without the patch and pass otherwise. I think I need a sponsor if this patch can go separately. Thanks, Felix pr94398-v1.patch Description: pr94398-v1.patch
RE: [PATCH] ICE: in vectorizable_load, at tree-vect-stmts.c:9173
Hi! > -Original Message- > From: Yangfei (Felix) > Sent: Monday, March 30, 2020 5:28 PM > To: gcc-patches@gcc.gnu.org > Cc: 'rguent...@suse.de' > Subject: [PATCH] ICE: in vectorizable_load, at tree-vect-stmts.c:9173 > > Hi, > > New bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94398 > > With -mstrict-align, aarch64_builtin_support_vector_misalignment will returns > false when misalignment factor is unknown at compile time. > Then vect_supportable_dr_alignment returns dr_unaligned_unsupported, > which triggers the ICE. I have pasted the call trace on the bug report. > > vect_supportable_dr_alignment is expected to return either dr_aligned or > dr_unaligned_supported for masked operations. > But it seems that this function only catches internal_fn IFN_MASK_LOAD & > IFN_MASK_STORE. > We are emitting a mask gather load here for this test case. > As backends have their own vector misalignment support policy, I am supposing > this should be better handled in the auto-vect shared code. > I can only reply to comment on the bug here as my account got locked by the bugzilla system for now. The way Richard (rsandifo) suggests also works for me and I agree it should be more consistent and better for compile time. One question here: when will a IFN_MASK_LOAD/IFN_MASK_STORE be passed to vect_supportable_dr_alignment? New patch: diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c index 12beef6..2825023 100644 --- a/gcc/tree-vect-stmts.c +++ b/gcc/tree-vect-stmts.c @@ -8051,8 +8051,15 @@ vectorizable_store (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi, auto_vec dr_chain (group_size); oprnds.create (group_size); - alignment_support_scheme -= vect_supportable_dr_alignment (first_dr_info, false); + /* Strided accesses perform only component accesses, alignment + is irrelevant for them. */ + if (STMT_VINFO_STRIDED_P (first_dr_info->stmt) + && !STMT_VINFO_GROUPED_ACCESS (first_dr_info->stmt)) +alignment_support_scheme = dr_unaligned_supported; + else +alignment_support_scheme + = vect_supportable_dr_alignment (first_dr_info, false); + gcc_assert (alignment_support_scheme); vec_loop_masks *loop_masks = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) @@ -9168,8 +9175,15 @@ vectorizable_load (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi, ref_type = reference_alias_ptr_type (DR_REF (first_dr_info->dr)); } - alignment_support_scheme -= vect_supportable_dr_alignment (first_dr_info, false); + /* Strided accesses perform only component accesses, alignment + is irrelevant for them. */ + if (STMT_VINFO_STRIDED_P (first_dr_info->stmt) + && !STMT_VINFO_GROUPED_ACCESS (first_dr_info->stmt)) +alignment_support_scheme = dr_unaligned_supported; + else +alignment_support_scheme + = vect_supportable_dr_alignment (first_dr_info, false); + gcc_assert (alignment_support_scheme); vec_loop_masks *loop_masks = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
[PATCH] ICE: in vectorizable_load, at tree-vect-stmts.c:9173
Hi, New bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94398 With -mstrict-align, aarch64_builtin_support_vector_misalignment will returns false when misalignment factor is unknown at compile time. Then vect_supportable_dr_alignment returns dr_unaligned_unsupported, which triggers the ICE. I have pasted the call trace on the bug report. vect_supportable_dr_alignment is expected to return either dr_aligned or dr_unaligned_supported for masked operations. But it seems that this function only catches internal_fn IFN_MASK_LOAD & IFN_MASK_STORE. We are emitting a mask gather load here for this test case. As backends have their own vector misalignment support policy, I am supposing this should be better handled in the auto-vect shared code. Proposed fix: diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c index 0192aa6..67d3345 100644 --- a/gcc/tree-vect-data-refs.c +++ b/gcc/tree-vect-data-refs.c @@ -6509,11 +6509,26 @@ vect_supportable_dr_alignment (dr_vec_info *dr_info, /* For now assume all conditional loads/stores support unaligned access without any special code. */ - if (gcall *stmt = dyn_cast (stmt_info->stmt)) -if (gimple_call_internal_p (stmt) - && (gimple_call_internal_fn (stmt) == IFN_MASK_LOAD - || gimple_call_internal_fn (stmt) == IFN_MASK_STORE)) - return dr_unaligned_supported; + gcall *call = dyn_cast (stmt_info->stmt); + if (call && gimple_call_internal_p (call)) +{ + internal_fn ifn = gimple_call_internal_fn (call); + switch (ifn) + { + case IFN_MASK_LOAD: + case IFN_MASK_LOAD_LANES: + case IFN_MASK_GATHER_LOAD: + case IFN_MASK_STORE: + case IFN_MASK_STORE_LANES: + case IFN_MASK_SCATTER_STORE: + return dr_unaligned_supported; + default: + break; + } +} + + if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) +return dr_unaligned_supported; if (loop_vinfo) { Suggestions? Thanks, Felix
RE: [RFC] Should widening_mul should consider block frequency?
> -Original Message- > From: Richard Biener [mailto:richard.guent...@gmail.com] > Sent: Thursday, March 26, 2020 3:37 PM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [RFC] Should widening_mul should consider block frequency? > > > > > That's a good point. I have attached the v2 patch. > > Also did a spec2017 test on aarch64, no obvious impact witnessed with this. > > Can you sponsor this patch please? My networking does not work well > > and I am having some trouble pushing it : - ( > > Pushed. For the future can you please attach patches suitable for git am? Sure, will do. Thanks for the help : - ) Felix
RE: [RFC] Should widening_mul should consider block frequency?
Hi! > -Original Message- > From: Richard Biener [mailto:richard.guent...@gmail.com] > Sent: Tuesday, March 24, 2020 10:14 PM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [RFC] Should widening_mul should consider block frequency? > > > > As written in the PR I'd follow fma generation and restrict defs to the > > > same > BB. > > > > Thanks for the suggestion. That should be more consistent. > > Attached please find the adapted patch. > > Bootstrap and tested on both x86_64 and aarch64 Linux platform. > > OK with moving the BB check before the is_widening_mult_p call since it's way > cheaper. That's a good point. I have attached the v2 patch. Also did a spec2017 test on aarch64, no obvious impact witnessed with this. Can you sponsor this patch please? My networking does not work well and I am having some trouble pushing it : - ( git commit msg: widening_mul: restrict ops to be defined in the same basic-block when convert plusminus to widen In the testcase for PR94269, widening_mul moves two multiply instructions from outside the loop to inside the loop, merging with two add instructions separately. This increases the cost of the loop. Like FMA detection in the same pass, simply restrict ops to be defined in the same basic-block to avoid possibly moving multiply to a different block with a higher execution frequency. 2020-03-26 Felix Yang gcc/ PR tree-optimization/94269 * tree-ssa-math-opts.c (convert_plusminus_to_widen): Restrict this operation to single basic block. gcc/testsuite/ PR tree-optimization/94269 * gcc.dg/pr94269.c: New test. change log: gcc: +2020-03-26 Felix Yang + + PR tree-optimization/94269 + * tree-ssa-math-opts.c (convert_plusminus_to_widen): Restrict this + operation to single basic block. gcc/testsuite: +2020-03-26 Felix Yang + + PR tree-optimization/94269 + * gcc.dg/pr94269.c: New test. + pr94269-v2.patch Description: pr94269-v2.patch
RE: [RFC] Should widening_mul should consider block frequency?
Hi! > -Original Message- > From: Richard Biener [mailto:richard.guent...@gmail.com] > Sent: Monday, March 23, 2020 11:25 PM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [RFC] Should widening_mul should consider block frequency? > > On Mon, Mar 23, 2020 at 10:53 AM Yangfei (Felix) > wrote: > > > > Hi, > > > > I created a bug for this issue: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94269 > > Looks like widening_mul phase may move multiply instruction from outside > the loop to inside the loop, merging with one add instruction inside the loop. > > This will increase the cost of the loop at least on aarch64 (4 cycles vs 1 > cycle). I think widening_mul should consider block frequency when doing such > a combination. > > I mean something like: > > As written in the PR I'd follow fma generation and restrict defs to the same > BB. Thanks for the suggestion. That should be more consistent. Attached please find the adapted patch. Bootstrap and tested on both x86_64 and aarch64 Linux platform. gcc: +2020-03-24 Felix Yang + + PR tree-optimization/94269 + * tree-ssa-math-opts.c (convert_plusminus_to_widen): Restrict this + operation to single basic block. gcc/testsuite: +2020-03-24 Felix Yang + + PR tree-optimization/94269 + * gcc.dg/pr94269.c: New test. + Thanks, Felix pr94269-v1.patch Description: pr94269-v1.patch
RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero
Hi! > -Original Message- > From: Segher Boessenkool [mailto:seg...@kernel.crashing.org] > Sent: Monday, March 23, 2020 8:10 PM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) > Subject: Re: [PATCH PR94026] combine missed opportunity to simplify > comparisons with zero > > Yeah, maybe not in simplify-rtx.c, hrm. There is SELECT_CC_MODE for these > things, and combine knows about that (not many passes do). I modified combine emitting a simple AND operation instead of making one zero_extract for this scenario. Attached please find the new patch. Hope this solves both of our concerns. gcc +2020-03-24 Felix Yang + + PR rtl-optimization/94026 + * combine.c (make_compound_operation_int): If we are have (and + (lshiftrt X C) M) and M is a constant that would select a field + of bits within an item, but not the entire word, fold this into + a simple AND if we are in an equality comparison. gcc/testsuite +2020-03-24 Felix Yang + + PR rtl-optimization/94026 + * gcc.dg/pr94026.c: New test. Thanks, Felix pr94026-v1.patch Description: pr94026-v1.patch
[RFC] Should widening_mul should consider block frequency?
Hi, I created a bug for this issue: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94269 Looks like widening_mul phase may move multiply instruction from outside the loop to inside the loop, merging with one add instruction inside the loop. This will increase the cost of the loop at least on aarch64 (4 cycles vs 1 cycle). I think widening_mul should consider block frequency when doing such a combination. I mean something like: diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c index 54ba035..4439452 100644 --- a/gcc/tree-ssa-math-opts.c +++ b/gcc/tree-ssa-math-opts.c @@ -2721,7 +2721,10 @@ convert_plusminus_to_widen (gimple_stmt_iterator *gsi, gimple *stmt, { if (!has_single_use (rhs1) || !is_widening_mult_p (rhs1_stmt, , _rhs1, - , _rhs2)) + , _rhs2) + || (gimple_bb (rhs1_stmt) != gimple_bb (stmt) + && gimple_bb (rhs1_stmt)->count.to_frequency(cfun) +< gimple_bb (stmt)->count.to_frequency(cfun))) return false; add_rhs = rhs2; conv_stmt = conv1_stmt; @@ -2730,7 +2733,10 @@ convert_plusminus_to_widen (gimple_stmt_iterator *gsi, gimple *stmt, { if (!has_single_use (rhs2) || !is_widening_mult_p (rhs2_stmt, , _rhs1, - , _rhs2)) + , _rhs2) + || (gimple_bb (rhs2_stmt) != gimple_bb (stmt) + && gimple_bb (rhs2_stmt)->count.to_frequency(cfun) +< gimple_bb (stmt)->count.to_frequency(cfun))) return false; add_rhs = rhs1; conv_stmt = conv2_stmt; Comments? Thanks, Felix
RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero
> -Original Message- > From: Segher Boessenkool [mailto:seg...@kernel.crashing.org] > Sent: Friday, March 20, 2020 9:38 AM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) > Subject: Re: [PATCH PR94026] combine missed opportunity to simplify > comparisons with zero > > On Thu, Mar 19, 2020 at 01:43:40AM +, Yangfei (Felix) wrote: > > 2. Given that the patterns for ubfx and ubfiz are already not simple, I am > afraid the pattern we got by combining the three would be much complex. > > And even more complex when further merged with insn 14 here in order to > make sure that we are doing a equality comparison with zero. > > It will be just as simple as with the other approach: I think the problem here is how to make sure we are doing a ***equality*** comparison with zero. We can only do the transformation under this condition. Then I see combine tries the following pattern: 173 Failed to match this instruction: 174 (set (reg:SI 101) 175 (ne:SI (and:SI (lshiftrt:SI (reg:SI 102) 176 (const_int 8 [0x8])) 177 (const_int 6 [0x6])) 178 (const_int 0 [0]))) But this cannot match a 'tst' instruction as the above pattern does not clobber the CC flag register. Also this means that we depend on specific uses cases, so may need different patterns to match all possible cases. > > > Another approach: > > > > > > Trying 7 -> 9: > > > 7: r99:SI=r103:SI>>0x8 > > > REG_DEAD r103:SI > > > 9: cc:CC_NZ=cmp(r99:SI&0x6,0) > > > REG_DEAD r99:SI > > > Failed to match this instruction: > > > (set (reg:CC_NZ 66 cc) > > > (compare:CC_NZ (and:SI (lshiftrt:SI (reg:SI 103) > > > (const_int 8 [0x8])) > > > (const_int 6 [0x6])) > > > (const_int 0 [0]))) > > > > > > This can be recognised as just that "tst" insn, no? But combine (or > > > simplify-rtx) should get rid of the shift here, just the "and" is > > > simpler after all (it just needs to change the constant for that). > > > > No, this does not mean an equality comparison with zero. I have mentioned > this in my previous mail. > > This should be simplified to > (set (reg:CC_NZ 66 cc) > (compare:CC_NZ (and:SI (reg:SI 103) > (const_int 1536)) > (const_int 0))) > (but it isn't), and that is just *and3nr_compare0, which is a "tst" > instruction. If this is fixed (in simplify-rtx.c), it will work as you want. But I don't think it's correct for logic in simplify-rtx.c to further simplify this rtl: (compare:CC_NZ (and:SI (lshiftrt:SI (reg:SI 102) (const_int 8 [0x8])) (const_int 6 [0x6])) (const_int 0 [0])) The reason is that it knows nothing about CC_NZ. CC_NZ is aarch64-port specific and it does not necessarily mean a equality comparison with zero. Correct me if I missed anything. Thanks, Felix
RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero
Hi, > -Original Message- > From: Segher Boessenkool [mailto:seg...@kernel.crashing.org] > Sent: Thursday, March 19, 2020 7:52 AM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) > Subject: Re: [PATCH PR94026] combine missed opportunity to simplify > comparisons with zero > > Hi! > > On Tue, Mar 17, 2020 at 02:05:19AM +, Yangfei (Felix) wrote: > > > Trying 7 -> 8: > > > 7: r99:SI=r103:SI>>0x8 > > > REG_DEAD r103:SI > > > 8: r100:SI=r99:SI&0x6 > > > REG_DEAD r99:SI > > > Failed to match this instruction: > > > (set (reg:SI 100) > > > (and:SI (lshiftrt:SI (reg:SI 103) > > > (const_int 8 [0x8])) > > > (const_int 6 [0x6]))) > > > > > > That should match already, perhaps with a splitter. aarch64 does > > > not have very generic rotate-and-mask (or -insert) instructions, so > > > the > > > aarch64 backend needs to help combine with the less trivial cases. > > > > > > If you have a splitter for *this* one, all else will probably work > > > "automatically": you split it to two ubfm, and the second of those > > > can then merge into the compare instruction, and everything works out. > > > > Do you mean splitting the above pattern into a combination of ubfx and > > ubfiz? > (Both are aliases of ubfm). > > Sure. The problem with aarch's bitfield instruction is that either the > source or > the dest has to be right-aligned, which isn't natural for the compiler. > > > I still don't see how the benefit can be achieved. > > The following is the expected assembly for the test case: > > tst x0, 1536 > > csetw0, ne > > ret > > This may not happen when the remaining ubfx is there. Also what > instruction be matched when ubfiz is merged into the compare? > > Anything I missed? > > The second insn could combine with the compare, and then that can combine > back further. Let me paste the RTL input to the combine phase: / (insn 6 3 7 2 (set (reg:SI 98) (ashiftrt:SI (reg:SI 102) (const_int 8 [0x8]))) "foo.c":3:16 742 {*aarch64_ashr_sisd_or_int_si3} (expr_list:REG_DEAD (reg:SI 102) (nil))) (note 7 6 8 2 NOTE_INSN_DELETED) (insn 8 7 9 2 (set (reg:CC_NZ 66 cc) (compare:CC_NZ (and:SI (reg:SI 98) (const_int 6 [0x6])) (const_int 0 [0]))) "foo.c":5:8 698 {*andsi3nr_compare0} (expr_list:REG_DEAD (reg:SI 98) (nil))) (note 9 8 14 2 NOTE_INSN_DELETED) (insn 14 9 15 2 (set (reg/i:SI 0 x0) (ne:SI (reg:CC_NZ 66 cc) (const_int 0 [0]))) "foo.c":10:1 494 {aarch64_cstoresi} (expr_list:REG_DEAD (reg:CC 66 cc) (nil))) */ Two issues that I can see here: 1. When the ubfiz is combined with the compare, the combined insn does not necessarily mean a equality comparison with zero. This is also the case when all the three insns (ubfx & ubfiz & compare) are combined together. 2. Given that the patterns for ubfx and ubfiz are already not simple, I am afraid the pattern we got by combining the three would be much complex. And even more complex when further merged with insn 14 here in order to make sure that we are doing a equality comparison with zero. So it looks difficult when we go this port-specific way without matching a "zero_extact". > Another approach: > > Trying 7 -> 9: > 7: r99:SI=r103:SI>>0x8 > REG_DEAD r103:SI > 9: cc:CC_NZ=cmp(r99:SI&0x6,0) > REG_DEAD r99:SI > Failed to match this instruction: > (set (reg:CC_NZ 66 cc) > (compare:CC_NZ (and:SI (lshiftrt:SI (reg:SI 103) > (const_int 8 [0x8])) > (const_int 6 [0x6])) > (const_int 0 [0]))) > > This can be recognised as just that "tst" insn, no? But combine (or > simplify-rtx) should get rid of the shift here, just the "and" is simpler > after all (it > just needs to change the constant for that). No, this does not mean an equality comparison with zero. I have mentioned this in my previous mail. Thanks, Felix
RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero
Hi, > -Original Message- > From: Segher Boessenkool [mailto:seg...@kernel.crashing.org] > Sent: Tuesday, March 17, 2020 1:58 AM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) > Subject: Re: [PATCH PR94026] combine missed opportunity to simplify > comparisons with zero > > On Mon, Mar 16, 2020 at 06:29:39AM +, Yangfei (Felix) wrote: > > Sorry for not getting your point here. > > Trying 7 -> 8: > 7: r99:SI=r103:SI>>0x8 > REG_DEAD r103:SI > 8: r100:SI=r99:SI&0x6 > REG_DEAD r99:SI > Failed to match this instruction: > (set (reg:SI 100) > (and:SI (lshiftrt:SI (reg:SI 103) > (const_int 8 [0x8])) > (const_int 6 [0x6]))) > > That should match already, perhaps with a splitter. aarch64 does not have > very generic rotate-and-mask (or -insert) instructions, so the > aarch64 backend needs to help combine with the less trivial cases. > > If you have a splitter for *this* one, all else will probably work > "automatically": you split it to two ubfm, and the second of those can then > merge into the compare instruction, and everything works out. Do you mean splitting the above pattern into a combination of ubfx and ubfiz? (Both are aliases of ubfm). I still don't see how the benefit can be achieved. The following is the expected assembly for the test case: tst x0, 1536 csetw0, ne ret This may not happen when the remaining ubfx is there. Also what instruction be matched when ubfiz is merged into the compare? Anything I missed? > > Also, this issue is there for ports like x86. If we go that way, then we > > need > to handle each port affected. > > Yes, you need to do target-specific stuff in every backend separately. > > > So I am inclined to handle this in an arch-independent way. > > But you don't. The transformation you do looks to be actively harmful on > many targets. (I haven't tested it yet, that takes 8h currently). Now I know your concern about zero_extract. Maybe this should be mentioned in docs like gccint. Also it's interesting to see how this may affect on those archs. Thanks, Felix
RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero
Hi, > -Original Message- > From: Segher Boessenkool [mailto:seg...@kernel.crashing.org] > Sent: Saturday, March 14, 2020 12:07 AM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) > Subject: Re: [PATCH PR94026] combine missed opportunity to simplify > comparisons with zero > > On Fri, Mar 13, 2020 at 03:21:18AM +, Yangfei (Felix) wrote: > > > On Wed, Mar 04, 2020 at 08:39:36AM +, Yangfei (Felix) wrote: > > > > This is a simple fix for PR94026. > > > > With this fix, combine will try make an extraction if we are in > > > > a equality > > > comparison and this is an AND > > > > with a constant which is power of two minus one. Shift here > > > > should be an > > > constant. For example, combine > > > > will transform (compare (and (lshiftrt x 8) 6) 0) to (compare > > > > (zero_extract > > > (x 2 9)) 0). > > > > > > Why is that a good thing? > > > > The reported test case is reduced from spec2017 541.leela_r. I have pasted > original code snippet on the bugzilla. > > We found other compilers like aocc/llvm can catch this pattern and simplify > > it. > > That wasn't my question, let me rephrase: why would writing it as zero_extract > (instead of as a more canonical form) be wanted? Sorry for not getting your point here. > The aarch backend only has zero_extract formulations for most of the bitfield > instructions. If you fix that problem, all of this should go away? Like, the > testcase in the PR starts with > > Trying 7 -> 8: > 7: r99:SI=r103:SI>>r104:SI#0 > REG_DEAD r104:SI > REG_DEAD r103:SI > 8: r100:SI=r99:SI&0x6 > REG_DEAD r99:SI > Failed to match this instruction: > (set (reg:SI 100) > (and:SI (ashiftrt:SI (reg:SI 103) > (subreg:QI (reg:SI 104) 0)) > (const_int 6 [0x6]))) > > and that should match already (that's an ubfm (ubfx))? For aarch64, if we use "ubfm/ubfx" for the reduced test case, then we still need to do a compare with zero. Then we won't get the benefit. For aarch64, we need to emit a "tst" instruction here. So we need to catch something like: 149 (set (reg:CC_NZ 66 cc) 150 (compare:CC_NZ (and:SI (lshiftrt:SI (reg:SI 102) 151 (const_int 8 [0x8])) 152 (const_int 6 [0x6])) 153 (const_int 0 [0]))) But this pattern is not accurate enough: we can only accept equality comparison with zero here (as indicated by the checking of equality_comparison in my original patch). Also, this issue is there for ports like x86. If we go that way, then we need to handle each port affected. So I am inclined to handle this in an arch-independent way. I looked into tree phases like fwprop & fold-const before, but didn't see an appropriate point to catch this opportunity. Then I came to the combine phase. > > > > (There should be thorough tests on many archs, showing it helps on > > > average, and it doesn't regress anything. I can do that for you, but not > right now). > > > > I only have aarch64 & x86_64 linux available and have tested this patch with > spec17 on both platforms. > > No obvious improvement & regression witnessed. This is expected as only > one instruction is reduced here. > > What should be tested is what new combinations are done, and which are *no > longer* done. In theory, we won't lose but emit more zero_extract with my patch. > > > In general, we should have *fewer* zero_extract, not more. > > Some reasons for that: > > 1) All those can be expressed with simpler operations as well; > 2) Most very similar expressions cannot be expressed as zero_extract, > although many architectures can handle (some of) those just fine; > 3) The optimizers do not handle zero_extract very well at all (this includes > simplify-rtx, to start with). > > sign_extract is nastier -- we really want to have a sign_extend that works on > separate bits, not as coarse as address units as we have now -- but it > currently > isn't handled much either. Thanks for explaining this. I have to admit that I didn't realize this issue when I was creating my original patch. Felix
RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero
> -Original Message- > From: Segher Boessenkool [mailto:seg...@kernel.crashing.org] > Sent: Friday, March 13, 2020 7:50 AM > To: Yangfei (Felix) > Cc: gcc-patches@gcc.gnu.org; Zhanghaijian (A) > Subject: Re: [PATCH PR94026] combine missed opportunity to simplify > comparisons with zero > > Hi! > > Please Cc: me on combine patches; you sent it nine days ago, and I didn't see > it > until now. OK. > On Wed, Mar 04, 2020 at 08:39:36AM +, Yangfei (Felix) wrote: > > This is a simple fix for PR94026. > > With this fix, combine will try make an extraction if we are in a equality > comparison and this is an AND > > with a constant which is power of two minus one. Shift here should be an > constant. For example, combine > > will transform (compare (and (lshiftrt x 8) 6) 0) to (compare > > (zero_extract > (x 2 9)) 0). > > Why is that a good thing? The reported test case is reduced from spec2017 541.leela_r. I have pasted original code snippet on the bugzilla. We found other compilers like aocc/llvm can catch this pattern and simplify it. > (There should be thorough tests on many archs, showing it helps on average, > and it doesn't regress anything. I can do that for you, but not right now). I only have aarch64 & x86_64 linux available and have tested this patch with spec17 on both platforms. No obvious improvement & regression witnessed. This is expected as only one instruction is reduced here. It's appreciated if this can be tested on other archs. > The code needs more comments, and the commit message should say what is > done and why you made those choices. > In general, we should have *fewer* zero_extract, not more. OK. I can add more comments & commit message if finally we are inclined to go with this patch. Thanks, Felix
RE: [PATCH PR94026] combine missed opportunity to simplify comparisons with zero
> -Original Message- > From: Jeff Law [mailto:l...@redhat.com] > Sent: Thursday, March 5, 2020 11:37 PM > To: Yangfei (Felix) ; gcc-patches@gcc.gnu.org > Cc: Zhanghaijian (A) > Subject: Re: [PATCH PR94026] combine missed opportunity to simplify > comparisons with zero > > On Wed, 2020-03-04 at 08:39 +, Yangfei (Felix) wrote: > > Hi, > > > > This is a simple fix for PR94026. > > With this fix, combine will try make an extraction if we are in a > > equality comparison and this is an AND > > with a constant which is power of two minus one. Shift here should > > be an constant. For example, combine > > will transform (compare (and (lshiftrt x 8) 6) 0) to (compare > > (zero_extract (x 2 9)) 0). > > > > Added one test case for this. Bootstrap and tested on both x86_64 > > and > > aarch64 Linux platform. > > Any suggestion? > > > > Thanks, > > Felix > > > > gcc: > > +2020-03-04 Felix Yang > > + > > + PR rtl-optimization/94026 > > + * combine.c (make_compound_operation_int): Make an extraction > > + if we are in a equality comparison and this is an AND with a > > + constant which is power of two minus one. > > + > > > > gcc/testsuite: > > +2020-03-04 Felix Yang > > + > > + PR rtl-optimization/94026 > > + * gcc.dg/pr94026.c: New test. > Just a note. We're in stage4 of our development cycle, meaning we focus on > regression bugfixes. I've queued this for evaluation in gcc-11. > jeff Sure, this is intended for 11. Thanks for doing that : - ) Best regards, Felix
[PATCH PR94026] combine missed opportunity to simplify comparisons with zero
Hi, This is a simple fix for PR94026. With this fix, combine will try make an extraction if we are in a equality comparison and this is an AND with a constant which is power of two minus one. Shift here should be an constant. For example, combine will transform (compare (and (lshiftrt x 8) 6) 0) to (compare (zero_extract (x 2 9)) 0). Added one test case for this. Bootstrap and tested on both x86_64 and aarch64 Linux platform. Any suggestion? Thanks, Felix gcc: +2020-03-04 Felix Yang + + PR rtl-optimization/94026 + * combine.c (make_compound_operation_int): Make an extraction + if we are in a equality comparison and this is an AND with a + constant which is power of two minus one. + gcc/testsuite: +2020-03-04 Felix Yang + + PR rtl-optimization/94026 + * gcc.dg/pr94026.c: New test. + diff --git a/gcc/combine.c b/gcc/combine.c index 58366a6d331..c05064fc333 100644 --- a/gcc/combine.c +++ b/gcc/combine.c @@ -8170,14 +8170,31 @@ make_compound_operation_int (scalar_int_mode mode, rtx *x_ptr, if (!CONST_INT_P (XEXP (x, 1))) break; + HOST_WIDE_INT pos; + unsigned HOST_WIDE_INT len; + pos = get_pos_from_mask (UINTVAL (XEXP (x, 1)), ); + /* If the constant is a power of two minus one and the first operand -is a logical right shift, make an extraction. */ +is a logical right shift, make an extraction. +If we are in a equality comparison and this is an AND with a constant +which is power of two minus one, also make an extraction. */ if (GET_CODE (XEXP (x, 0)) == LSHIFTRT - && (i = exact_log2 (UINTVAL (XEXP (x, 1)) + 1)) >= 0) + && (pos == 0 || (pos > 0 && equality_comparison + && CONST_INT_P (XEXP (XEXP (x, 0), 1) { new_rtx = make_compound_operation (XEXP (XEXP (x, 0), 0), next_code); - new_rtx = make_extraction (mode, new_rtx, 0, XEXP (XEXP (x, 0), 1), -i, 1, 0, in_code == COMPARE); + if (pos == 0) + { + new_rtx = make_extraction (mode, new_rtx, 0, +XEXP (XEXP (x, 0), 1), len, 1, 0, +in_code == COMPARE); + } + else + { + int real_pos = pos + UINTVAL (XEXP (XEXP (x, 0), 1)); + new_rtx = make_extraction (mode, new_rtx, real_pos, NULL_RTX, +len, 1, 0, in_code == COMPARE); + } } /* Same as previous, but for (subreg (lshiftrt ...)) in first op. */ @@ -8186,13 +8203,25 @@ make_compound_operation_int (scalar_int_mode mode, rtx *x_ptr, && is_a (GET_MODE (SUBREG_REG (XEXP (x, 0))), _mode) && GET_CODE (SUBREG_REG (XEXP (x, 0))) == LSHIFTRT - && (i = exact_log2 (UINTVAL (XEXP (x, 1)) + 1)) >= 0) + && (pos == 0 + || (pos > 0 && equality_comparison + && CONST_INT_P (XEXP (SUBREG_REG (XEXP (x, 0)), 1) { rtx inner_x0 = SUBREG_REG (XEXP (x, 0)); new_rtx = make_compound_operation (XEXP (inner_x0, 0), next_code); - new_rtx = make_extraction (inner_mode, new_rtx, 0, -XEXP (inner_x0, 1), -i, 1, 0, in_code == COMPARE); + if (pos == 0) + { + new_rtx = make_extraction (inner_mode, new_rtx, 0, +XEXP (inner_x0, 1), +len, 1, 0, in_code == COMPARE); + } + else + { + int real_pos = pos + UINTVAL (XEXP (inner_x0, 1)); + new_rtx = make_extraction (inner_mode, new_rtx, real_pos, +NULL_RTX, len, 1, 0, +in_code == COMPARE); + } /* If we narrowed the mode when dropping the subreg, then we lose. */ if (GET_MODE_SIZE (inner_mode) < GET_MODE_SIZE (mode)) @@ -8200,10 +8229,10 @@ make_compound_operation_int (scalar_int_mode mode, rtx *x_ptr, /* If that didn't give anything, see if the AND simplifies on its own. */ - if (!new_rtx && i >= 0) + if (!new_rtx) { new_rtx = make_compound_operation (XEXP (x, 0), next_code); - new_rtx = make_extraction (mode, new_rtx, 0, NULL_RTX, i, 1, + new_rtx = make_extraction (mode, new_rtx, pos, NULL_RTX, len, 1, 0, in_code == COMPARE); } } @@ -8212,7 +8241,7 @@ make_compound_operation_int (scalar_int_mode mode, rtx *x_ptr, || GET_CODE (XEXP (x, 0)) == IOR) && GET_CODE (XEXP (XEXP (x, 0), 0)) == LSHIFTRT && GET_CODE (XEXP
Re: [PATCH, AArch64] atomics: prefetch the destination for write prior to ldxr/stxr loops
> On Mon, Mar 7, 2016 at 7:27 PM, Yangfei (Felix) <felix.y...@huawei.com> wrote: > > Hi, > > > > As discussed in LKML: > http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html, > the > cost of changing a cache line > > from shared to exclusive state can be significant on aarch64 cores, > especially when this is triggered by an exclusive store, since it may > > result in having to retry the transaction. > > This patch makes use of the "prfm PSTL1STRM" instruction to prefetch > cache lines for write prior to ldxr/stxr loops generated by the ll/sc atomic > routines. > > Bootstrapped on AArch64 server, is it OK? > > > I don't think this is a good thing in general. For an example on ThunderX, > the > prefetch just adds a cycle for no benefit. This really depends on the > micro-architecture of the core and how LDXR/STXR are > implemented. So after this patch, it will slow down ThunderX. > > Thanks, > Andrew Pinski > Hi Andrew, I am not quite clear about the ThunderX micro-arch. But, Yes, I agree it depends on the micro-architecture of the core. As the mentioned kernel patch is merged upstream, I think the added prefetch instruction in atomic routines is good for most of AArch64 cores in the market. If it does nothing good for ThunderX, then how about adding some checking here? I mean disabling the the generation of the prfm if we are tuning for ThunderX. Thanks, Felix
[PATCH, AArch64] atomics: prefetch the destination for write prior to ldxr/stxr loops
Hi, As discussed in LKML: http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html, the cost of changing a cache line from shared to exclusive state can be significant on aarch64 cores, especially when this is triggered by an exclusive store, since it may result in having to retry the transaction. This patch makes use of the "prfm PSTL1STRM" instruction to prefetch cache lines for write prior to ldxr/stxr loops generated by the ll/sc atomic routines. Bootstrapped on AArch64 server, is it OK? Thanks, Felix aarch64-atomics-v0.diff Description: aarch64-atomics-v0.diff
Re: [PATCH] Only accept BUILT_IN_NORMAL stringops for interesting_stringop_to_profile_p
, stmt, HIST_TYPE_SINGLE_VALUE); if (!histogram) return false; + val = histogram-hvalue.counters[0]; count = histogram-hvalue.counters[1]; all = histogram-hvalue.counters[2]; gimple_remove_histogram_value (cfun, stmt, histogram); + /* We require that count is at least half of all; this means that for the transformation to fire the value must be constant at least 80% of time. */ @@ -1719,8 +1712,10 @@ gimple_stringops_transform (gimple_stmt_iterator * prob = GCOV_COMPUTE_SCALE (count, all); else prob = 0; + dest = gimple_call_arg (stmt, 0); dest_align = get_pointer_alignment (dest); + fcode = DECL_FUNCTION_CODE (gimple_call_fndecl (stmt)); switch (fcode) { case BUILT_IN_MEMCPY: @@ -1811,6 +1806,7 @@ stringop_block_profile (gimple stmt, unsigned int /* Find values inside STMT for that we want to measure histograms for division/modulo optimization. */ + static void gimple_divmod_values_to_profile (gimple stmt, histogram_values *values) { @@ -1891,21 +1887,18 @@ gimple_indirect_call_to_profile (gimple stmt, hist /* Find values inside STMT for that we want to measure histograms for string operations. */ + static void gimple_stringops_values_to_profile (gimple stmt, histogram_values *values) { - tree fndecl; - tree blck_size; tree dest; + tree blck_size; int size_arg; - if (gimple_code (stmt) != GIMPLE_CALL) + if (!gimple_call_builtin_p (stmt, BUILT_IN_NORMAL)) return; - fndecl = gimple_call_fndecl (stmt); - if (!fndecl) -return; - if (!interesting_stringop_to_profile_p (fndecl, stmt, size_arg)) + if (!interesting_stringop_to_profile_p (stmt, size_arg)) return; dest = gimple_call_arg (stmt, 0); @@ -1919,6 +1912,7 @@ gimple_stringops_values_to_profile (gimple stmt, h values-safe_push (gimple_alloc_histogram_value (cfun, HIST_TYPE_AVERAGE, stmt, blck_size)); } + if (TREE_CODE (blck_size) != INTEGER_CST) values-safe_push (gimple_alloc_histogram_value (cfun, HIST_TYPE_IOR, stmt, dest)); Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 141081) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,12 @@ +2014-08-20 Felix Yang felix.y...@huawei.com + Jiji Jiang jiangj...@huawei.com + + * value-prof.c (interesting_stringop_to_profile_p): Removed FNDECL argument + and get builtin function code directly from CALL. + (gimple_stringop_fixed_value): Modified accordingly. + (gimple_stringops_transform, gimple_stringops_values_to_profile): Modified + accordingly and only accept BUILT_IN_NORMAL string operations. + 2015-08-18 Segher Boessenkool seg...@kernel.crashing.org Backport from mainline: On Thu, Aug 20, 2015 at 5:17 AM, Yangfei (Felix) felix.y...@huawei.com wrote: Hi, As DECL_FUNCTION_CODE is overloaded for builtin functions in different classes, so need to check builtin class before using fcode. Patch posted below. Bootstrapped on x86_64-suse-linux, OK for trunk? Thanks. Ugh. The code in the callers already looks like it could have some TLC, like instead of fndecl = gimple_call_fndecl (stmt); if (!fndecl) return false; fcode = DECL_FUNCTION_CODE (fndecl); if (!interesting_stringop_to_profile_p (fndecl, stmt, size_arg)) return false; simply do if (!gimple_call_builtin_p (stmt, BUILT_IN_NORMAL)) return false; if (!interesting_stringop_to_profile_p (gimple_call_fndecl (stmt), )) similar for the other caller. interesting_stringop_to_profile_p can also get function-code directly from stmt, removing the redundant first argument or even do the gimple_call_builtin_p call itself. Mind reworking the patch accordingly? Thanks, Richard. patch-v2.diff Description: patch-v2.diff
[PATCH] Only accept BUILT_IN_NORMAL stringops for interesting_stringop_to_profile_p
Hi, As DECL_FUNCTION_CODE is overloaded for builtin functions in different classes, so need to check builtin class before using fcode. Patch posted below. Bootstrapped on x86_64-suse-linux, OK for trunk? Thanks. Index: gcc/value-prof.c === --- gcc/value-prof.c(revision 141081) +++ gcc/value-prof.c(working copy) @@ -1547,8 +1547,12 @@ gimple_ic_transform (gimple_stmt_iterator *gsi) static bool interesting_stringop_to_profile_p (tree fndecl, gimple call, int *size_arg) { - enum built_in_function fcode = DECL_FUNCTION_CODE (fndecl); + enum built_in_function fcode; + if (DECL_BUILT_IN_CLASS (fndecl) != BUILT_IN_NORMAL) +return false; + + fcode = DECL_FUNCTION_CODE (fndecl); if (fcode != BUILT_IN_MEMCPY fcode != BUILT_IN_MEMPCPY fcode != BUILT_IN_MEMSET fcode != BUILT_IN_BZERO) return false; Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 141081) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,9 @@ +2015-08-20 Felix Yang felix.y...@huawei.com + Jiji Jiang jiangj...@huawei.com + + * value-prof.c (interesting_stringop_to_profile_p): Only accept string + operations which belong to the BUILT_IN_NORMAL builtin class. + 2015-08-18 Segher Boessenkool seg...@kernel.crashing.org Backport from mainline:
[PING^2, AArch64] Add long-call attribute and pragma interfaces
Patch ping ... On 04/02/2015 11:59 PM, Yangfei (Felix) wrote: Patch ping: https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01148.html This patch needs documentation for the new attributes and pragmas before it can be committed. (Since this is a new feature I think it has to wait until stage 1, too, but that's not my call.) -Sandra Sure, I will update the docs when this patch is approved by the port maintainers. I didn't get any feedback from Richard for the v2 patch. Thanks.
Re: [PING, AArch64] Add long-call attribute and pragma interfaces
On 04/02/2015 11:59 PM, Yangfei (Felix) wrote: Patch ping: https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01148.html This patch needs documentation for the new attributes and pragmas before it can be committed. (Since this is a new feature I think it has to wait until stage 1, too, but that's not my call.) -Sandra Sure, I will update the docs when this patch is approved by the port maintainers. I didn't get any feedback from Richard for the v2 patch. Thanks.
Re: [RFC AArch64] Implement TARGET_PROMOTE_FUNCTION_MODE for ILP32 code generation
Hi Andrew, Sorry for the late reply. Seems that I misunderstood the AAPCS64 specification. Thanks for the clarification. On Mar 16, 2015, at 2:28 AM, Yangfei (Felix) felix.y...@huawei.com wrote: Hi, For this trivial testcase: extern int bar (int , int); int foo (int *a, int *b) { return bar (*a, *b); } I noticed that GCC generate redundant zero-extension instructions under ILP32 (aarch64-linux-gnu-gcc -S -O2 -mabi=ilp32). Assembly code: .arch armv8-a+fp+simd .file 1.c .text .align 2 .p2align 3,,7 .global foo .type foo, %function foo: uxtwx0, w0 uxtwx1, w1 ldr w0, [x0] ldr w1, [x1] b bar .size foo, .-foo .ident GCC: (20140403) 5.0.0 20150312 (experimental) According the ILP32 ABI, the two uxtw instrunctions here is not necessary. The following is a proposed patch to fix this issue, i.e. promoting pointer function arguments to word_mode. But I don't know whether it's a good idea to do this for pointer return values. Any comments? Actually they are required. The abi says the upper 32bits are undefined for arguments smaller then 64bits. I had this discussion a year or more ago about this case. A simple struct like struct a { int * b; int c; }; Will break the code if we don't have the zero extends Try void f(int *); void g(struct a d) { f(d.b); } And see that there is no zero extends inside g. I saw this exact thing when working on getting gccgo working. It also means the assembly functions in glibc are broken and need to be fixed. Thanks, Andrew Index: gcc/config/aarch64/aarch64.c = == --- gcc/config/aarch64/aarch64.c(revision 221393) +++ gcc/config/aarch64/aarch64.c(working copy) @@ -1587,7 +1587,7 @@ aarch64_function_value (const_tree type, const_tre machine_mode ag_mode; mode = TYPE_MODE (type); - if (INTEGRAL_TYPE_P (type)) + if (INTEGRAL_TYPE_P (type) || POINTER_TYPE_P (type)) mode = promote_function_mode (type, mode, unsignedp, func, 1); if (aarch64_return_in_msb (type)) @@ -1650,6 +1650,24 @@ aarch64_function_value_regno_p (const unsigned int return false; } +/* Implement TARGET_PROMOTE_FUNCTION_MODE. */ + +static machine_mode +aarch64_promote_function_mode (const_tree type, machine_mode mode, + int *punsignedp, const_tree fntype, + int for_return) +{ + /* Pointer function arguments and return values are promoted to +word_mode. */ + if (type != NULL_TREE POINTER_TYPE_P (type)) +{ + *punsignedp = POINTERS_EXTEND_UNSIGNED; + return word_mode; +} + + return default_promote_function_mode (type, mode, punsignedp, fntype, +for_return); } + /* Implement TARGET_RETURN_IN_MEMORY. If the type T of the result of a function is such that @@ -11329,6 +11347,9 @@ aarch64_gen_adjusted_ldpstp (rtx *operands, bool l #define TARGET_OVERRIDE_OPTIONS_AFTER_CHANGE \ aarch64_override_options_after_change +#undef TARGET_PROMOTE_FUNCTION_MODE +#define TARGET_PROMOTE_FUNCTION_MODE aarch64_promote_function_mode + #undef TARGET_PASS_BY_REFERENCE #define TARGET_PASS_BY_REFERENCE aarch64_pass_by_reference aarch64-promote-v2.diff
[PING, AArch64] Add long-call attribute and pragma interfaces
Patch ping: https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01148.html Thanks.
[RFC AArch64] Implement TARGET_PROMOTE_FUNCTION_MODE for ILP32 code generation
Hi, For this trivial testcase: extern int bar (int , int); int foo (int *a, int *b) { return bar (*a, *b); } I noticed that GCC generate redundant zero-extension instructions under ILP32 (aarch64-linux-gnu-gcc -S -O2 -mabi=ilp32). Assembly code: .arch armv8-a+fp+simd .file 1.c .text .align 2 .p2align 3,,7 .global foo .type foo, %function foo: uxtwx0, w0 uxtwx1, w1 ldr w0, [x0] ldr w1, [x1] b bar .size foo, .-foo .ident GCC: (20140403) 5.0.0 20150312 (experimental) According the ILP32 ABI, the two uxtw instrunctions here is not necessary. The following is a proposed patch to fix this issue, i.e. promoting pointer function arguments to word_mode. But I don't know whether it's a good idea to do this for pointer return values. Any comments? Index: gcc/config/aarch64/aarch64.c === --- gcc/config/aarch64/aarch64.c(revision 221393) +++ gcc/config/aarch64/aarch64.c(working copy) @@ -1587,7 +1587,7 @@ aarch64_function_value (const_tree type, const_tre machine_mode ag_mode; mode = TYPE_MODE (type); - if (INTEGRAL_TYPE_P (type)) + if (INTEGRAL_TYPE_P (type) || POINTER_TYPE_P (type)) mode = promote_function_mode (type, mode, unsignedp, func, 1); if (aarch64_return_in_msb (type)) @@ -1650,6 +1650,24 @@ aarch64_function_value_regno_p (const unsigned int return false; } +/* Implement TARGET_PROMOTE_FUNCTION_MODE. */ + +static machine_mode +aarch64_promote_function_mode (const_tree type, machine_mode mode, + int *punsignedp, const_tree fntype, + int for_return) +{ + /* Pointer function arguments and return values are promoted to word_mode. */ + if (type != NULL_TREE POINTER_TYPE_P (type)) +{ + *punsignedp = POINTERS_EXTEND_UNSIGNED; + return word_mode; +} + + return default_promote_function_mode (type, mode, punsignedp, fntype, +for_return); +} + /* Implement TARGET_RETURN_IN_MEMORY. If the type T of the result of a function is such that @@ -11329,6 +11347,9 @@ aarch64_gen_adjusted_ldpstp (rtx *operands, bool l #define TARGET_OVERRIDE_OPTIONS_AFTER_CHANGE \ aarch64_override_options_after_change +#undef TARGET_PROMOTE_FUNCTION_MODE +#define TARGET_PROMOTE_FUNCTION_MODE aarch64_promote_function_mode + #undef TARGET_PASS_BY_REFERENCE #define TARGET_PASS_BY_REFERENCE aarch64_pass_by_reference aarch64-promote-v2.diff Description: aarch64-promote-v2.diff
[PING ^ 5] [PATCH, AARCH64] Add support for -mlong-calls option
Ping ... Patch ping: https://gcc.gnu.org/ml/gcc-patches/2014-11/msg02258.html Any comments, Richard? Thanks.
Re: [PING] [PATCH] [AArch64, NEON] Add vfms_n_f32, vfmsq_n_f32 and vfmsq_n_f64 specified by the ACLE
On 21/01/15 09:22, Yangfei (Felix) wrote: This is a ping for: https://gcc.gnu.org/ml/gcc-patches/2014-12/msg01008.html I updated the testcase adding test for vfmsq_n_f64 intrinsic. Test OK for both aarch64-linux-gnu and aarch64_be-linux-gnu-gcc. OK for the trunk? Thanks. Index: gcc/ChangeLog = == --- gcc/ChangeLog (revision 219845) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2015-01-21 Felix Yang felix.y...@huawei.com + + * config/aarch64/arm_neon.h (vfms_n_f32, vfmsq_n_f32, vfmsq_n_f64): New + intrinsics. + Hi Felix, Thanks for the the patch. It LGTM apart from one point - you seem to have missed out vfms_n_f64? Thanks, Tejas. Hello Tejas, You are right, the vfms_n_f64 is missing here. And I find that vfma_n_f64 is not there too. I would like to compose up another patch to add them. So is this patch OK for the trunk? Thanks.
[PING] [PATCH] [AArch64, NEON] Add vfms_n_f32, vfmsq_n_f32 and vfmsq_n_f64 specified by the ACLE
This is a ping for: https://gcc.gnu.org/ml/gcc-patches/2014-12/msg01008.html I updated the testcase adding test for vfmsq_n_f64 intrinsic. Test OK for both aarch64-linux-gnu and aarch64_be-linux-gnu-gcc. OK for the trunk? Thanks. Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 219845) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2015-01-21 Felix Yang felix.y...@huawei.com + + * config/aarch64/arm_neon.h (vfms_n_f32, vfmsq_n_f32, vfmsq_n_f64): New + intrinsics. + 2015-01-19 Jiong Wang jiong.w...@arm.com Andrew Pinski apin...@cavium.com Index: gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfms_n.c === --- gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfms_n.c (revision 0) +++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfms_n.c (revision 0) @@ -0,0 +1,74 @@ +#include arm_neon.h +#include arm-neon-ref.h +#include compute-ref-data.h + +#if defined(__aarch64__) defined(__ARM_FEATURE_FMA) +/* Expected results. */ +VECT_VAR_DECL(expected,hfloat,32,2) [] = { 0x4438ca3d, 0x44390a3d }; +VECT_VAR_DECL(expected,hfloat,32,4) [] = { 0x44869eb8, 0x4486beb8, 0x4486deb8, 0x4486feb8 }; +VECT_VAR_DECL(expected,hfloat,64,2) [] = { 0x408906e1532b8520, 0x40890ee1532b8520 }; + +#define VECT_VAR_ASSIGN(S,Q,T1,W) S##Q##_##T1##W +#define ASSIGN(S, Q, T, W, V) T##W##_t S##Q##_##T##W = V +#define TEST_MSG VFMS_N/VFMSQ_N + +void exec_vfms_n (void) +{ + /* Basic test: v4=vfms_n(v1,v2), then store the result. */ +#define TEST_VFMS_N(Q, T1, T2, W, N) \ + VECT_VAR(vector_res, T1, W, N) = \ +vfms##Q##_n_##T2##W(VECT_VAR(vector1, T1, W, N), \ + VECT_VAR(vector2, T1, W, N),\ + VECT_VAR_ASSIGN(scalar, Q, T1, W)); \ + vst1##Q##_##T2##W(VECT_VAR(result, T1, W, N), VECT_VAR(vector_res, T1, W, N)) + +#define CHECK_VFMS_N_RESULTS(test_name,comment) \ + {\ +CHECK_FP(test_name, float, 32, 2, PRIx32, expected, comment); \ +CHECK_FP(test_name, float, 32, 4, PRIx32, expected, comment); \ +CHECK_FP(test_name, float, 64, 2, PRIx64, expected, comment); \ + } + +#define DECL_VFMS_N_VAR(VAR) \ + DECL_VARIABLE(VAR, float, 32, 2);\ + DECL_VARIABLE(VAR, float, 32, 4);\ + DECL_VARIABLE(VAR, float, 64, 2); + + DECL_VFMS_N_VAR(vector1); + DECL_VFMS_N_VAR(vector2); + DECL_VFMS_N_VAR(vector3); + DECL_VFMS_N_VAR(vector_res); + + clean_results (); + + /* Initialize input vector1 from buffer. */ + VLOAD(vector1, buffer, , float, f, 32, 2); + VLOAD(vector1, buffer, q, float, f, 32, 4); + VLOAD(vector1, buffer, q, float, f, 64, 2); + + /* Choose init value arbitrarily. */ + VDUP(vector2, , float, f, 32, 2, -9.3f); + VDUP(vector2, q, float, f, 32, 4, -29.7f); + VDUP(vector2, q, float, f, 64, 2, -15.8f); + + /* Choose init value arbitrarily. */ + ASSIGN(scalar, , float, 32, 81.2f); + ASSIGN(scalar, q, float, 32, 36.8f); + ASSIGN(scalar, q, float, 64, 51.7f); + + /* Execute the tests. */ + TEST_VFMS_N(, float, f, 32, 2); + TEST_VFMS_N(q, float, f, 32, 4); + TEST_VFMS_N(q, float, f, 64, 2); + + CHECK_VFMS_N_RESULTS (TEST_MSG, ); +} +#endif + +int main (void) +{ +#if defined(__aarch64__) defined(__ARM_FEATURE_FMA) + exec_vfms_n (); +#endif + return 0; +} Index: gcc/testsuite/ChangeLog === --- gcc/testsuite/ChangeLog (revision 219845) +++ gcc/testsuite/ChangeLog (working copy) @@ -1,3 +1,7 @@ +2015-01-21 Felix Yang felix.y...@huawei.com + + * gcc.target/aarch64/advsimd-intrinsics/vfms_n.c: New test. + 2015-01-19 Felix Yang felix.y...@huawei.com Haijian Zhang z.zhanghaij...@huawei.com Index: gcc/config/aarch64/arm_neon.h === --- gcc/config/aarch64/arm_neon.h (revision 219845) +++ gcc/config/aarch64/arm_neon.h (working copy) @@ -14774,7 +14774,24 @@ vfmsq_f64 (float64x2_t __a, float64x2_t __b, float return __builtin_aarch64_fmav2df (-__b, __c, __a); } +__extension__ static __inline float32x2_t __attribute__ ((__always_inline__)) +vfms_n_f32 (float32x2_t __a, float32x2_t __b, float32_t __c) +{ + return __builtin_aarch64_fmav2sf (-__b, vdup_n_f32 (__c), __a); +} +__extension__ static __inline float32x4_t __attribute__ ((__always_inline__)) +vfmsq_n_f32 (float32x4_t __a, float32x4_t __b, float32_t __c) +{ + return __builtin_aarch64_fmav4sf (-__b, vdupq_n_f32 (__c), __a); +} + +__extension__ static __inline float64x2_t __attribute__ ((__always_inline__)) +vfmsq_n_f64 (float64x2_t __a, float64x2_t __b,
[PING] [PATCH] [AArch64, NEON] Fix testcases add by r218484
Hi, This is a ping for: https://gcc.gnu.org/ml/gcc-patches/2014-12/msg01328.html OK for the trunk? Thanks.
[PING ^ 4] [RFC PATCH, AARCH64] Add support for -mlong-calls option
Patch ping: https://gcc.gnu.org/ml/gcc-patches/2014-11/msg02258.html Any comments, Richard? Thanks.
Re: [PATCH, autofdo] Some code cleanup
Hi, I updated the patch adding one ChangeLog entry. OK for the trunk? Thanks. Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 219297) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,12 @@ +2015-01-17 Felix Yang felix.y...@huawei.com + + * auto-profile.c (afdo_find_equiv_class): Remove unnecessary code. + (autofdo_source_profile::get_callsite_total_count, + function_instance::get_function_instance_by_decl, + string_table::get_index, string_table::get_index_by_decl, + afdo_vpt_for_early_inline, afdo_callsite_hot_enough_for_early_inline): + Fix comment typos and formatting. + 2015-01-06 Sandra Loosemore san...@codesourcery.com * doc/invoke.texi (RS/6000 and PowerPC Options): Tidy formatting Index: gcc/auto-profile.c === --- gcc/auto-profile.c (revision 219297) +++ gcc/auto-profile.c (working copy) @@ -96,7 +96,7 @@ along with GCC; see the file COPYING3. If not see standalone symbol, or a clone of a function that is inlined into another function. - Phase 2: Early inline + valur profile transformation. + Phase 2: Early inline + value profile transformation. Early inline uses autofdo_source_profile to find if a callsite is: * inlined in the profiled binary. * callee body is hot in the profiling run. @@ -361,7 +361,7 @@ get_original_name (const char *name) /* Return the combined location, which is a 32bit integer in which higher 16 bits stores the line offset of LOC to the start lineno - of DECL, The lower 16 bits stores the discrimnator. */ + of DECL, The lower 16 bits stores the discriminator. */ static unsigned get_combined_location (location_t loc, tree decl) @@ -424,7 +424,7 @@ get_inline_stack (location_t locus, inline_stack * /* Return STMT's combined location, which is a 32bit integer in which higher 16 bits stores the line offset of LOC to the start lineno - of DECL, The lower 16 bits stores the discrimnator. */ + of DECL, The lower 16 bits stores the discriminator. */ static unsigned get_relative_location_for_stmt (gimple stmt) @@ -481,8 +481,8 @@ string_table::get_index (const char *name) const string_index_map::const_iterator iter = map_.find (name); if (iter == map_.end ()) return -1; - else -return iter-second; + + return iter-second; } /* Return the index of a given function DECL. Return -1 if DECL is not @@ -502,8 +502,8 @@ string_table::get_index_by_decl (tree decl) const return ret; if (DECL_ABSTRACT_ORIGIN (decl)) return get_index_by_decl (DECL_ABSTRACT_ORIGIN (decl)); - else -return -1; + + return -1; } /* Return the function name of a given INDEX. */ @@ -569,8 +569,8 @@ function_instance::get_function_instance_by_decl ( } if (DECL_ABSTRACT_ORIGIN (decl)) return get_function_instance_by_decl (lineno, DECL_ABSTRACT_ORIGIN (decl)); - else -return NULL; + + return NULL; } /* Store the profile info for LOC in INFO. Return TRUE if profile info @@ -597,7 +597,7 @@ function_instance::mark_annotated (location_t loc) iter-second.annotated = true; } -/* Read the inlinied indirect call target profile for STMT and store it in +/* Read the inlined indirect call target profile for STMT and store it in MAP, return the total count for all inlined indirect calls. */ gcov_type @@ -824,8 +824,8 @@ autofdo_source_profile::get_callsite_total_count ( || afdo_string_table-get_index (IDENTIFIER_POINTER ( DECL_ASSEMBLER_NAME (edge-callee-decl))) != s-name ()) return 0; - else -return s-total_count (); + + return s-total_count (); } /* Read AutoFDO profile and returns TRUE on success. */ @@ -956,9 +956,9 @@ read_profile (void) histograms for indirect-call optimization. This function is actually served for 2 purposes: - * before annotation, we need to mark histogram, promote and inline - * after annotation, we just need to mark, and let follow-up logic to - decide if it needs to promote and inline. */ + * before annotation, we need to mark histogram, promote and inline + * after annotation, we just need to mark, and let follow-up logic to + decide if it needs to promote and inline. */ static void afdo_indirect_call (gimple_stmt_iterator *gsi, const icall_target_map map, @@ -1054,7 +1054,7 @@ set_edge_annotated (edge e, edge_set *annotated) } /* For a given BB, set its execution count. Attach value profile if a stmt - is not in PROMOTED, because we only want to promot an indirect call once. + is not in PROMOTED, because we only want to promote an indirect call once. Return TRUE if BB is annotated. */ static bool @@ -1138,7 +1138,7 @@ afdo_find_equiv_class (bb_set *annotated_bb) bb1-aux = bb; if (bb1-count bb-count is_bb_annotated (bb1,
Re: [PATCH] [AArch64, NEON] Improve vpmaxX vpminX intrinsics
On 09/12/14 08:17, Yangfei (Felix) wrote: On 28 November 2014 at 09:23, Yangfei (Felix) felix.y...@huawei.com wrote: Hi, This patch converts vpmaxX vpminX intrinsics to use builtin functions instead of the previous inline assembly syntax. Regtested with aarch64-linux-gnu on QEMU. Also passed the glorious testsuite of Christophe Lyon. OK for the trunk? Hi Felix, We know from experience that the advsimd intrinsics tend to be fragile for big endian and in general it is fairly easy to break the big endian case. For these advsimd improvements that you are working on (that we very much appreciate) it is important to run both little endian and big endian regressions. Thanks /Marcus Okay. Any plan for the advsimd big-endian improvement? I rebased this patch over Alan Lawrance's patch: https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00279.html No regressions for aarch64_be-linux-gnu target too. OK for the thunk? Index: gcc/ChangeLog = == --- gcc/ChangeLog (revision 218464) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,18 @@ +2014-12-09 Felix Yang felix.y...@huawei.com + + * config/aarch64/aarch64-simd.md (aarch64_maxmin_unspmode): New + pattern. + * config/aarch64/aarch64-simd-builtins.def (smaxp, sminp, umaxp, + uminp, smax_nanp, smin_nanp): New builtins. + * config/aarch64/arm_neon.h (vpmax_s8, vpmax_s16, vpmax_s32, + vpmax_u8, vpmax_u16, vpmax_u32, vpmaxq_s8, vpmaxq_s16, vpmaxq_s32, + vpmaxq_u8, vpmaxq_u16, vpmaxq_u32, vpmax_f32, vpmaxq_f32, vpmaxq_f64, + vpmaxqd_f64, vpmaxs_f32, vpmaxnm_f32, vpmaxnmq_f32, vpmaxnmq_f64, + vpmaxnmqd_f64, vpmaxnms_f32, vpmin_s8, vpmin_s16, vpmin_s32, vpmin_u8, + vpmin_u16, vpmin_u32, vpminq_s8, vpminq_s16, vpminq_s32, vpminq_u8, + vpminq_u16, vpminq_u32, vpmin_f32, vpminq_f32, vpminq_f64, vpminqd_f64, + vpmins_f32, vpminnm_f32, vpminnmq_f32, vpminnmq_f64, + vpminnmqd_f64, + __extension__ static __inline float32x2_t __attribute__ ((__always_inline__)) Index: gcc/config/aarch64/aarch64-simd.md = == --- gcc/config/aarch64/aarch64-simd.md (revision 218464) +++ gcc/config/aarch64/aarch64-simd.md (working copy) @@ -1017,6 +1017,28 @@ DONE; }) +;; Pairwise Integer Max/Min operations. +(define_insn aarch64_maxmin_unspmode + [(set (match_operand:VDQ_BHSI 0 register_operand =w) + (unspec:VDQ_BHSI [(match_operand:VDQ_BHSI 1 register_operand w) +(match_operand:VDQ_BHSI 2 register_operand w)] + MAXMINV))] + TARGET_SIMD + maxmin_uns_opp\t%0.Vtype, %1.Vtype, %2.Vtype + [(set_attr type neon_minmaxq)] +) + Hi Felix, Sorry for the delay in getting back to you on this. If you've rolled aarch64_reduc_maxmin_uns_internalv2si into the above pattern, do you still need it? For all its call points, just point them to aarch64_maxmin_unspmode? Thanks, Tejas. Hello Tejas, I didn't do this yet. Currently the aarch64_reduc_maxmin_uns_internalv2si is only called by reduc_maxmin_uns_scal_mode. I find it kind of trouble to handle this due to the use of iterators in the caller pattern. Are you going to rework this part?
[PATCH, autofdo] Some code cleanup
Hi, The attached patch does some code cleanup for auto-profile.c: fix typos and remove some unnecessary MAX/MIN checks plus some else. OK for the trunk? Index: gcc/auto-profile.c === --- gcc/auto-profile.c (revision 219297) +++ gcc/auto-profile.c (working copy) @@ -96,7 +96,7 @@ along with GCC; see the file COPYING3. If not see standalone symbol, or a clone of a function that is inlined into another function. - Phase 2: Early inline + valur profile transformation. + Phase 2: Early inline + value profile transformation. Early inline uses autofdo_source_profile to find if a callsite is: * inlined in the profiled binary. * callee body is hot in the profiling run. @@ -361,7 +361,7 @@ get_original_name (const char *name) /* Return the combined location, which is a 32bit integer in which higher 16 bits stores the line offset of LOC to the start lineno - of DECL, The lower 16 bits stores the discrimnator. */ + of DECL, The lower 16 bits stores the discriminator. */ static unsigned get_combined_location (location_t loc, tree decl) @@ -424,7 +424,7 @@ get_inline_stack (location_t locus, inline_stack * /* Return STMT's combined location, which is a 32bit integer in which higher 16 bits stores the line offset of LOC to the start lineno - of DECL, The lower 16 bits stores the discrimnator. */ + of DECL, The lower 16 bits stores the discriminator. */ static unsigned get_relative_location_for_stmt (gimple stmt) @@ -481,8 +481,8 @@ string_table::get_index (const char *name) const string_index_map::const_iterator iter = map_.find (name); if (iter == map_.end ()) return -1; - else -return iter-second; + + return iter-second; } /* Return the index of a given function DECL. Return -1 if DECL is not @@ -502,8 +502,8 @@ string_table::get_index_by_decl (tree decl) const return ret; if (DECL_ABSTRACT_ORIGIN (decl)) return get_index_by_decl (DECL_ABSTRACT_ORIGIN (decl)); - else -return -1; + + return -1; } /* Return the function name of a given INDEX. */ @@ -569,8 +569,8 @@ function_instance::get_function_instance_by_decl ( } if (DECL_ABSTRACT_ORIGIN (decl)) return get_function_instance_by_decl (lineno, DECL_ABSTRACT_ORIGIN (decl)); - else -return NULL; + + return NULL; } /* Store the profile info for LOC in INFO. Return TRUE if profile info @@ -597,7 +597,7 @@ function_instance::mark_annotated (location_t loc) iter-second.annotated = true; } -/* Read the inlinied indirect call target profile for STMT and store it in +/* Read the inlined indirect call target profile for STMT and store it in MAP, return the total count for all inlined indirect calls. */ gcov_type @@ -824,8 +824,8 @@ autofdo_source_profile::get_callsite_total_count ( || afdo_string_table-get_index (IDENTIFIER_POINTER ( DECL_ASSEMBLER_NAME (edge-callee-decl))) != s-name ()) return 0; - else -return s-total_count (); + + return s-total_count (); } /* Read AutoFDO profile and returns TRUE on success. */ @@ -956,9 +956,9 @@ read_profile (void) histograms for indirect-call optimization. This function is actually served for 2 purposes: - * before annotation, we need to mark histogram, promote and inline - * after annotation, we just need to mark, and let follow-up logic to - decide if it needs to promote and inline. */ + * before annotation, we need to mark histogram, promote and inline + * after annotation, we just need to mark, and let follow-up logic to + decide if it needs to promote and inline. */ static void afdo_indirect_call (gimple_stmt_iterator *gsi, const icall_target_map map, @@ -1054,7 +1054,7 @@ set_edge_annotated (edge e, edge_set *annotated) } /* For a given BB, set its execution count. Attach value profile if a stmt - is not in PROMOTED, because we only want to promot an indirect call once. + is not in PROMOTED, because we only want to promote an indirect call once. Return TRUE if BB is annotated. */ static bool @@ -1138,7 +1138,7 @@ afdo_find_equiv_class (bb_set *annotated_bb) bb1-aux = bb; if (bb1-count bb-count is_bb_annotated (bb1, *annotated_bb)) { -bb-count = MAX (bb-count, bb1-count); +bb-count = bb1-count; set_bb_annotated (bb, annotated_bb); } } @@ -1150,7 +1150,7 @@ afdo_find_equiv_class (bb_set *annotated_bb) bb1-aux = bb; if (bb1-count bb-count is_bb_annotated (bb1, *annotated_bb)) { -bb-count = MAX (bb-count, bb1-count); +bb-count = bb1-count; set_bb_annotated (bb, annotated_bb); } } @@ -1455,13 +1455,14 @@ afdo_vpt_for_early_inline (stmt_set *promoted_stmt } } } + if (has_vpt) { optimize_inline_calls
[PATCH] Fix PR64240
Hi, This patch fixes an obvious typo which may affect the DDG creation of SMS and make this optimization produce buggy code. Bootstrapped on x86_64-suse-linux. Also passed check-gcc test for aarch64-linux-gnu. OK for the trunk? Index: gcc/ddg.c === --- gcc/ddg.c (revision 218582) +++ gcc/ddg.c (working copy) @@ -77,7 +77,7 @@ mark_mem_use (rtx *x, void *) { subrtx_iterator::array_type array; FOR_EACH_SUBRTX (iter, array, *x, NONCONST) -if (MEM_P (*x)) +if (MEM_P (*iter)) { mem_ref_p = true; break; Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 218582) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2014-12-16 Felix Yang felix.y...@huawei.com + + PR rtl-optimization/64240 + * ddg.c (mark_mem_use): Check *iter instead of *x. + 2014-12-10 Felix Yang felix.y...@huawei.com * config/aarch64/aarch64-protos.h (aarch64_function_profiler): Remove pr64240.diff Description: pr64240.diff
Re: [PATCH] Fix PR64240
On December 16, 2014 9:51:25 AM CET, Yangfei (Felix) felix.y...@huawei.com wrote: Hi, This patch fixes an obvious typo which may affect the DDG creation of SMS and make this optimization produce buggy code. Bootstrapped on x86_64-suse-linux. Also passed check-gcc test for aarch64-linux-gnu. OK for the trunk? Do you have a testcase? If so please add it. OK Yes, the patch is updated with the testcase added. Index: gcc/ddg.c === --- gcc/ddg.c (revision 218582) +++ gcc/ddg.c (working copy) @@ -77,7 +77,7 @@ mark_mem_use (rtx *x, void *) { subrtx_iterator::array_type array; FOR_EACH_SUBRTX (iter, array, *x, NONCONST) -if (MEM_P (*x)) +if (MEM_P (*iter)) { mem_ref_p = true; break; Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 218582) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2014-12-16 Felix Yang felix.y...@huawei.com + + PR rtl-optimization/64240 + * ddg.c (mark_mem_use): Check *iter instead of *x. + 2014-12-10 Felix Yang felix.y...@huawei.com * config/aarch64/aarch64-protos.h (aarch64_function_profiler): Remove Index: gcc/testsuite/gcc.dg/sms-12.c === --- gcc/testsuite/gcc.dg/sms-12.c (revision 0) +++ gcc/testsuite/gcc.dg/sms-12.c (revision 0) @@ -0,0 +1,43 @@ +/* { dg-do run } */ +/* { dg-skip-if { ! { aarch64-*-* } } { * } { } } */ +/* { dg-options -O2 -fmodulo-sched -funroll-loops -fdump-rtl-sms --param sms-min-sc=1 -fmodulo-sched-allow-regmoves -fPIC } */ + +extern void abort (void); + +int X[1000]={0}; +int Y[1000]={0}; + +extern void abort (void); + +__attribute__ ((noinline)) +int +foo (int len, long a) +{ + int i; + long res = a; + + len = 1000; + for (i = 0; i len; i++) +res += X[i]* Y[i]; + + if (res != 601) +abort (); + +} + +int +main () +{ + X[0] = Y[1] = 2; + Y[0] = X[1] = 21; + X[2] = Y[3] = 3; + Y[2] = X[3] = 31; + X[4] = Y[5] = 4; + Y[4] = X[5] = 41; + + foo (6, 3); + return 0; +} + +/* { dg-final { cleanup-rtl-dump sms } } */ + Property changes on: gcc/testsuite/gcc.dg/sms-12.c ___ Added: svn:executable + * Index: gcc/testsuite/ChangeLog === --- gcc/testsuite/ChangeLog (revision 218582) +++ gcc/testsuite/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2014-12-16 Felix Yang felix.y...@huawei.com + + PR rtl-optimization/64240 + * gcc.dg/sms-12.c: New test. + 2014-12-10 Martin Liska mli...@suse.cz * gcc.dg/ipa/pr63909.c: New test. pr64240-v2.diff Description: pr64240-v2.diff
Re: [PATCH] [AArch64, NEON] Fix testcases add by r218484
#define DECL_VABD_VAR(VAR) \ be careful with your cut and paste. VABD should probably be VFMA_N here, although it's purely a naming convention :-) The v3 patch attached fixed this minor issue. Thanks. It's OK for me with that change, but I'm not a maintainer. One more question: are there any corner-cases we would want to check? (for instance, rounding, nan, infinity, ...) We don't see any testsuite covers the test of these intrinsics. So we are adding these testcases to test the basic functionality. For now, I don't see any corner-cases that need to be checked for this patch. Index: gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/compute-ref-data.h === --- gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/compute-ref-data.h (revision 218582) +++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/compute-ref-data.h (working copy) @@ -142,6 +142,10 @@ VECT_VAR_DECL_INIT(buffer, poly, 16, 8); PAD(buffer_pad, poly, 16, 8); VECT_VAR_DECL_INIT(buffer, float, 32, 4); PAD(buffer_pad, float, 32, 4); +#ifdef __aarch64__ +VECT_VAR_DECL_INIT(buffer, float, 64, 2); +PAD(buffer_pad, float, 64, 2); +#endif /* The tests for vld1_dup and vdup expect at least 4 entries in the input buffer, so force 1- and 2-elements initializers to have 4 Index: gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c === --- gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c (revision 218582) +++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c (working copy) @@ -2,6 +2,7 @@ #include arm-neon-ref.h #include compute-ref-data.h +#if defined(__aarch64__) defined(__ARM_FEATURE_FMA) /* Expected results. */ VECT_VAR_DECL(expected,hfloat,32,2) [] = { 0x4438ca3d, 0x44390a3d }; VECT_VAR_DECL(expected,hfloat,32,4) [] = { 0x44869eb8, 0x4486beb8, 0x4486deb8, 0x4486feb8 }; @@ -9,33 +10,34 @@ VECT_VAR_DECL(expected,hfloat,64,2) [] = { 0x40890 #define VECT_VAR_ASSIGN(S,Q,T1,W) S##Q##_##T1##W #define ASSIGN(S, Q, T, W, V) T##W##_t S##Q##_##T##W = V -#define TEST_MSG VFMA/VFMAQ +#define TEST_MSG VFMA_N/VFMAQ_N + void exec_vfma_n (void) { /* Basic test: v4=vfma_n(v1,v2), then store the result. */ -#define TEST_VFMA(Q, T1, T2, W, N) \ +#define TEST_VFMA_N(Q, T1, T2, W, N) \ VECT_VAR(vector_res, T1, W, N) = \ vfma##Q##_n_##T2##W(VECT_VAR(vector1, T1, W, N), \ - VECT_VAR(vector2, T1, W, N), \ - VECT_VAR_ASSIGN(Scalar, Q, T1, W)); \ + VECT_VAR(vector2, T1, W, N),\ + VECT_VAR_ASSIGN(scalar, Q, T1, W)); \ vst1##Q##_##T2##W(VECT_VAR(result, T1, W, N), VECT_VAR(vector_res, T1, W, N)) -#define CHECK_VFMA_RESULTS(test_name,comment) \ +#define CHECK_VFMA_N_RESULTS(test_name,comment) \ {\ CHECK_FP(test_name, float, 32, 2, PRIx32, expected, comment); \ CHECK_FP(test_name, float, 32, 4, PRIx32, expected, comment); \ - CHECK_FP(test_name, float, 64, 2, PRIx64, expected, comment); \ - } +CHECK_FP(test_name, float, 64, 2, PRIx64, expected, comment); \ + } -#define DECL_VABD_VAR(VAR) \ +#define DECL_VFMA_N_VAR(VAR) \ DECL_VARIABLE(VAR, float, 32, 2);\ DECL_VARIABLE(VAR, float, 32, 4);\ - DECL_VARIABLE(VAR, float, 64, 2); + DECL_VARIABLE(VAR, float, 64, 2); - DECL_VABD_VAR(vector1); - DECL_VABD_VAR(vector2); - DECL_VABD_VAR(vector3); - DECL_VABD_VAR(vector_res); + DECL_VFMA_N_VAR(vector1); + DECL_VFMA_N_VAR(vector2); + DECL_VFMA_N_VAR(vector3); + DECL_VFMA_N_VAR(vector_res); clean_results (); @@ -50,20 +52,23 @@ void exec_vfma_n (void) VDUP(vector2, q, float, f, 64, 2, 15.8f); /* Choose init value arbitrarily. */ - ASSIGN(Scalar, , float, 32, 81.2f); - ASSIGN(Scalar, q, float, 32, 36.8f); - ASSIGN(Scalar, q, float, 64, 51.7f); + ASSIGN(scalar, , float, 32, 81.2f); + ASSIGN(scalar, q, float, 32, 36.8f); + ASSIGN(scalar, q, float, 64, 51.7f); /* Execute the tests. */ - TEST_VFMA(, float, f, 32, 2); - TEST_VFMA(q, float, f, 32, 4); - TEST_VFMA(q, float, f, 64, 2); + TEST_VFMA_N(, float, f, 32, 2); + TEST_VFMA_N(q, float, f, 32, 4); + TEST_VFMA_N(q, float, f, 64, 2); - CHECK_VFMA_RESULTS (TEST_MSG, ); + CHECK_VFMA_N_RESULTS (TEST_MSG, ); } +#endif int main (void) { +#if defined(__aarch64__) defined(__ARM_FEATURE_FMA) exec_vfma_n (); +#endif return 0; } Index: gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma.c
Re: [PATCH] [AArch64, NEON] Fix testcases add by r218484
Thanks for reviewing the patch. See my comments inlined: This patch fix this two issues. Three changes: 1. vfma_f32, vfmaq_f32, vfms_f32, vfmsq_f32 are only available for arm*-*-* target with the FMA feature, we take care of this through the macro __ARM_FEATURE_FMA. 2. vfma_n_f32 and vfmaq_n_f32 are only available for aarch64 target, we take care of this through the macro __aarch64__. 3. vfmaq_f64, vfmaq_n_f64 and vfmsq_f64 are only available for aarch64 target, we just exclude test for them to keep the testcases clean. (Note: They also pass on aarch64 aarch64_be target and we can add test for them if needed). I would prefer to have all the available variants tested. OK, the v2 patch attached have all the available variants added. +#ifdef __aarch64__ /* Expected results. */ VECT_VAR_DECL(expected,hfloat,32,2) [] = { 0x4438ca3d, 0x44390a3d }; VECT_VAR_DECL(expected,hfloat,32,4) [] = { 0x44869eb8, 0x4486beb8, 0x4486deb8, 0x4486feb8 }; -VECT_VAR_DECL(expected,hfloat,64,2) [] = { 0x408906e1532b8520, 0x40890ee1532b8520 }; Why do you remove this one? We need to make some changes to the header files for this test. Initially, I don't want to touch the header files, so I reduced this testcase to a minimal one. int main (void) { +#ifdef __ARM_FEATURE_FMA exec_vfms (); +#endif return 0; } In the other tests, I try to put as much code in common as possible, between the 'a' and 's' variants (e.g. vmla/vmls). Maybe you can do that as a follow-up? Yes, I think we can handle this with a follow-on patch. The v2 patch is tested on armeb-linux-gnueabi, arm-linux-gnueabi, aarch64-linux-gnu and aarch64_be-linux-gnu. How about this one? Thanks. Index: gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/compute-ref-data.h === --- gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/compute-ref-data.h (revision 218582) +++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/compute-ref-data.h (working copy) @@ -142,6 +142,10 @@ VECT_VAR_DECL_INIT(buffer, poly, 16, 8); PAD(buffer_pad, poly, 16, 8); VECT_VAR_DECL_INIT(buffer, float, 32, 4); PAD(buffer_pad, float, 32, 4); +#ifdef __aarch64__ +VECT_VAR_DECL_INIT(buffer, float, 64, 2); +PAD(buffer_pad, float, 64, 2); +#endif /* The tests for vld1_dup and vdup expect at least 4 entries in the input buffer, so force 1- and 2-elements initializers to have 4 Index: gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c === --- gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c (revision 218582) +++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c (working copy) @@ -2,6 +2,7 @@ #include arm-neon-ref.h #include compute-ref-data.h +#if defined(__aarch64__) defined(__ARM_FEATURE_FMA) /* Expected results. */ VECT_VAR_DECL(expected,hfloat,32,2) [] = { 0x4438ca3d, 0x44390a3d }; VECT_VAR_DECL(expected,hfloat,32,4) [] = { 0x44869eb8, 0x4486beb8, 0x4486deb8, 0x4486feb8 }; @@ -9,28 +10,29 @@ VECT_VAR_DECL(expected,hfloat,64,2) [] = { 0x40890 #define VECT_VAR_ASSIGN(S,Q,T1,W) S##Q##_##T1##W #define ASSIGN(S, Q, T, W, V) T##W##_t S##Q##_##T##W = V -#define TEST_MSG VFMA/VFMAQ +#define TEST_MSG VFMA_N/VFMAQ_N + void exec_vfma_n (void) { /* Basic test: v4=vfma_n(v1,v2), then store the result. */ -#define TEST_VFMA(Q, T1, T2, W, N) \ +#define TEST_VFMA_N(Q, T1, T2, W, N) \ VECT_VAR(vector_res, T1, W, N) = \ vfma##Q##_n_##T2##W(VECT_VAR(vector1, T1, W, N), \ - VECT_VAR(vector2, T1, W, N), \ - VECT_VAR_ASSIGN(Scalar, Q, T1, W)); \ + VECT_VAR(vector2, T1, W, N),\ + VECT_VAR_ASSIGN(scalar, Q, T1, W)); \ vst1##Q##_##T2##W(VECT_VAR(result, T1, W, N), VECT_VAR(vector_res, T1, W, N)) -#define CHECK_VFMA_RESULTS(test_name,comment) \ +#define CHECK_VFMA_N_RESULTS(test_name,comment) \ {\ CHECK_FP(test_name, float, 32, 2, PRIx32, expected, comment); \ CHECK_FP(test_name, float, 32, 4, PRIx32, expected, comment); \ - CHECK_FP(test_name, float, 64, 2, PRIx64, expected, comment); \ - } +CHECK_FP(test_name, float, 64, 2, PRIx64, expected, comment); \ + } #define DECL_VABD_VAR(VAR) \ DECL_VARIABLE(VAR, float, 32, 2);\ DECL_VARIABLE(VAR, float, 32, 4);\ - DECL_VARIABLE(VAR, float, 64, 2); + DECL_VARIABLE(VAR, float, 64, 2); DECL_VABD_VAR(vector1); DECL_VABD_VAR(vector2); @@ -50,20 +52,23
[PATCH] [AArch64, NEON] Add vfms_n_f32, vfmsq_n_f32 and vfmsq_n_f64 specified by the ACLE
Hi, This patch add three intrinsics that are required by the ACLE specification. A new testcase is added which covers vfms_n_f32 and vfmsq_n_f32. Tested on both aarch64-linux-gnu and aarch64_be-linux-gnu. OK? Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 218582) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2014-12-11 Felix Yang felix.y...@huawei.com + + * config/aarch64/arm_neon.h (vfms_n_f32, vfmsq_n_f32, vfmsq_n_f64): New + intrinsics. + 2014-12-10 Felix Yang felix.y...@huawei.com * config/aarch64/aarch64-protos.h (aarch64_function_profiler): Remove Index: gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfms_n.c === --- gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfms_n.c (revision 0) +++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfms_n.c (revision 0) @@ -0,0 +1,67 @@ +#include arm_neon.h +#include arm-neon-ref.h +#include compute-ref-data.h + +#ifdef __aarch64__ +/* Expected results. */ +VECT_VAR_DECL(expected,hfloat,32,2) [] = { 0x4438ca3d, 0x44390a3d }; +VECT_VAR_DECL(expected,hfloat,32,4) [] = { 0x44869eb8, 0x4486beb8, 0x4486deb8, 0x4486feb8 }; + +#define VECT_VAR_ASSIGN(S,Q,T1,W) S##Q##_##T1##W +#define ASSIGN(S, Q, T, W, V) T##W##_t S##Q##_##T##W = V +#define TEST_MSG VFMS_N/VFMSQ_N + +void exec_vfms_n (void) +{ + /* Basic test: v4=vfms_n(v1,v2), then store the result. */ +#define TEST_VFMS(Q, T1, T2, W, N) \ + VECT_VAR(vector_res, T1, W, N) = \ +vfms##Q##_n_##T2##W(VECT_VAR(vector1, T1, W, N), \ + VECT_VAR(vector2, T1, W, N),\ + VECT_VAR_ASSIGN(scalar, Q, T1, W)); \ + vst1##Q##_##T2##W(VECT_VAR(result, T1, W, N), VECT_VAR(vector_res, T1, W, N)) + +#define CHECK_VFMS_RESULTS(test_name,comment) \ + {\ +CHECK_FP(test_name, float, 32, 2, PRIx32, expected, comment); \ +CHECK_FP(test_name, float, 32, 4, PRIx32, expected, comment); \ + } + +#define DECL_VABD_VAR(VAR) \ + DECL_VARIABLE(VAR, float, 32, 2);\ + DECL_VARIABLE(VAR, float, 32, 4);\ + + DECL_VABD_VAR(vector1); + DECL_VABD_VAR(vector2); + DECL_VABD_VAR(vector3); + DECL_VABD_VAR(vector_res); + + clean_results (); + + /* Initialize input vector1 from buffer. */ + VLOAD(vector1, buffer, , float, f, 32, 2); + VLOAD(vector1, buffer, q, float, f, 32, 4); + + /* Choose init value arbitrarily. */ + VDUP(vector2, , float, f, 32, 2, -9.3f); + VDUP(vector2, q, float, f, 32, 4, -29.7f); + + /* Choose init value arbitrarily. */ + ASSIGN(scalar, , float, 32, 81.2f); + ASSIGN(scalar, q, float, 32, 36.8f); + + /* Execute the tests. */ + TEST_VFMS(, float, f, 32, 2); + TEST_VFMS(q, float, f, 32, 4); + + CHECK_VFMS_RESULTS (TEST_MSG, ); +} +#endif + +int main (void) +{ +#ifdef __aarch64__ + exec_vfms_n (); +#endif + return 0; +} Index: gcc/testsuite/ChangeLog === --- gcc/testsuite/ChangeLog (revision 218582) +++ gcc/testsuite/ChangeLog (working copy) @@ -1,3 +1,7 @@ +2014-12-08 Felix Yang felix.y...@huawei.com + + * gcc.target/aarch64/advsimd-intrinsics/vfms_n.c: New test. + 2014-12-10 Martin Liska mli...@suse.cz * gcc.dg/ipa/pr63909.c: New test. Index: gcc/config/aarch64/arm_neon.h === --- gcc/config/aarch64/arm_neon.h (revision 218582) +++ gcc/config/aarch64/arm_neon.h (working copy) @@ -15254,7 +15254,24 @@ vfmsq_f64 (float64x2_t __a, float64x2_t __b, float return __builtin_aarch64_fmav2df (-__b, __c, __a); } +__extension__ static __inline float32x2_t __attribute__ ((__always_inline__)) +vfms_n_f32 (float32x2_t __a, float32x2_t __b, float32_t __c) +{ + return __builtin_aarch64_fmav2sf (-__b, vdup_n_f32 (__c), __a); +} +__extension__ static __inline float32x4_t __attribute__ ((__always_inline__)) +vfmsq_n_f32 (float32x4_t __a, float32x4_t __b, float32_t __c) +{ + return __builtin_aarch64_fmav4sf (-__b, vdupq_n_f32 (__c), __a); +} + +__extension__ static __inline float64x2_t __attribute__ ((__always_inline__)) +vfmsq_n_f64 (float64x2_t __a, float64x2_t __b, float64_t __c) +{ + return __builtin_aarch64_fmav2df (-__b, vdupq_n_f64 (__c), __a); +} + /* vfms_lane */ __extension__ static __inline float32x2_t __attribute__ ((__always_inline__)) add-vfms_n-v1.diff Description: add-vfms_n-v1.diff
Re: [PING ^ 3][PATCH, AArch64] Add doloop_end pattern for -fmodulo-sched
--- gcc/config/aarch64/aarch64.c(revision 217394) +++ gcc/config/aarch64/aarch64.c(working copy) @@ -10224,6 +10224,9 @@ aarch64_use_by_pieces_infrastructure_p (unsigned i #define TARGET_USE_BY_PIECES_INFRASTRUCTURE_P \ aarch64_use_by_pieces_infrastructure_p +#undef TARGET_CAN_USE_DOLOOP_P +#define TARGET_CAN_USE_DOLOOP_P can_use_doloop_if_innermost + struct gcc_target targetm = TARGET_INITIALIZER; #include gt-aarch64.h Hi Felix, This patch causes https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64240 when sms-3 is tested with -fPIC. It runs fine when I reverse this patch out. Please could you have a look? Thanks, Tejas. OK, I have reproduced with -fPIC option. Will take a look.
Re: [COMMITTED] [PING] [PATCH] [AArch64, NEON] More NEON intrinsics improvement
+__extension__ static __inline float32x2_t __attribute__ +((__always_inline__)) +vfms_f32 (float32x2_t __a, float32x2_t __b, float32x2_t __c) { + return __builtin_aarch64_fmav2sf (-__b, __c, __a); } + +__extension__ static __inline float32x4_t __attribute__ +((__always_inline__)) +vfmsq_f32 (float32x4_t __a, float32x4_t __b, float32x4_t __c) { + return __builtin_aarch64_fmav4sf (-__b, __c, __a); } + +__extension__ static __inline float64x2_t __attribute__ +((__always_inline__)) +vfmsq_f64 (float64x2_t __a, float64x2_t __b, float64x2_t __c) { + return __builtin_aarch64_fmav2df (-__b, __c, __a); } + + Thanks, the patch looks good. Just one comment: You could also add float32x2_t vfms_n_f32(float32x2_t a, float32x2_t b, float32_t n) and its Q-variant. You can, if you wish, deal with Tejas' comment with a follow on patch rather than re-spinning this one. Provided this patch has no regressions on a big endian and a little endian test run then you can commit it. Thanks /Marcus No regressions for aarch64_be-linux-gnu target. Committed as r218484. Will come up with a new patch to deal with Tejas' comment. Thanks. My validations of trunk show that your new tests are incorrect: none of them compiles because the hfloat64_t type isn't defined. Also, keep in mind that the tests in this directory are executed by the aarch32 target too. Christophe It seems that some code for the newly added testcases is missing when the patch is generated. I will fix them soon. Thanks for pointing this out.
[PATCH] [AArch64, NEON] Fix testcases add by r218484
Hi, We find that the committed patch is not correctly generated from our local branch. This caused some code necessary for the testcases missing. As pointed out by Christophe in https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00778.html, we need to rework the testcases so that it can work for AArch32 target too. This patch fix this two issues. Three changes: 1. vfma_f32, vfmaq_f32, vfms_f32, vfmsq_f32 are only available for arm*-*-* target with the FMA feature, we take care of this through the macro __ARM_FEATURE_FMA. 2. vfma_n_f32 and vfmaq_n_f32 are only available for aarch64 target, we take care of this through the macro __aarch64__. 3. vfmaq_f64, vfmaq_n_f64 and vfmsq_f64 are only available for aarch64 target, we just exclude test for them to keep the testcases clean. (Note: They also pass on aarch64 aarch64_be target and we can add test for them if needed). Tested on armeb-linux-gnueabi, arm-linux-gnueabi, aarch64-linux-gnu and aarch64_be-linux-gnu. OK for the trunk? Sorry if this cause you guys any trouble, we will be more carefull in our future work. Index: gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c === --- gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c (revision 218582) +++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c (working copy) @@ -2,35 +2,34 @@ #include arm-neon-ref.h #include compute-ref-data.h +#ifdef __aarch64__ /* Expected results. */ VECT_VAR_DECL(expected,hfloat,32,2) [] = { 0x4438ca3d, 0x44390a3d }; VECT_VAR_DECL(expected,hfloat,32,4) [] = { 0x44869eb8, 0x4486beb8, 0x4486deb8, 0x4486feb8 }; -VECT_VAR_DECL(expected,hfloat,64,2) [] = { 0x408906e1532b8520, 0x40890ee1532b8520 }; #define VECT_VAR_ASSIGN(S,Q,T1,W) S##Q##_##T1##W #define ASSIGN(S, Q, T, W, V) T##W##_t S##Q##_##T##W = V -#define TEST_MSG VFMA/VFMAQ +#define TEST_MSG VFMA_N/VFMAQ_N + void exec_vfma_n (void) { /* Basic test: v4=vfma_n(v1,v2), then store the result. */ #define TEST_VFMA(Q, T1, T2, W, N) \ VECT_VAR(vector_res, T1, W, N) = \ vfma##Q##_n_##T2##W(VECT_VAR(vector1, T1, W, N), \ - VECT_VAR(vector2, T1, W, N), \ - VECT_VAR_ASSIGN(Scalar, Q, T1, W)); \ + VECT_VAR(vector2, T1, W, N),\ + VECT_VAR_ASSIGN(scalar, Q, T1, W)); \ vst1##Q##_##T2##W(VECT_VAR(result, T1, W, N), VECT_VAR(vector_res, T1, W, N)) #define CHECK_VFMA_RESULTS(test_name,comment) \ {\ CHECK_FP(test_name, float, 32, 2, PRIx32, expected, comment); \ CHECK_FP(test_name, float, 32, 4, PRIx32, expected, comment); \ - CHECK_FP(test_name, float, 64, 2, PRIx64, expected, comment); \ - } + } #define DECL_VABD_VAR(VAR) \ DECL_VARIABLE(VAR, float, 32, 2);\ - DECL_VARIABLE(VAR, float, 32, 4);\ - DECL_VARIABLE(VAR, float, 64, 2); + DECL_VARIABLE(VAR, float, 32, 4); DECL_VABD_VAR(vector1); DECL_VABD_VAR(vector2); @@ -42,28 +41,27 @@ void exec_vfma_n (void) /* Initialize input vector1 from buffer. */ VLOAD(vector1, buffer, , float, f, 32, 2); VLOAD(vector1, buffer, q, float, f, 32, 4); - VLOAD(vector1, buffer, q, float, f, 64, 2); /* Choose init value arbitrarily. */ VDUP(vector2, , float, f, 32, 2, 9.3f); VDUP(vector2, q, float, f, 32, 4, 29.7f); - VDUP(vector2, q, float, f, 64, 2, 15.8f); /* Choose init value arbitrarily. */ - ASSIGN(Scalar, , float, 32, 81.2f); - ASSIGN(Scalar, q, float, 32, 36.8f); - ASSIGN(Scalar, q, float, 64, 51.7f); + ASSIGN(scalar, , float, 32, 81.2f); + ASSIGN(scalar, q, float, 32, 36.8f); /* Execute the tests. */ TEST_VFMA(, float, f, 32, 2); TEST_VFMA(q, float, f, 32, 4); - TEST_VFMA(q, float, f, 64, 2); CHECK_VFMA_RESULTS (TEST_MSG, ); } +#endif int main (void) { +#ifdef __aarch64__ exec_vfma_n (); +#endif return 0; } Index: gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma.c === --- gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma.c (revision 218582) +++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma.c (working copy) @@ -2,12 +2,13 @@ #include arm-neon-ref.h #include compute-ref-data.h +#ifdef __ARM_FEATURE_FMA /* Expected results. */ VECT_VAR_DECL(expected,hfloat,32,2) [] = { 0x4438ca3d, 0x44390a3d }; VECT_VAR_DECL(expected,hfloat,32,4) [] = { 0x44869eb8, 0x4486beb8, 0x4486deb8, 0x4486feb8 }; -VECT_VAR_DECL(expected,hfloat,64,2) [] = { 0x408906e1532b8520, 0x40890ee1532b8520 }; #define TEST_MSG VFMA/VFMAQ + void exec_vfma
Re: [PATCH] [AArch64, NEON] Improve vpmaxX vpminX intrinsics
On 28 November 2014 at 09:23, Yangfei (Felix) felix.y...@huawei.com wrote: Hi, This patch converts vpmaxX vpminX intrinsics to use builtin functions instead of the previous inline assembly syntax. Regtested with aarch64-linux-gnu on QEMU. Also passed the glorious testsuite of Christophe Lyon. OK for the trunk? Hi Felix, We know from experience that the advsimd intrinsics tend to be fragile for big endian and in general it is fairly easy to break the big endian case. For these advsimd improvements that you are working on (that we very much appreciate) it is important to run both little endian and big endian regressions. Thanks /Marcus Okay. Any plan for the advsimd big-endian improvement? I rebased this patch over Alan Lawrance's patch: https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00279.html No regressions for aarch64_be-linux-gnu target too. OK for the thunk? Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 218464) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,18 @@ +2014-12-09 Felix Yang felix.y...@huawei.com + + * config/aarch64/aarch64-simd.md (aarch64_maxmin_unspmode): New + pattern. + * config/aarch64/aarch64-simd-builtins.def (smaxp, sminp, umaxp, + uminp, smax_nanp, smin_nanp): New builtins. + * config/aarch64/arm_neon.h (vpmax_s8, vpmax_s16, vpmax_s32, + vpmax_u8, vpmax_u16, vpmax_u32, vpmaxq_s8, vpmaxq_s16, vpmaxq_s32, + vpmaxq_u8, vpmaxq_u16, vpmaxq_u32, vpmax_f32, vpmaxq_f32, vpmaxq_f64, + vpmaxqd_f64, vpmaxs_f32, vpmaxnm_f32, vpmaxnmq_f32, vpmaxnmq_f64, + vpmaxnmqd_f64, vpmaxnms_f32, vpmin_s8, vpmin_s16, vpmin_s32, vpmin_u8, + vpmin_u16, vpmin_u32, vpminq_s8, vpminq_s16, vpminq_s32, vpminq_u8, + vpminq_u16, vpminq_u32, vpmin_f32, vpminq_f32, vpminq_f64, vpminqd_f64, + vpmins_f32, vpminnm_f32, vpminnmq_f32, vpminnmq_f64, vpminnmqd_f64, + 2014-12-07 Felix Yang felix.y...@huawei.com Shanyao Chen chenshan...@huawei.com Index: gcc/config/aarch64/arm_neon.h === --- gcc/config/aarch64/arm_neon.h (revision 218464) +++ gcc/config/aarch64/arm_neon.h (working copy) @@ -8843,491 +8843,7 @@ vpadds_f32 (float32x2_t a) return result; } -__extension__ static __inline float32x2_t __attribute__ ((__always_inline__)) -vpmax_f32 (float32x2_t a, float32x2_t b) -{ - float32x2_t result; - __asm__ (fmaxp %0.2s, %1.2s, %2.2s - : =w(result) - : w(a), w(b) - : /* No clobbers */); - return result; -} - -__extension__ static __inline int8x8_t __attribute__ ((__always_inline__)) -vpmax_s8 (int8x8_t a, int8x8_t b) -{ - int8x8_t result; - __asm__ (smaxp %0.8b, %1.8b, %2.8b - : =w(result) - : w(a), w(b) - : /* No clobbers */); - return result; -} - __extension__ static __inline int16x4_t __attribute__ ((__always_inline__)) -vpmax_s16 (int16x4_t a, int16x4_t b) -{ - int16x4_t result; - __asm__ (smaxp %0.4h, %1.4h, %2.4h - : =w(result) - : w(a), w(b) - : /* No clobbers */); - return result; -} - -__extension__ static __inline int32x2_t __attribute__ ((__always_inline__)) -vpmax_s32 (int32x2_t a, int32x2_t b) -{ - int32x2_t result; - __asm__ (smaxp %0.2s, %1.2s, %2.2s - : =w(result) - : w(a), w(b) - : /* No clobbers */); - return result; -} - -__extension__ static __inline uint8x8_t __attribute__ ((__always_inline__)) -vpmax_u8 (uint8x8_t a, uint8x8_t b) -{ - uint8x8_t result; - __asm__ (umaxp %0.8b, %1.8b, %2.8b - : =w(result) - : w(a), w(b) - : /* No clobbers */); - return result; -} - -__extension__ static __inline uint16x4_t __attribute__ ((__always_inline__)) -vpmax_u16 (uint16x4_t a, uint16x4_t b) -{ - uint16x4_t result; - __asm__ (umaxp %0.4h, %1.4h, %2.4h - : =w(result) - : w(a), w(b) - : /* No clobbers */); - return result; -} - -__extension__ static __inline uint32x2_t __attribute__ ((__always_inline__)) -vpmax_u32 (uint32x2_t a, uint32x2_t b) -{ - uint32x2_t result; - __asm__ (umaxp %0.2s, %1.2s, %2.2s - : =w(result) - : w(a), w(b) - : /* No clobbers */); - return result; -} - -__extension__ static __inline float32x2_t __attribute__ ((__always_inline__)) -vpmaxnm_f32 (float32x2_t a, float32x2_t b) -{ - float32x2_t result; - __asm__ (fmaxnmp %0.2s,%1.2s,%2.2s - : =w(result) - : w(a), w(b) - : /* No clobbers */); - return result; -} - -__extension__ static __inline float32x4_t __attribute__ ((__always_inline__)) -vpmaxnmq_f32 (float32x4_t a, float32x4_t b) -{ - float32x4_t result; - __asm__ (fmaxnmp %0.4s,%1.4s,%2.4s - : =w(result) - : w(a), w(b) - : /* No clobbers */); - return result; -} - -__extension__ static __inline float64x2_t
Re: [PATCH] [AArch64, NEON] Improve vpmaxX vpminX intrinsics
You'll need to rebase over Alan Lawrance's patch. https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00279.html Yes, see my new patch: https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00750.html +;; Pairwise Integer Max/Min operations. +(define_insn aarch64_maxmin_unspmode + [(set (match_operand:VQ_S 0 register_operand =w) + (unspec:VQ_S [(match_operand:VQ_S 1 register_operand w) +(match_operand:VQ_S 2 register_operand w)] + MAXMINV))] + TARGET_SIMD + maxmin_uns_opp\t%0.Vtype, %1.Vtype, %2.Vtype + [(set_attr type neon_minmaxq)] +) + Could you roll aarch64_reduc_maxmin_uns_internalv2si into this pattern? Will come up with another patch to fix this. Thanks for pointing this out. Thanks, Tejas.
[PING ^ 3] [RFC PATCH, AARCH64] Add support for -mlong-calls option
Hi, This is a pin for: https://gcc.gnu.org/ml/gcc-patches/2014-11/msg02258.html Thanks.
[PATCH, trivial] [AArch64] Remove declaration of removed function from aarch64-protos.h
The definition of function aarch64_function_profiler is removed since GCC-4.9. But the declaration is still there in aarch64-protos.h. So remove it. OK for the trunk? Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 218464) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2014-12-09 Felix Yang felix.y...@huawei.com + + * config/aarch64/aarch64-protos.h (aarch64_function_profiler): Remove + declaration of removed function. + 2014-12-07 Felix Yang felix.y...@huawei.com Shanyao Chen chenshan...@huawei.com Index: gcc/config/aarch64/aarch64-protos.h === --- gcc/config/aarch64/aarch64-protos.h (revision 218464) +++ gcc/config/aarch64/aarch64-protos.h (working copy) @@ -247,7 +247,6 @@ void aarch64_expand_epilogue (bool); void aarch64_expand_mov_immediate (rtx, rtx); void aarch64_expand_prologue (void); void aarch64_expand_vector_init (rtx, rtx); -void aarch64_function_profiler (FILE *, int); void aarch64_init_cumulative_args (CUMULATIVE_ARGS *, const_tree, rtx, const_tree, unsigned); void aarch64_init_expanders (void);
[COMMITTED] [PING] [PATCH] [AArch64, NEON] More NEON intrinsics improvement
On 5 December 2014 at 18:44, Tejas Belagod tejas.bela...@arm.com wrote: +__extension__ static __inline float32x2_t __attribute__ +((__always_inline__)) +vfms_f32 (float32x2_t __a, float32x2_t __b, float32x2_t __c) { + return __builtin_aarch64_fmav2sf (-__b, __c, __a); } + +__extension__ static __inline float32x4_t __attribute__ +((__always_inline__)) +vfmsq_f32 (float32x4_t __a, float32x4_t __b, float32x4_t __c) { + return __builtin_aarch64_fmav4sf (-__b, __c, __a); } + +__extension__ static __inline float64x2_t __attribute__ +((__always_inline__)) +vfmsq_f64 (float64x2_t __a, float64x2_t __b, float64x2_t __c) { + return __builtin_aarch64_fmav2df (-__b, __c, __a); } + + Thanks, the patch looks good. Just one comment: You could also add float32x2_t vfms_n_f32(float32x2_t a, float32x2_t b, float32_t n) and its Q-variant. You can, if you wish, deal with Tejas' comment with a follow on patch rather than re-spinning this one. Provided this patch has no regressions on a big endian and a little endian test run then you can commit it. Thanks /Marcus No regressions for aarch64_be-linux-gnu target. Committed as r218484. Will come up with a new patch to deal with Tejas' comment. Thanks.
[PING] [PATCH] [AArch64, NEON] More NEON intrinsics improvement
Any comments? Thanks. Hi, This patch converts more intrinsics to use builtin functions instead of the previous inline assembly syntax. Passed the glorious testsuite of Christophe Lyon. Three testcases are added for the testing of intriniscs which are not covered by the testsuite: gcc.target/aarch64/vfma.c gcc.target/aarch64/vfma_n.c gcc.target/aarch64/vfms.c Regtested with aarch64-linux-gnu on QEMU. OK for the trunk? Index: gcc/ChangeLog = == --- gcc/ChangeLog (revision 217394) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,26 @@ +2014-11-18 Felix Yang felix.y...@huawei.com + Haijian Zhang z.zhanghaij...@huawei.com + Jiji Jiang jiangj...@huawei.com + Pengfei Sui suipeng...@huawei.com + + * config/aarch64/arm_neon.h (vrecpe_u32, vrecpeq_u32): Rewrite using + builtin functions. + (vfma_f32, vfmaq_f32, vfmaq_f64, vfma_n_f32, vfmaq_n_f32, vfmaq_n_f64, + vfms_f32, vfmsq_f32, vfmsq_f64): Likewise. + (vhsub_s8, vhsub_u8, vhsub_s16, vhsub_u16, vhsub_s32, vhsub_u32, + vhsubq_s8, vhsubq_u8, vhsubq_s16, vhsubq_u16, vhsubq_s32, vhsubq_u32, + vsubhn_s16, vsubhn_u16, vsubhn_s32, vsubhn_u32, vsubhn_s64, vsubhn_u66, + vrsubhn_s16, vrsubhn_u16, vrsubhn_s32, vrsubhn_u32, vrsubhn_s64, + vrsubhn_u64, vsubhn_high_s16, vsubhn_high_u16, vsubhn_high_s32, + vsubhn_high_u32, vsubhn_high_s64, vsubhn_high_u64, vrsubhn_high_s16, + vrsubhn_high_u16, vrsubhn_high_s32, vrsubhn_high_u32, vrsubhn_high_s64, + vrsubhn_high_u64): Likewise. + * config/aarch64/iterators.md (VDQ_SI): New mode iterator. + * config/aarch64/aarch64.md (define_c_enum unspec): Add UNSPEC_URECPE. + * config/aarch64/aarch64-simd.md (aarch64_urecpemode): New pattern. + * config/aarch64/aarch64-simd-builtins.def (shsub, uhsub, subhn, rsubhn, + subhn2, rsubhn2, urecpe): New builtins. + 2014-11-11 Andrew Pinski apin...@cavium.com Bug target/61997 Index: gcc/testsuite/gcc.target/aarch64/narrow_high-intrinsics.c = == --- gcc/testsuite/gcc.target/aarch64/narrow_high-intrinsics.c (revision 217394) +++ gcc/testsuite/gcc.target/aarch64/narrow_high-intrinsics.c (working copy) @@ -107,9 +107,9 @@ ONE (vmovn_high, uint16x8_t, uint16x4_t, uint32x4_ ONE (vmovn_high, uint32x4_t, uint32x2_t, uint64x2_t, u64) -/* { dg-final { scan-assembler-times \\tsubhn2 v 6} } */ +/* { dg-final { scan-assembler-times \\tsubhn2\\tv 6} } */ /* { dg-final { scan-assembler-times \\taddhn2\\tv 6} } */ -/* { dg-final { scan-assembler-times rsubhn2 v 6} } */ +/* { dg-final { scan-assembler-times rsubhn2\\tv 6} } */ /* { dg-final { scan-assembler-times raddhn2\\tv 6} } */ /* { dg-final { scan-assembler-times \\trshrn2 v 6} } */ /* { dg-final { scan-assembler-times \\tshrn2 v 6} } */ Index: gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c = == --- gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c (revision 0) +++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c (revision 0) @@ -0,0 +1,69 @@ +#include arm_neon.h +#include arm-neon-ref.h +#include compute-ref-data.h + +/* Expected results. */ +VECT_VAR_DECL(expected,hfloat,32,2) [] = { 0x4438ca3d, 0x44390a3d }; +VECT_VAR_DECL(expected,hfloat,32,4) [] = { 0x44869eb8, 0x4486beb8, +0x4486deb8, 0x4486feb8 }; +VECT_VAR_DECL(expected,hfloat,64,2) [] = { 0x408906e1532b8520, +0x40890ee1532b8520 }; + +#define VECT_VAR_ASSIGN(S,Q,T1,W) S##Q##_##T1##W #define ASSIGN(S, Q, +T, W, V) T##W##_t S##Q##_##T##W = V #define TEST_MSG VFMA/VFMAQ +void exec_vfma_n (void) +{ + /* Basic test: v4=vfma_n(v1,v2), then store the result. */ +#define TEST_VFMA(Q, T1, T2, W, N) \ + VECT_VAR(vector_res, T1, W, N) = \ +vfma##Q##_n_##T2##W(VECT_VAR(vector1, T1, W, N), \ + VECT_VAR(vector2, T1, W, N), \ + VECT_VAR_ASSIGN(Scalar, Q, T1, W)); \ + vst1##Q##_##T2##W(VECT_VAR(result, T1, W, N), VECT_VAR(vector_res, +T1, W, N)) + +#define CHECK_VFMA_RESULTS(test_name,comment) \ + { \ +CHECK_FP(test_name, float, 32, 2, PRIx32, expected, comment);\ +CHECK_FP(test_name, float, 32, 4, PRIx32, expected, comment);\ + CHECK_FP(test_name, float, 64, 2, PRIx64, expected, comment); \ + } + +#define DECL_VABD_VAR(VAR) \ + DECL_VARIABLE(VAR, float, 32, 2); \ + DECL_VARIABLE(VAR, float, 32, 4); \ + DECL_VARIABLE(VAR, float, 64, 2); + + DECL_VABD_VAR(vector1); +
[PATCH] [AArch64, NEON] Improve vpmaxX vpminX intrinsics
Hi, This patch converts vpmaxX vpminX intrinsics to use builtin functions instead of the previous inline assembly syntax. Regtested with aarch64-linux-gnu on QEMU. Also passed the glorious testsuite of Christophe Lyon. OK for the trunk? Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 218128) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,19 @@ +2014-11-28 Felix Yang felix.y...@huawei.com + + * config/aarch64/aarch64-simd.md (aarch64_maxmin_unspmode): New + pattern. + * config/aarch64/aarch64-simd-builtins.def (smaxp, sminp, umaxp, + uminp, smax_nanp, smin_nanp): New builtins. + * config/aarch64/arm_neon.h (vpmax_s8, vpmax_s16, vpmax_s32, + vpmax_u8, vpmax_u16, vpmax_u32, vpmaxq_s8, vpmaxq_s16, vpmaxq_s32, + vpmaxq_u8, vpmaxq_u16, vpmaxq_u32, vpmax_f32, vpmaxq_f32, vpmaxq_f64, + vpmaxqd_f64, vpmaxs_f32, vpmaxnm_f32, vpmaxnmq_f32, vpmaxnmq_f64, + vpmaxnmqd_f64, vpmaxnms_f32, vpmin_s8, vpmin_s16, vpmin_s32, vpmin_u8, + vpmin_u16, vpmin_u32, vpminq_s8, vpminq_s16, vpminq_s32, vpminq_u8, + vpminq_u16, vpminq_u32, vpmin_f32, vpminq_f32, vpminq_f64, vpminqd_f64, + vpmins_f32, vpminnm_f32, vpminnmq_f32, vpminnmq_f64, vpminnmqd_f64, + vpminnms_f32): Rewrite using builtin functions. + 2014-11-27 Richard Biener rguent...@suse.de PR middle-end/64088 Index: gcc/config/aarch64/arm_neon.h === --- gcc/config/aarch64/arm_neon.h (revision 218128) +++ gcc/config/aarch64/arm_neon.h (working copy) @@ -8975,491 +8975,7 @@ vpadds_f32 (float32x2_t a) return result; } -__extension__ static __inline float32x2_t __attribute__ ((__always_inline__)) -vpmax_f32 (float32x2_t a, float32x2_t b) -{ - float32x2_t result; - __asm__ (fmaxp %0.2s, %1.2s, %2.2s - : =w(result) - : w(a), w(b) - : /* No clobbers */); - return result; -} - -__extension__ static __inline int8x8_t __attribute__ ((__always_inline__)) -vpmax_s8 (int8x8_t a, int8x8_t b) -{ - int8x8_t result; - __asm__ (smaxp %0.8b, %1.8b, %2.8b - : =w(result) - : w(a), w(b) - : /* No clobbers */); - return result; -} - __extension__ static __inline int16x4_t __attribute__ ((__always_inline__)) -vpmax_s16 (int16x4_t a, int16x4_t b) -{ - int16x4_t result; - __asm__ (smaxp %0.4h, %1.4h, %2.4h - : =w(result) - : w(a), w(b) - : /* No clobbers */); - return result; -} - -__extension__ static __inline int32x2_t __attribute__ ((__always_inline__)) -vpmax_s32 (int32x2_t a, int32x2_t b) -{ - int32x2_t result; - __asm__ (smaxp %0.2s, %1.2s, %2.2s - : =w(result) - : w(a), w(b) - : /* No clobbers */); - return result; -} - -__extension__ static __inline uint8x8_t __attribute__ ((__always_inline__)) -vpmax_u8 (uint8x8_t a, uint8x8_t b) -{ - uint8x8_t result; - __asm__ (umaxp %0.8b, %1.8b, %2.8b - : =w(result) - : w(a), w(b) - : /* No clobbers */); - return result; -} - -__extension__ static __inline uint16x4_t __attribute__ ((__always_inline__)) -vpmax_u16 (uint16x4_t a, uint16x4_t b) -{ - uint16x4_t result; - __asm__ (umaxp %0.4h, %1.4h, %2.4h - : =w(result) - : w(a), w(b) - : /* No clobbers */); - return result; -} - -__extension__ static __inline uint32x2_t __attribute__ ((__always_inline__)) -vpmax_u32 (uint32x2_t a, uint32x2_t b) -{ - uint32x2_t result; - __asm__ (umaxp %0.2s, %1.2s, %2.2s - : =w(result) - : w(a), w(b) - : /* No clobbers */); - return result; -} - -__extension__ static __inline float32x2_t __attribute__ ((__always_inline__)) -vpmaxnm_f32 (float32x2_t a, float32x2_t b) -{ - float32x2_t result; - __asm__ (fmaxnmp %0.2s,%1.2s,%2.2s - : =w(result) - : w(a), w(b) - : /* No clobbers */); - return result; -} - -__extension__ static __inline float32x4_t __attribute__ ((__always_inline__)) -vpmaxnmq_f32 (float32x4_t a, float32x4_t b) -{ - float32x4_t result; - __asm__ (fmaxnmp %0.4s,%1.4s,%2.4s - : =w(result) - : w(a), w(b) - : /* No clobbers */); - return result; -} - -__extension__ static __inline float64x2_t __attribute__ ((__always_inline__)) -vpmaxnmq_f64 (float64x2_t a, float64x2_t b) -{ - float64x2_t result; - __asm__ (fmaxnmp %0.2d,%1.2d,%2.2d - : =w(result) - : w(a), w(b) - : /* No clobbers */); - return result; -} - -__extension__ static __inline float64_t __attribute__ ((__always_inline__)) -vpmaxnmqd_f64 (float64x2_t a) -{ - float64_t result; - __asm__ (fmaxnmp %d0,%1.2d - : =w(result) - : w(a) - : /* No clobbers */); - return result; -} - -__extension__ static __inline float32_t __attribute__ ((__always_inline__)) -vpmaxnms_f32 (float32x2_t a) -{ -
Re: [PATCH PR59593] [arm] Backport r217772 r217826 to 4.8 4.9
I've backported this fix to 4.8 4.9 branch. These patches have been tested for armeb-none-eabi-gcc/g++ with qemu, and both the test results were ok. Looks OK with me. Ramana, is this OK for the 4.8 4.9 branches? Thanks.
Re: [PATCH, PR63742][ARM] Fix arm *movhi_insn_arch4 pattern for big-endian
On 19/11/14 09:29, Yangfei (Felix) wrote: Sorry for missing the point. It seems to me that 't2' here will conflict with condition of the pattern *movhi_insn_arch4: TARGET_ARM arm_arch4 (register_operand (operands[0], HImode) || register_operand (operands[1], HImode)) #define TARGET_ARM (! TARGET_THUMB) /* 32-bit Thumb-2 code. */ #define TARGET_THUMB2 (TARGET_THUMB arm_arch_thumb2) Bah, Indeed ! - I misremembered the t2 there, my mistake. Yes you are right there, but what I'd like you to do is to use that mechanism rather than putting all this logic in the predicate. So, I'd prefer you to add a v6t2 to the values for the arch attribute, don't forget to update the comments above. and in arch_enabled you need to enforce this with (and (eq_attr arch v6t2) (match_test TARGET_32BIT arm_arch6 arm_arch_thumb2)) (const_string yes) And in the pattern use v6t2 ... arm_arch_thumb2 implies that this is at the architecture level of v6t2. Therefore TARGET_ARM arm_arch_thumb2 implies ARM state. Hi Ramana, Thank you for your suggestions. I rebased the patch on the latest trunk and updated it accordingly. As this patch will not work for architectures older than armv6t2, I also prefer Thomas's patch to fix for them. I am currently performing test for this patch. Assuming no issues pops up, OK for the trunk? And is it necessary to backport this patch to the 4.8 4.9 branches? I've applied the following as obvious after Kugan mentioned on IRC this morning noticing a movwne r0, #-32768. Obviously this won't be accepted as is by the assembler and we should be using the %L character. Applied to trunk as obvious. Felix, How did you test this patch ? regards Ramana I regtested the patch for arm-eabi-gcc/g++ big-endian with qemu. The test result is OK. That's strange ... This issue can be reproduced by the following testcase. Thanks for fixing it. #include stdio.h unsigned short v = 0x5678; int i; int j = 0; int *ptr = j; int func() { for (i = 0; i 1; ++i) { *ptr = -1; v = 0xF234; } return v; } 2014-11-20 Ramana Radhakrishnan ramana.radhakrish...@arm.com PR target/59593 * config/arm/arm.md (*movhi_insn): Use right formatting for immediate.
Re: [PATCH, PR63742][ARM] Fix arm *movhi_insn_arch4 pattern for big-endian
On 19/11/14 09:29, Yangfei (Felix) wrote: Sorry for missing the point. It seems to me that 't2' here will conflict with condition of the pattern *movhi_insn_arch4: TARGET_ARM arm_arch4 (register_operand (operands[0], HImode) || register_operand (operands[1], HImode)) #define TARGET_ARM (! TARGET_THUMB) /* 32-bit Thumb-2 code. */ #define TARGET_THUMB2 (TARGET_THUMB arm_arch_thumb2) Bah, Indeed ! - I misremembered the t2 there, my mistake. Yes you are right there, but what I'd like you to do is to use that mechanism rather than putting all this logic in the predicate. So, I'd prefer you to add a v6t2 to the values for the arch attribute, don't forget to update the comments above. and in arch_enabled you need to enforce this with (and (eq_attr arch v6t2) (match_test TARGET_32BIT arm_arch6 arm_arch_thumb2)) (const_string yes) And in the pattern use v6t2 ... arm_arch_thumb2 implies that this is at the architecture level of v6t2. Therefore TARGET_ARM arm_arch_thumb2 implies ARM state. Hi Ramana, Thank you for your suggestions. I rebased the patch on the latest trunk and updated it accordingly. As this patch will not work for architectures older than armv6t2, I also prefer Thomas's patch to fix for them. I am currently performing test for this patch. Assuming no issues pops up, OK for the trunk? And is it necessary to backport this patch to the 4.8 4.9 branches? I've applied the following as obvious after Kugan mentioned on IRC this morning noticing a movwne r0, #-32768. Obviously this won't be accepted as is by the assembler and we should be using the %L character. Applied to trunk as obvious. Felix, How did you test this patch ? regards Ramana I regtested the patch for arm-eabi-gcc/g++ big-endian with qemu. The test result is OK. That's strange ... This issue can be reproduced by the following testcase. Thanks for fixing it. #include stdio.h unsigned short v = 0x5678; int i; int j = 0; int *ptr = j; int func() { for (i = 0; i 1; ++i) { *ptr = -1; v = 0xF234; } return v; } And the architecture level is set to armv7-a by default when testing.
Re: [PATCH, PR63742][ARM] Fix arm *movhi_insn_arch4 pattern for big-endian
Sorry for missing the point. It seems to me that 't2' here will conflict with condition of the pattern *movhi_insn_arch4: TARGET_ARM arm_arch4 (register_operand (operands[0], HImode) || register_operand (operands[1], HImode)) #define TARGET_ARM (! TARGET_THUMB) /* 32-bit Thumb-2 code. */ #define TARGET_THUMB2 (TARGET_THUMB arm_arch_thumb2) Bah, Indeed ! - I misremembered the t2 there, my mistake. Yes you are right there, but what I'd like you to do is to use that mechanism rather than putting all this logic in the predicate. So, I'd prefer you to add a v6t2 to the values for the arch attribute, don't forget to update the comments above. and in arch_enabled you need to enforce this with (and (eq_attr arch v6t2) (match_test TARGET_32BIT arm_arch6 arm_arch_thumb2)) (const_string yes) And in the pattern use v6t2 ... arm_arch_thumb2 implies that this is at the architecture level of v6t2. Therefore TARGET_ARM arm_arch_thumb2 implies ARM state. Hi Ramana, Thank you for your suggestions. I rebased the patch on the latest trunk and updated it accordingly. As this patch will not work for architectures older than armv6t2, I also prefer Thomas's patch to fix for them. I am currently performing test for this patch. Assuming no issues pops up, OK for the trunk? And is it necessary to backport this patch to the 4.8 4.9 branches? Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 217717) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,11 @@ +2014-11-19 Felix Yang felix.y...@huawei.com + Shanyao Chen chenshan...@huawei.com + + PR target/59593 + * config/arm/arm.md (define_attr arch): Add v6t2. + (define_attr arch_enabled): Add test for the above. + (*movhi_insn_arch4): Add new alternative. + 2014-11-18 Felix Yang felix.y...@huawei.com * config/aarch64/aarch64.c (doloop_end): New pattern. Index: gcc/config/arm/arm.md === --- gcc/config/arm/arm.md (revision 217717) +++ gcc/config/arm/arm.md (working copy) @@ -125,9 +125,10 @@ ; This can be a for ARM, t for either of the Thumbs, 32 for ; TARGET_32BIT, t1 or t2 to specify a specific Thumb mode. v6 ; for ARM or Thumb-2 with arm_arch6, and nov6 for ARM without -; arm_arch6. This attribute is used to compute attribute enabled, -; use type any to enable an alternative in all cases. -(define_attr arch any,a,t,32,t1,t2,v6,nov6,neon_for_64bits,avoid_neon_for_64bits,iwmmxt,iwmmxt2,armv6_or_vfpv3 +; arm_arch6. v6t2 for Thumb-2 with arm_arch6. This attribute is +; used to compute attribute enabled, use type any to enable an +; alternative in all cases. +(define_attr arch any,a,t,32,t1,t2,v6,nov6,v6t2,neon_for_64bits,avoid_neon_for_64bits,iwmmxt,iwmmxt2,armv6_or_vfpv3 (const_string any)) (define_attr arch_enabled no,yes @@ -162,6 +163,10 @@ (match_test TARGET_32BIT !arm_arch6)) (const_string yes) +(and (eq_attr arch v6t2) + (match_test TARGET_32BIT arm_arch6 arm_arch_thumb2)) +(const_string yes) + (and (eq_attr arch avoid_neon_for_64bits) (match_test TARGET_NEON) (not (match_test TARGET_PREFER_NEON_64BITS))) @@ -6288,8 +6293,8 @@ ;; Pattern to recognize insn generated default case above (define_insn *movhi_insn_arch4 - [(set (match_operand:HI 0 nonimmediate_operand =r,r,m,r) - (match_operand:HI 1 general_operand rIk,K,r,mi))] + [(set (match_operand:HI 0 nonimmediate_operand =r,r,r,m,r) + (match_operand:HI 1 general_operand rIk,K,n,r,mi))] TARGET_ARM arm_arch4 (register_operand (operands[0], HImode) @@ -6297,16 +6302,19 @@ @ mov%?\\t%0, %1\\t%@ movhi mvn%?\\t%0, #%B1\\t%@ movhi + movw%?\\t%0, %1\\t%@ movhi str%(h%)\\t%1, %0\\t%@ movhi ldr%(h%)\\t%0, %1\\t%@ movhi [(set_attr predicable yes) - (set_attr pool_range *,*,*,256) - (set_attr neg_pool_range *,*,*,244) + (set_attr pool_range *,*,*,*,256) + (set_attr neg_pool_range *,*,*,*,244) + (set_attr arch *,*,v6t2,*,*) (set_attr_alternative type [(if_then_else (match_operand 1 const_int_operand ) (const_string mov_imm ) (const_string mov_reg)) (const_string mvn_imm) + (const_string mov_imm) (const_string store1) (const_string load1)])] ) arm-patch-v3.diff Description: arm-patch-v3.diff
[PATCH] [AArch64, NEON] More NEON intrinsics improvement
Hi, This patch converts more intrinsics to use builtin functions instead of the previous inline assembly syntax. Passed the glorious testsuite of Christophe Lyon. Three testcases are added for the testing of intriniscs which are not covered by the testsuite: gcc.target/aarch64/vfma.c gcc.target/aarch64/vfma_n.c gcc.target/aarch64/vfms.c Regtested with aarch64-linux-gnu on QEMU. OK for the trunk? Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 217394) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,26 @@ +2014-11-18 Felix Yang felix.y...@huawei.com + Haijian Zhang z.zhanghaij...@huawei.com + Jiji Jiang jiangj...@huawei.com + Pengfei Sui suipeng...@huawei.com + + * config/aarch64/arm_neon.h (vrecpe_u32, vrecpeq_u32): Rewrite using + builtin functions. + (vfma_f32, vfmaq_f32, vfmaq_f64, vfma_n_f32, vfmaq_n_f32, vfmaq_n_f64, + vfms_f32, vfmsq_f32, vfmsq_f64): Likewise. + (vhsub_s8, vhsub_u8, vhsub_s16, vhsub_u16, vhsub_s32, vhsub_u32, + vhsubq_s8, vhsubq_u8, vhsubq_s16, vhsubq_u16, vhsubq_s32, vhsubq_u32, + vsubhn_s16, vsubhn_u16, vsubhn_s32, vsubhn_u32, vsubhn_s64, vsubhn_u66, + vrsubhn_s16, vrsubhn_u16, vrsubhn_s32, vrsubhn_u32, vrsubhn_s64, + vrsubhn_u64, vsubhn_high_s16, vsubhn_high_u16, vsubhn_high_s32, + vsubhn_high_u32, vsubhn_high_s64, vsubhn_high_u64, vrsubhn_high_s16, + vrsubhn_high_u16, vrsubhn_high_s32, vrsubhn_high_u32, vrsubhn_high_s64, + vrsubhn_high_u64): Likewise. + * config/aarch64/iterators.md (VDQ_SI): New mode iterator. + * config/aarch64/aarch64.md (define_c_enum unspec): Add UNSPEC_URECPE. + * config/aarch64/aarch64-simd.md (aarch64_urecpemode): New pattern. + * config/aarch64/aarch64-simd-builtins.def (shsub, uhsub, subhn, rsubhn, + subhn2, rsubhn2, urecpe): New builtins. + 2014-11-11 Andrew Pinski apin...@cavium.com Bug target/61997 Index: gcc/testsuite/gcc.target/aarch64/narrow_high-intrinsics.c === --- gcc/testsuite/gcc.target/aarch64/narrow_high-intrinsics.c (revision 217394) +++ gcc/testsuite/gcc.target/aarch64/narrow_high-intrinsics.c (working copy) @@ -107,9 +107,9 @@ ONE (vmovn_high, uint16x8_t, uint16x4_t, uint32x4_ ONE (vmovn_high, uint32x4_t, uint32x2_t, uint64x2_t, u64) -/* { dg-final { scan-assembler-times \\tsubhn2 v 6} } */ +/* { dg-final { scan-assembler-times \\tsubhn2\\tv 6} } */ /* { dg-final { scan-assembler-times \\taddhn2\\tv 6} } */ -/* { dg-final { scan-assembler-times rsubhn2 v 6} } */ +/* { dg-final { scan-assembler-times rsubhn2\\tv 6} } */ /* { dg-final { scan-assembler-times raddhn2\\tv 6} } */ /* { dg-final { scan-assembler-times \\trshrn2 v 6} } */ /* { dg-final { scan-assembler-times \\tshrn2 v 6} } */ Index: gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c === --- gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c (revision 0) +++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vfma_n.c (revision 0) @@ -0,0 +1,69 @@ +#include arm_neon.h +#include arm-neon-ref.h +#include compute-ref-data.h + +/* Expected results. */ +VECT_VAR_DECL(expected,hfloat,32,2) [] = { 0x4438ca3d, 0x44390a3d }; +VECT_VAR_DECL(expected,hfloat,32,4) [] = { 0x44869eb8, 0x4486beb8, 0x4486deb8, 0x4486feb8 }; +VECT_VAR_DECL(expected,hfloat,64,2) [] = { 0x408906e1532b8520, 0x40890ee1532b8520 }; + +#define VECT_VAR_ASSIGN(S,Q,T1,W) S##Q##_##T1##W +#define ASSIGN(S, Q, T, W, V) T##W##_t S##Q##_##T##W = V +#define TEST_MSG VFMA/VFMAQ +void exec_vfma_n (void) +{ + /* Basic test: v4=vfma_n(v1,v2), then store the result. */ +#define TEST_VFMA(Q, T1, T2, W, N) \ + VECT_VAR(vector_res, T1, W, N) = \ +vfma##Q##_n_##T2##W(VECT_VAR(vector1, T1, W, N), \ + VECT_VAR(vector2, T1, W, N), \ + VECT_VAR_ASSIGN(Scalar, Q, T1, W)); \ + vst1##Q##_##T2##W(VECT_VAR(result, T1, W, N), VECT_VAR(vector_res, T1, W, N)) + +#define CHECK_VFMA_RESULTS(test_name,comment) \ + {\ +CHECK_FP(test_name, float, 32, 2, PRIx32, expected, comment); \ +CHECK_FP(test_name, float, 32, 4, PRIx32, expected, comment); \ + CHECK_FP(test_name, float, 64, 2, PRIx64, expected, comment); \ + } + +#define DECL_VABD_VAR(VAR) \ + DECL_VARIABLE(VAR, float, 32, 2);\ + DECL_VARIABLE(VAR, float, 32, 4);\ + DECL_VARIABLE(VAR, float, 64, 2); + + DECL_VABD_VAR(vector1); + DECL_VABD_VAR(vector2); + DECL_VABD_VAR(vector3); +
Re: [PATCH 0/3][AArch64]More intrinsics/builtins improvements
Yeah, I agree that your approach is better. I missed this point. Thanks. Ah, sorry for the duplication of effort. And thanks for the heads-up about upcoming work! I don't think I have any plans for any of those others at the moment. In the case of vld1_dup, however, I'm going to argue that my approach (https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01718.html) is better, in that a builtin is opaque (blocks optimization) for the midend, whereas gcc vector extensions (as in vdup_n_...) allows the midend to perform constant-folding, etc. Does that make sense? --Alan Yangfei (Felix) wrote: These three are logically independent, but all on a common theme, and I've tested them all together by bootstrapped + check-gcc on aarch64-none-elf cross-tested check-gcc on aarch64_be-none-elf Ok for trunk? Hi Alan, It seems that we are duplicating the work for the vld1_dup part. (Refer to my message: https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01462.html) I have a plan to improve these intrinsics/builtins: vrsubhnX, vrsqrtX, vqrdmulX, vqmovX, vqdmulhqX, vqdmulhX, vpminX, vpmaxX, vpaddX, vpadaX vmvnX, vmulxX, vmovnX, vmlsX, vhsubX, vcvtX, vcopyX, vaddlvX, vabX, vfmX, vrecpeX, vcntX, vclsX And work for these intrinsics is in progress: vfmX, vrecpeX, vhsubX, vcntX, vclsX Please let me know if you guys want to work on any of them. Thanks.
Re: [PING ^ 3][PATCH, AArch64] Add doloop_end pattern for -fmodulo-sched
Yes, major code is borrowed from ARM port with the expected mode of loop pseudo changed to DImode. The function loop_canon_p called in sms_schedule can make sure that SMS is only performed for innermost loops. But I think it's a good idea to implement the TARGET_CAN_USE_DOLOOP_P hook here. See my updated patch below. How about this one? Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 217394) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2014-11-13 Felix Yang felix.y...@huawei.com + + * config/aarch64/aarch64.c (doloop_end): New pattern. + * config/aarch64/aarch64.md (TARGET_CAN_USE_DOLOOP_P): Implement. + 2014-11-11 Andrew Pinski apin...@cavium.com Bug target/61997 Index: gcc/config/aarch64/aarch64.md === --- gcc/config/aarch64/aarch64.md (revision 217394) +++ gcc/config/aarch64/aarch64.md (working copy) @@ -4087,6 +4087,47 @@ [(set_attr type mrs)]) +;; Define the subtract-one-and-jump insns so loop.c +;; knows what to generate. +(define_expand doloop_end + [(use (match_operand 0 )) ; loop pseudo + (use (match_operand 1 ))] ; label + +{ + /* Currently SMS relies on the do-loop pattern to recognize loops + where (1) the control part consists of all insns defining and/or + using a certain 'count' register and (2) the loop count can be + adjusted by modifying this register prior to the loop. + ??? The possible introduction of a new block to initialize the + new IV can potentially affect branch optimizations. */ + if (optimize 0 flag_modulo_sched) +{ + rtx s0; + rtx bcomp; + rtx loc_ref; + rtx cc_reg; + rtx insn; + rtx cmp; + + if (GET_MODE (operands[0]) != DImode) + FAIL; + + s0 = operands [0]; + insn = emit_insn (gen_adddi3_compare0 (s0, s0, GEN_INT (-1))); + + cmp = XVECEXP (PATTERN (insn), 0, 0); + cc_reg = SET_DEST (cmp); + bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx); + loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [1]); + emit_jump_insn (gen_rtx_SET (VOIDmode, pc_rtx, + gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp, +loc_ref, pc_rtx))); + DONE; +} + else +FAIL; +}) + ;; AdvSIMD Stuff (include aarch64-simd.md) Index: gcc/config/aarch64/aarch64.c === --- gcc/config/aarch64/aarch64.c(revision 217394) +++ gcc/config/aarch64/aarch64.c(working copy) @@ -10224,6 +10224,9 @@ aarch64_use_by_pieces_infrastructure_p (unsigned i #define TARGET_USE_BY_PIECES_INFRASTRUCTURE_P \ aarch64_use_by_pieces_infrastructure_p +#undef TARGET_CAN_USE_DOLOOP_P +#define TARGET_CAN_USE_DOLOOP_P can_use_doloop_if_innermost + struct gcc_target targetm = TARGET_INITIALIZER; #include gt-aarch64.h On 17 November 2014 07:59, Yangfei (Felix) felix.y...@huawei.com wrote: +2014-11-13 Felix Yang felix.y...@huawei.com + + * config/aarch64/aarch64.md (doloop_end): New pattern. + This looks like a straight copy of the ARM code, but without the TARGET_CAN_USE_DOLOOP_P definition. If the reason for including this code is for the benefit of module-sched then should the hook be defined to limit the use of this pattern to inner most loops only? Cheers /Marcus aarch64-doloop-v4.diff Description: aarch64-doloop-v4.diff
Re: [PATCH, PR63742][ARM] Fix arm *movhi_insn_arch4 pattern for big-endian
On 06/11/14 08:35, Yangfei (Felix) wrote: The idea is simple: Use movw for certain const source operand instead of ldrh. And exclude the const values which cannot be handled by mov/mvn/movw. I am doing regression test for this patch. Assuming no issue pops up, OK for trunk? So, doesn't that makes the bug latent for architectures older than armv6t2 and big endian and only fixed this in ARM state ? I'd prefer a complete solution please. What about *thumb2_movhi_insn in thumb2.md ? Actually, we fix the bug by excluding the const values which cannot be handled. The patch still works even without the adding of movw here. The new movw alternative here is just an small code optimization for certain arch. We can handle consts by movw instead of ldrh and this better for performance. We find that this bug is not reproducible for *thumb2_movhi_insn. The reason is that this pattern can always move consts using movw. Please fix the PR number in your final commit with PR 59593. Index: gcc/config/arm/predicates.md = == --- gcc/config/arm/predicates.md(revision 216838) +++ gcc/config/arm/predicates.md(working copy) @@ -144,6 +144,11 @@ (and (match_code const_int) (match_test INTVAL (op) == 0))) +(define_predicate arm_movw_immediate_operand + (and (match_test TARGET_32BIT arm_arch_thumb2) + (and (match_code const_int) + (match_test (INTVAL (op) 0x) == 0 + ;; Something valid on the RHS of an ARM data-processing instruction (define_predicate arm_rhs_operand (ior (match_operand 0 s_register_operand) @@ -211,6 +216,11 @@ (ior (match_operand 0 arm_rhs_operand) (match_operand 0 arm_not_immediate_operand))) +(define_predicate arm_hi_operand + (ior (match_operand 0 arm_rhsm_operand) + (ior (match_operand 0 arm_not_immediate_operand) +(match_operand 0 arm_movw_immediate_operand + I think you don't need any of these predicates. (define_predicate arm_di_operand (ior (match_operand 0 s_register_operand) (match_operand 0 arm_immediate_di_operand))) Index: gcc/config/arm/arm.md = == --- gcc/config/arm/arm.md (revision 216838) +++ gcc/config/arm/arm.md (working copy) @@ -6285,8 +6285,8 @@ ;; Pattern to recognize insn generated default case above (define_insn *movhi_insn_arch4 - [(set (match_operand:HI 0 nonimmediate_operand =r,r,m,r) - (match_operand:HI 1 general_operand rIk,K,r,mi))] + [(set (match_operand:HI 0 nonimmediate_operand =r,r,r,m,r) + (match_operand:HI 1 arm_hi_operand rIk,K,j,r,mi))] Use `n' instead of `j' - movw can handle all of the numerical constants for HImode values. And the predicate can remain general_operand. Hello Ramana, We need to make sure that movw is only used for architectures which satisfy arm_arch_thumb2 as indicated in the following predicate added: +(define_predicate arm_movw_immediate_operand + (and (match_test TARGET_32BIT arm_arch_thumb2) + (and (match_code const_int) + (match_test (INTVAL (op) 0x) == 0 I am modifying the predicate in order to fix issue for other older architectures. It seems we can't achieve this by simply using 'n' instead of 'j' here, right? Thanks.
Re: [PING ^ 3][PATCH, AArch64] Add doloop_end pattern for -fmodulo-sched
On 11/18/2014 11:48 AM, Yangfei (Felix) wrote: +(define_expand doloop_end + [(use (match_operand 0 )) ; loop pseudo + (use (match_operand 1 ))] ; label + +{ + /* Currently SMS relies on the do-loop pattern to recognize loops + where (1) the control part consists of all insns defining and/or + using a certain 'count' register and (2) the loop count can be + adjusted by modifying this register prior to the loop. + ??? The possible introduction of a new block to initialize the + new IV can potentially affect branch optimizations. */ + if (optimize 0 flag_modulo_sched) You'd be better off moving this condition into the expansion predicate (which is currently ). This short-circuits a lot of unnecessary work. See pass_rtl_doloop::gate. r~ Yeah, that's a good idea. See my updated patch :-) Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 217394) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2014-11-13 Felix Yang felix.y...@huawei.com + + * config/aarch64/aarch64.c (doloop_end): New pattern. + * config/aarch64/aarch64.md (TARGET_CAN_USE_DOLOOP_P): Implement. + 2014-11-11 Andrew Pinski apin...@cavium.com Bug target/61997 Index: gcc/config/aarch64/aarch64.md === --- gcc/config/aarch64/aarch64.md (revision 217394) +++ gcc/config/aarch64/aarch64.md (working copy) @@ -4087,6 +4087,43 @@ [(set_attr type mrs)]) +;; Define the subtract-one-and-jump insns so loop.c +;; knows what to generate. +(define_expand doloop_end + [(use (match_operand 0 )) ; loop pseudo + (use (match_operand 1 ))] ; label + optimize 0 flag_modulo_sched +{ + rtx s0; + rtx bcomp; + rtx loc_ref; + rtx cc_reg; + rtx insn; + rtx cmp; + + /* Currently SMS relies on the do-loop pattern to recognize loops + where (1) the control part consists of all insns defining and/or + using a certain 'count' register and (2) the loop count can be + adjusted by modifying this register prior to the loop. + ??? The possible introduction of a new block to initialize the + new IV can potentially affect branch optimizations. */ + + if (GET_MODE (operands[0]) != DImode) +FAIL; + + s0 = operands [0]; + insn = emit_insn (gen_adddi3_compare0 (s0, s0, GEN_INT (-1))); + + cmp = XVECEXP (PATTERN (insn), 0, 0); + cc_reg = SET_DEST (cmp); + bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx); + loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [1]); + emit_jump_insn (gen_rtx_SET (VOIDmode, pc_rtx, + gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp, +loc_ref, pc_rtx))); + DONE; +}) + ;; AdvSIMD Stuff (include aarch64-simd.md) Index: gcc/config/aarch64/aarch64.c === --- gcc/config/aarch64/aarch64.c(revision 217394) +++ gcc/config/aarch64/aarch64.c(working copy) @@ -10224,6 +10224,9 @@ aarch64_use_by_pieces_infrastructure_p (unsigned i #define TARGET_USE_BY_PIECES_INFRASTRUCTURE_P \ aarch64_use_by_pieces_infrastructure_p +#undef TARGET_CAN_USE_DOLOOP_P +#define TARGET_CAN_USE_DOLOOP_P can_use_doloop_if_innermost + struct gcc_target targetm = TARGET_INITIALIZER; #include gt-aarch64.h aarch64-doloop-v5.diff Description: aarch64-doloop-v5.diff
[PING ^ 2][RFC PATCH, AARCH64] Add support for -mlong-calls option
Ping again? Any comment please? Ping? I hope this patch can catch up with stage 1 of GCC-5.0. Thanks. Hi Felix, Sorry for the delay responding, I've been out of the office recently and I'm only just catching up on a backlog of GCC related emails. I'm in two minds about this; I can potentially see the need for attributes to enable long calls for specific calls, and maybe also for pragmas that can be used to efficiently mark a group of functions in that way; but I don't really see the value in adding a -mlong-calls option to do this globally. The reasoning is as follows: long calls are generally very expensive and relatively few functions should need them in most applications (since code that needs to span more than a single block of 128Mbytes - the span of a BL or B instruction - will be very rare in reality). The best way to handle very large branches for those rare cases where you do have a very large contiguous block of code more than 128MB is by having the linker insert veneers when needed; the code will branch to the veneer which will insert an indirect branch at that point (the ABI guarantees that at function call boundaries IP0 and IP1 will not contain live values, making them available for such purposes). In a very small number of cases it might be desirable to mark specific functions as being too far away to reach; in those cases the attributes and pragma methods can be used to mark such calls as being far calls. Aside: The reason -mlong-calls was added to GCC for ARM is that the code there pre-dates the EABI, which introduced the concept of link-time veneering of calls - the option should be unnecessary now that almost everyone uses the EABI as the basis for their platform ABI. We don't have such a legacy for AArch64 and I'd need to see strong justification for its use before adding the option there as well. So please can you rework the patch to remove the -mlong-calls option and just leave the attribute and pragma interfaces. R. Hello Richard, Thanks for the comments. I agree with the idea. And I updated the patch with the -mlong-calls option removed and use short call by default. Reg-tested for aarch64-linux-gnu with qemu. Is it OK for trunk? Index: gcc/ChangeLog = == --- gcc/ChangeLog (revision 217394) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,25 @@ +2014-11-12 Felix Yang felix.y...@huawei.com + Haijian Zhang z.zhanghaij...@huawei.com + + * config/aarch64/aarch64.h (REGISTER_TARGET_PRAGMAS): Define. + * config/aarch64/aarch64.c (aarch64_set_default_type_attributes, + aarch64_attribute_table, aarch64_comp_type_attributes, + aarch64_decl_is_long_call_p, aarch64_function_in_section_p, + aarch64_pr_long_calls, aarch64_pr_no_long_calls, + aarch64_pr_long_calls_off): New functions. + (TARGET_SET_DEFAULT_TYPE_ATTRIBUTES): Define as + aarch64_set_default_type_attributes. + (TARGET_ATTRIBUTE_TABLE): Define as aarch64_attribute_table. + (TARGET_COMP_TYPE_ATTRIBUTES): Define as aarch64_comp_type_attribute. + (aarch64_pragma_enum): New enum. + (aarch64_attribute_table): New attribute table. + * config/aarch64/aarch64-protos.h (aarch64_pr_long_calls, + aarch64_pr_no_long_calls, aarch64_pr_long_calls_off): New declarations. + * config/aarch64/aarch64.md (sibcall, sibcall_value): Modified to + generate indirect call for sibling call when needed. + * config/aarch64/predicate.md (aarch64_call_insn_operand): Modified to + exclude a symbol_ref for an indirect call. + 2014-11-11 Andrew Pinski apin...@cavium.com Bug target/61997 Index: gcc/testsuite/gcc.target/aarch64/long-calls-1.c = == --- gcc/testsuite/gcc.target/aarch64/long-calls-1.c (revision 0) +++ gcc/testsuite/gcc.target/aarch64/long-calls-1.c (revision 0) @@ -0,0 +1,133 @@ +/* Check that long calls to different sections are not optimized to +bl. */ +/* { dg-do compile } */ +/* { dg-options -O2 } */ +/* This test expects that short calls are the default. */ +/* { dg-skip-if -mlong-calls in use { *-*-* } { -mlong-calls } +{ } } */ + +#define section(S) __attribute__((section(S))) #define weak +__attribute__((weak)) #define noinline __attribute__((noinline)) +#define long_call __attribute__((long_call)) #define short_call +__attribute__((short_call)) + +#define REMOTE_CALL(ID, TARGET_ATTRS, CALL_ATTRS) \ + const char *TARGET_ATTRS ID (void); \ + const char *CALL_ATTRS call_##ID (void) { return ID () + 1; } + +#define EXTERN_CALL(ID, TARGET_ATTRS, CALL_ATTRS) \ + const char *TARGET_ATTRS noinline ID (void) { return
Re: [PATCH, PR63742][ARM] Fix arm *movhi_insn_arch4 pattern for big-endian
On 18/11/14 11:02, Yangfei (Felix) wrote: On 06/11/14 08:35, Yangfei (Felix) wrote: The idea is simple: Use movw for certain const source operand instead of ldrh. And exclude the const values which cannot be handled by mov/mvn/movw. I am doing regression test for this patch. Assuming no issue pops up, OK for trunk? So, doesn't that makes the bug latent for architectures older than armv6t2 and big endian and only fixed this in ARM state ? I'd prefer a complete solution please. What about *thumb2_movhi_insn in thumb2.md ? Actually, we fix the bug by excluding the const values which cannot be handled. The patch still works even without the adding of movw here. The new movw alternative here is just an small code optimization for certain arch. We can handle consts by movw instead of ldrh and this better for performance. We find that this bug is not reproducible for *thumb2_movhi_insn. The reason is that this pattern can always move consts using movw. Please fix the PR number in your final commit with PR 59593. Index: gcc/config/arm/predicates.md = == --- gcc/config/arm/predicates.md (revision 216838) +++ gcc/config/arm/predicates.md (working copy) @@ -144,6 +144,11 @@ (and (match_code const_int) (match_test INTVAL (op) == 0))) +(define_predicate arm_movw_immediate_operand + (and (match_test TARGET_32BIT arm_arch_thumb2) + (and (match_code const_int) + (match_test (INTVAL (op) 0x) == 0 + ;; Something valid on the RHS of an ARM data-processing instruction (define_predicate arm_rhs_operand (ior (match_operand 0 s_register_operand) @@ -211,6 +216,11 @@ (ior (match_operand 0 arm_rhs_operand) (match_operand 0 arm_not_immediate_operand))) +(define_predicate arm_hi_operand + (ior (match_operand 0 arm_rhsm_operand) + (ior (match_operand 0 arm_not_immediate_operand) +(match_operand 0 arm_movw_immediate_operand + I think you don't need any of these predicates. (define_predicate arm_di_operand (ior (match_operand 0 s_register_operand) (match_operand 0 arm_immediate_di_operand))) Index: gcc/config/arm/arm.md = == --- gcc/config/arm/arm.md (revision 216838) +++ gcc/config/arm/arm.md (working copy) @@ -6285,8 +6285,8 @@ ;; Pattern to recognize insn generated default case above (define_insn *movhi_insn_arch4 - [(set (match_operand:HI 0 nonimmediate_operand =r,r,m,r) - (match_operand:HI 1 general_operand rIk,K,r,mi))] + [(set (match_operand:HI 0 nonimmediate_operand =r,r,r,m,r) + (match_operand:HI 1 arm_hi_operand rIk,K,j,r,mi))] Use `n' instead of `j' - movw can handle all of the numerical constants for HImode values. And the predicate can remain general_operand. Did you read my comment about set_attr arch further down in the thread ? Look at the set_attr arch alternative and set this to t2 for the movw alternative. I would expect that to be enough to sort this out instead of adding all this code. Sorry for missing the point. It seems to me that 't2' here will conflict with condition of the pattern *movhi_insn_arch4: TARGET_ARM arm_arch4 (register_operand (operands[0], HImode) || register_operand (operands[1], HImode)) #define TARGET_ARM (! TARGET_THUMB) /* 32-bit Thumb-2 code. */ #define TARGET_THUMB2 (TARGET_THUMB arm_arch_thumb2) Right? Thanks.
Re: [PING ^ 2][RFC PATCH, AARCH64] Add support for -mlong-calls option
On Tue, Nov 18, 2014 at 11:51 AM, Yangfei (Felix) felix.y...@huawei.com wrote: Ping again? Any comment please? Pinging daily is only going to irritate people. Please desist from doing so. Ramana Oh, thanks for reminding me. And sorry if this bothers you guys. The end of stage1 of GCC 5.0 causes me to push this a little bit :-) Ping? I hope this patch can catch up with stage 1 of GCC-5.0. Thanks. Hi Felix, Sorry for the delay responding, I've been out of the office recently and I'm only just catching up on a backlog of GCC related emails. I'm in two minds about this; I can potentially see the need for attributes to enable long calls for specific calls, and maybe also for pragmas that can be used to efficiently mark a group of functions in that way; but I don't really see the value in adding a -mlong-calls option to do this globally. The reasoning is as follows: long calls are generally very expensive and relatively few functions should need them in most applications (since code that needs to span more than a single block of 128Mbytes - the span of a BL or B instruction - will be very rare in reality). The best way to handle very large branches for those rare cases where you do have a very large contiguous block of code more than 128MB is by having the linker insert veneers when needed; the code will branch to the veneer which will insert an indirect branch at that point (the ABI guarantees that at function call boundaries IP0 and IP1 will not contain live values, making them available for such purposes). In a very small number of cases it might be desirable to mark specific functions as being too far away to reach; in those cases the attributes and pragma methods can be used to mark such calls as being far calls. Aside: The reason -mlong-calls was added to GCC for ARM is that the code there pre-dates the EABI, which introduced the concept of link-time veneering of calls - the option should be unnecessary now that almost everyone uses the EABI as the basis for their platform ABI. We don't have such a legacy for AArch64 and I'd need to see strong justification for its use before adding the option there as well. So please can you rework the patch to remove the -mlong-calls option and just leave the attribute and pragma interfaces. R. Hello Richard, Thanks for the comments. I agree with the idea. And I updated the patch with the -mlong-calls option removed and use short call by default. Reg-tested for aarch64-linux-gnu with qemu. Is it OK for trunk? Index: gcc/ChangeLog = == --- gcc/ChangeLog (revision 217394) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,25 @@ +2014-11-12 Felix Yang felix.y...@huawei.com + Haijian Zhang z.zhanghaij...@huawei.com + + * config/aarch64/aarch64.h (REGISTER_TARGET_PRAGMAS): Define. + * config/aarch64/aarch64.c (aarch64_set_default_type_attributes, + aarch64_attribute_table, aarch64_comp_type_attributes, + aarch64_decl_is_long_call_p, aarch64_function_in_section_p, + aarch64_pr_long_calls, aarch64_pr_no_long_calls, + aarch64_pr_long_calls_off): New functions. + (TARGET_SET_DEFAULT_TYPE_ATTRIBUTES): Define as + aarch64_set_default_type_attributes. + (TARGET_ATTRIBUTE_TABLE): Define as aarch64_attribute_table. + (TARGET_COMP_TYPE_ATTRIBUTES): Define as aarch64_comp_type_attribute. + (aarch64_pragma_enum): New enum. + (aarch64_attribute_table): New attribute table. + * config/aarch64/aarch64-protos.h (aarch64_pr_long_calls, + aarch64_pr_no_long_calls, aarch64_pr_long_calls_off): New declarations. + * config/aarch64/aarch64.md (sibcall, sibcall_value): Modified to + generate indirect call for sibling call when needed. + * config/aarch64/predicate.md (aarch64_call_insn_operand): Modified to + exclude a symbol_ref for an indirect call. + 2014-11-11 Andrew Pinski apin...@cavium.com Bug target/61997 Index: gcc/testsuite/gcc.target/aarch64/long-calls-1.c = == --- gcc/testsuite/gcc.target/aarch64/long-calls-1.c (revision 0) +++ gcc/testsuite/gcc.target/aarch64/long-calls-1.c (revision 0) @@ -0,0 +1,133 @@ +/* Check that long calls to different sections are not optimized +to bl. */ +/* { dg-do compile } */ +/* { dg-options -O2 } */ +/* This test expects that short calls are the default. */ +/* { dg-skip-if -mlong-calls in use { *-*-* } { -mlong-calls +} { } } */ + +#define section(S) __attribute__((section(S))) #define weak +__attribute__((weak)) #define noinline __attribute__((noinline)) +#define
Re: [PING][PATCH] [AARCH64, NEON] Improve vcls(q?) vcnt(q?) and vld1(q?)_dup intrinsics
On 17 November 2014 06:58, Yangfei (Felix) felix.y...@huawei.com wrote: PING? BTW: It seems that Alan's way of improving vld1(q?)_dup intrinsic is more elegant. So is the improvement of vcls(q?) vcnt(q?) OK for trunk? Thanks. Please rebase over Alan's patch and repost, thank you /Marcus I rebased the patch on the latest trunk. Regtested for aarch64-linux-gnu with qemu. OK for the trunk? Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 217717) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,14 @@ +2014-11-13 Felix Yang felix.y...@huawei.com + Shanyao Chen chenshan...@huawei.com + + * config/aarch64/aarch64-simd.md (clrsbmode2, popcountmode2): New + patterns. + * config/aarch64/aarch64-simd-builtins.def (clrsb, popcount): New + builtins. + * config/aarch64/arm_neon.h (vcls_s8, vcls_s16, vcls_s32, vclsq_s8, + vclsq_s16, vclsq_s32, vcnt_p8, vcnt_s8, vcnt_u8, vcntq_p8, vcntq_s8, + vcntq_u8): Rewrite using builtin functions. + 2014-11-18 Felix Yang felix.y...@huawei.com * config/aarch64/aarch64.c (doloop_end): New pattern. Index: gcc/config/aarch64/arm_neon.h === --- gcc/config/aarch64/arm_neon.h (revision 217717) +++ gcc/config/aarch64/arm_neon.h (working copy) @@ -5317,138 +5317,6 @@ vaddlvq_u32 (uint32x4_t a) return result; } -__extension__ static __inline int8x8_t __attribute__ ((__always_inline__)) -vcls_s8 (int8x8_t a) -{ - int8x8_t result; - __asm__ (cls %0.8b,%1.8b - : =w(result) - : w(a) - : /* No clobbers */); - return result; -} - -__extension__ static __inline int16x4_t __attribute__ ((__always_inline__)) -vcls_s16 (int16x4_t a) -{ - int16x4_t result; - __asm__ (cls %0.4h,%1.4h - : =w(result) - : w(a) - : /* No clobbers */); - return result; -} - -__extension__ static __inline int32x2_t __attribute__ ((__always_inline__)) -vcls_s32 (int32x2_t a) -{ - int32x2_t result; - __asm__ (cls %0.2s,%1.2s - : =w(result) - : w(a) - : /* No clobbers */); - return result; -} - -__extension__ static __inline int8x16_t __attribute__ ((__always_inline__)) -vclsq_s8 (int8x16_t a) -{ - int8x16_t result; - __asm__ (cls %0.16b,%1.16b - : =w(result) - : w(a) - : /* No clobbers */); - return result; -} - -__extension__ static __inline int16x8_t __attribute__ ((__always_inline__)) -vclsq_s16 (int16x8_t a) -{ - int16x8_t result; - __asm__ (cls %0.8h,%1.8h - : =w(result) - : w(a) - : /* No clobbers */); - return result; -} - -__extension__ static __inline int32x4_t __attribute__ ((__always_inline__)) -vclsq_s32 (int32x4_t a) -{ - int32x4_t result; - __asm__ (cls %0.4s,%1.4s - : =w(result) - : w(a) - : /* No clobbers */); - return result; -} - -__extension__ static __inline poly8x8_t __attribute__ ((__always_inline__)) -vcnt_p8 (poly8x8_t a) -{ - poly8x8_t result; - __asm__ (cnt %0.8b,%1.8b - : =w(result) - : w(a) - : /* No clobbers */); - return result; -} - -__extension__ static __inline int8x8_t __attribute__ ((__always_inline__)) -vcnt_s8 (int8x8_t a) -{ - int8x8_t result; - __asm__ (cnt %0.8b,%1.8b - : =w(result) - : w(a) - : /* No clobbers */); - return result; -} - -__extension__ static __inline uint8x8_t __attribute__ ((__always_inline__)) -vcnt_u8 (uint8x8_t a) -{ - uint8x8_t result; - __asm__ (cnt %0.8b,%1.8b - : =w(result) - : w(a) - : /* No clobbers */); - return result; -} - -__extension__ static __inline poly8x16_t __attribute__ ((__always_inline__)) -vcntq_p8 (poly8x16_t a) -{ - poly8x16_t result; - __asm__ (cnt %0.16b,%1.16b - : =w(result) - : w(a) - : /* No clobbers */); - return result; -} - -__extension__ static __inline int8x16_t __attribute__ ((__always_inline__)) -vcntq_s8 (int8x16_t a) -{ - int8x16_t result; - __asm__ (cnt %0.16b,%1.16b - : =w(result) - : w(a) - : /* No clobbers */); - return result; -} - -__extension__ static __inline uint8x16_t __attribute__ ((__always_inline__)) -vcntq_u8 (uint8x16_t a) -{ - uint8x16_t result; - __asm__ (cnt %0.16b,%1.16b - : =w(result) - : w(a) - : /* No clobbers */); - return result; -} - #define vcopyq_lane_f32(a, b, c, d) \ __extension__ \ ({ \ @@ -14082,6 +13950,44 @@ vcltzd_f64 (float64_t __a) return __a 0.0 ? -1ll : 0ll; } +/* vcls. */ + +__extension__ static __inline int8x8_t __attribute__ ((__always_inline__)) +vcls_s8 (int8x8_t __a) +{ + return
[PING ^ 3][PATCH, AArch64] Add doloop_end pattern for -fmodulo-sched
PING? Is it OK for me to apply this patch? Thanks. On 11/12/2014 11:01 AM, Yangfei (Felix) wrote: +(define_expand doloop_end + [(use (match_operand 0 )) ; loop pseudo + (use (match_operand 1 ))] ; label + + +{ Drop the surrounding the { }. r~ Hello, I updated the patch with the redundant removed. Is it OK for trunk now? I hope this patch can catch up with stage 1 of GCC-5.0. Thanks. Index: gcc/ChangeLog = == --- gcc/ChangeLog (revision 217394) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,7 @@ +2014-11-13 Felix Yang felix.y...@huawei.com + + * config/aarch64/aarch64.md (doloop_end): New pattern. + 2014-11-11 Andrew Pinski apin...@cavium.com Bug target/61997 Index: gcc/config/aarch64/aarch64.md = == --- gcc/config/aarch64/aarch64.md (revision 217394) +++ gcc/config/aarch64/aarch64.md (working copy) @@ -4087,6 +4087,47 @@ [(set_attr type mrs)]) +;; Define the subtract-one-and-jump insns so loop.c ;; knows what to +generate. +(define_expand doloop_end + [(use (match_operand 0 )) ; loop pseudo + (use (match_operand 1 ))] ; label + +{ + /* Currently SMS relies on the do-loop pattern to recognize loops + where (1) the control part consists of all insns defining and/or + using a certain 'count' register and (2) the loop count can be + adjusted by modifying this register prior to the loop. + ??? The possible introduction of a new block to initialize the + new IV can potentially affect branch optimizations. */ + if (optimize 0 flag_modulo_sched) +{ + rtx s0; + rtx bcomp; + rtx loc_ref; + rtx cc_reg; + rtx insn; + rtx cmp; + + if (GET_MODE (operands[0]) != DImode) + FAIL; + + s0 = operands [0]; + insn = emit_insn (gen_adddi3_compare0 (s0, s0, GEN_INT (-1))); + + cmp = XVECEXP (PATTERN (insn), 0, 0); + cc_reg = SET_DEST (cmp); + bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx); + loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [1]); + emit_jump_insn (gen_rtx_SET (VOIDmode, pc_rtx, +gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp, + loc_ref, pc_rtx))); + DONE; +} + else +FAIL; +}) + ;; AdvSIMD Stuff (include aarch64-simd.md)