Re: [PATCH] ppc: testsuite: pr79004 needs -mlong-double-128

2024-04-29 Thread Kewen.Lin
on 2024/4/29 15:20, Alexandre Oliva wrote:
> On Apr 28, 2024, "Kewen.Lin"  wrote:
> 
>> OK, from this perspective IMHO it seems more clear to adopt xfail
>> with effective target long_double_64bit?
> 
> That's effective target is quite broken, alas.  I doubt it's used
> anywhere: it calls an undefined proc, and its memcmp call seems to have
> the size cut from the 128-bit functions.  (a patchlet that
> fixes these most glaring issues is below)
> 
> Furthermore, it doesn't really work.  Since it adds -mlong-double-64 for
> the effective target test, it overrides the default, so it sort of
> always passes, even on a 128-bit long double target.  But since the test
> itself doesn't add that option, any xfails on long_double_64bit would be
> flagged as XPASS.
> 
> There's no effective target test for 64-bit long double that doesn't
> override the default, so we'd have to add one.  Alas, the natural name
> for it is the one that's taken with overriding behavior, and the current
> option-overriding tests, that need to be used along with the
> corresponding add-options in testcases, might benefit from a renaming to
> make them fit the already-established (?) naming standards.  Yuck.
> 

Oops, it's really out of my expectation, I just noticed that no test cases
are using this effective target and the commit r12-3151-g4c5d76a655b9ab
contributing this even doesn't adopt it.  Thanks for catching this and sorry
that I didn't check it before suggesting it, I think we can aggressively
drop this effective target instead to avoid any possible confusion.

CC Mike for this.

How about the generic one "longdouble64"?  I did a grep and found it has one
use, I'd expect it can work here. :)

gcc/testsuite//gcc.target/powerpc/pr99708.c:/* { dg-xfail-run-if "unsupported 
type __ibm128 with long-double-64" { longdouble64 } } */

BR,
Kewen

> 
> diff --git a/gcc/testsuite/lib/target-supports.exp 
> b/gcc/testsuite/lib/target-supports.exp
> index 182d80129de9b..603da25c97d67 100644
> --- a/gcc/testsuite/lib/target-supports.exp
> +++ b/gcc/testsuite/lib/target-supports.exp
> @@ -2961,12 +2961,12 @@ proc check_effective_target_long_double_64bit { } {
> /* eliminate removing volatile cast warning.  */
> a2 = a;
> b2 = b;
> -   if (memcmp (, , 16) != 0)
> +   if (memcmp (, , 8) != 0)
>   return 1;
> sprintf (buffer, "%lg", b);
> return strcmp (buffer, "3") != 0;
>   }
> -}  [add_options_for_ppc_long_double_override_64bit ""]]
> +}  [add_options_for_long_double_64bit ""]]
>  }
>  
>  # Return the appropriate options to specify that long double uses the IEEE
> 
> 



Re: [PATCH] adjust vectorization expectations for ppc costmodel 76b

2024-04-29 Thread Kewen.Lin
on 2024/4/29 14:28, Alexandre Oliva wrote:
> On Apr 28, 2024, "Kewen.Lin"  wrote:
> 
>> Nit: Maybe add a prefix "testsuite: ".
> 
> ACK
> 
>>>
>>> From: Kewen Lin 
> 
>> Thanks, you can just drop this.  :)
> 
> I've turned it into Co-Authored-By, since you insist.
> 
> But unfortunately with the patch it still fails when testing for
> -mcpu=power7 on ppc64le-linux-gnu: it does vectorize the loop with 13
> iterations.  We need 16 iterations, as in an earlier version of this
> test, for it to pass for -mcpu=power7, but then it doesn't pass for
> -mcpu=power6.
> 
> It looks like we're going to have to adjust the expectations.
> 

I had a look at the failure, it's due to that "vect_no_align" is
evaluated as true unexpectedly.

  "selector_expression: ` vect_no_align || {! vector_alignment_reachable} ' 1"

Currently powerpc* checks check_p8vector_hw_available, ppc64le-linux-gnu
has at least Power8 support (that is testing machine supports p8vector run),
so it concludes vect_no_align is true.

proc check_effective_target_vect_no_align { } {
return [check_cached_effective_target_indexed vect_no_align {
  expr { [istarget mipsisa64*-*-*]
 || [istarget mips-sde-elf]
 || [istarget sparc*-*-*]
 || [istarget ia64-*-*]
 || [check_effective_target_arm_vect_no_misalign]
 || ([istarget powerpc*-*-*] && [check_p8vector_hw_available])

I'll fix this in PR113535 which was filed previously for visiting powerpc
specific check in these vect* effective targets.  If the testing just goes
with native cpu type, this issue will become invisible.  I think you can
still push the patch as the testing just exposes another issue.

BR,
Kewen



Re: [PATCH] ppc: testsuite: pr79004 needs -mlong-double-128

2024-04-28 Thread Kewen.Lin
Hi,

on 2024/4/28 16:20, Alexandre Oliva wrote:
> On Apr 23, 2024, "Kewen.Lin"  wrote:
> 
>> This patch seemed to miss to CC gcc-patches list. :)
> 
> Oops, sorry, thanks for catching that.
> 
> Here it is.  FTR, you've already responded suggesting an apparent
> preference for addressing PR105359, but since I meant to contribute it,
> I'm reposting is to gcc-patches, now with a reference to the PR.

OK, from this perspective IMHO it seems more clear to adopt xfail
with effective target long_double_64bit?

BR,
Kewen

> 
> 
> ppc: testsuite: pr79004 needs -mlong-double-128
> 
> Some of the asm opcodes expected by pr79004 depend on
> -mlong-double-128 to be output.  E.g., without this flag, the
> conditions of patterns @extenddf2 and extendsf2 do not
> hold, and so GCC resorts to libcalls instead of even trying
> rs6000_expand_float128_convert.
> 
> Perhaps the conditions are too strict, and they could enable the use
> of conversion insns involving __ieee128/_Float128 even with 64-bit
> long doubles.  Alas, for now, we need this flag for the test to pass
> on target variants that use 64-bit long doubles.
> 
> 
> for  gcc/testsuite/ChangeLog
> 
>   * gcc.target/powerpr/pr79004.c: Add -mlong-double-128.
> ---
>  gcc/testsuite/gcc.target/powerpc/pr79004.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr79004.c 
> b/gcc/testsuite/gcc.target/powerpc/pr79004.c
> index e411702dc98a9..061a0e83fe2ad 100644
> --- a/gcc/testsuite/gcc.target/powerpc/pr79004.c
> +++ b/gcc/testsuite/gcc.target/powerpc/pr79004.c
> @@ -1,6 +1,6 @@
>  /* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
>  /* { dg-require-effective-target powerpc_p9vector_ok } */
> -/* { dg-options "-mdejagnu-cpu=power9 -O2 -mfloat128" } */
> +/* { dg-options "-mdejagnu-cpu=power9 -O2 -mfloat128 -mlong-double-128" } */
>  /* { dg-prune-output ".-mfloat128. option may not be fully supported" } */
>  
>  #include 
> 
> 





Re: [PATCH] adjust vectorization expectations for ppc costmodel 76b

2024-04-28 Thread Kewen.Lin
Hi,

on 2024/4/28 16:14, Alexandre Oliva wrote:
> On Apr 24, 2024, "Kewen.Lin"  wrote:
> 
>> For !has_arch_pwr7 case, it still adopts peeling but as the comment (one 
>> line above)
>> shows the original intention of this case is to expect not profitable for 
>> peeling
>> so it's not expected to be handled here, can we just tweak the loop bound 
>> instead,
>> such as:
> 
>> -#define N 14
>> +#define N 13
>>  #define OFF 4 
> 
>> ?, it can make this loop not profitable to be vectorized for !vect_no_align 
>> with
>> peeling (both pwr7 and pwr6) and keep consistent.
> 
> Like this?  I didn't feel I could claim authorship of this one-liner
> just because I turned it into a patch and tested it, so I took the
> liberty of turning your own words above into the commit message.  So

Feel free to do so!

> far, tested on ppc64le-linux-gnu (ppc9).  Testing with vxworks targets
> now.  Would you like to tweak the commit message to your liking?

OK, tweaked as below.

> Otherwise, is this ok to install?
> 
> Thanks,
> 
> 
> adjust iteration count for ppc costmodel 76b

Nit: Maybe add a prefix "testsuite: ".

> 
> From: Kewen Lin 

Thanks, you can just drop this.  :)

> 
> The original intention of this case is to expect not profitable for
> peeling.  Tweak the loop bound to make this loop not profitable to be
> vectorized for !vect_no_align with peeling (both pwr7 and pwr6) and
> keep consistent.

For some hardware which doesn't support unaligned vector memory access,
test case costmodel-vect-76b.c expects to see cost modeling would make
the decision that it's not profitable for peeling, according to the
commit history, test case comments and the way to check.

For now, the existing loop bound 14 works well for Power7, but it does
not for some targets on which the cost of operation vec_perm can be
different from Power7, such as: Power6, it's 3 vs. 1.  This difference
further causes the difference (10 vs. 12) on the minimum iteration for
profitability and cause the failure.  To keep the original test point,
this patch is to tweak the loop bound to ensure it's not profitable
to be vectorized for !vect_no_align with peeling.

OK for trunk (assuming the testings run well on p6/p7 too), thanks!

BR,
Kewen

> 
> 
> for  gcc/testsuite/ChangeLog
> 
>   * gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c (N): Tweak.
> ---
>  .../gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c 
> b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c
> index cbbfbb24658f8..e48b0ab759e75 100644
> --- a/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c
> +++ b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c
> @@ -6,7 +6,7 @@
>  
>  /* On Power7 without misalign vector support, this case is to check it's not
> profitable to perform vectorization by peeling to align the store.  */
> -#define N 14
> +#define N 13
>  #define OFF 4
>  
>  /* Check handling of accesses for which the "initial condition" -
> 
> 



Re: [PATCH] adjust vectorization expectations for ppc costmodel 76b

2024-04-24 Thread Kewen.Lin
Hi,

on 2024/4/22 17:28, Alexandre Oliva wrote:
> Ping?
> https://gcc.gnu.org/pipermail/gcc-patches/2021-March/566525.html
> 
> 
> This test expects vectorization at power8+ because strict alignment is
> not required for vectors.  For power7, vectorization is not to take
> place because it's not deemed profitable: 12 iterations would be
> required to make it so.
> 
> But for power6 and below, the test's 10 iterations are enough to make
> vectorization profitable, but the test doesn't expect this.  Assuming
> the decision is indeed appropriate, I'm adjusting the expectations.

For a record, the cost difference between power6 and power7 is the cost
for vec_perm, it's:

* p6 *

ic[i_23] 2 times vector_stmt costs 2 in prologue
ic[i_23] 1 times vector_stmt costs 1 in prologue
ic[i_23] 1 times vector_load costs 2 in body
ic[i_23] 1 times vec_perm costs 1 in body

vs.

* p7 *

ic[i_23] 2 times vector_stmt costs 2 in prologue
ic[i_23] 1 times vector_stmt costs 1 in prologue
ic[i_23] 1 times vector_load costs 2 in body
ic[i_23] 1 times vec_perm costs 3 in body

, it further cause minimum iters for profitability difference.

> 
> 
> for  gcc/testsuite/ChangeLog
> 
>   * gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c: Adjust
>   expectations for cpus below power7.
> ---
>  .../gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c |9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c 
> b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c
> index cbbfbb24658f8..0dab2c08acdb4 100644
> --- a/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c
> +++ b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c
> @@ -46,9 +46,10 @@ int main (void)
>return 0;
>  }
>  
> -/* Peeling to align the store is used. Overhead of peeling is too high.  */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 0 "vect" { target 
> { vector_alignment_reachable && {! vect_no_align} } } } } */
> -/* { dg-final { scan-tree-dump-times "vectorization not profitable" 1 "vect" 
> { target { vector_alignment_reachable && {! vect_hw_misalign} } } } } */
> +/* Peeling to align the store is used. Overhead of peeling is too high
> +   for power7, but acceptable for earlier architectures.  */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 0 "vect" { target 
> { has_arch_pwr7 && { vector_alignment_reachable && {! vect_no_align} } } } } 
> } */
> +/* { dg-final { scan-tree-dump-times "vectorization not profitable" 1 "vect" 
> { target { has_arch_pwr7 && { vector_alignment_reachable && {! 
> vect_hw_misalign} } } } } } */
>  
>  /* Versioning to align the store is used. Overhead of versioning is not too 
> high.  */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target 
> { vect_no_align || {! vector_alignment_reachable} } } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target 
> { vect_no_align || { {! vector_alignment_reachable} || {! has_arch_pwr7 } } } 
> } } } */

For !has_arch_pwr7 case, it still adopts peeling but as the comment (one line 
above)
shows the original intention of this case is to expect not profitable for 
peeling
so it's not expected to be handled here, can we just tweak the loop bound 
instead,
such as:

-#define N 14
+#define N 13
 #define OFF 4 

?, it can make this loop not profitable to be vectorized for !vect_no_align with
peeling (both pwr7 and pwr6) and keep consistent.

BR,
Kewen

> 
> 



Re: [PATCH v2] [testsuite] require sqrt_insn effective target where needed

2024-04-23 Thread Kewen.Lin
Hi,

on 2024/4/22 17:56, Alexandre Oliva wrote:
> This patch takes feedback received for 3 earlier patches, and adopts a
> simpler approach to skip the still-failing tests, that I believe to be
> in line with ppc maintainers' expressed preferences.
> https://gcc.gnu.org/pipermail/gcc-patches/2021-February/565939.html
> https://gcc.gnu.org/pipermail/gcc-patches/2021-March/566617.html
> https://gcc.gnu.org/pipermail/gcc-patches/2021-March/566521.html
> Ping?-ish :-)
> 
> 
> Some tests fail on ppc and ppc64 when testing a compiler [with options
> for] for a CPU [emulator] that doesn't support the sqrt insn.
> 
> The gcc.dg/cdce3.c is one in which the expected shrink-wrap
> optimization only takes place when the target CPU supports a sqrt
> insn.
> 
> The gcc.target/powerpc/pr46728-1[0-4].c tests use -mpowerpc-gpopt and
> call sqrt(), which involves the sqrt insn that the target CPU under
> test may not support.
> 
> Require a sqrt_insn effective target for all the affected tests.
> 
> Regstrapped on x86_64-linux-gnu and ppc64el-linux-gnu.  Also testing
> with gcc-13 on ppc64-vx7r2 and ppc-vx7r2.  Ok to install?
> 
> 
> for  gcc/testsuite/ChangeLog
> 
>   * gcc.dg/cdce3.c: Require sqrt_insn effective target.
>   * gcc.target/powerpc/pr46728-10.c: Likewise.
>   * gcc.target/powerpc/pr46728-11.c: Likewise.
>   * gcc.target/powerpc/pr46728-13.c: Likewise.
>   * gcc.target/powerpc/pr46728-14.c: Likewise.
> ---
>  gcc/testsuite/gcc.dg/cdce3.c  |3 ++-
>  gcc/testsuite/gcc.target/powerpc/pr46728-10.c |1 +
>  gcc/testsuite/gcc.target/powerpc/pr46728-11.c |1 +
>  gcc/testsuite/gcc.target/powerpc/pr46728-13.c |1 +
>  gcc/testsuite/gcc.target/powerpc/pr46728-14.c |1 +
>  5 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/cdce3.c b/gcc/testsuite/gcc.dg/cdce3.c
> index 601ddf055fd71..f759a95972e8b 100644
> --- a/gcc/testsuite/gcc.dg/cdce3.c
> +++ b/gcc/testsuite/gcc.dg/cdce3.c
> @@ -1,7 +1,8 @@
>  /* { dg-do compile } */
>  /* { dg-require-effective-target hard_float } */
> +/* { dg-require-effective-target sqrt_insn } */
>  /* { dg-options "-O2 -fmath-errno -fdump-tree-cdce-details 
> -fdump-tree-optimized" } */
> -/* { dg-final { scan-tree-dump "cdce3.c:11: \[^\n\r]* function call is 
> shrink-wrapped into error conditions\." "cdce" } } */
> +/* { dg-final { scan-tree-dump "cdce3.c:12: \[^\n\r]* function call is 
> shrink-wrapped into error conditions\." "cdce" } } */
>  /* { dg-final { scan-tree-dump "sqrtf \\(\[^\n\r]*\\); \\\[tail call\\\]" 
> "optimized" } } */
>  /* { dg-skip-if "doesn't have a sqrtf insn" { mmix-*-* } } */
> 

This change needs an approval from global maintainer as it touches a generic 
test case?

> diff --git a/gcc/testsuite/gcc.target/powerpc/pr46728-10.c 
> b/gcc/testsuite/gcc.target/powerpc/pr46728-10.c
> index 3be4728d333a4..7e9bb638106c2 100644
> --- a/gcc/testsuite/gcc.target/powerpc/pr46728-10.c
> +++ b/gcc/testsuite/gcc.target/powerpc/pr46728-10.c
> @@ -1,6 +1,7 @@
>  /* { dg-do run } */
>  /* { dg-skip-if "-mpowerpc-gpopt not supported" { powerpc*-*-darwin* } } */
>  /* { dg-options "-O2 -ffast-math -fno-inline -fno-unroll-loops -lm 
> -mpowerpc-gpopt" } */
> +/* { dg-require-effective-target sqrt_insn } */

This change looks sensible to me.

Nit: With the proposed change, I'd expect that we can remove the line for 
powerpc*-*-darwin*.

CC Iain to confirm.

BR,
Kewen

> 
>  #include 
> 
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr46728-11.c 
> b/gcc/testsuite/gcc.target/powerpc/pr46728-11.c
> index 43b6728a4b812..5bfa25925675a 100644
> --- a/gcc/testsuite/gcc.target/powerpc/pr46728-11.c
> +++ b/gcc/testsuite/gcc.target/powerpc/pr46728-11.c
> @@ -1,6 +1,7 @@
>  /* { dg-do run } */
>  /* { dg-skip-if "-mpowerpc-gpopt not supported" { powerpc*-*-darwin* } } */
>  /* { dg-options "-O2 -ffast-math -fno-inline -fno-unroll-loops -lm 
> -mpowerpc-gpopt" } */
> +/* { dg-require-effective-target sqrt_insn } */
> 
>  #include 
> 
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr46728-13.c 
> b/gcc/testsuite/gcc.target/powerpc/pr46728-13.c
> index b9fd63973b728..b66d0209a5e54 100644
> --- a/gcc/testsuite/gcc.target/powerpc/pr46728-13.c
> +++ b/gcc/testsuite/gcc.target/powerpc/pr46728-13.c
> @@ -1,6 +1,7 @@
>  /* { dg-do run } */
>  /* { dg-skip-if "-mpowerpc-gpopt not supported" { powerpc*-*-darwin* } } */
>  /* { dg-options "-O2 -ffast-math -fno-inline -fno-unroll-loops -lm 
> -mpowerpc-gpopt" } */
> +/* { dg-require-effective-target sqrt_insn } */
> 
>  #include 
> 
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr46728-14.c 
> b/gcc/testsuite/gcc.target/powerpc/pr46728-14.c
> index 5a13bdb6c..71a1a70c4e7a2 100644
> --- a/gcc/testsuite/gcc.target/powerpc/pr46728-14.c
> +++ b/gcc/testsuite/gcc.target/powerpc/pr46728-14.c
> @@ -1,6 +1,7 @@
>  /* { dg-do run } */
>  /* { dg-skip-if "-mpowerpc-gpopt not supported" { powerpc*-*-darwin* } } */
>  /* { dg-options "-O2 -ffast-math -fno-inline 

Re: [PATCH v2] xfail fetestexcept test - ppc always uses fcmpu

2024-04-23 Thread Kewen.Lin
Hi,

on 2024/4/22 18:00, Alexandre Oliva wrote:
> On Mar 10, 2021, Joseph Myers  wrote:
> 
>> On Wed, 10 Mar 2021, Alexandre Oliva wrote:
>>> operand exception for quiet NaN.  I couldn't find any evidence that
>>> the rs6000 backend ever outputs fcmpo.  Therefore, I'm adding the same
>>> execution xfail marker to this test.
> 
>> In my view, such an XFAIL (for a GCC bug as opposed to an environmental 
>> issue) should have a comment pointing to a corresponding open bug in GCC 
>> Bugzilla.  In this case, that's bug 58684.
> 
> Thanks for the suggestion, yeah, that makes sense.  Fixed in v2 below.
> https://gcc.gnu.org/pipermail/gcc-patches/2021-March/566523.html
> Ping?-ish
> 
> 
> gcc.dg/torture/pr91323.c tests that a compare with NaNf doesn't set an
> exception using builtin compare intrinsics, and that it does when
> using regular compare operators.
> 
> That doesn't seem to be expected to work on powerpc targets.  It fails
> on GNU/Linux, it's marked to be skipped on AIX, and a similar test,
> gcc.dg/torture/pr93133.c, has the execution test xfailed for all of
> powerpc*-*-*.
> 
> In this test, the functions that use intrinsics for the compare end up
> with the same code as the one that uses compare operators, using
> fcmpu, a floating compare that, unlike fcmpo, does not set the invalid
> operand exception for quiet NaN.  I couldn't find any evidence that
> the rs6000 backend ever outputs fcmpo.  Therefore, I'm adding the same
> execution xfail marker to this test.
> 
> Regstrapped on x86_64-linux-gnu and ppc64el-linux-gnu.  Also tested with
> gcc-13 on ppc64-vx7r2 and ppc-vx7r2.  Ok to install?
> 
> 
> for  gcc/testsuite/ChangeLog
> 
>   PR target/58684
>   * gcc.dg/torture/pr91323.c: Expect execution fail on
>   powerpc*-*-*.
> ---
>  gcc/testsuite/gcc.dg/torture/pr91323.c |3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/torture/pr91323.c 
> b/gcc/testsuite/gcc.dg/torture/pr91323.c
> index 1411fcaa3966c..f188faa3ccf47 100644
> --- a/gcc/testsuite/gcc.dg/torture/pr91323.c
> +++ b/gcc/testsuite/gcc.dg/torture/pr91323.c
> @@ -1,4 +1,5 @@
> -/* { dg-do run } */
> +/* { dg-do run { xfail powerpc*-*-* } } */
> +/* The ppc xfail is because of PR target/58684.  */

OK, though the proposed comment is slightly different from what's in
the related commit r8-6445-g86145a19abf39f. :)  Thanks!

BR,
Kewen

>  /* { dg-add-options ieee } */
>  /* { dg-require-effective-target fenv_exceptions } */
>  /* { dg-skip-if "fenv" { powerpc-ibm-aix* } } */
> 
> 



Re: [PATCH] ppc: testsuite: vec-mul requires vsx runtime

2024-04-23 Thread Kewen.Lin
on 2024/4/22 17:35, Alexandre Oliva wrote:
> Ping?
> https://gcc.gnu.org/pipermail/gcc-patches/2022-May/593947.html
> 
> 
> vec-mul is an execution test, but it only requires a powerpc_vsx_ok
> effective target, which is enough only for compile tests.  In order to
> To check for runtime and execution environment support, we need to
> require vsx_hw.  Make that a condition for execution, but still
> perform a compile test if the condition is not satisfied.
> 
> Regstrapped on x86_64-linux-gnu and ppc64el-linux-gnu.  Also tested with
> gcc-13 on ppc64-vx7r2 and ppc-vx7r2.  Ok to install?
> 
> 
> for  gcc/testsuite/ChangeLog
> 
>   * gcc.target/powerpc/vec-mul.c: Run on target vsx_hw, just
>   compile otherwise.
> ---
>  gcc/testsuite/gcc.target/powerpc/vec-mul.c |3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/gcc/testsuite/gcc.target/powerpc/vec-mul.c 
> b/gcc/testsuite/gcc.target/powerpc/vec-mul.c
> index bfcaf80719d1d..11da86159723f 100644
> --- a/gcc/testsuite/gcc.target/powerpc/vec-mul.c
> +++ b/gcc/testsuite/gcc.target/powerpc/vec-mul.c
> @@ -1,4 +1,5 @@
> -/* { dg-do run } */
> +/* { dg-do compile { target { ! vsx_hw } } } */
> +/* { dg-do run { target vsx_hw } } */
>  /* { dg-require-effective-target powerpc_vsx_ok } */

Nit: It's useless to check powerpc_vsx_ok for vsx_hw, so powerpc_vsx_ok check
can be moved to be with ! vsx_hw.

OK with this nit tweaked, thanks!

BR,
Kewen

>  /* { dg-options "-mvsx -O3" } */
> 
> 
> 


Re: [PATCH] Request check for hw support in ppc run tests with -maltivec/-mvsx

2024-04-23 Thread Kewen.Lin
on 2024/4/22 17:31, Alexandre Oliva wrote:
> 
> From: Olivier Hainque 
> 
> Regstrapped on x86_64-linux-gnu and ppc64el-linux-gnu.  Also tested with
> gcc-13 on ppc64-vx7r2 and ppc-vx7r2.  Ok to install?

OK, thanks!

BR,
Kewen

> 
> for  gcc/testsuite/ChangeLog
> 
>   * gcc.target/powerpc/swaps-p8-20.c: Change powerpc_altivec_ok
>   require-effective-target test into vmx_hw.
>   * gcc.target/powerpc/vsx-vector-5.c: Change powerpc_vsx_ok
>   require-effective-target test into vsx_hw.
> ---
>  gcc/testsuite/gcc.target/powerpc/swaps-p8-20.c  |2 +-
>  gcc/testsuite/gcc.target/powerpc/vsx-vector-5.c |5 +
>  2 files changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/gcc/testsuite/gcc.target/powerpc/swaps-p8-20.c 
> b/gcc/testsuite/gcc.target/powerpc/swaps-p8-20.c
> index 564e8acb1f421..755519bfe847d 100644
> --- a/gcc/testsuite/gcc.target/powerpc/swaps-p8-20.c
> +++ b/gcc/testsuite/gcc.target/powerpc/swaps-p8-20.c
> @@ -1,5 +1,5 @@
>  /* { dg-do run } */
> -/* { dg-require-effective-target powerpc_altivec_ok } */
> +/* { dg-require-effective-target vmx_hw } */
>  /* { dg-options "-O2 -mdejagnu-cpu=power8 -maltivec" } */
>  
>  /* The expansion for vector character multiply introduces a vperm operation.
> diff --git a/gcc/testsuite/gcc.target/powerpc/vsx-vector-5.c 
> b/gcc/testsuite/gcc.target/powerpc/vsx-vector-5.c
> index dcc88b1f3a4c6..37a324b6f897d 100644
> --- a/gcc/testsuite/gcc.target/powerpc/vsx-vector-5.c
> +++ b/gcc/testsuite/gcc.target/powerpc/vsx-vector-5.c
> @@ -1,11 +1,8 @@
>  /* { dg-do run { target lp64 } } */
>  /* { dg-skip-if "" { powerpc*-*-darwin* } } */
> -/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-require-effective-target vsx_hw } */
>  /* { dg-options "-mvsx -O2" } */
>  
> -/* This will run, and someday we should add the support to test whether we 
> are
> -   running on VSX hardware.  */
> -
>  #include 
>  #include 
>  
> 



Re: [PATCH] disable ldist for test, to restore vectorizing-candidate loop

2024-04-23 Thread Kewen.Lin
on 2024/4/22 17:27, Alexandre Oliva wrote:
> Ping?
> https://gcc.gnu.org/pipermail/gcc-patches/2021-March/566524.html
> 
> The loop we're supposed to try to vectorize in
> gcc.dg/vect/costmodel/ppc/costmodel-vect-31a.c is turned into a memset
> before the vectorizer runs.
> 
> Various other tests in this set have already run into this, and the
> solution has been to disable this loop distribution transformation,
> enabled at -O2, so that the vectorizer gets a chance to transform the
> loop and, in this testcase, fail to do so.
> 
> Regstrapped on x86_64-linux-gnu and ppc64el-linux-gnu.  Also tested with
> gcc-13 on ppc64-vx7r2 and ppc-vx7r2.  Ok to install?

OK, thanks!

BR,
Kewen

> 
> 
> for  gcc/testsuite/ChangeLog
> 
>   * gcc.dg/vect/costmodel/ppc/costmodel-vect-31a.c: Disable
>   ldist.
> ---
>  .../gcc.dg/vect/costmodel/ppc/costmodel-vect-31a.c |1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-31a.c 
> b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-31a.c
> index 454a714a30916..90b5d5a7f400b 100644
> --- a/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-31a.c
> +++ b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-31a.c
> @@ -1,4 +1,5 @@
>  /* { dg-require-effective-target vect_int } */
> +/* { dg-additional-options "-fno-tree-loop-distribute-patterns" } */
> 
>  #include 
>  #include "../../tree-vect.h"
> 




Re: [PATCH] [testsuite] [ppc64] expect error on vxworks too

2024-04-23 Thread Kewen.Lin
on 2024/4/22 17:23, Alexandre Oliva wrote:
> 
> These ppc lp64 tests check for errors or warnings on -mno-powerpc64.
> On powerpc64-*-vxworks* we get the same errors as on most other
> covered platforms, but the tests did not mark them as expected for
> this target.  On powerpc-*-vxworks*, the tests are skipped because
> lp64 is not satisfied, so I'm naming powerpc*-*-vxworks* rather than
> something more specific.
> 
> Regstrapped on x86_64-linux-gnu and ppc64el-linux-gnu.  Also tested with
> gcc-13 on ppc64-vx7r2 and ppc-vx7r2.  Ok to install?

OK, thanks!

BR,
Kewen

> 
> 
> for  gcc/testsuite/ChangeLog
> 
>   * gcc.target/powerpc/pr106680-1.c: Error on vxworks too.
>   * gcc.target/powerpc/pr106680-2.c: Likewise.
>   * gcc.target/powerpc/pr106680-3.c: Likewise.
> ---
>  gcc/testsuite/gcc.target/powerpc/pr106680-1.c |2 +-
>  gcc/testsuite/gcc.target/powerpc/pr106680-2.c |2 +-
>  gcc/testsuite/gcc.target/powerpc/pr106680-3.c |2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr106680-1.c 
> b/gcc/testsuite/gcc.target/powerpc/pr106680-1.c
> index d624d43230a7a..aadaa614cfeba 100644
> --- a/gcc/testsuite/gcc.target/powerpc/pr106680-1.c
> +++ b/gcc/testsuite/gcc.target/powerpc/pr106680-1.c
> @@ -8,6 +8,6 @@ int foo ()
>return 1;
>  }
> 
> -/* { dg-error "'-m64' requires a PowerPC64 cpu" "PR106680" { target 
> powerpc*-*-linux* powerpc*-*-freebsd* powerpc-*-rtems* } 0 } */
> +/* { dg-error "'-m64' requires a PowerPC64 cpu" "PR106680" { target 
> powerpc*-*-linux* powerpc*-*-freebsd* powerpc-*-rtems* powerpc*-*-vxworks* } 
> 0 } */
>  /* { dg-warning "'-m64' requires PowerPC64 architecture, enabling" 
> "PR106680" { target powerpc*-*-darwin* } 0 } */
>  /* { dg-warning "'-maix64' requires PowerPC64 architecture remain enabled" 
> "PR106680" { target powerpc*-*-aix* } 0 } */
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr106680-2.c 
> b/gcc/testsuite/gcc.target/powerpc/pr106680-2.c
> index a9ed73726ef0c..f0758e303350a 100644
> --- a/gcc/testsuite/gcc.target/powerpc/pr106680-2.c
> +++ b/gcc/testsuite/gcc.target/powerpc/pr106680-2.c
> @@ -9,6 +9,6 @@ int foo ()
>return 1;
>  }
> 
> -/* { dg-error "'-m64' requires a PowerPC64 cpu" "PR106680" { target 
> powerpc*-*-linux* powerpc*-*-freebsd* powerpc-*-rtems* } 0 } */
> +/* { dg-error "'-m64' requires a PowerPC64 cpu" "PR106680" { target 
> powerpc*-*-linux* powerpc*-*-freebsd* powerpc-*-rtems* powerpc*-*-vxworks* } 
> 0 } */
>  /* { dg-warning "'-m64' requires PowerPC64 architecture, enabling" 
> "PR106680" { target powerpc*-*-darwin* } 0 } */
>  /* { dg-warning "'-maix64' requires PowerPC64 architecture remain enabled" 
> "PR106680" { target powerpc*-*-aix* } 0 } */
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr106680-3.c 
> b/gcc/testsuite/gcc.target/powerpc/pr106680-3.c
> index b642d5c7a008d..bca012e2cf663 100644
> --- a/gcc/testsuite/gcc.target/powerpc/pr106680-3.c
> +++ b/gcc/testsuite/gcc.target/powerpc/pr106680-3.c
> @@ -8,6 +8,6 @@ int foo ()
>return 1;
>  }
> 
> -/* { dg-error "'-m64' requires a PowerPC64 cpu" "PR106680" { target 
> powerpc*-*-linux* powerpc*-*-freebsd* powerpc-*-rtems* } 0 } */
> +/* { dg-error "'-m64' requires a PowerPC64 cpu" "PR106680" { target 
> powerpc*-*-linux* powerpc*-*-freebsd* powerpc-*-rtems* powerpc*-*-vxworks* } 
> 0 } */
>  /* { dg-warning "'-m64' requires PowerPC64 architecture, enabling" 
> "PR106680" { target powerpc*-*-darwin* } 0 } */
>  /* { dg-warning "'-maix64' requires PowerPC64 architecture remain enabled" 
> "PR106680" { target powerpc*-*-aix* } 0 } */
> 



Re: [PATCH, rs6000] Use bcdsub. instead of bcdadd. for bcd invalid number checking

2024-04-17 Thread Kewen.Lin
Hi,

on 2024/4/18 10:01, HAO CHEN GUI wrote:
> Hi,
>   This patch replace bcdadd. with bcdsub. for bcd invalid number checking.
> bcdadd on two same numbers might cause overflow which also set
> overflow/invalid bit so that we can't distinguish it's invalid or overflow.
> The bcdsub doesn't have the problem as subtracting on two same number never
> causes overflow.
> 
>   Bootstrapped and tested on powerpc64-linux BE and LE with no
> regressions. Is it OK for the trunk?

Considering that this issue affects some basic functionality of bcd bifs
and the fix itself is simple and very safe, OK for trunk, thanks for fixing!

BR,
Kewen

> 
> Thanks
> Gui Haochen
> 
> ChangeLog
> rs6000: Use bcdsub. instead of bcdadd. for bcd invalid number checking
> 
> bcdadd. might causes overflow which also set the overflow/invalid bit.
> bcdsub. doesn't have the issue when do subtracting on two same bcd number.
> 
> gcc/
>   * config/rs6000/altivec.md (*bcdinvalid_): Replace bcdadd
>   with bcdsub.
>   (bcdinvalid_): Likewise.
> 
> gcc/testsuite/
>   * gcc.target/powerpc/bcd-4.c: Adjust the number of bcdadd and
>   bcdsub.
> 
> patch.diff
> diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
> index 4d4c94ff0a0..bb20441c096 100644
> --- a/gcc/config/rs6000/altivec.md
> +++ b/gcc/config/rs6000/altivec.md
> @@ -4586,18 +4586,18 @@ (define_insn "*bcdinvalid_"
>[(set (reg:CCFP CR6_REGNO)
>   (compare:CCFP
>(unspec:V2DF [(match_operand:VBCD 1 "register_operand" "v")]
> -   UNSPEC_BCDADD)
> +   UNSPEC_BCDSUB)
>(match_operand:V2DF 2 "zero_constant" "j")))
> (clobber (match_scratch:VBCD 0 "=v"))]
>"TARGET_P8_VECTOR"
> -  "bcdadd. %0,%1,%1,0"
> +  "bcdsub. %0,%1,%1,0"
>[(set_attr "type" "vecsimple")])
> 
>  (define_expand "bcdinvalid_"
>[(parallel [(set (reg:CCFP CR6_REGNO)
>  (compare:CCFP
>   (unspec:V2DF [(match_operand:VBCD 1 "register_operand")]
> -  UNSPEC_BCDADD)
> +  UNSPEC_BCDSUB)
>   (match_dup 2)))
> (clobber (match_scratch:VBCD 3))])
> (set (match_operand:SI 0 "register_operand")
> diff --git a/gcc/testsuite/gcc.target/powerpc/bcd-4.c 
> b/gcc/testsuite/gcc.target/powerpc/bcd-4.c
> index 2c7041c4d32..6d2c59ef792 100644
> --- a/gcc/testsuite/gcc.target/powerpc/bcd-4.c
> +++ b/gcc/testsuite/gcc.target/powerpc/bcd-4.c
> @@ -2,8 +2,8 @@
>  /* { dg-require-effective-target int128 } */
>  /* { dg-require-effective-target p9vector_hw } */
>  /* { dg-options "-mdejagnu-cpu=power9 -O2 -save-temps" } */
> -/* { dg-final { scan-assembler-times {\mbcdadd\M} 7 } } */
> -/* { dg-final { scan-assembler-times {\mbcdsub\M} 18 } } */
> +/* { dg-final { scan-assembler-times {\mbcdadd\M} 5 } } */
> +/* { dg-final { scan-assembler-times {\mbcdsub\M} 20 } } */
>  /* { dg-final { scan-assembler-times {\mbcds\M} 2 } } */
>  /* { dg-final { scan-assembler-times {\mdenbcdq\M} 1 } } */
> 



[PATCH] testsuite, rs6000: Fix builtins-6-p9-runnable.c for BE [PR114744]

2024-04-17 Thread Kewen.Lin
Hi,

Test case builtins-6-p9-runnable.c doesn't work well on BE
due to two problems:
  - When applying vec_xl_len onto data_128 and data_u128
with length 8, it expects to load 128[01] from
the memory, but unfortunately assigning 128[01] to
a {vector} {u,}int128 type variable, the value isn't
guaranteed to be at the beginning of storage (in the
low part of memory), which means the loaded value can
be unexpected (as shown on BE).  So this patch is to
introduce getU128 which can ensure the given value
shows up as expected and also update some dumping code
for debugging.
  - When applying vec_xl_len_r with length 16, on BE it's
just like the normal vector load, so the expected data
should not be reversed from the original.

Tested well on both powerpc64{,le}-linux-gnu, I'm going to
push this soon.

BR,
Kewen
-
PR testsuite/114744

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/builtins-6-p9-runnable.c: Adjust for BE by fixing
data_{u,}128, their uses and vec_uc_expected1, also adjust some formats.
---
 .../powerpc/builtins-6-p9-runnable.c  | 119 ++
 1 file changed, 64 insertions(+), 55 deletions(-)

diff --git a/gcc/testsuite/gcc.target/powerpc/builtins-6-p9-runnable.c 
b/gcc/testsuite/gcc.target/powerpc/builtins-6-p9-runnable.c
index 20fdd3bb4ec..36101c2b861 100644
--- a/gcc/testsuite/gcc.target/powerpc/builtins-6-p9-runnable.c
+++ b/gcc/testsuite/gcc.target/powerpc/builtins-6-p9-runnable.c
@@ -337,6 +337,30 @@ void print_f (vector float vec_expected,
 }
 #endif

+typedef union
+{
+  vector __int128_t i1;
+  __int128_t i2;
+  vector __uint128_t u1;
+  __uint128_t u2;
+  struct
+  {
+long long d1;
+long long d2;
+  };
+} U128;
+
+/* For a given long long VALUE, ensure it's stored from the beginning
+   of {u,}int128 memory storage (the low address), it avoids to load
+   unexpected data without the whole vector length.  */
+
+static inline void
+getU128 (U128 *pu, unsigned long long value)
+{
+  pu->d1 = value;
+  pu->d2 = 0;
+}
+
 int main() {
   int i, j;
   size_t len;
@@ -835,21 +859,24 @@ int main() {
 #endif
 }

+  U128 u_temp;
   /* vec_xl_len() tests */
   for (i = 0; i < 100; i++)
 {
-  data_c[i] = i;
-  data_uc[i] = i+1;
-  data_ssi[i] = i+10;
-  data_usi[i] = i+11;
-  data_si[i] = i+100;
-  data_ui[i] = i+101;
-  data_sll[i] = i+1000;
-  data_ull[i] = i+1001;
-  data_f[i] = i+10.0;
-  data_d[i] = i+100.0;
-  data_128[i] = i + 1280;
-  data_u128[i] = i + 1281;
+   data_c[i] = i;
+   data_uc[i] = i + 1;
+   data_ssi[i] = i + 10;
+   data_usi[i] = i + 11;
+   data_si[i] = i + 100;
+   data_ui[i] = i + 101;
+   data_sll[i] = i + 1000;
+   data_ull[i] = i + 1001;
+   data_f[i] = i + 10.0;
+   data_d[i] = i + 100.0;
+   getU128 (_temp, i + 1280);
+   data_128[i] = u_temp.i2;
+   getU128 (_temp, i + 1281);
+   data_u128[i] = u_temp.u2;
 }

   len = 16;
@@ -1160,34 +1187,38 @@ int main() {
 #endif
 }

-  vec_s128_expected1 = (vector __int128_t){1280};
+  getU128 (_temp, 1280);
+  vec_s128_expected1 = u_temp.i1;
   vec_s128_result1 = vec_xl_len (data_128, len);

   if (vec_s128_expected1[0] != vec_s128_result1[0])
 {
 #ifdef DEBUG
-   printf("Error: vec_xl_len(), len = %d, vec_s128_result1[0] = %lld %llu; 
",
- len, vec_s128_result1[0] >> 64,
- vec_s128_result1[0] & (__int128_t)0x);
-   printf("vec_s128_expected1[0] = %lld %llu\n",
- vec_s128_expected1[0] >> 64,
- vec_s128_expected1[0] & (__int128_t)0x);
+  U128 u1, u2;
+  u1.i1 = vec_s128_result1;
+  u2.i1 = vec_s128_expected1;
+  printf ("Error: vec_xl_len(), len = %d,"
+ "vec_s128_result1[0] = %llx %llx; ",
+ len, u1.d1, u1.d2);
+  printf ("vec_s128_expected1[0] = %llx %llx\n", u2.d1, u2.d2);
 #else
abort ();
 #endif
 }

   vec_u128_result1 = vec_xl_len (data_u128, len);
-  vec_u128_expected1 = (vector __uint128_t){1281};
+  getU128 (_temp, 1281);
+  vec_u128_expected1 = u_temp.u1;
   if (vec_u128_expected1[0] != vec_u128_result1[0])
 #ifdef DEBUG
 {
-   printf("Error: vec_xl_len(), len = %d, vec_u128_result1[0] = %lld; ",
- len, vec_u128_result1[0] >> 64,
- vec_u128_result1[0] & (__int128_t)0x);
-   printf("vec_u128_expected1[0] = %lld\n",
- vec_u128_expected1[0] >> 64,
- vec_u128_expected1[0] & (__int128_t)0x);
+  U128 u1, u2;
+  u1.u1 = vec_u128_result1;
+  u2.u1 = vec_u128_expected1;
+  printf ("Error: vec_xl_len(), len = %d,"
+ "vec_u128_result1[0] = %llx %llx; ",
+ len, u1.d1, u1.d2);
+  printf ("vec_u128_expected1[0] = %llx %llx\n", u2.d1, u2.d2);
 }
 #else
 abort ();
@@ -1421,8 

Re: [PATCH V3] rs6000: Don't ICE when compiling the __builtin_vsx_splat_2di built-in [PR113950]

2024-04-17 Thread Kewen.Lin
Hi,

on 2024/4/17 17:05, jeevitha wrote:
> Hi,
> 
> On 18/03/24 7:00 am, Kewen.Lin wrote:
> 
>>> The bogus vsx_splat_ code goes all the way back to GCC 8, so we
>>> should backport this fix.  Segher and Ke Wen, can we get an approval
>>> to backport this to all the open release branches (GCC 13, 12, 11)?
>>> Thanks.
>>
>> Sure, okay for backporting this to all active branches, thanks!
>>
> 
> I need clarification regarding the backporting of PR113950 to GCC 12.
> 
> We encountered an issue while resolving merge conflicts in GCC 12. The 
> problem lies in extra deletions in the diff due to cherry-picking. Now,
> we're unsure about the best approach for handling the backport.
> 
> To provide context, I have included the relevant diff snippet below,
> 
> diff --cc gcc/config/rs6000/vsx.md
> index c45794fb9ed,f135fa079bd..000
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@@ -4562,8 -4666,8 +4562,8 @@@
> rtx op1 = operands[1];
> if (MEM_P (op1))
>   operands[1] = rs6000_force_indexed_or_indirect_mem (op1);
> -   else if (!REG_P (op1))
> - op1 = force_reg (mode, op1);
> +   else
>  -operands[1] = force_reg (mode, op1);
> ++operands[1] = force_reg (mode, op1);
>   })
> 
> I'm seeking your advice on how to proceed with the backport. Do you
> think the above change is acceptable, or should we also backport Segher's
> commit e0e3ce634818b83965b87512938490df4d57f81d, which caused the conflict?.

I prefer the former, which is the least modification, for release branches
let's introduce as few changes as possible, and the amendment on the conflict
is minor and straightforward.

BR,
Kewen

> There was no regression with both of these changes.
> 
> Jeevitha.
> 





Re: [PATCH, rs6000] Fix test case bcd4.c

2024-04-17 Thread Kewen.Lin
Hi,

on 2024/4/17 13:12, HAO CHEN GUI wrote:
> Hi,
>   This patch fixes loss of return statement in maxbcd of bcd-4.c. Without
> return statement, it returns an invalid bcd number and make the test
> noneffective. The patch also enables test to run on Power9 and Big Endian,
> as all bcd instructions are supported from Power9.
> 
>   Bootstrapped and tested on powerpc64-linux BE and LE with no
> regressions. Is it OK for the trunk?

OK with two nits below tweaked, thanks!

> 
> Thanks
> Gui Haochen
> 
> 
> ChangeLog
> rs6000: Fix bcd test case
> 
> gcc/testsuite/
>   * gcc.target/powerpc/bcd-4.c: Enable the case to be tested on Power9.
>   Enable the case to be run on big endian.  Fix function maxbcd and
>   other misc. problems.
> 
> 
> patch.diff
> diff --git a/gcc/testsuite/gcc.target/powerpc/bcd-4.c 
> b/gcc/testsuite/gcc.target/powerpc/bcd-4.c
> index 2c8554dfe82..8c0bac2720f 100644
> --- a/gcc/testsuite/gcc.target/powerpc/bcd-4.c
> +++ b/gcc/testsuite/gcc.target/powerpc/bcd-4.c
> @@ -1,7 +1,7 @@
>  /* { dg-do run } */
>  /* { dg-require-effective-target int128 } */
> -/* { dg-require-effective-target power10_hw } */
> -/* { dg-options "-mdejagnu-cpu=power10 -O2 -save-temps" } */
> +/* { dg-require-effective-target p9vector_hw } */
> +/* { dg-options "-mdejagnu-cpu=power9 -O2 -save-temps" } */
>  /* { dg-final { scan-assembler-times {\mbcdadd\M} 7 } } */
>  /* { dg-final { scan-assembler-times {\mbcdsub\M} 18 } } */
>  /* { dg-final { scan-assembler-times {\mbcds\M} 2 } } */
> @@ -44,10 +44,20 @@ vector unsigned char maxbcd(unsigned int sign)
>vector unsigned char result;
>int i;
> 
> +#ifdef _BIG_ENDIAN
> +  for (i = 0; i < 15; i++)
> +#else
>for (i = 15; i > 0; i--)
> +#endif
>  result[i] = 0x99;
> 
> -  result[0] = sign << 4 | 0x9;
> +#ifdef _BIG_ENDIAN

Nit: I'd prefer __BIG_ENDIAN__, although both work in most cases,
RS6000_CPU_CPP_ENDIAN_BUILTINS in netbsd.h doesn't define _BIG_ENDIAN.

So could you do a global replacement with __BIG_ENDIAN__?

> +  result[15] = 0x90 | sign;
> +#else
> +  result[0] = 0x90 | sign;
> +#endif
> +
> +  return result;
>  }
> 
>  vector unsigned char num2bcd(long int a, int encoding)
> @@ -70,9 +80,17 @@ vector unsigned char num2bcd(long int a, int encoding)
> 
>hi = a % 10;   // 1st digit
>a = a / 10;
> +#ifdef _BIG_ENDIAN
> +  result[15] = hi << 4| sign;
> +#else
>result[0] = hi << 4| sign;
> +#endif
> 
> +#ifdef _BIG_ENDIAN
> +  for (i = 14; i >= 0; i--)
> +#else
>for (i = 1; i < 16; i++)
> +#endif
>  {
>low = a % 10;
>a = a / 10;
> @@ -117,7 +135,11 @@ int main ()
>  }
> 
>/* result should be positive */
> +#ifdef _BIG_ENDIAN
> +  if ((result[15] & 0xF) != BCD_POS0)
> +#else
>if ((result[0] & 0xF) != BCD_POS0)
> +#endif
>  #if DEBUG
>printf("ERROR: __builtin_bcdadd sign of result is %d.  Does not match "
>"expected_result = %d\n",
> @@ -150,7 +172,11 @@ int main ()
>  }
> 
>/* Result should be positive, alternate encoding.  */
> +#ifdef _BIG_ENDIAN
> +  if ((result[15] & 0xF) != BCD_POS1)
> +#else
>if ((result[0] & 0xF) != BCD_POS1)
> +#endif
>  #if DEBUG
>  printf("ERROR: __builtin_bcdadd sign of result is %d.  Does not "
>  "match expected_result = %d\n",
> @@ -183,7 +209,11 @@ int main ()
>  }
> 
>/* result should be negative */
> +#ifdef _BIG_ENDIAN
> +  if ((result[15] & 0xF) != BCD_NEG)
> +#else
>if ((result[0] & 0xF) != BCD_NEG)
> +#endif
>  #if DEBUG
>  printf("ERROR: __builtin_bcdadd sign, neg of result is %d.  Does not "
>  "match expected_result = %d\n",
> @@ -217,7 +247,11 @@ int main ()
>  }
> 
>/* result should be positive, alt encoding */

Nit: This comment is inconsistent with the below check, may be:

/* result should be negative.  */

BR,
Kewen

> +#ifdef _BIG_ENDIAN
> +  if ((result[15] & 0xF) != BCD_NEG)
> +#else
>if ((result[0] & 0xF) != BCD_NEG)
> +#endif
>  #if DEBUG
>  printf("ERROR: __builtin_bcdadd sign, of result is %d.  Does not match "
>  "expected_result = %d\n",
> @@ -250,7 +284,11 @@ int main ()
>  }
> 
>/* result should be positive */
> +#ifdef _BIG_ENDIAN
> +  if ((result[15] & 0xF) != BCD_POS1)
> +#else
>if ((result[0] & 0xF) != BCD_POS1)
> +#endif
>  #if DEBUG
>  printf("ERROR: __builtin_bcdsub sign, result is %d.  Does not match "
>  "expected_result = %d\n",
> @@ -283,7 +321,7 @@ int main ()
>  abort();
>  #endif
> 
> -  a = maxbcd(BCD_NEG);
> +  a = maxbcd(BCD_POS0);
>b = maxbcd(BCD_NEG);
> 
>if (__builtin_bcdsub_ofl (a, b, 0) == 0)
> @@ -462,8 +500,12 @@ int main ()
>  }
> 
>/* result should be positive */
> +#ifdef _BIG_ENDIAN
> +  if ((result[15] & 0xF) != BCD_POS0)
> +#else
>if ((result[0] & 0xF) != BCD_POS0)
> -#if 0
> +#endif
> +#if DEBUG
>  printf("ERROR: __builtin_bcdmul10 sign, result is %d.  Does not match "
>  "expected_result = %d\n",
>  result[0] & 0xF, BCD_POS1);
> @@ -492,7 

Re: [PATCH] rs6000: Add OPTION_MASK_POWER8 [PR101865]

2024-04-11 Thread Kewen.Lin
Hi,

on 2024/4/12 06:15, Peter Bergner wrote:
> FYI: This patch is an update to Will Schmidt's patches to fix PR101865:
> 
>   https://gcc.gnu.org/pipermail/gcc-patches/2022-September/601825.html
>   https://gcc.gnu.org/pipermail/gcc-patches/2022-September/601823.html
> 
> ...taking into consideration patch reviews received than.  I also found
> a few more locations that needed patching, as well as simplifying the
> testsuite test cases by removing the need to scan for the predefined macros.
> 
> 
> 
> The bug in PR101865 is the _ARCH_PWR8 predefined macro is conditional upon
> TARGET_DIRECT_MOVE, which can be false for some -mcpu=power8 compiles if the
> -mno-altivec or -mno-vsx options are used.  The solution here is to create
> a new OPTION_MASK_POWER8 mask that is true for -mcpu=power8, regardless of
> Altivec or VSX enablement.
> 
> Unfortunately, the only way to create an OPTION_MASK_* mask is to create
> a new option, which we have done here, but marked it as WarnRemoved since
> we do not want users using it.  For stage1, we will look into how we can
> create ISA mask flags for use in the compiler without the need for explicit
> options.
> 
> The passed bootstrap and regtest on powerpc64le-linux.  Ok for trunk?

Thanks for fixing this.  I guess it should go well on powerpc64-linux too,
but since it's very late stage4 now, could you also test this on BE machine?

> 
> This is also broken on the release branches, so ok for backports after
> some burn-in time on trunk?
> 
> Peter
> 
> 
> 2024-04-11  Will Schmidt  
>   Peter Bergner  
> 
> gcc/
>   PR target/101865
>   * config/rs6000/rs6000-builtin.cc (rs6000_builtin_is_supported): Use
>   TARGET_POWER8.
>   * config/rs6000/rs6000-c.cc (rs6000_target_modify_macros): Use
>   OPTION_MASK_POWER8.
>   * config/rs6000/rs6000-cpus.def (POWERPC_MASKS): Add OPTION_MASK_POWER8.
>   (ISA_2_7_MASKS_SERVER): Likewise.
>   * config/rs6000/rs6000.cc (rs6000_option_override_internal): Update
>   comment.  Use OPTION_MASK_POWER8 and TARGET_POWER8.
>   * config/rs6000/rs6000.h (TARGET_SYNC_HI_QI): Use TARGET_POWER8.
>   * config/rs6000/rs6000.md (define_attr "isa"): Add p8.
>   (define_attr "enabled"): Handle it.
>   (define_insn "prefetch"): Use TARGET_POWER8.
>   * config/rs6000/rs6000.opt (mdo-not-use-this-option): New.
> 
> gcc/testsuite/
>   PR target/101865
>   * gcc.target/powerpc/predefined-p7-novsx.c: New test.
>   * gcc.target/powerpc/predefined-p8-noaltivec-novsx.c: New test.
>   * gcc.target/powerpc/predefined-p8-noaltivec.c: New test.
>   * gcc.target/powerpc/predefined-p8-novsx.c: New test.
>   * gcc.target/powerpc/predefined-p8-pragma-vsx.c: New test.
>   * gcc.target/powerpc/predefined-p9-novsx.c: New test.
> 
> diff --git a/gcc/config/rs6000/rs6000-builtin.cc 
> b/gcc/config/rs6000/rs6000-builtin.cc
> index e7d6204074c..320affd79e3 100644
> --- a/gcc/config/rs6000/rs6000-builtin.cc
> +++ b/gcc/config/rs6000/rs6000-builtin.cc
> @@ -165,7 +165,7 @@ rs6000_builtin_is_supported (enum rs6000_gen_builtins 
> fncode)
>  case ENB_P7_64:
>return TARGET_POPCNTD && TARGET_POWERPC64;
>  case ENB_P8:
> -  return TARGET_DIRECT_MOVE;
> +  return TARGET_POWER8;
>  case ENB_P8V:
>return TARGET_P8_VECTOR;
>  case ENB_P9:
> diff --git a/gcc/config/rs6000/rs6000-c.cc b/gcc/config/rs6000/rs6000-c.cc
> index 647f20de7f2..bd493ab87c5 100644
> --- a/gcc/config/rs6000/rs6000-c.cc
> +++ b/gcc/config/rs6000/rs6000-c.cc
> @@ -429,7 +429,7 @@ rs6000_target_modify_macros (bool define_p, HOST_WIDE_INT 
> flags)
>  rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR6");
>if ((flags & OPTION_MASK_POPCNTD) != 0)
>  rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR7");
> -  if ((flags & OPTION_MASK_P8_VECTOR) != 0)
> +  if ((flags & OPTION_MASK_POWER8) != 0)
>  rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR8");
>if ((flags & OPTION_MASK_MODULO) != 0)
>  rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR9");
> diff --git a/gcc/config/rs6000/rs6000-cpus.def 
> b/gcc/config/rs6000/rs6000-cpus.def
> index 45dd5a85901..6ee678e69c3 100644
> --- a/gcc/config/rs6000/rs6000-cpus.def
> +++ b/gcc/config/rs6000/rs6000-cpus.def
> @@ -47,6 +47,7 @@
> fusion here, instead set it in rs6000.cc if we are tuning for a power8
> system.  */
>  #define ISA_2_7_MASKS_SERVER (ISA_2_6_MASKS_SERVER   \
> +  | OPTION_MASK_POWER8   \
>| OPTION_MASK_P8_VECTOR\
>| OPTION_MASK_CRYPTO   \
>| OPTION_MASK_EFFICIENT_UNALIGNED_VSX  \
> @@ -130,6 +131,7 @@
>| OPTION_MASK_MODULO   \
>| OPTION_MASK_MULHW\
>| OPTION_MASK_NO_UPDATE 

Re: [PATCH] testsuite: Adjust pr113359-2_*.c with unsigned long long [PR114662]

2024-04-10 Thread Kewen.Lin
on 2024/4/10 15:11, Richard Biener wrote:
> On Wed, Apr 10, 2024 at 8:24 AM Kewen.Lin  wrote:
>>
>> Hi,
>>
>> pr113359-2_*.c define a struct having unsigned long type
>> members ay and az which have 4 bytes size at -m32, while
>> the related constants CL1 and CL2 used for equality check
>> are always 8 bytes, it makes compiler consider the below
>>
>>   69   if (a.ay != CL1)
>>   70 __builtin_abort ();
>>
>> always to abort and optimize away the following call to
>> getb, which leads to the expected wpa dumping on
>> "Semantic equality" missing.
>>
>> This patch is to modify the types with unsigned long long
>> accordingly.  Tested well on powerpc64-linux-gnu.
>>
>> Is it ok for trunk?
> 
> OK

Thanks!  Pushed as r14-9886.

BR,
Kewen

> 
>> BR,
>> Kewen
>> -
>> PR testsuite/114662
>>
>> gcc/testsuite/ChangeLog:
>>
>> * gcc.dg/lto/pr113359-2_0.c: Use unsigned long long instead of
>> unsigned long.
>> * gcc.dg/lto/pr113359-2_1.c: Likewise.
>> ---
>>  gcc/testsuite/gcc.dg/lto/pr113359-2_0.c | 8 
>>  gcc/testsuite/gcc.dg/lto/pr113359-2_1.c | 8 
>>  2 files changed, 8 insertions(+), 8 deletions(-)
>>
>> diff --git a/gcc/testsuite/gcc.dg/lto/pr113359-2_0.c 
>> b/gcc/testsuite/gcc.dg/lto/pr113359-2_0.c
>> index 8b2d5bdfab2..8495667599d 100644
>> --- a/gcc/testsuite/gcc.dg/lto/pr113359-2_0.c
>> +++ b/gcc/testsuite/gcc.dg/lto/pr113359-2_0.c
>> @@ -8,15 +8,15 @@
>>  struct SA
>>  {
>>unsigned int ax;
>> -  unsigned long ay;
>> -  unsigned long az;
>> +  unsigned long long ay;
>> +  unsigned long long az;
>>  };
>>
>>  struct SB
>>  {
>>unsigned int bx;
>> -  unsigned long by;
>> -  unsigned long bz;
>> +  unsigned long long by;
>> +  unsigned long long bz;
>>  };
>>
>>  struct ZA
>> diff --git a/gcc/testsuite/gcc.dg/lto/pr113359-2_1.c 
>> b/gcc/testsuite/gcc.dg/lto/pr113359-2_1.c
>> index 61bc0547981..8320f347efe 100644
>> --- a/gcc/testsuite/gcc.dg/lto/pr113359-2_1.c
>> +++ b/gcc/testsuite/gcc.dg/lto/pr113359-2_1.c
>> @@ -5,15 +5,15 @@
>>  struct SA
>>  {
>>unsigned int ax;
>> -  unsigned long ay;
>> -  unsigned long az;
>> +  unsigned long long ay;
>> +  unsigned long long az;
>>  };
>>
>>  struct SB
>>  {
>>unsigned int bx;
>> -  unsigned long by;
>> -  unsigned long bz;
>> +  unsigned long long by;
>> +  unsigned long long bz;
>>  };
>>
>>  struct ZA
>> --
>> 2.43.0



[PATCH] testsuite: Adjust pr113359-2_*.c with unsigned long long [PR114662]

2024-04-10 Thread Kewen.Lin
Hi,

pr113359-2_*.c define a struct having unsigned long type
members ay and az which have 4 bytes size at -m32, while
the related constants CL1 and CL2 used for equality check
are always 8 bytes, it makes compiler consider the below

  69   if (a.ay != CL1)
  70 __builtin_abort ();

always to abort and optimize away the following call to
getb, which leads to the expected wpa dumping on
"Semantic equality" missing.

This patch is to modify the types with unsigned long long
accordingly.  Tested well on powerpc64-linux-gnu.

Is it ok for trunk?

BR,
Kewen
-
PR testsuite/114662

gcc/testsuite/ChangeLog:

* gcc.dg/lto/pr113359-2_0.c: Use unsigned long long instead of
unsigned long.
* gcc.dg/lto/pr113359-2_1.c: Likewise.
---
 gcc/testsuite/gcc.dg/lto/pr113359-2_0.c | 8 
 gcc/testsuite/gcc.dg/lto/pr113359-2_1.c | 8 
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/lto/pr113359-2_0.c 
b/gcc/testsuite/gcc.dg/lto/pr113359-2_0.c
index 8b2d5bdfab2..8495667599d 100644
--- a/gcc/testsuite/gcc.dg/lto/pr113359-2_0.c
+++ b/gcc/testsuite/gcc.dg/lto/pr113359-2_0.c
@@ -8,15 +8,15 @@
 struct SA
 {
   unsigned int ax;
-  unsigned long ay;
-  unsigned long az;
+  unsigned long long ay;
+  unsigned long long az;
 };

 struct SB
 {
   unsigned int bx;
-  unsigned long by;
-  unsigned long bz;
+  unsigned long long by;
+  unsigned long long bz;
 };

 struct ZA
diff --git a/gcc/testsuite/gcc.dg/lto/pr113359-2_1.c 
b/gcc/testsuite/gcc.dg/lto/pr113359-2_1.c
index 61bc0547981..8320f347efe 100644
--- a/gcc/testsuite/gcc.dg/lto/pr113359-2_1.c
+++ b/gcc/testsuite/gcc.dg/lto/pr113359-2_1.c
@@ -5,15 +5,15 @@
 struct SA
 {
   unsigned int ax;
-  unsigned long ay;
-  unsigned long az;
+  unsigned long long ay;
+  unsigned long long az;
 };

 struct SB
 {
   unsigned int bx;
-  unsigned long by;
-  unsigned long bz;
+  unsigned long long by;
+  unsigned long long bz;
 };

 struct ZA
--
2.43.0


Re: [PATCH] rs6000: Replace OPTION_MASK_DIRECT_MOVE with OPTION_MASK_P8_VECTOR [PR101865]

2024-04-08 Thread Kewen.Lin
on 2024/4/9 11:20, Peter Bergner wrote:
> On 4/8/24 9:37 PM, Kewen.Lin wrote:
>> on 2024/4/8 21:21, Peter Bergner wrote:
>> I prefer to remove it completely, that is:
>>
>>> -mdirect-move
>>> -Target Undocumented Mask(DIRECT_MOVE) Var(rs6000_isa_flags) WarnRemoved
>>
>> The reason why you still kept it is to keep a historical record here?
> 
> I believe we've never completely removed an option before.  I think the

By checking the history, we did remove some options for SPE, paired single,
xilinx-fpu etc., which can be taken as gone with feature removal, but also
-maltivec={le,be} and -misel={yes,no}.

> thought was, if some software package explicitly used the option, then
> they shouldn't see an 'unrecognized command-line option' error, but
> rather either a warning that the option was removed or just silently
> ignore it.  Ie, we don't want to make a package that used to build with
> an old compiler now break its build because the option doesn't exist
> anymore.

I understand, but an argument is that no errors (even no warnings) can imply
some option still takes effect and cause some misunderstanding easily.  For
the release in which we remove the support of an option, we can still mark
it as WarnRemoved, but after a release or so, users should be aware of this
change and modify their build scripts if need, it's better to emit errors
for them to avoid the false appearance that it's still supported.

> 
>> Segher pointed out to me that this kind of option complete removal should be
>> stage 1 stuff, so let's defer to make it in a separated patch next release
>> (including some other options like mfpgpr you showed below etc.). :)
> 
> If we're going to completely remove it, then for sure, it's a stage1 thing.
> I'd like to hear Segher's thoughts on whether we should completely remove
> it or just silently ignore it.
> 
> 
> 
>> For the original patch,
>>
>>> +mno-direct-move
>>> +Target Undocumented WarnRemoved
>>
>> s/WarnRemoved/Ignore/ to match some other existing practice, there is no
>> warning now if specifying -mno-direct-move and it would be good to keep
>> the same behavior for users.
> 
> If we want to silently ignore -mdirect-move and -mno-direct-move, then we
> just need to do:
> 
> mdirect-move
> -Target Undocumented Mask(DIRECT_MOVE) Var(rs6000_isa_flags) WarnRemoved
> +Target Undocumented Ignore
> 

Since removing it completely is a stage1 thing, I prefer to keep mdirect-move
and -mno-direct-move handlings as before, WarnRemoved and Ignore separately.

> There's no need to mention -mno-direct-move at all then.  It was only in the
> case I thought we wanted to warn against it's use that I added 
> -mno-direct-move.
> 
> 

Not to mention it is fine too, just keep the handlings and defer it to stage 1. 
:)

BR,
Kewen



Re: [PATCH] rs6000: Replace OPTION_MASK_DIRECT_MOVE with OPTION_MASK_P8_VECTOR [PR101865]

2024-04-08 Thread Kewen.Lin
Hi Peter,

on 2024/4/8 21:21, Peter Bergner wrote:
> On 4/8/24 3:55 AM, Kewen.Lin wrote:
>> on 2024/4/6 06:28, Peter Bergner wrote:
>>> +mno-direct-move
>>> +Target Undocumented WarnRemoved
>>> +
>>>  mdirect-move
>>> -Target Undocumented Mask(DIRECT_MOVE) Var(rs6000_isa_flags) WarnRemoved
>>> +Target Undocumented WarnRemoved
>>
>> When reviewing my previous patch to "neuter option -mpower{8,9}-vector",
>> Segher mentioned that we don't need to keep such option warning all the
>> time and can drop it like in a release later as users should be aware of
>> this information then, I agreed and considering that patch disabling
>> -m[no-]direct-move was r8-7845-g57f108f5a1e1b2, I think we can just remove
>> m[no-]direct-move here?  What do you think?
> 
> 
> I'm fine with that if that is what we want.  So something like the following?
> 
> +;; This option existed in the past, but now is always silently ignored.
> mdirect-move
> -Target Undocumented Mask(DIRECT_MOVE) Var(rs6000_isa_flags) WarnRemoved
> +Target Undocumented Ignore

I prefer to remove it completely, that is:

> -mdirect-move
> -Target Undocumented Mask(DIRECT_MOVE) Var(rs6000_isa_flags) WarnRemoved

The reason why you still kept it is to keep a historical record here?

Segher pointed out to me that this kind of option complete removal should be
stage 1 stuff, so let's defer to make it in a separated patch next release
(including some other options like mfpgpr you showed below etc.). :)

For the original patch,

> +mno-direct-move
> +Target Undocumented WarnRemoved

s/WarnRemoved/Ignore/ to match some other existing practice, there is no
warning now if specifying -mno-direct-move and it would be good to keep
the same behavior for users.

OK for trunk and active branches with this tweaked, thanks!

> 
> 
> The above seems to silently ignore both -mdirect-move and -mno-direct-move
> which I think is what we want.  That said, it's not what we've done with
> other options, but maybe those just need to be changed too?

Yes, I think they need to be changed too (next release).

BR,
Kewen



Re: [PATCH] testsuite: Add profile_update_atomic check to gcov-20.c [PR114614]

2024-04-08 Thread Kewen.Lin
on 2024/4/8 18:47, Richard Biener wrote:
> On Mon, Apr 8, 2024 at 11:23 AM Kewen.Lin  wrote:
>>
>> Hi,
>>
>> As PR114614 shows, the newly added test case gcov-20.c by
>> commit r14-9789-g08a52331803f66 failed on targets which do
>> not support atomic profile update, there would be a message
>> like:
>>
>>   warning: target does not support atomic profile update,
>>single mode is selected
>>
>> Since the test case adopts -fprofile-update=atomic, it
>> requires effective target check profile_update_atomic, this
>> patch is to add the check accordingly.
>>
>> Tested well on x86_64-redhat-linux, powerpc64-linux-gnu P8/P9
>> and powerpc64le-linux-gnu P9/P10.
>>
>> Is it ok for trunk?
> 
> OK

Thanks, pushed as r14-9851.

BR,
Kewen

> 
>> BR,
>> Kewen
>> -
>> PR testsuite/114614
>>
>> gcc/testsuite/ChangeLog:
>>
>> * gcc.misc-tests/gcov-20.c: Add effective target check
>> profile_update_atomic.
>> ---
>>  gcc/testsuite/gcc.misc-tests/gcov-20.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/gcc/testsuite/gcc.misc-tests/gcov-20.c 
>> b/gcc/testsuite/gcc.misc-tests/gcov-20.c
>> index 215faffc980..ca8c12aad2b 100644
>> --- a/gcc/testsuite/gcc.misc-tests/gcov-20.c
>> +++ b/gcc/testsuite/gcc.misc-tests/gcov-20.c
>> @@ -1,5 +1,6 @@
>>  /* { dg-options "-fcondition-coverage -ftest-coverage 
>> -fprofile-update=atomic" } */
>>  /* { dg-do run { target native } } */
>> +/* { dg-require-effective-target profile_update_atomic } */
>>
>>  /* Some side effect to stop branches from being pruned */
>>  int x = 0;
>> --
>> 2.43.0



Re: [PATCH] rs6000: Fix wrong align passed to build_aligned_type [PR88309]

2024-04-08 Thread Kewen.Lin
on 2024/4/8 18:47, Richard Biener wrote:
> On Mon, Apr 8, 2024 at 11:22 AM Kewen.Lin  wrote:
>>
>> Hi,
>>
>> As the comments in PR88309 show, there are two oversights
>> in rs6000_gimple_fold_builtin that pass align in bytes to
>> build_aligned_type but which actually requires align in
>> bits, it causes unexpected ICE or hanging in function
>> is_miss_rate_acceptable due to zero align_unit value.
>>
>> This patch is to fix them by converting bytes to bits, add
>> an assertion on positive align_unit value and notes function
>> build_aligned_type requires align measured in bits in its
>> function comment.
>>
>> Bootstrapped and regtested on x86_64-redhat-linux,
>> powerpc64-linux-gnu P8/P9 and powerpc64le-linux-gnu P9 and P10.
>>
>> Is it (the generic part code change) ok for trunk?
> 
> OK

Thanks, pushed as r14-9850, is it also ok to backport after burn-in time?

BR,
Kewen

> 
>> BR,
>> Kewen
>> -
>> PR target/88309
>>
>> Co-authored-by: Andrew Pinski 
>>
>> gcc/ChangeLog:
>>
>> * config/rs6000/rs6000-builtin.cc (rs6000_gimple_fold_builtin): Fix
>> wrong align passed to function build_aligned_type.
>> * tree-ssa-loop-prefetch.cc (is_miss_rate_acceptable): Add an
>> assertion to ensure align_unit should be positive.
>> * tree.cc (build_qualified_type): Update function comments.
>>
>> gcc/testsuite/ChangeLog:
>>
>> * gcc.target/powerpc/pr88309.c: New test.
>> ---
>>  gcc/config/rs6000/rs6000-builtin.cc|  4 ++--
>>  gcc/testsuite/gcc.target/powerpc/pr88309.c | 27 ++
>>  gcc/tree-ssa-loop-prefetch.cc  |  2 ++
>>  gcc/tree.cc|  3 ++-
>>  4 files changed, 33 insertions(+), 3 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr88309.c
>>
>> diff --git a/gcc/config/rs6000/rs6000-builtin.cc 
>> b/gcc/config/rs6000/rs6000-builtin.cc
>> index 6698274031b..e7d6204074c 100644
>> --- a/gcc/config/rs6000/rs6000-builtin.cc
>> +++ b/gcc/config/rs6000/rs6000-builtin.cc
>> @@ -1900,7 +1900,7 @@ rs6000_gimple_fold_builtin (gimple_stmt_iterator *gsi)
>> tree lhs_type = TREE_TYPE (lhs);
>> /* In GIMPLE the type of the MEM_REF specifies the alignment.  The
>>   required alignment (power) is 4 bytes regardless of data type.  */
>> -   tree align_ltype = build_aligned_type (lhs_type, 4);
>> +   tree align_ltype = build_aligned_type (lhs_type, 32);
>> /* POINTER_PLUS_EXPR wants the offset to be of type 'sizetype'.  
>> Create
>>the tree using the value from arg0.  The resulting type will match
>>the type of arg1.  */
>> @@ -1944,7 +1944,7 @@ rs6000_gimple_fold_builtin (gimple_stmt_iterator *gsi)
>> tree arg2_type = ptr_type_node;
>> /* In GIMPLE the type of the MEM_REF specifies the alignment.  The
>>required alignment (power) is 4 bytes regardless of data type.  */
>> -   tree align_stype = build_aligned_type (arg0_type, 4);
>> +   tree align_stype = build_aligned_type (arg0_type, 32);
>> /* POINTER_PLUS_EXPR wants the offset to be of type 'sizetype'.  
>> Create
>>the tree using the value from arg1.  */
>> gimple_seq stmts = NULL;
>> diff --git a/gcc/testsuite/gcc.target/powerpc/pr88309.c 
>> b/gcc/testsuite/gcc.target/powerpc/pr88309.c
>> new file mode 100644
>> index 000..c0078cf2b8c
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/powerpc/pr88309.c
>> @@ -0,0 +1,27 @@
>> +/* { dg-require-effective-target powerpc_vsx_ok } */
>> +/* { dg-options "-mvsx -O2 -fprefetch-loop-arrays" } */
>> +
>> +/* Verify there is no ICE or hanging.  */
>> +
>> +#include 
>> +
>> +void b(float *c, vector float a, vector float, vector float)
>> +{
>> +  vector float d;
>> +  vector char ahbc;
>> +  vec_xst(vec_perm(a, d, ahbc), 0, c);
>> +}
>> +
>> +vector float e(vector unsigned);
>> +
>> +void f() {
>> +  float *dst;
>> +  int g = 0;
>> +  for (;; g += 16) {
>> +vector unsigned m, i;
>> +vector unsigned n, j;
>> +vector unsigned k, l;
>> +b(dst + g * 3, e(m), e(n), e(k));
>> +b(dst + (g + 4) * 3, e(i), e(j), e(l));
>> +  }
>> +}
>> diff --git a/gcc/tree-ssa-loop-prefetch.cc b/gcc/tree-ssa-loop-prefetch.cc
>> index bbd98e03254..70073cc4fe4 100644
>> --- a/gcc/tree-ssa-loop-prefetch.cc
>>

[PATCH] rs6000: Fix wrong align passed to build_aligned_type [PR88309]

2024-04-08 Thread Kewen.Lin
Hi,

As the comments in PR88309 show, there are two oversights
in rs6000_gimple_fold_builtin that pass align in bytes to
build_aligned_type but which actually requires align in
bits, it causes unexpected ICE or hanging in function
is_miss_rate_acceptable due to zero align_unit value.

This patch is to fix them by converting bytes to bits, add
an assertion on positive align_unit value and notes function
build_aligned_type requires align measured in bits in its
function comment.

Bootstrapped and regtested on x86_64-redhat-linux, 
powerpc64-linux-gnu P8/P9 and powerpc64le-linux-gnu P9 and P10.

Is it (the generic part code change) ok for trunk?

BR,
Kewen
-
PR target/88309

Co-authored-by: Andrew Pinski 

gcc/ChangeLog:

* config/rs6000/rs6000-builtin.cc (rs6000_gimple_fold_builtin): Fix
wrong align passed to function build_aligned_type.
* tree-ssa-loop-prefetch.cc (is_miss_rate_acceptable): Add an
assertion to ensure align_unit should be positive.
* tree.cc (build_qualified_type): Update function comments.

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/pr88309.c: New test.
---
 gcc/config/rs6000/rs6000-builtin.cc|  4 ++--
 gcc/testsuite/gcc.target/powerpc/pr88309.c | 27 ++
 gcc/tree-ssa-loop-prefetch.cc  |  2 ++
 gcc/tree.cc|  3 ++-
 4 files changed, 33 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr88309.c

diff --git a/gcc/config/rs6000/rs6000-builtin.cc 
b/gcc/config/rs6000/rs6000-builtin.cc
index 6698274031b..e7d6204074c 100644
--- a/gcc/config/rs6000/rs6000-builtin.cc
+++ b/gcc/config/rs6000/rs6000-builtin.cc
@@ -1900,7 +1900,7 @@ rs6000_gimple_fold_builtin (gimple_stmt_iterator *gsi)
tree lhs_type = TREE_TYPE (lhs);
/* In GIMPLE the type of the MEM_REF specifies the alignment.  The
  required alignment (power) is 4 bytes regardless of data type.  */
-   tree align_ltype = build_aligned_type (lhs_type, 4);
+   tree align_ltype = build_aligned_type (lhs_type, 32);
/* POINTER_PLUS_EXPR wants the offset to be of type 'sizetype'.  Create
   the tree using the value from arg0.  The resulting type will match
   the type of arg1.  */
@@ -1944,7 +1944,7 @@ rs6000_gimple_fold_builtin (gimple_stmt_iterator *gsi)
tree arg2_type = ptr_type_node;
/* In GIMPLE the type of the MEM_REF specifies the alignment.  The
   required alignment (power) is 4 bytes regardless of data type.  */
-   tree align_stype = build_aligned_type (arg0_type, 4);
+   tree align_stype = build_aligned_type (arg0_type, 32);
/* POINTER_PLUS_EXPR wants the offset to be of type 'sizetype'.  Create
   the tree using the value from arg1.  */
gimple_seq stmts = NULL;
diff --git a/gcc/testsuite/gcc.target/powerpc/pr88309.c 
b/gcc/testsuite/gcc.target/powerpc/pr88309.c
new file mode 100644
index 000..c0078cf2b8c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr88309.c
@@ -0,0 +1,27 @@
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-mvsx -O2 -fprefetch-loop-arrays" } */
+
+/* Verify there is no ICE or hanging.  */
+
+#include 
+
+void b(float *c, vector float a, vector float, vector float)
+{
+  vector float d;
+  vector char ahbc;
+  vec_xst(vec_perm(a, d, ahbc), 0, c);
+}
+
+vector float e(vector unsigned);
+
+void f() {
+  float *dst;
+  int g = 0;
+  for (;; g += 16) {
+vector unsigned m, i;
+vector unsigned n, j;
+vector unsigned k, l;
+b(dst + g * 3, e(m), e(n), e(k));
+b(dst + (g + 4) * 3, e(i), e(j), e(l));
+  }
+}
diff --git a/gcc/tree-ssa-loop-prefetch.cc b/gcc/tree-ssa-loop-prefetch.cc
index bbd98e03254..70073cc4fe4 100644
--- a/gcc/tree-ssa-loop-prefetch.cc
+++ b/gcc/tree-ssa-loop-prefetch.cc
@@ -739,6 +739,8 @@ is_miss_rate_acceptable (unsigned HOST_WIDE_INT 
cache_line_size,
   if (delta >= (HOST_WIDE_INT) cache_line_size)
 return false;

+  gcc_assert (align_unit > 0);
+
   miss_positions = 0;
   total_positions = (cache_line_size / align_unit) * distinct_iters;
   max_allowed_miss_positions = (ACCEPTABLE_MISS_RATE * total_positions) / 1000;
diff --git a/gcc/tree.cc b/gcc/tree.cc
index f801712c9dd..6f8400e6640 100644
--- a/gcc/tree.cc
+++ b/gcc/tree.cc
@@ -5689,7 +5689,8 @@ build_qualified_type (tree type, int type_quals 
MEM_STAT_DECL)
   return t;
 }

-/* Create a variant of type T with alignment ALIGN.  */
+/* Create a variant of type T with alignment ALIGN which
+   is measured in bits.  */

 tree
 build_aligned_type (tree type, unsigned int align)
--
2.43.0


[PATCH] testsuite: Add profile_update_atomic check to gcov-20.c [PR114614]

2024-04-08 Thread Kewen.Lin
Hi,

As PR114614 shows, the newly added test case gcov-20.c by
commit r14-9789-g08a52331803f66 failed on targets which do
not support atomic profile update, there would be a message
like:

  warning: target does not support atomic profile update,
   single mode is selected

Since the test case adopts -fprofile-update=atomic, it
requires effective target check profile_update_atomic, this
patch is to add the check accordingly.

Tested well on x86_64-redhat-linux, powerpc64-linux-gnu P8/P9
and powerpc64le-linux-gnu P9/P10.

Is it ok for trunk?

BR,
Kewen
-
PR testsuite/114614

gcc/testsuite/ChangeLog:

* gcc.misc-tests/gcov-20.c: Add effective target check
profile_update_atomic.
---
 gcc/testsuite/gcc.misc-tests/gcov-20.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/testsuite/gcc.misc-tests/gcov-20.c 
b/gcc/testsuite/gcc.misc-tests/gcov-20.c
index 215faffc980..ca8c12aad2b 100644
--- a/gcc/testsuite/gcc.misc-tests/gcov-20.c
+++ b/gcc/testsuite/gcc.misc-tests/gcov-20.c
@@ -1,5 +1,6 @@
 /* { dg-options "-fcondition-coverage -ftest-coverage -fprofile-update=atomic" 
} */
 /* { dg-do run { target native } } */
+/* { dg-require-effective-target profile_update_atomic } */

 /* Some side effect to stop branches from being pruned */
 int x = 0;
--
2.43.0


Re: [PATCH 3/3] Add -mcpu=power11 tests

2024-04-08 Thread Kewen.Lin
Hi Mike,

on 2024/3/20 12:16, Michael Meissner wrote:
> This patch adds some simple tests for -mcpu=power11 support.  In order to run
> these tests, you need an assembler that supports the appropriate option for
> supporting the Power11 processor (-mpower11 under Linux or -mpwr11 under AIX).
> 
> I have tested this patch on a little endian power10 system and a big endian
> power9 system using the latest binutils which includes support for power11.
> There were no regressions, and the 3 power11 tests added ran on both systems.
> Can I check this patch into GCC 15 when it opens up for general patches?
> 
> 2024-03-18  Michael Meissner  
> 
> gcc/testsuite/
> 
>   * gcc.target/powerpc/power11-1.c: New test.
>   * gcc.target/powerpc/power11-2.c: Likewise.
>   * gcc.target/powerpc/power11-3.c: Likewise.
>   * lib/target-supports.exp (check_effective_target_power11_ok): Add new
>   effective target.
> ---
>  gcc/testsuite/gcc.target/powerpc/power11-1.c | 13 +
>  gcc/testsuite/gcc.target/powerpc/power11-2.c | 20 
>  gcc/testsuite/gcc.target/powerpc/power11-3.c | 10 ++
>  gcc/testsuite/lib/target-supports.exp| 17 +
>  4 files changed, 60 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/power11-1.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/power11-2.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/power11-3.c
> 
> diff --git a/gcc/testsuite/gcc.target/powerpc/power11-1.c 
> b/gcc/testsuite/gcc.target/powerpc/power11-1.c
> new file mode 100644
> index 000..6a2e802eedf
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/power11-1.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile { target powerpc*-*-* } } */
> +/* { dg-require-effective-target power11_ok } */
> +/* { dg-options "-mdejagnu-cpu=power11 -O2" } */
> +
> +/* Basic check to see if the compiler supports -mcpu=power11.  */
> +
> +#ifndef _ARCH_PWR11
> +#error "-mcpu=power11 is not supported"
> +#endif
> +
> +void foo (void)
> +{
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/power11-2.c 
> b/gcc/testsuite/gcc.target/powerpc/power11-2.c
> new file mode 100644
> index 000..7b9904c1d29
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/power11-2.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile { target powerpc*-*-* } } */
> +/* { dg-require-effective-target power11_ok } */
> +/* { dg-options "-O2" } */
> +
> +/* Check if we can set the power11 target via a target attribute.  */
> +
> +__attribute__((__target__("cpu=power9")))
> +void foo_p9 (void)
> +{
> +}
> +
> +__attribute__((__target__("cpu=power10")))
> +void foo_p10 (void)
> +{
> +}
> +
> +__attribute__((__target__("cpu=power11")))
> +void foo_p11 (void)
> +{
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/power11-3.c 
> b/gcc/testsuite/gcc.target/powerpc/power11-3.c
> new file mode 100644
> index 000..9b2d643cc0f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/power11-3.c
> @@ -0,0 +1,10 @@
> +/* { dg-do compile { target powerpc*-*-* } }  */
> +/* { dg-require-effective-target power11_ok } */
> +/* { dg-options "-mdejagnu-cpu=power8 -O2" }  */
> +
> +/* Check if we can set the power11 target via a target_clones attribute.  */
> +
> +__attribute__((__target_clones__("cpu=power11,cpu=power9,default")))
> +void foo (void)
> +{
> +}
> diff --git a/gcc/testsuite/lib/target-supports.exp 
> b/gcc/testsuite/lib/target-supports.exp
> index 467b539b20d..be80494be80 100644
> --- a/gcc/testsuite/lib/target-supports.exp
> +++ b/gcc/testsuite/lib/target-supports.exp
> @@ -7104,6 +7104,23 @@ proc check_effective_target_power10_ok { } {
>  }
>  }
>  
> +# Return 1 if this is a PowerPC target supporting -mcpu=power11.
> +
> +proc check_effective_target_power11_ok { } {
> +if { ([istarget powerpc*-*-*]) } {
> + return [check_no_compiler_messages power11_ok object {
> + int main (void) {
> + #ifndef _ARCH_PWR11
> + #error "-mcpu=power11 is not supported"
> + #endif
> + return 0;
> + }
> + } "-mcpu=power11"]
> +} else {
> + return 0
> +}
> +}

Sorry that I didn't catch this before, this effective target looks useless
since its users power11-[123].c are all for compiling and the compilation
doesn't rely on assembler behavior.  power11-1.c has checked for _ARCH_PWR11,
maybe we want some cases with "dg-do assemble" to adopt this?

btw, the other two sub-patches in this series look good to me, as I know this
series has been on Segher's TODO list, I'll leave the approvals to him.

BR,
Kewen



Re: [PATCH] rs6000: Replace OPTION_MASK_DIRECT_MOVE with OPTION_MASK_P8_VECTOR [PR101865]

2024-04-08 Thread Kewen.Lin
Hi Peter,

on 2024/4/6 06:28, Peter Bergner wrote:
> This is a cleanup patch in preparation to fixing the real bug in PR101865.
> TARGET_DIRECT_MOVE is redundant with TARGET_P8_VECTOR, so alias it to that.
> Also replace all usages of OPTION_MASK_DIRECT_MOVE with OPTION_MASK_P8_VECTOR
> and delete the now dead mask.
> 
> This passed bootstrap and retesting on powerpc64le-linux with no regressions.
> Ok for trunk?
> 
> Eventually we'll want to backport this along with the follow-on patch that
> actually fixes PR101865.
> 
> Peter
> 
> 
> gcc/
>   PR target/101865
>   * config/rs6000/rs6000.h (TARGET_DIRECT_MOVE): Define.
>   * config/rs6000/rs6000.cc (rs6000_option_override_internal): Replace
>   OPTION_MASK_DIRECT_MOVE with OPTION_MASK_P8_VECTOR.  Delete redundant
>   OPTION_MASK_DIRECT_MOVE usage.  Delete TARGET_DIRECT_MOVE dead code.
>   (rs6000_opt_masks): Neuter the "direct-move" option.
>   * config/rs6000/rs6000-c.cc (rs6000_target_modify_macros): Replace
>   OPTION_MASK_DIRECT_MOVE with OPTION_MASK_P8_VECTOR.  Delete useless
>   comment.
>   * config/rs6000/rs6000-cpus.def (ISA_2_7_MASKS_SERVER): Delete
>   OPTION_MASK_DIRECT_MOVE.
>   (OTHER_VSX_VECTOR_MASKS): Likewise.
>   (POWERPC_MASKS): Likewise.
>   * config/rs6000/rs6000.opt (mno-direct-move): New.
>   (mdirect-move): Remove Mask and Var.
> 
> 
> diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
> index 68bc45d65ba..77d045c9f6e 100644
> --- a/gcc/config/rs6000/rs6000.h
> +++ b/gcc/config/rs6000/rs6000.h
> @@ -471,6 +471,8 @@ extern int rs6000_vector_align[];
>  #define TARGET_EXTSWSLI  (TARGET_MODULO && TARGET_POWERPC64)
>  #define TARGET_MADDLDTARGET_MODULO
>  
> +/* TARGET_DIRECT_MOVE is redundant to TARGET_P8_VECTOR, so alias it to that. 
>  */
> +#define TARGET_DIRECT_MOVE   TARGET_P8_VECTOR
>  #define TARGET_XSCVDPSPN (TARGET_DIRECT_MOVE || TARGET_P8_VECTOR)
>  #define TARGET_XSCVSPDPN (TARGET_DIRECT_MOVE || TARGET_P8_VECTOR)
>  #define TARGET_VADDUQM   (TARGET_P8_VECTOR && TARGET_POWERPC64)
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index 6ba9df4f02e..c241371147c 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -3811,7 +3811,7 @@ rs6000_option_override_internal (bool global_init_p)
>   Testing for direct_move matches power8 and later.  */
>if (!BYTES_BIG_ENDIAN
>&& !(processor_target_table[tune_index].target_enable
> -& OPTION_MASK_DIRECT_MOVE))
> +& OPTION_MASK_P8_VECTOR))
>  rs6000_isa_flags |= ~rs6000_isa_flags_explicit & 
> OPTION_MASK_STRICT_ALIGN;
>  
>/* Add some warnings for VSX.  */
> @@ -3853,8 +3853,7 @@ rs6000_option_override_internal (bool global_init_p)
>&& (rs6000_isa_flags_explicit & (OPTION_MASK_SOFT_FLOAT
>  | OPTION_MASK_ALTIVEC
>  | OPTION_MASK_VSX)) != 0)
> -rs6000_isa_flags &= ~((OPTION_MASK_P8_VECTOR | OPTION_MASK_CRYPTO
> -| OPTION_MASK_DIRECT_MOVE)
> +rs6000_isa_flags &= ~((OPTION_MASK_P8_VECTOR | OPTION_MASK_CRYPTO)
>& ~rs6000_isa_flags_explicit);
>  
>if (TARGET_DEBUG_REG || TARGET_DEBUG_TARGET)
> @@ -3939,13 +3938,6 @@ rs6000_option_override_internal (bool global_init_p)
>rs6000_isa_flags &= ~OPTION_MASK_FPRND;
>  }
>  
> -  if (TARGET_DIRECT_MOVE && !TARGET_VSX)
> -{
> -  if (rs6000_isa_flags_explicit & OPTION_MASK_DIRECT_MOVE)
> - error ("%qs requires %qs", "-mdirect-move", "-mvsx");
> -  rs6000_isa_flags &= ~OPTION_MASK_DIRECT_MOVE;
> -}
> -
>if (TARGET_P8_VECTOR && !TARGET_ALTIVEC)
>  rs6000_isa_flags &= ~OPTION_MASK_P8_VECTOR;
>  
> @@ -24429,7 +24421,7 @@ static struct rs6000_opt_mask const 
> rs6000_opt_masks[] =
>   false, true  },
>{ "cmpb",  OPTION_MASK_CMPB,   false, true  },
>{ "crypto",OPTION_MASK_CRYPTO, false, 
> true  },
> -  { "direct-move",   OPTION_MASK_DIRECT_MOVE,false, true  },
> +  { "direct-move",   0,  false, true  },
>{ "dlmzb", OPTION_MASK_DLMZB,  false, true  },
>{ "efficient-unaligned-vsx",   OPTION_MASK_EFFICIENT_UNALIGNED_VSX,
>   false, true  },
> diff --git a/gcc/config/rs6000/rs6000-c.cc b/gcc/config/rs6000/rs6000-c.cc
> index ce0b14a8d37..647f20de7f2 100644
> --- a/gcc/config/rs6000/rs6000-c.cc
> +++ b/gcc/config/rs6000/rs6000-c.cc
> @@ -429,19 +429,7 @@ rs6000_target_modify_macros (bool define_p, 
> HOST_WIDE_INT flags)
>  rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR6");
>if ((flags & OPTION_MASK_POPCNTD) != 0)
>  rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR7");
> -  /* Note 

Re: [PATCH v2] rs6000: Stackoverflow in optimized code on PPC [PR100799]

2024-04-03 Thread Kewen.Lin
on 2024/4/3 19:18, Jakub Jelinek wrote:
> On Wed, Apr 03, 2024 at 07:01:50PM +0800, Kewen.Lin wrote:
>> Thanks for the details on debugging support, but IIUC with this workaround
>> being adopted, the debuggability on hidden args are already broken, aren't?
> 
> No.
> In the correct program case, which should be the usual case, the caller will
> pass the arguments and one should be able to see the values in the debugger
> even if the function doesn't actually use those arguments.
> If the caller is buggy and doesn't pass those arguments, one should be able
> to see garbage values for those arguments and perhaps that way figure out
> that the program is buggy and fix it.

But it's not true with Ajit's current implementation that is lying args are
passed in r11..., so whatever the caller is usual (in argument save area) or
not (not such value), the values are all broken.

> 
>> Since with a usual caller, the actual argument is passed in argument save
>> area, but the callee debug information says the location is %r11 or some
>> other stack slot.
>>
>> I think the difficulty is that: with this workaround, for some arguments we
>> are lying they are not passed in argument save area, then we have to pretend
>> they are passed in r11,r12..., but in fact these registers are not valid to
>> pass arguments, so it's unreasonable and confusing.  With your explanation,
>> I agree that stripping DECL_ARGUMENTS chains isn't a good way to eliminate
>> this confusion, maybe always using GP_ARG_MIN_REG/GP_ARG_MAX_REG for things
>> exceeding GP_ARG_MAX_REG can reduce the unreasonableness (but still confusing
>> IMHO).
> 
> If those arguments aren't passed in r11/r12, but in memory, the workaround
> shouldn't pretend they are passed somewhere where they aren't actually
> passed.

Unfortunately the current implementation doesn't conform this, I misunderstood
you knew that.

> Instead, it should load them from the memory where they are actually
> normally passed.
> What needs to be ensured though is that those arguments are for -O0 loaded
> from those stack slots and saved to different stack slots (inside of the
> callee frame, rather than in caller's frame), for -O1+ just not loaded at
> all and pretended to just live in the caller's frame, and most importantly
> ensure that the callee doesn't try to think there is a parameter save area
> in the caller's frame which it can use for random saving related or
> unrelated data.  So, e.g. REG_EQUAL/REG_EQUIV shouldn't be used, nor tell
> that the 1st-8th arguments could be saved to the parameter save area.
> So, for the 1st-8th arguments it really should act as if there is no
> parameter save area and for the DECL_HIDDEN_STRING_LENGTH ones after it
> as it those are passed in memory, but as if that memory is owned by the
> caller, not callee, so it is not correct to modify that memory.

Now I got your points.  I like this proposal and also believe it makes more
sense on both the resulted assembly and the debuggability support, though
it sounds the implementation has to be more complicated than what's done.

Thanks for all the inputs!!

BR,
Kewen



Re: [PATCH v2] rs6000: Stackoverflow in optimized code on PPC [PR100799]

2024-04-03 Thread Kewen.Lin
Hi!

on 2024/4/3 17:23, Jakub Jelinek wrote:
> On Wed, Apr 03, 2024 at 05:02:40PM +0800, Kewen.Lin wrote:
>> on 2024/4/3 16:35, Jakub Jelinek wrote:
>>> On Wed, Apr 03, 2024 at 01:18:54PM +0800, Kewen.Lin wrote:
>>>>> I'd prefer not to remove DECL_ARGUMENTS chains, they are valid arguments 
>>>>> that just some
>>>>> invalid code doesn't pass.  By removing them you basically always create 
>>>>> an
>>>>> invalid case, this time in the other direction, valid caller passes more
>>>>> arguments than the callee (invalidly) expects.
>>>>
>>>> Thanks for the comments, do you mean it can affect the arguments 
>>>> validation when there
>>>> is explicit function declaration with interface?  Then can we strip them 
>>>> when we are
>>>> going to expand them (like checking currently_expanding_function_start)?
>>>
>>> I'd prefer not stripping them at all; they are clearly marked as perhaps not
>>> passed in buggy programs (the DECL_HIDDEN_STRING_LENGTH argument) and
>>> removing them implies the decl is a throw away, that after expansion
>>
>> Yes, IMHO it's safe as they are unused.
> 
> But they are still passed in the usual case.
> 
>>> nothing will actually look at it anymore.  I believe that is the case of
>>> function bodies, we expand them into RTL and throw away the GIMPLE, and
>>> after final clear the bodies, but it is not the case of the FUNCTION_DECL
>>> or its DECL_ARGUMENTs etc.  E.g. GIMPLE optimizations or expansion of
>>> callers could be looking at those as well.
>>
>> At expand time GIMPLE optimizations should already finish, so it should be
>> safe to strip them at that time?
> 
> No.
> The IPA/post IPA behavior is that IPA optimizations are performed and then
> cgraph finalizes one function at a time, going there from modifications
> needed from IPA passes, post IPA GIMPLE optimizations, expansion to RTL,
> RTL optimizations, emitting assembly, throwing away the body, then picking
> another function and repeating that etc.
> So, when one function makes it to expansion, if you modify its
> DECL_ARGUMENTS etc., all the post IPA GIMPLE optimization passes of other
> functions might still see such changes.

Thanks for explaining, I agree it's risky from this perspective.

> 
>>  It would surprise me if expansions of
>> callers will look at callee's information, it's more like what should be
>> done in IPA analysis instead?
> 
> Depends on what exactly it is.  E.g. C non-prototyped functions have
> just DECL_ARGUMENTS to check how many arguments the call should have vs.
> what is actually passed.

OK.

> 
>> No, it's not what I was looking for.  Peter's comments made me feel it's not
>> good to have assembly at O0 like:
>>
>> std %r3,112(%r31)
>> std %r4,120(%r31)
>> std %r5,128(%r31)
>> std %r6,136(%r31)
>> std %r7,144(%r31)
>> std %r8,152(%r31)
>> std %r9,160(%r31)
>> std %r10,168(%r31)
>> std %r11,176(%r31) // this mislead people that we pass 9th arg via 
>> r11,
>>// it would be nice not to have it.
>>
>> so I was thinking if there is some way to get rid of it.
> 
> You want to optimize at -O0?  Don't.

I don't really want optimization but try to get rid of the unreasonable
assembly code.  :)

> That will screw up debugging.  The function does have that argument, it
> should show up in debug info; it should show up also at -O2 in debug info
> etc.  If you remove chains from DECL_ARGUMENTS, because we have early dwarf
> these days, DW_TAG_formal_parameter nodes should have been already created,
> but it would mean that DW_AT_location for those arguments likely isn't
> filled.  Now, for -O2 it might be the case that the argument has useful
> location only at the start of the function, could have
> DW_OP_entry_value(%r11) afterwards, but at -O0 it better should have some
> stack slot into which the argument is saved and DW_AT_location should be
> that stack slot.  All that should change with the workaround is that if the
> stack slot would be normally in the argument save area in the caller's
> frame, if such argument save area can't be counted on, then it needs to be
> saved in some other stack slot, like arguments are saved to when there are
> only <= 8 arguments.

Thanks for the details on debugging support, but IIUC with this workaround
being adopted, the debuggability on hidden args are already broken, aren't?
Since with a usual caller, the actual argument is passed in argument save
area, bu

Re: [PATCH v2] rs6000: Stackoverflow in optimized code on PPC [PR100799]

2024-04-03 Thread Kewen.Lin
Hi Jakub,

on 2024/4/3 16:35, Jakub Jelinek wrote:
> On Wed, Apr 03, 2024 at 01:18:54PM +0800, Kewen.Lin wrote:
>>> I'd prefer not to remove DECL_ARGUMENTS chains, they are valid arguments 
>>> that just some
>>> invalid code doesn't pass.  By removing them you basically always create an
>>> invalid case, this time in the other direction, valid caller passes more
>>> arguments than the callee (invalidly) expects.
>>
>> Thanks for the comments, do you mean it can affect the arguments validation 
>> when there
>> is explicit function declaration with interface?  Then can we strip them 
>> when we are
>> going to expand them (like checking currently_expanding_function_start)?
> 
> I'd prefer not stripping them at all; they are clearly marked as perhaps not
> passed in buggy programs (the DECL_HIDDEN_STRING_LENGTH argument) and
> removing them implies the decl is a throw away, that after expansion

Yes, IMHO it's safe as they are unused.

> nothing will actually look at it anymore.  I believe that is the case of
> function bodies, we expand them into RTL and throw away the GIMPLE, and
> after final clear the bodies, but it is not the case of the FUNCTION_DECL
> or its DECL_ARGUMENTs etc.  E.g. GIMPLE optimizations or expansion of
> callers could be looking at those as well.

At expand time GIMPLE optimizations should already finish, so it should be
safe to strip them at that time?  It would surprise me if expansions of
callers will look at callee's information, it's more like what should be
done in IPA analysis instead?

> 
>> since from the
>> perspective of resulted assembly, with this workaround, the callee can:
>>   1) pass the hidden args in unexpected GPR like r11, ... at -O0;
>>   2) get rid of such hidden args as they are unused at -O2;
>> This proposal aims to make the assembly at -O0 not to pass with r11... (same 
>> as -O2),
>> comparing to the assembly at O2, the mismatch isn't actually changed.
> 
> The aim for the workaround was just avoid assuming there is a argument save
> area in the caller stack when it is sometimes missing.

Yeah, understood.

> If you are looking for optimizations where nothing actually passes the
> unneeded arguments and nothing expects them to be passed, then it is a task
> for IPA optimizations and should be done solely if IPA determines that all
> callers can be adjusted together with the callee; I think IPA already does
> that in that case for years, regardless if it is DECL_HIDDEN_STRING_LENGTH
> PARM_DECL or not.

No, it's not what I was looking for.  Peter's comments made me feel it's not
good to have assembly at O0 like:

std %r3,112(%r31)
std %r4,120(%r31)
std %r5,128(%r31)
std %r6,136(%r31)
std %r7,144(%r31)
std %r8,152(%r31)
std %r9,160(%r31)
std %r10,168(%r31)
std %r11,176(%r31) // this mislead people that we pass 9th arg via r11,
   // it would be nice not to have it.

so I was thinking if there is some way to get rid of it.

BR,
Kewen



Re: [PATCH v2] rs6000: Stackoverflow in optimized code on PPC [PR100799]

2024-04-02 Thread Kewen.Lin
Hi Jakub,

on 2024/4/2 16:03, Jakub Jelinek wrote:
> On Tue, Apr 02, 2024 at 02:12:04PM +0800, Kewen.Lin wrote:
>>>>>> The old code for the unused hidden parameter (which was the 9th param) 
>>>>>> would
>>>>>> fall thru to the "return NULL_RTX;" which would make the callee assume 
>>>>>> there
>>>>>> was a parameter save area allocated.  Now instead, we'll return a reg 
>>>>>> rtx,
>>>>>> probably of r11 (r3 thru r10 are our param regs) and I'm guessing we'll 
>>>>>> now
>>>>>> see a copy of r11 into a pseudo like we do for the other param regs.
>>>>>> Is that a problem? Given it's an unused parameter, it'll probably get 
>>>>>> deleted
>>>>>> as dead code, but could it cause any issues?  What if we have more than 
>>>>>> one
>>
>> I think Peter raised one good point, not sure it would really cause some 
>> issues,
>> but the assigned reg goes beyond GP_ARG_MAX_REG, at least it is confusing to 
>> people
>> especially without DCE like at -O0.  Can we aggressively remove these 
>> candidates
>> from DECL_ARGUMENTS chain?  Does it cause any assertion to fail?
> 
> I'd prefer not to remove DECL_ARGUMENTS chains, they are valid arguments that 
> just some
> invalid code doesn't pass.  By removing them you basically always create an
> invalid case, this time in the other direction, valid caller passes more
> arguments than the callee (invalidly) expects.

Thanks for the comments, do you mean it can affect the arguments validation 
when there
is explicit function declaration with interface?  Then can we strip them when 
we are
going to expand them (like checking currently_expanding_function_start)?  since 
from the
perspective of resulted assembly, with this workaround, the callee can:
  1) pass the hidden args in unexpected GPR like r11, ... at -O0;
  2) get rid of such hidden args as they are unused at -O2;
This proposal aims to make the assembly at -O0 not to pass with r11... (same as 
-O2),
comparing to the assembly at O2, the mismatch isn't actually changed.

BR,
Kewen



Re: [PATCH v2] rs6000: Stackoverflow in optimized code on PPC [PR100799]

2024-04-02 Thread Kewen.Lin
Hi!

on 2024/3/24 02:37, Ajit Agarwal wrote:
> 
> 
> On 23/03/24 9:33 pm, Peter Bergner wrote:
>> On 3/23/24 4:33 AM, Ajit Agarwal wrote:
> -  else if (align_words < GP_ARG_NUM_REG)
> +  else if (align_words < GP_ARG_NUM_REG
> +|| (cum->hidden_string_length
> +&& cum->actual_parm_length <= GP_ARG_NUM_REG))
 {
   if (TARGET_32BIT && TARGET_POWERPC64)
 return rs6000_mixed_function_arg (mode, type, align_words);

   return gen_rtx_REG (mode, GP_ARG_MIN_REG + align_words);
 }
   else
 return NULL_RTX;

 The old code for the unused hidden parameter (which was the 9th param) 
 would
 fall thru to the "return NULL_RTX;" which would make the callee assume 
 there
 was a parameter save area allocated.  Now instead, we'll return a reg rtx,
 probably of r11 (r3 thru r10 are our param regs) and I'm guessing we'll now
 see a copy of r11 into a pseudo like we do for the other param regs.
 Is that a problem? Given it's an unused parameter, it'll probably get 
 deleted
 as dead code, but could it cause any issues?  What if we have more than one

I think Peter raised one good point, not sure it would really cause some issues,
but the assigned reg goes beyond GP_ARG_MAX_REG, at least it is confusing to 
people
especially without DCE like at -O0.  Can we aggressively remove these candidates
from DECL_ARGUMENTS chain?  Does it cause any assertion to fail?

BR,
Kewen


 unused hidden parameter and we return r12 and r13 which have specific uses
 in our ABIs (eg, r13 is our TCB pointer), so it may not actually look dead.
 Have you verified what the callee RTL looks like after expand for these
 unused hidden parameters?  Is there a rtx we can return that isn't a 
 NULL_RTX
 which triggers the assumption of a parameter save area, but isn't a reg rtx
 which might lead to some rtl being generated?  Would a (const_int 0) or
 something else work?


>>> For the above use case it will return 
>>>
>>> (reg:DI 5 %r5) and below check entry_parm = 
>>> (reg:DI 5 %r5) and the following check will not return TRUE and hence
>>>parameter save area will not be allocated.
>>
>> Why r5?!?!   The 8th (integer) param would return r10, so I'd assume if
>> the next param was a hidden param, then it'd get the next gpr, so r11.
>> How does it jump back to r5 which may have been used by the 3rd param?
>>
>>
> My mistake its r11 only for hidden param.
>>
>>
>>
>>> It will not generate any rtx in the callee rtl code but it just used to
>>> check whether to allocate parameter save area or not when number of args > 
>>> 8.
>>>
>>> /* If there is no incoming register, we need a stack.  */
>>>   entry_parm = rs6000_function_arg (args_so_far, arg);
>>>   if (entry_parm == NULL)
>>> return true;
>>>
>>>   /* Likewise if we need to pass both in registers and on the stack.  */
>>>   if (GET_CODE (entry_parm) == PARALLEL
>>>   && XEXP (XVECEXP (entry_parm, 0, 0), 0) == NULL_RTX)
>>> return true;
>>
>> Yes, this code in rs6000_parm_needs_stack() uses the rs6000_function_arg()
>> return value as a boolean to tell us whether a parameter save area is 
>> required
>> so what we return is unimportant other than to know it's not NULL_RTX.
>>
>> I'm more concerned about the use of the target hook 
>> targetm.calls.function_arg
>> used in the generic parts of the compiler.  What will that code do 
>> differently
>> now that we return a reg rtx rather than NULL_RTX?  Might that code use
>> the reg rtx to emit something?  I'd feel better if you could verify what
>> happens in that code when we return a reg rtx for that 9th hidden param which
>> isn't really being passed in a register.
>>
> 
> As per my understanding and debugging openBLAS code testcase I see that 
> reg_rtx returned inside the below IF condition is used for check whether 
> paramter save area is needed or not. 
> 
> In the generic code where targetm.calls.function_arg is called 
> in calls.cc returned rtx is used for PARALLEL case so that we can
> check if we need to pass both in registers and stack then they emit
> store with respect to return rtx. If we identify that we need only
> registers for argument then it emits nothing.
> 
> Thanks & Regards
> Ajit
>>
>> Peter
>>
>>



Re: [PATCH v1] rs6000: Stackoverflow in optimized code on PPC [PR100799]

2024-04-01 Thread Kewen.Lin
Hi!

on 2024/3/22 17:36, Jakub Jelinek wrote:
> On Fri, Mar 22, 2024 at 02:55:43PM +0530, Ajit Agarwal wrote:
>> rs6000: Stackoverflow in optimized code on PPC [PR100799]
>>
>> When using FlexiBLAS with OpenBLAS we noticed corruption of
>> the parameters passed to OpenBLAS functions. FlexiBLAS
>> basically provides a BLAS interface where each function
>> is a stub that forwards the arguments to a real BLAS lib,
>> like OpenBLAS.
>>
>> Fixes the corruption of caller frame checking number of
>> arguments is less than equal to GP_ARG_NUM_REG (8)
>> excluding hidden unused DECLS.
> 
> Looks mostly good to me except some comment nits, but I'll defer
> the actual ack to the rs6000 maintainers.
> 
>> +  /* Workaround buggy C/C++ wrappers around Fortran routines with
>> + character(len=constant) arguments if the hidden string length arguments
>> + are passed on the stack; if the callers forget to pass those arguments,
>> + attempting to tail call in such routines leads to stack corruption.
> 
> I thought it isn't just tail calls, even normal calls.
> When the buggy C/C++ wrappers call the function with fewer arguments
> than it actually has and doesn't expect the parameter save area on the
> caller side because of that while the callee expects it and the callee
> actually stores something in the parameter save area, it corrupts whatever
> is in the caller stack frame at that location.

I agree it's not just tail calls, but currently DECL_HIDDEN_STRING_LENGTH
setting is guarded with flag_tail_call_workaround, which was intended to
be only for tail calls.  So I wonder if we should update this option name,
or introduce another option which is more for C/Fortran interoperability
workaround, set DECL_HIDDEN_STRING_LENGTH with this guard and also enable
this existing flag_tail_call_workaround.

> 
>> + Avoid return stack space for parameters <= 8 excluding hidden string
>> + length argument is passed (partially or fully) on the stack in the
>> + caller and the callee needs to pass any arguments on the stack.  */
>> +  unsigned int num_args = 0;
>> +  unsigned int hidden_length = 0;
>> +
>> +  for (tree arg = DECL_ARGUMENTS (current_function_decl);
>> +   arg; arg = DECL_CHAIN (arg))
>> +{
>> +  num_args++;
>> +  if (DECL_HIDDEN_STRING_LENGTH (arg))
>> +{
>> +  tree parmdef = ssa_default_def (cfun, arg);
>> +  if (parmdef == NULL || has_zero_uses (parmdef))
>> +{
>> +  cum->hidden_string_length = 1;
>> +  hidden_length++;
>> +}
>> +}

As Fortran allows to have some string with unknown length, it's possible to
have some test cases which have mixed used and unused hidden lengths, since
the used ones matter, users may already modify their C code to prepare the
required used hidden length.  But with this change, the modified code could
not work any more.  For example, 7th and 8th are unused but 9th argument is
used, 9th is passed by caller on stack but callee expects it comes from r9
instead (7th arg).  So IMHO we should be more conservative and only make
this workaround by taking care of the continuous unused hidden length at the
end of arg list.  Someone may argue if users already know how to modify their
C code to interoperate with Fortran, we should already modify all their C code
and won't adopt this workaround, but if this restriction still works for all
the motivated test cases, IMHO keeping more conservative is good, as users
could only update some "broken" cases not "all".

BR,
Kewen


>> +   }
>> +
>> +  cum->actual_parm_length = num_args - hidden_length;
>> +
>>/* Check for a longcall attribute.  */
>>if ((!fntype && rs6000_default_long_calls)
>>|| (fntype
>> @@ -1857,7 +1884,16 @@ rs6000_function_arg (cumulative_args_t cum_v, const 
>> function_arg_info )
>>  
>>return rs6000_finish_function_arg (mode, rvec, k);
>>  }
>> -  else if (align_words < GP_ARG_NUM_REG)
>> + /* Workaround buggy C/C++ wrappers around Fortran routines with
>> +character(len=constant) arguments if the hidden string length arguments
>> +are passed on the stack; if the callers forget to pass those arguments,
>> +attempting to tail call in such routines leads to stack corruption.
>> +Avoid return stack space for parameters <= 8 excluding hidden string
>> +length argument is passed (partially or fully) on the stack in the
>> +caller and the callee needs to pass any arguments on the stack.  */
>> +  else if (align_words < GP_ARG_NUM_REG
>> +   || (cum->hidden_string_length
>> +   && cum->actual_parm_length <= GP_ARG_NUM_REG))
>>  {
>>if (TARGET_32BIT && TARGET_POWERPC64)
>>  return rs6000_mixed_function_arg (mode, type, align_words);
>> diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
>> index 68bc45d65ba..a1d3ed00b14 100644
>> --- a/gcc/config/rs6000/rs6000.h
>> +++ b/gcc/config/rs6000/rs6000.h
>> @@ -1490,6 +1490,14 @@ typedef struct 

Re: [PATCH] rs6000: Fix up setup_incoming_varargs [PR114175]

2024-03-19 Thread Kewen.Lin
Hi Jakub,

on 2024/3/19 01:21, Jakub Jelinek wrote:
> Hi!
> 
> The c23-stdarg-8.c test (as well as the new test below added to cover even
> more cases) FAIL on powerpc64le-linux and presumably other powerpc* targets
> as well.
> Like in the r14-9503-g218d174961 change on x86-64 we need to advance
> next_cum after the hidden return pointer argument even in case where
> there are no user arguments before ... in C23.
> The following patch does that.
> 
> There is another TYPE_NO_NAMED_ARGS_STDARG_P use later on:
>   if (!TYPE_NO_NAMED_ARGS_STDARG_P (TREE_TYPE (current_function_decl))
>   && targetm.calls.must_pass_in_stack (arg))
> first_reg_offset += rs6000_arg_size (TYPE_MODE (arg.type), arg.type);
> but I believe it was added there in r13-3549-g4fe34cdc unnecessarily,
> when there is no hidden return pointer argument, arg.type is NULL and
> must_pass_in_stack_var_size as well as must_pass_in_stack_var_size_or_pad
> return false in that case, and for the TYPE_NO_NAMED_ARGS_STDARG_P
> case with hidden return pointer argument that argument should have pointer
> type and it is the first argument, so must_pass_in_stack shouldn't be true
> for it either.
> 
> Bootstrapped/regtested on powerpc64le-linux, bootstrap/regtest on
> powerpc64-linux running, ok for trunk?

Okay for trunk (I guess all testings should go well), thanks for taking
care of this!

FWIW, I also tested c23-stdarg-* test cases on aix with this patch, all
of them worked well.

BR,
Kewen

> 
> 2024-03-18  Jakub Jelinek  
> 
>   PR target/114175
>   * config/rs6000/rs6000-call.cc (setup_incoming_varargs): Only skip
>   rs6000_function_arg_advance_1 for TYPE_NO_NAMED_ARGS_STDARG_P functions
>   if arg.type is NULL.
> 
>   * gcc.dg/c23-stdarg-9.c: New test.
> 
> --- gcc/config/rs6000/rs6000-call.cc.jj   2024-01-03 12:01:19.645532834 
> +0100
> +++ gcc/config/rs6000/rs6000-call.cc  2024-03-18 11:36:02.376846802 +0100
> @@ -2253,7 +2253,8 @@ setup_incoming_varargs (cumulative_args_
>  
>/* Skip the last named argument.  */
>next_cum = *get_cumulative_args (cum);
> -  if (!TYPE_NO_NAMED_ARGS_STDARG_P (TREE_TYPE (current_function_decl)))
> +  if (!TYPE_NO_NAMED_ARGS_STDARG_P (TREE_TYPE (current_function_decl))
> +  || arg.type != NULL_TREE)
>  rs6000_function_arg_advance_1 (_cum, arg.mode, arg.type, arg.named,
>  0);
>  
> --- gcc/testsuite/gcc.dg/c23-stdarg-9.c.jj2024-03-18 11:46:17.281200214 
> +0100
> +++ gcc/testsuite/gcc.dg/c23-stdarg-9.c   2024-03-18 11:46:26.826065998 
> +0100
> @@ -0,0 +1,284 @@
> +/* Test C23 variadic functions with no named parameters, or last named
> +   parameter with a declaration not allowed in C17.  Execution tests.  */
> +/* { dg-do run } */
> +/* { dg-options "-O2 -std=c23 -pedantic-errors" } */
> +
> +#include 
> +
> +struct S { int a[1024]; };
> +
> +int
> +f1 (...)
> +{
> +  int r = 0;
> +  va_list ap;
> +  va_start (ap);
> +  r += va_arg (ap, int);
> +  va_end (ap);
> +  return r;
> +}
> +
> +int
> +f2 (...)
> +{
> +  int r = 0;
> +  va_list ap;
> +  va_start (ap);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  va_end (ap);
> +  return r;
> +}
> +
> +int
> +f3 (...)
> +{
> +  int r = 0;
> +  va_list ap;
> +  va_start (ap);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  va_end (ap);
> +  return r;
> +}
> +
> +int
> +f4 (...)
> +{
> +  int r = 0;
> +  va_list ap;
> +  va_start (ap);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  va_end (ap);
> +  return r;
> +}
> +
> +int
> +f5 (...)
> +{
> +  int r = 0;
> +  va_list ap;
> +  va_start (ap);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  va_end (ap);
> +  return r;
> +}
> +
> +int
> +f6 (...)
> +{
> +  int r = 0;
> +  va_list ap;
> +  va_start (ap);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  va_end (ap);
> +  return r;
> +}
> +
> +int
> +f7 (...)
> +{
> +  int r = 0;
> +  va_list ap;
> +  va_start (ap);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  va_end (ap);
> +  return r;
> +}
> +
> +int
> +f8 (...)
> +{
> +  int r = 0;
> +  va_list ap;
> +  va_start (ap);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  va_end (ap);
> +  return r;
> +}
> +
> +struct S
> +s1 (...)
> +{
> +  int r = 0;
> +  va_list ap;
> +  va_start (ap);
> +  r += va_arg (ap, int);
> +  va_end (ap);
> +  struct S s = {};
> + 

Re: [PATCH V3] rs6000: Don't ICE when compiling the __builtin_vsx_splat_2di built-in [PR113950]

2024-03-17 Thread Kewen.Lin
Hi,

on 2024/3/16 04:34, Peter Bergner wrote:
> On 3/6/24 3:27 AM, Kewen.Lin wrote:
>> on 2024/3/4 02:55, jeevitha wrote:
>>> The following patch has been bootstrapped and regtested on 
>>> powerpc64le-linux.
>>> 
>>> When we expand the __builtin_vsx_splat_2di function, we were allowing 
>>> immediate
>>> value for second operand which causes an unrecognizable insn ICE. Even 
>>> though
>>> the immediate value was forced into a register, it wasn't correctly assigned
>>> to the second operand. So corrected the assignment of op1 to operands[1].
> [snip]
>> As the discussions in the thread of the previous versions, I think
>> Segher agreed this approach, so OK for trunk with the above nits
>> tweaked, thanks!
> 
> The bogus vsx_splat_ code goes all the way back to GCC 8, so we
> should backport this fix.  Segher and Ke Wen, can we get an approval
> to backport this to all the open release branches (GCC 13, 12, 11)?
> Thanks.

Sure, okay for backporting this to all active branches, thanks!

> 
> Jeevitha, once we get approval, please perform the backports.
> 
> Peter
> 
> 

BR,
Kewen



Re: [PATCH] fix PowerPC < 7 w/ Altivec not to default to power7

2024-03-10 Thread Kewen.Lin
Hi,

on 2024/3/8 19:33, Rene Rebe wrote:
> This might not be the best timing -short before a major release-,
> however, Sam just commented on the bug I filled years ago [1], so here
> we go:
> 
> Glibc uses .machine to determine assembler optimizations to use.
> However, since reworking the rs6000 .machine output selection in
> commit e154242724b084380e3221df7c08fcdbd8460674 22 May 2019, G5 as
> well as Cell, and even power4 w/ -maltivec currently resulted in
> power7. Mask _ALTIVEC away as the .machine selection already did for
> GFX and GPOPT.

Thanks for fixing, this fix looks reasonable to me, OPTION_MASK_ALTIVEC
is a part of POWERPC_7400_MASK so any specified cpu type which has this
POWERPC_7400_MASK by default and isn't handled early in function
rs6000_machine_from_flags can suffer from this issue.

> 
> powerpc64-t2-linux-gnu-gcc  test.c -S -o - -mcpu=G5
>   .file   "test.c"
>   .machine power7
>   .abiversion 2
>   .section".text"
>   .ident  "GCC: (GNU) 10.2.0"
>   .section.note.GNU-stack,"",@progbits
> 

Nit: Could you also add one test case for this?

btw, -mdejagnu-cpu=G5 can force the cpu type in dg-options.

> We ship this in T2/Linux [2] since 2020 and it is tested on G5, Cell
> and Power8.
> 
> Signed-of-by: René Rebe 
> 
> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97367
> [2] https://t2sde.org
> 
> --- gcc-11.1.0-RC-20210423/gcc/config/rs6000/rs6000.cc.vanilla
> 2021-04-25 22:57:16.964223106 +0200
> +++ gcc-11.1.0-RC-20210423/gcc/config/rs6000/rs6000.cc2021-04-25 
> 22:57:27.193223841 +0200
> @@ -5765,7 +5765,7 @@
>HOST_WIDE_INT flags = rs6000_isa_flags;
>  
>/* Disable the flags that should never influence the .machine selection.  
> */
> -  flags &= ~(OPTION_MASK_PPC_GFXOPT | OPTION_MASK_PPC_GPOPT | 
> OPTION_MASK_ISEL);
> +  flags &= ~(OPTION_MASK_PPC_GFXOPT | OPTION_MASK_PPC_GPOPT | 
> OPTION_MASK_ALTIVEC | OPTION_MASK_ISEL);

Nit: This line is too long and needs re-format.

BR,
Kewen

>  
>if ((flags & (ISA_3_1_MASKS_SERVER & ~ISA_3_0_MASKS_SERVER)) != 0)
>  return "power10";
> 



Re: [PATCH V3] rs6000: Don't ICE when compiling the __builtin_vsx_splat_2di built-in [PR113950]

2024-03-06 Thread Kewen.Lin
Hi,

on 2024/3/4 02:55, jeevitha wrote:
> Hi All,
> 
> The following patch has been bootstrapped and regtested on powerpc64le-linux.
>   
> When we expand the __builtin_vsx_splat_2di function, we were allowing 
> immediate
> value for second operand which causes an unrecognizable insn ICE. Even though
> the immediate value was forced into a register, it wasn't correctly assigned
> to the second operand. So corrected the assignment of op1 to operands[1].
> 
> 2024-02-29  Jeevitha Palanisamy  
> 
> gcc/
>   PR target/113950
>   * config/rs6000/vsx.md (vsx_splat_): Corrected assignment to
>   operand1.

Nit: s/Corrected/Correct/, maybe add "and simplify else if with else.".

> 
> gcc/testsuite/
>   PR target/113950
>   * gcc.target/powerpc/pr113950.c: New testcase.
> 
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index 6111cc90eb7..f135fa079bd 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -4666,8 +4666,8 @@
>rtx op1 = operands[1];
>if (MEM_P (op1))
>  operands[1] = rs6000_force_indexed_or_indirect_mem (op1);
> -  else if (!REG_P (op1))
> -op1 = force_reg (mode, op1);
> +  else
> +operands[1] = force_reg (mode, op1);
>  })
>  
>  (define_insn "vsx_splat__reg"
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr113950.c 
> b/gcc/testsuite/gcc.target/powerpc/pr113950.c
> new file mode 100644
> index 000..64566a580d9
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr113950.c
> @@ -0,0 +1,24 @@
> +/* PR target/113950 */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O1 -mvsx" } */

Nit: s/-O1/-O2/, -O2 is preferred when the failure can be reproduced
with -O2 (not just -O1), since optimization may change the level in
which it's placed, -O2 is more general.

As the discussions in the thread of the previous versions, I think
Segher agreed this approach, so OK for trunk with the above nits
tweaked, thanks!

BR,
Kewen

> +
> +/* Verify we do not ICE on the following.  */
> +
> +void abort (void);
> +
> +int main ()
> +{
> +  int i;
> +  vector signed long long vsll_result, vsll_expected_result;
> +  signed long long sll_arg1;
> +
> +  sll_arg1 = 300;
> +  vsll_expected_result = (vector signed long long) {300, 300};
> +  vsll_result = __builtin_vsx_splat_2di (sll_arg1);  
> +
> +  for (i = 0; i < 2; i++)
> +if (vsll_result[i] != vsll_expected_result[i])
> +  abort();
> +
> +  return 0;
> +}
> 
> 



Re: [PATCH, rs6000] Add subreg patterns for SImode rotate and mask insert

2024-03-06 Thread Kewen.Lin
Hi,

on 2024/3/1 10:41, HAO CHEN GUI wrote:
> Hi,
>   This patch fixes regression cases in gcc.target/powerpc/rlwimi-2.c. In
> combine pass, SImode (subreg from DImode) lshiftrt is converted to DImode
> lshiftrt with an out AND. It matches a DImode rotate and mask insert on
> rs6000.
> 
> Trying 2 -> 7:
> 2: r122:DI=r129:DI
>   REG_DEAD r129:DI
> 7: r125:SI=r122:DI#0 0>>0x1f
>   REG_DEAD r122:DI
> Failed to match this instruction:
> (set (subreg:DI (reg:SI 125 [ x ]) 0)
> (zero_extract:DI (reg:DI 129)
> (const_int 32 [0x20])
> (const_int 1 [0x1])))
> Successfully matched this instruction:
> (set (subreg:DI (reg:SI 125 [ x ]) 0)
> (and:DI (lshiftrt:DI (reg:DI 129)
> (const_int 31 [0x1f]))
> (const_int 4294967295 [0x])))
> 
> This conversion blocks the further combination which combines to a SImode
> rotate and mask insert insn.
> 
> Trying 9, 7 -> 10:
> 9: r127:SI=r130:DI#0&0xfffe
>   REG_DEAD r130:DI
> 7: r125:SI#0=r129:DI 0>>0x1f&0x
>   REG_DEAD r129:DI
>10: r124:SI=r127:SI|r125:SI
>   REG_DEAD r125:SI
>   REG_DEAD r127:SI
> Failed to match this instruction:
> (set (reg:SI 124)
> (ior:SI (and:SI (subreg:SI (reg:DI 130) 0)
> (const_int -2 [0xfffe]))
> (subreg:SI (zero_extract:DI (reg:DI 129)
> (const_int 32 [0x20])
> (const_int 1 [0x1])) 0)))
> Failed to match this instruction:
> (set (reg:SI 124)
> (ior:SI (and:SI (subreg:SI (reg:DI 130) 0)
> (const_int -2 [0xfffe]))
> (subreg:SI (and:DI (lshiftrt:DI (reg:DI 129)
> (const_int 31 [0x1f]))
> (const_int 4294967295 [0x])) 0)))
> 
>   The root cause of the issue is if it's necessary to do the widen mode for
> lshiftrt when the target already has the narrow mode lshiftrt and its cost
> is not high. My former patch tried to fix the problem but not accepted yet.
> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624852.html

Hope Segher can chime in this proposal updating combine pass, I can understand
the new proposal by introducing new patterns in target code is able to fix the
issue, but IMHO it's likely there are some other mis-optimization which don't
get noticed and they need some similar pattern extension (duplicate some pattern
& adjust with subreg) to optimize, from this perspective, it would be nice if
it's possible to have a more general fix.

Some minor comments for this patch itself are inlined.

> 
>   As it's stage 4 now, I drafted this patch to fix the regression by adding
> subreg patterns of SImode rotate and mask insert. It actually does reversed
> things and narrow the mode for lshiftrt so that it can matches the SImode
> rotate and mask insert.
> 
>   The case "rlwimi-2.c" is fixed and restore the corresponding number of
> insns to original ones. The case "rlwinm-0.c" is also changed and 9 "rlwinm"
> is replaced with 9 "rldicl" as the sequence of combine is changed. It's not
> a regression as the total number of insns isn't changed.
> 
>   Bootstrapped and tested on x86 and powerpc64-linux BE and LE with no
> regressions. Is it OK for the trunk?
> 
> Thanks
> Gui Haochen
> 
> 
> ChangeLog
> rs6000: Add subreg patterns for SImode rotate and mask insert
> 
> In combine pass, SImode (subreg from DImode) lshiftrt is converted to DImode
> lshiftrt with an AND.  The new pattern matches rotate and mask insert on
> rs6000.  Thus it blocks the pattern to be further combined to a SImode rotate
> and mask insert pattern.  This patch fixes the problem by adding two subreg
> pattern for SImode rotate and mask insert patterns.
> 
> gcc/
>   PR target/93738
>   * config/rs6000/rs6000.md (*rotlsi3_insert_9): New.
>   (*rotlsi3_insert_8): New.
> 
> gcc/testsuite/
>   PR target/93738
>   * gcc.target/powerpc/rlwimi-2.c: Adjust the number of 64bit and 32bit
>   rotate instructions.
>   * gcc.target/powerpc/rlwinm-0.c: Likewise.
> 
> patch.diff
> diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
> index bc8bc6ab060..b0b40f91e3e 100644
> --- a/gcc/config/rs6000/rs6000.md
> +++ b/gcc/config/rs6000/rs6000.md
> @@ -4253,6 +4253,36 @@ (define_insn "*rotl3_insert"
>  ; difference between rlwimi and rldimi.  We also might want dot forms,
>  ; but not for rlwimi on POWER4 and similar processors.
> 
> +; Subreg pattern of insn "*rotlsi3_insert"
> +(define_insn_and_split "*rotlsi3_insert_9"

Nit: "*rotlsi3_insert_subreg" seems a better name, ...

> +  [(set (match_operand:SI 0 "gpc_reg_operand" "=r")
> + (ior:SI (and:SI
> +  (match_operator:SI 8 "lowpart_subreg_operator"
> +   [(and:DI (match_operator:DI 4 "rotate_mask_operator"
> + [(match_operand:DI 1 "gpc_reg_operand" "r")
> +  (match_operand:SI 2 "const_int_operand" "n")])
> +(match_operand:DI 3 

Re: [PATCH 08/11] rs6000, add tests and documentation for various, built-ins

2024-02-28 Thread Kewen.Lin
Hi,

on 2024/2/21 01:57, Carl Love wrote:
>  
>  GCC maintainers:
> 
> The patch adds documentation a number of built-ins.
> 
> The patch has been tested on Power 10 with no regressions.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
>  rs6000, add tests and documentation for various built-ins
> 
> This patch adds a test case and documentation in extend.texi for the
> following built-ins:
> 
> __builtin_altivec_fix_sfsi
> __builtin_altivec_fixuns_sfsi
> __builtin_altivec_float_sisf
> __builtin_altivec_uns_float_sisf

I think these are covered by vec_{unsigned,signed,float}, could you
have a check?

> __builtin_altivec_vrsqrtfp

Similar to that __builtin_altivec_vrsqrtefp is covered by vec_rsqrte,
this is already covered by vec_rsqrt, which has the vf instance
__builtin_vsx_xvrsqrtsp, so this one is useless and removable.


> __builtin_altivec_mask_for_load

This one is for internal use, I don't think we want to document it in
user manual.

> __builtin_altivec_vsel_1ti
> __builtin_altivec_vsel_1ti_uns

I think we can extend the existing vec_sel for vsq and vuq, also update
the documentation.

> __builtin_vec_init_v16qi
> __builtin_vec_init_v4sf
> __builtin_vec_init_v4si
> __builtin_vec_init_v8hi

There are more vec_init variants __builtin_vec_init_{v2df,v2di,v1ti},
any reasons not include them here? ...

> __builtin_vec_set_v16qi
> __builtin_vec_set_v4sf
> __builtin_vec_set_v4si
> __builtin_vec_set_v8hi

... and some similar variants for this one?

it seems that users can just use something like:

  vector ... = {x, y} ...

for the vector initialization and something like:

  vector ... z;
  z[0] = ...;
  z[i] = ...;

for the vector set.  Can you check if there are some
differences between the above style and builtin? (both
BE and LE).  And the historical reasons for adding them?

If we really need them, I'd like to see we just have
the according overload function like vec_init and vec_set
instead of exposing the instances with different suffixes.

BR,
Kewen

> 
> gcc/ChangeLog:
>   * doc/extend.texi (__builtin_altivec_fix_sfsi,
>   __builtin_altivec_fixuns_sfsi, __builtin_altivec_float_sisf,
>   __builtin_altivec_uns_float_sisf, __builtin_altivec_vrsqrtfp,
>   __builtin_altivec_mask_for_load, __builtin_altivec_vsel_1ti,
>   __builtin_altivec_vsel_1ti_uns, __builtin_vec_init_v16qi,
>   __builtin_vec_init_v4sf, __builtin_vec_init_v4si,
>   __builtin_vec_init_v8hi, __builtin_vec_set_v16qi,
>   __builtin_vec_set_v4sf, __builtin_vec_set_v4si,
>   __builtin_vec_set_v8hi): Add documentation.
> 
> gcc/testsuite/ChangeLog:
>   * gcc.target/powerpc/altivec-38.c: New test case.
> ---
>  gcc/doc/extend.texi   |  98 
>  gcc/testsuite/gcc.target/powerpc/altivec-38.c | 503 ++
>  2 files changed, 601 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/altivec-38.c
> 
> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> index 87fd30bfa9e..89d0a1f77b0 100644
> --- a/gcc/doc/extend.texi
> +++ b/gcc/doc/extend.texi
> @@ -22678,6 +22678,104 @@ if the VSX instruction set is available.  The 
> @samp{vec_vsx_ld} and
>  @samp{LXVW4X}, @samp{STXVD2X}, and @samp{STXVW4X} instructions.
>  
>  
> +@smallexample
> +vector signed int __builtin_altivec_fix_sfsi (vector float);
> +vector signed int __builtin_altivec_fixuns_sfsi (vector float);
> +vector float __builtin_altivec_float_sisf (vector int);
> +vector float __builtin_altivec_uns_float_sisf (vector int);
> +vector float __builtin_altivec_vrsqrtfp (vector float);
> +@end smallexample
> +
> +The @code{__builtin_altivec_fix_sfsi} converts a vector of single precision
> +floating point values to a vector of signed integers with round to zero.
> +
> +The @code{__builtin_altivec_fixuns_sfsi} converts a vector of single 
> precision
> +floating point values to a vector of unsigned integers with round to zero.  
> If
> +the rounded floating point value is less then 0 the result is 0 and VXCVI
> +is set to 1.
> +
> +The @code{__builtin_altivec_float_sisf} converts a vector of single precision
> +signed integers to a vector of floating point values using the rounding mode
> +specified by RN.
> +
> +The @code{__builtin_altivec_uns_float_sisf} converts a vector of single
> +precision unsigned integers to a vector of floating point values using the
> +rounding mode specified by RN.
> +
> +The @code{__builtin_altivec_vrsqrtfp} returns a vector of floating point
> +estimates of the reciprical square root of each floating point source vector
> +element.
> +
> +@smallexample
> +vector signed char test_altivec_mask_for_load (const void *);
> +@end smallexample
> +
> +The @code{__builtin_altivec_vrsqrtfp} returns a vector mask based on the
> +bottom four bits of the argument.  Let X be the 32-byte value:
> +0x00 || 0x01 || 0x02 || ... || 0x1D || 0x1E || 0x1F.
> +Bytes sh 

Re: PATCH 11/11] rs6000, make test vec-cmpne.c a runnable test

2024-02-28 Thread Kewen.Lin
Hi,

on 2024/2/21 01:58, Carl Love wrote:
>  GCC maintainers:
> 
> The patch changes the  vec-cmpne.c from a compile only test to a runnable 
> test.  The macros to create the functions needed to test the built-ins and 
> verify the restults are all there in the include file.  The .c file just 
> needed to have the macro definitions inserted and change the header from 
> compile to run.  The test can now do functional verification of the results 
> in addition to verifying the expected instructions are generated.
> 
> The patch has been tested on Power 10 with no regressions.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
> rs6000, make test vec-cmpne.c a runnable test
> 
> The macros in vec-cmpne.h define test functions.  They also setup
> test value functions, verification functions and execute test functions.
> The test is setup as a compile only test so none of the verification and
> execute functions are being used.

But there is a test gcc/testsuite/gcc.target/powerpc/vec-cmpne-runnable.c
which aims to do the runtime verification.

BR,
Kewen

> 
> The patch adds the macro definitions to create the intialization,
> verfiy and execute functions to a main program so not only can the
> test verify the correct instructions are generated but also run the
> tests and verify the results.  The test is then changed from a compile
> to a run test.
> 
> gcc/testsuite/ChangeLog:
>   * gcc.target/powerpc/vec-cmple.c (main): Add main function with
>   macro calls to define the test functions, create the verify
>   functions and execute functions.
>   Update scan-assembler-times (vcmpequ): Updated count to include
>   instructions used to generate expected test results.
>   * gcc.target/powerpc/vec-cmple.h (vector_tests_##NAME): Remove
>   line continuation after closing bracket.  Remove extra blank line.
> ---
>  gcc/testsuite/gcc.target/powerpc/vec-cmpne.c | 41 +++-
>  gcc/testsuite/gcc.target/powerpc/vec-cmpne.h |  3 +-
>  2 files changed, 32 insertions(+), 12 deletions(-)
> 
> diff --git a/gcc/testsuite/gcc.target/powerpc/vec-cmpne.c 
> b/gcc/testsuite/gcc.target/powerpc/vec-cmpne.c
> index b57e0ac8638..2c369976a44 100644
> --- a/gcc/testsuite/gcc.target/powerpc/vec-cmpne.c
> +++ b/gcc/testsuite/gcc.target/powerpc/vec-cmpne.c
> @@ -1,20 +1,41 @@
> -/* { dg-do compile } */
> +/* { dg-do run } */
>  /* { dg-require-effective-target powerpc_altivec_ok } */
> -/* { dg-options "-maltivec -O2" } */
> +/* { dg-options "-maltivec -O2 -save-temps" } */
>  
>  /* Test that the vec_cmpne builtin generates the expected Altivec
> instructions.  */
>  
>  #include "vec-cmpne.h"
>  
> -define_test_functions (int, signed int, signed int, si);
> -define_test_functions (int, unsigned int, unsigned int, ui);
> -define_test_functions (short, signed short, signed short, ss);
> -define_test_functions (short, unsigned short, unsigned short, us);
> -define_test_functions (char, signed char, signed char, sc);
> -define_test_functions (char, unsigned char, unsigned char, uc);
> -define_test_functions (int, signed int, float, ff);
> +int main ()
> +{
> +  define_test_functions (int, signed int, signed int, si);
> +  define_test_functions (int, unsigned int, unsigned int, ui);
> +  define_test_functions (short, signed short, signed short, ss);
> +  define_test_functions (short, unsigned short, unsigned short, us);
> +  define_test_functions (char, signed char, signed char, sc);
> +  define_test_functions (char, unsigned char, unsigned char, uc);
> +  define_test_functions (int, signed int, float, ff);
> +
> +  define_init_verify_functions (int, signed int, signed int, si);
> +  define_init_verify_functions (int, unsigned int, unsigned int, ui);
> +  define_init_verify_functions (short, signed short, signed short, ss);
> +  define_init_verify_functions (short, unsigned short, unsigned short, us);
> +  define_init_verify_functions (char, signed char, signed char, sc);
> +  define_init_verify_functions (char, unsigned char, unsigned char, uc);
> +  define_init_verify_functions (int, signed int, float, ff);
> +
> +  execute_test_functions (int, signed int, signed int, si);
> +  execute_test_functions (int, unsigned int, unsigned int, ui);
> +  execute_test_functions (short, signed short, signed short, ss);
> +  execute_test_functions (short, unsigned short, unsigned short, us);
> +  execute_test_functions (char, signed char, signed char, sc);
> +  execute_test_functions (char, unsigned char, unsigned char, uc);
> +  execute_test_functions (int, signed int, float, ff);
> +
> +  return 0;
> +}
>  
>  /* { dg-final { scan-assembler-times {\mvcmpequb\M}  2 } } */
>  /* { dg-final { scan-assembler-times {\mvcmpequh\M}  2 } } */
> -/* { dg-final { scan-assembler-times {\mvcmpequw\M}  2 } } */
> +/* { dg-final { scan-assembler-times {\mvcmpequw\M}  32 } } */
> diff --git 

Re: [PATCH 09/11] rs6000, add test cases for the vec_cmpne built-ins

2024-02-28 Thread Kewen.Lin
Hi,

on 2024/2/21 01:57, Carl Love wrote:
> GCC maintainers:
> 
> The patch adds test cases for the vec_cmpne of built-ins.
> 
> The patch has been tested on Power 10 with no regressions.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
> rs6000, add test cases for the vec_cmpne built-ins

The subject and this subject line are saying "vec_cmpne" ...

> 
> Add test cases for the signed int, unsigned it, signed short, unsigned
> short, signed char and unsigned char built-ins.
> 
> Note, the built-ins are documented in the Power Vector Instrinsic
> Programing reference manual.
> 
> gcc/testsuite/ChangeLog:
>   * gcc.target/powerpc/vec-cmple.c: New test case.
>   * gcc.target/powerpc/vec-cmple.h: New test case include file.


... But I think you meant "vec_cmple".

> ---
>  gcc/testsuite/gcc.target/powerpc/vec-cmple.c | 35 
>  gcc/testsuite/gcc.target/powerpc/vec-cmple.h | 84 
>  2 files changed, 119 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vec-cmple.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vec-cmple.h
> 
> diff --git a/gcc/testsuite/gcc.target/powerpc/vec-cmple.c 
> b/gcc/testsuite/gcc.target/powerpc/vec-cmple.c
> new file mode 100644
> index 000..766a1c770e2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/vec-cmple.c
> @@ -0,0 +1,35 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_altivec_ok } */

Should be "vmx_hw" for run test.

> +/* { dg-options "-maltivec -O2" } */
> +
> +/* Test that the vec_cmpne builtin generates the expected Altivec
> +   instructions.  */

It seems this file was copied from vec-cmpne.c?  As we have 
vec-cmpne-runnable.c, maybe we can rename it to vec-cmple-runnable.c.

And previously since we have vec-cmpne-runnable.c and vec-cmpne.c
to use vec-cmpne.h, so a header was introduced.  If you just want to
add one runnable test case, maybe just inline vec-cmple.h since
it's not used by others at all.

BR,
Kewen

> +
> +#include "vec-cmple.h"
> +
> +int main ()
> +{
> +  /* Note macro expansions for "signed long long int" and
> + "unsigned long long int" do not work for the vec_vsx_ld builtin.  */
> +  define_test_functions (int, signed int, signed int, si);
> +  define_test_functions (int, unsigned int, unsigned int, ui);
> +  define_test_functions (short, signed short, signed short, ss);
> +  define_test_functions (short, unsigned short, unsigned short, us);
> +  define_test_functions (char, signed char, signed char, sc);
> +  define_test_functions (char, unsigned char, unsigned char, uc);
> +
> +  define_init_verify_functions (int, signed int, signed int, si);
> +  define_init_verify_functions (int, unsigned int, unsigned int, ui);
> +  define_init_verify_functions (short, signed short, signed short, ss);
> +  define_init_verify_functions (short, unsigned short, unsigned short, us);
> +  define_init_verify_functions (char, signed char, signed char, sc);
> +  define_init_verify_functions (char, unsigned char, unsigned char, uc);
> +
> +  execute_test_functions (int, signed int, signed int, si);
> +  execute_test_functions (int, unsigned int, unsigned int, ui);
> +  execute_test_functions (short, signed short, signed short, ss);
> +  execute_test_functions (short, unsigned short, unsigned short, us);
> +  execute_test_functions (char, signed char, signed char, sc);
> +  execute_test_functions (char, unsigned char, unsigned char, uc);
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/vec-cmple.h 
> b/gcc/testsuite/gcc.target/powerpc/vec-cmple.h
> new file mode 100644
> index 000..4126706b99a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/vec-cmple.h
> @@ -0,0 +1,84 @@
> +#include "altivec.h"
> +
> +#define N 4096
> +
> +#include 
> +void abort ();
> +
> +#define PRAGMA(X) _Pragma (#X)
> +#define UNROLL0 PRAGMA (GCC unroll 0)
> +
> +#define define_test_functions(VBTYPE, RTYPE, STYPE, NAME)\
> +\
> +RTYPE result_le_##NAME[N] __attribute__((aligned(16))); \
> +STYPE operand1_##NAME[N] __attribute__((aligned(16))); \
> +STYPE operand2_##NAME[N] __attribute__((aligned(16))); \
> +RTYPE expected_##NAME[N] __attribute__((aligned(16))); \
> +\
> +__attribute__((noinline)) void vector_tests_##NAME () \
> +{ \
> +  vector STYPE v1_##NAME, v2_##NAME; \
> +  vector bool VBTYPE tmp_##NAME; \
> +  int i; \
> +  UNROLL0 \
> +  for (i = 0; i < N; i+=16/sizeof (STYPE))   \
> +{ \
> +  /* result_le = operand1!=operand2.  */ \
> +  v1_##NAME = vec_vsx_ld (0, (const vector STYPE*)_##NAME[i]); \
> +  v2_##NAME = vec_vsx_ld (0, (const vector STYPE*)_##NAME[i]); \
> +\
> +  tmp_##NAME = vec_cmple (v1_##NAME, v2_##NAME); \
> +  vec_vsx_st (tmp_##NAME, 0, _le_##NAME[i]); \
> +} \
> +}
> +
> +#define define_init_verify_functions(VBTYPE, RTYPE, STYPE, NAME) \
> +__attribute__((noinline)) void 

Re: [PATCH 07/11] rs6000, __builtin_vsx_xvcmpeq[sp, dp, sp_p] add, documentation and test case

2024-02-28 Thread Kewen.Lin
Hi Carl,

on 2024/2/21 01:57, Carl Love wrote:
> 
>  GCC maintainers:
> 
> The patch adds documentation and test case for the  __builtin_vsx_xvcmpeq[sp, 
> dp, sp_p] built-ins.
> 
> The patch has been tested on Power 10 with no regressions.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
> 
> rs6000, __builtin_vsx_xvcmpeq[sp, dp, sp_p] add documentation and test case
> 
> Add a test case for the __builtin_vsx_xvcmpeqsp_p built-in.
> 
> Add documentation for the __builtin_vsx_xvcmpeqsp_p,
> __builtin_vsx_xvcmpeqdp, and __builtin_vsx_xvcmpeqsp builtins.

1) for __builtin_vsx_xvcmpeqsp_p, its functionality has been already covered
by __builtin_altivec_vcmpeqfp_p which is a instance of __builtin_vec_vcmpeq_p,
so it's useless and removable.

2) for __builtin_vsx_xvcmpeqdp, it's a instance for overloaded PVIPR function
vec_cmpeq, it's unexpected to use it directly, so we don't need to document it.

3) for __builtin_vsx_xvcmpeqsp, it's duplicated of existing vec_cmpeq instance
__builtin_altivec_vcmpeqfp, so it's useless and removable.

BR,
Kewen

> 
> gcc/ChangeLog:
>   * doc/extend.texi (__builtin_vsx_xvcmpeqsp_p,
>   __builtin_vsx_xvcmpeqdp, __builtin_vsx_xvcmpeqsp): Add
>   documentation.
> 
> gcc/testsuite/ChangeLog:
>   * gcc.target/powerpc/vsx-builtin-runnable-4.c: New test case.
> ---
>  gcc/doc/extend.texi   |  23 +++
>  .../powerpc/vsx-builtin-runnable-4.c  | 135 ++
>  2 files changed, 158 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-4.c
> 
> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> index 22f67ebab31..87fd30bfa9e 100644
> --- a/gcc/doc/extend.texi
> +++ b/gcc/doc/extend.texi
> @@ -22700,6 +22700,18 @@ vectors of their defined type.  The corresponding 
> result element is set to
>  all ones if the two argument elements are less than or equal and all zeros
>  otherwise.
>  
> +@smallexample
> +const vf __builtin_vsx_xvcmpeqsp (vf, vf);
> +const vd __builtin_vsx_xvcmpeqdp (vd, vd);
> +@end smallexample
> +
> +The builti-ins @code{__builtin_vsx_xvcmpeqdp} and
> +@code{__builtin_vsx_xvcmpeqdp} compare two floating point vectors and return
> +a vector.  If the corresponding elements are equal then the corresponding
> +vector element of the result is set to all ones, it is set to all zeros
> +otherwise.
> +
> +
>  @node PowerPC AltiVec Built-in Functions Available on ISA 2.07
>  @subsubsection PowerPC AltiVec Built-in Functions Available on ISA 2.07
>  
> @@ -23989,6 +24001,17 @@ is larger than 128 bits, the result is undefined.
>  The result is the modulo result of dividing the first input  by the second
>  input.
>  
> +@smallexample
> +const signed int __builtin_vsx_xvcmpeqdp_p (signed int, vd, vd);
> +@end smallexample
> +
> +The first argument of the builti-in @code{__builtin_vsx_xvcmpeqdp_p} is an
> +integer in the range of 0 to 1.  The second and third arguments are floating
> +point vectors to be compared.  The result is 1 if the first argument is a 1
> +and one or more of the corresponding vector elements are equal.  The result 
> is
> +1 if the first argument is 0 and all of the corresponding vector elements are
> +not equal.  The result is zero otherwise.
> +
>  The following builtins perform 128-bit vector comparisons.  The
>  @code{vec_all_xx}, @code{vec_any_xx}, and @code{vec_cmpxx}, where @code{xx} 
> is
>  one of the operations @code{eq, ne, gt, lt, ge, le} perform pairwise
> diff --git a/gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-4.c 
> b/gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-4.c
> new file mode 100644
> index 000..8ac07c7c807
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-4.c
> @@ -0,0 +1,135 @@
> +/* { dg-do run { target { power10_hw } } } */
> +/* { dg-do link { target { ! power10_hw } } } */
> +/* { dg-options "-mdejagnu-cpu=power10 -O2 -save-temps" } */
> +/* { dg-require-effective-target power10_ok } */
> +
> +#define DEBUG 0
> +
> +#if DEBUG
> +#include 
> +#include 
> +#endif
> +
> +void abort (void);
> +
> +int main ()
> +{
> +  int i;
> +  int result;
> +  vector float vf_arg1, vf_arg2;
> +  vector double d_arg1, d_arg2;
> +
> +  /* Compare vectors with one equal element, check
> + for all elements unequal, i.e. first arg is 1.  */
> +  vf_arg1 = (vector float) {1.0, 2.0, 3.0, 4.0};
> +  vf_arg2 = (vector float) {1.0, 3.0, 2.0, 8.0};
> +  result = __builtin_vsx_xvcmpeqsp_p (1, vf_arg1, vf_arg2);
> +
> +#if DEBUG
> +  printf("result = 0x%x\n", (unsigned int) result);
> +#endif
> +
> +  if (result != 1)
> +for (i = 0; i < 4; i++)
> +#if DEBUG
> +  printf("ERROR, __builtin_vsx_xvcmpeqsp_p 1: arg 1 = 1, varg3[%d] = %f, 
> varg3[%d] = %f\n",
> +  i, vf_arg1[i], i, vf_arg2[i]);
> +#else
> +  abort();
> +#endif
> +  /* Compare vectors with one equal element, 

Re: [PATCH 06/11] rs6000, __builtin_vsx_xxpermdi_1ti add documentation, and test case

2024-02-28 Thread Kewen.Lin
Hi Carl,

on 2024/2/21 01:57, Carl Love wrote:
> GCC maintainers:
> 
> The patch adds documentation and test case for the __builtin_vsx_xxpermdi_1ti 
> built-in.
> 
> The patch has been tested on Power 10 with no regressions.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
> 
> rs6000, __builtin_vsx_xxpermdi_1ti add documentation and test case
> 
> Add documentation to the extend.texi file for the
> __builtin_vsx_xxpermdi_1ti built-in.

I think this one should be part of vec_xxpermdi (overload.def), we can
extend vec_xxpermdi by one more instance with type vsq, also update the
documentation on vec_xxpermdi for this newly introduced.

BR,
Kewen

> 
> Add test cases for the __builtin_vsx_xxpermdi_1ti built-in.
> 
> gcc/ChangeLog:
>   * doc/extend.texi (__builtin_vsx_xxpermdi_1ti): Add documentation.
> 
> gcc/testsuite/ChangeLog:
>   * gcc.target/powerpc/vsx-builtin-runnable-3.c: New test case.
> ---
>  gcc/doc/extend.texi   |  7 +++
>  .../powerpc/vsx-builtin-runnable-3.c  | 48 +++
>  2 files changed, 55 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-3.c
> 
> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> index 83eed9e334b..22f67ebab31 100644
> --- a/gcc/doc/extend.texi
> +++ b/gcc/doc/extend.texi
> @@ -21508,6 +21508,13 @@ vector __int128  __builtin_vsx_xxpermdi_1ti (vector 
> __int128, vector __int128,
>  const int);
>  
>  @end smallexample
> +
> +The  @code{__builtin_vsx_xxpermdi_1ti} Let srcA[127:0] be the 128-bit first
> +argument and srcB[127:0] be the 128-bit second argument.  Let sel[1:0] be the
> +least significant bits of the const int argument (third input argument).  The
> +result bits [127:64] is srcB[127:64] if  sel[1] = 0, srcB[63:0] otherwise.  
> The
> +result bits [63:0] is srcA[127:64] if  sel[0] = 0, srcA[63:0] otherwise.
> +
>  @node Basic PowerPC Built-in Functions Available on ISA 2.07
>  @subsubsection Basic PowerPC Built-in Functions Available on ISA 2.07
>  
> diff --git a/gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-3.c 
> b/gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-3.c
> new file mode 100644
> index 000..ba287597cec
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-3.c
> @@ -0,0 +1,48 @@
> +/* { dg-do run { target { lp64 } } } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mdejagnu-cpu=power7" } */
> +
> +#include 
> +
> +#define DEBUG 0
> +
> +#if DEBUG
> +#include 
> +#include 
> +#endif
> +
> +void abort (void);
> +
> +int main ()
> +{
> +  int i;
> +
> +  vector signed __int128 vsq_arg1, vsq_arg2, vsq_result, vsq_expected_result;
> +
> +  vsq_arg1[0] = (__int128) 0x;
> +  vsq_arg1[0] = vsq_arg1[0] << 64 | (__int128) 0x;
> +  vsq_arg2[0] = (__int128) 0x1100110011001100;
> +  vsq_arg2[0] = (vsq_arg2[0]  << 64) | (__int128) 0x;
> +
> +  vsq_expected_result[0] = (__int128) 0x;
> +  vsq_expected_result[0] = (vsq_expected_result[0] << 64)
> +| (__int128) 0x;
> +
> +  vsq_result = __builtin_vsx_xxpermdi_1ti (vsq_arg1, vsq_arg2, 2);
> +
> +  if (vsq_result[0] != vsq_expected_result[0])
> +{
> +#if DEBUG
> +   printf("ERROR, __builtin_vsx_xxpermdi_1ti: vsq_result = 0x%016llx 
> %016llx\n",
> +   (unsigned long long) (vsq_result[0] >> 64),
> +   (unsigned long long) vsq_result[0]);
> +   printf(" vsq_expected_resultd = 0x%016llx 
> %016llx\n",
> +   (unsigned long long)(vsq_expected_result[0] >> 64),
> +   (unsigned long long) vsq_expected_result[0]);
> +#else
> +  abort();
> +#endif
> + }
> +
> +  return 0;
> +}


Re: [PATCH 05/11] rs6000, __builtin_vsx_xvneg[sp,dp] add documentation, and test cases

2024-02-28 Thread Kewen.Lin
Hi,

on 2024/2/21 01:56, Carl Love wrote:
> GCC maintainers:
> 
> The patch adds documentation and test cases for the __builtin_vsx_xvnegsp, 
> __builtin_vsx_xvnegdp built-ins.
> 
> The patch has been tested on Power 10 with no regressions.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
> rs6000, __builtin_vsx_xvneg[sp,dp] add documentation and test cases
> 
> Add documentation to the extend.texi file for the two built-ins
> __builtin_vsx_xvnegsp, __builtin_vsx_xvnegdp.

I think these two are useless, the functionality has been covered by vec_neg
in PVIPR, so instead we should get rid of these definitions (bif def table,
test cases if there are some).

BR,
Kewen

> 
> Add test cases for the two built-ins.
> 
> gcc/ChangeLog:
>   * doc/extend.texi (__builtin_vsx_xvnegsp, __builtin_vsx_xvnegdp):
>   Add documentation.
> 
> gcc/testsuite/ChangeLog:
>   * gcc.target/powerpc/vsx-builtin-runnable-2.c: New test case.
> ---
>  gcc/doc/extend.texi   | 13 +
>  .../powerpc/vsx-builtin-runnable-2.c  | 51 +++
>  2 files changed, 64 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-2.c
> 
> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> index 583b1d890bf..83eed9e334b 100644
> --- a/gcc/doc/extend.texi
> +++ b/gcc/doc/extend.texi
> @@ -21495,6 +21495,19 @@ The @code{__builtin_vsx_xvcvuxwdp} converts single 
> precision unsigned integer
>  value to a double precision floating point value.  Input element at index 2*i
>  is stored in the destination element i.
>  
> +@smallexample
> +vector float __builtin_vsx_xvnegsp (vector float);
> +vector double __builtin_vsx_xvnegdp (vector double);
> +@end smallexample
> +
> +The  @code{__builtin_vsx_xvnegsp} and @code{__builtin_vsx_xvnegdp} negate 
> each
> +vector element.
> +
> +@smallexample
> +vector __int128  __builtin_vsx_xxpermdi_1ti (vector __int128, vector 
> __int128,
> +const int);
> +
> +@end smallexample
>  @node Basic PowerPC Built-in Functions Available on ISA 2.07
>  @subsubsection Basic PowerPC Built-in Functions Available on ISA 2.07
>  
> diff --git a/gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-2.c 
> b/gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-2.c
> new file mode 100644
> index 000..7906a8e01d7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-2.c
> @@ -0,0 +1,51 @@
> +/* { dg-do run { target { lp64 } } } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mdejagnu-cpu=power7" } */
> +
> +#define DEBUG 0
> +
> +#if DEBUG
> +#include 
> +#include 
> +#endif
> +
> +void abort (void);
> +
> +int main ()
> +{
> +  int i;
> +  vector double vd_arg1, vd_result, vd_expected_result;
> +  vector float vf_arg1, vf_result, vf_expected_result;
> +
> +  /* VSX Vector Negate Single-Precision.  */
> +
> +  vf_arg1 = (vector float) {-1.0, 12345.98, -2.1234, 238.9};
> +  vf_result = __builtin_vsx_xvnegsp (vf_arg1);
> +  vf_expected_result = (vector float) {1.0, -12345.98, 2.1234, -238.9};
> +
> +  for (i = 0; i < 4; i++)
> +if (vf_result[i] != vf_expected_result[i])
> +#if DEBUG
> +  printf("ERROR, __builtin_vsx_xvnegsp: vf_result[%d] = %f, 
> vf_expected_result[%d] = %f\n",
> +  i, vf_result[i], i, vf_expected_result[i]);
> +#else
> +  abort();
> +#endif
> +
> +  /* VSX Vector Negate Double-Precision.  */
> +
> +  vd_arg1 = (vector double) {12345.98, -2.1234};
> +  vd_result = __builtin_vsx_xvnegdp (vd_arg1);
> +  vd_expected_result = (vector double) {-12345.98, 2.1234};
> +
> +  for (i = 0; i < 2; i++)
> +if (vd_result[i] != vd_expected_result[i])
> +#if DEBUG
> +  printf("ERROR, __builtin_vsx_xvnegdp: vd_result[%d] = %f, 
> vd_expected_result[%d] = %f\n",
> +  i, vd_result[i], i, vd_expected_result[i]);
> +#else
> +  abort();
> +#endif
> +
> +  return 0;
> +}



Re: [PATCH 03/11] rs6000, remove duplicated built-ins

2024-02-28 Thread Kewen.Lin
on 2024/2/21 01:56, Carl Love wrote:
> GCC maintainers:
> 
> There are a number of undocumented built-ins that are duplicates of other 
> documented built-ins.  This patch removes the duplicates so users will only 
> use the documented built-in.
> 
> The patch has been tested on Power 10 with no regressions.

Can you also test this on at least one BE machine?  The behaviors of some
built-ins may also depend on endianness.

> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
> -
> 
> rs6000, remove duplicated built-ins
> 
> The following undocumented built-ins are same as existing documented
> overloaded builtins.
> 
>   const vf __builtin_vsx_xxmrghw (vf, vf);
> same as  vf __builtin_vec_mergeh (vf, vf);  (overloaded vec_mergeh)
> 
>   const vsi __builtin_vsx_xxmrghw_4si (vsi, vsi);
> same as vsi __builtin_vec_mergeh (vsi, vsi);   (overloaded vec_mergeh)
> 
>   const vf __builtin_vsx_xxmrglw (vf, vf);
> same as vf __builtin_vec_mergel (vf, vf);  (overloaded vec_mergel)
> 
>   const vsi __builtin_vsx_xxmrglw_4si (vsi, vsi);
> same as vsi __builtin_vec_mergel (vsi, vsi);   (overloaded vec_mergel)
> 

With these builtin definitions removed, the according expanders
vsx_xxmrg{h,l}w_v4s{f,i} look useless then, please have a check, if so,
they should be removed together, and put this part of changes into a
separated patch (mainly vec merge) ...


>   const vsc __builtin_vsx_xxsel_16qi (vsc, vsc, vsc);
> same as vsc __builtin_vec_sel (vsc, vsc, vuc);  (overloaded vec_sel)
> 
>   const vuc __builtin_vsx_xxsel_16qi_uns (vuc, vuc, vuc);
> same as vuc __builtin_vec_sel (vuc, vuc, vuc);  (overloaded vec_sel)
> 
>   const vd __builtin_vsx_xxsel_2df (vd, vd, vd);
> same as  vd __builtin_vec_sel (vd, vd, vull);   (overloaded vec_sel)
> 
>   const vsll __builtin_vsx_xxsel_2di (vsll, vsll, vsll);
> same as vsll __builtin_vec_sel (vsll, vsll, vsll);  (overloaded vec_sel)
> 
>   const vull __builtin_vsx_xxsel_2di_uns (vull, vull, vull);
> same as vull __builtin_vec_sel (vull, vull, vsll);  (overloaded vec_sel)
> 
>   const vf __builtin_vsx_xxsel_4sf (vf, vf, vf);
> same as vf __builtin_vec_sel (vf, vf, vsi)  (overloaded vec_sel)
> 
>   const vsi __builtin_vsx_xxsel_4si (vsi, vsi, vsi);
> same as vsi __builtin_vec_sel (vsi, vsi, vbi);  (overloaded vec_sel)
> 
>   const vui __builtin_vsx_xxsel_4si_uns (vui, vui, vui);
> same as vui __builtin_vec_sel (vui, vui, vui);  (overloaded vec_sel)
> 
>   const vss __builtin_vsx_xxsel_8hi (vss, vss, vss);
> same as vss __builtin_vec_sel (vss, vss, vbs);  (overloaded vec_sel)
> 
>   const vus __builtin_vsx_xxsel_8hi_uns (vus, vus, vus);
> same as vus __builtin_vec_sel (vus, vus, vus);  (overloaded vec_sel)

... and adopt another one for this part (vec_sel).

> 
> This patch removed the duplicate built-in definitions so only the
> documented built-ins will be available for use.  The case statements in
> rs6000_gimple_fold_builtin that ar no longer needed are also removed.
> 
> gcc/ChangeLog:
>   * config/rs6000/rs6000-builtins.def (__builtin_vsx_xxmrghw,
>   __builtin_vsx_xxmrghw_4si, __builtin_vsx_xxmrglw,
>   __builtin_vsx_xxmrglw_4si, __builtin_vsx_xxsel_16qi,
>   __builtin_vsx_xxsel_16qi_uns, __builtin_vsx_xxsel_2df,
>   __builtin_vsx_xxsel_2di, __builtin_vsx_xxsel_2di_uns,
>   __builtin_vsx_xxsel_4sf, __builtin_vsx_xxsel_4si,
>   __builtin_vsx_xxsel_4si_uns, __builtin_vsx_xxsel_8hi,
>   __builtin_vsx_xxsel_8hi_uns): Removed built-in definition.

Nit: s/Removed/Remove/

>   * config/rs6000/rs6000-builtin.cc (rs6000_gimple_fold_builtin):
>   remove case entries RS6000_BIF_XXMRGLW_4SI,
>   RS6000_BIF_XXMRGLW_4SF, RS6000_BIF_XXMRGHW_4SI,
>   RS6000_BIF_XXMRGHW_4SF.

Nit: s/remove/Remove/

> 
> gcc/testsuite/ChangeLog:
>   * gcc.target/powerpc/vsx-builtin-3.c (__builtin_vsx_xxsel_4si,
>   __builtin_vsx_xxsel_8hi, __builtin_vsx_xxsel_16qi,
>   __builtin_vsx_xxsel_4sf, __builtin_vsx_xxsel_2df): Remove test
>   cases for removed built-ins.
> ---
>  gcc/config/rs6000/rs6000-builtin.cc   |  4 --
>  gcc/config/rs6000/rs6000-builtins.def | 42 ---
>  .../gcc.target/powerpc/vsx-builtin-3.c|  6 ---
>  3 files changed, 52 deletions(-)
> 
> diff --git a/gcc/config/rs6000/rs6000-builtin.cc 
> b/gcc/config/rs6000/rs6000-builtin.cc
> index 6698274031b..e436cbe4935 100644
> --- a/gcc/config/rs6000/rs6000-builtin.cc
> +++ b/gcc/config/rs6000/rs6000-builtin.cc
> @@ -2110,20 +2110,16 @@ rs6000_gimple_fold_builtin (gimple_stmt_iterator *gsi)
>  /* vec_mergel (integrals).  */
>  case RS6000_BIF_VMRGLH:
>  case RS6000_BIF_VMRGLW:
> -case RS6000_BIF_XXMRGLW_4SI:
>  case RS6000_BIF_VMRGLB:
>  case RS6000_BIF_VEC_MERGEL_V2DI:
> -case RS6000_BIF_XXMRGLW_4SF:
>  case RS6000_BIF_VEC_MERGEL_V2DF:
>fold_mergehl_helper (gsi, stmt, 1);
>   

Re: [PATCH 04/11] rs6000, Update comment for the __builtin_vsx_vper*, built-ins.

2024-02-28 Thread Kewen.Lin
Hi,

on 2024/2/21 01:56, Carl Love wrote:
> GCC maintainers:
> 
> The patch expands an existing comment to document that the duplicates are 
> covered by an overloaded built-in.  I am wondering if we should just go ahead 
> and remove the duplicates?

As the below comments Bill placed before, I think we should remove them, since
users should use the standard interface vec_perm which is defined by PVIPR.

They are not undocumented at all, in case some users are still using such 
builtins
they should switch to use vec_perm instead, so even if it's stage 4 now, it 
looks
still fine to drop them IMHO.

Segher & Peter, what do you think of this?

BR,
Kewen

> 
> The patch has been tested on Power 10 with no regressions.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
> -
> rs6000, Update comment for the __builtin_vsx_vper* built-ins.
> 
> There is a comment about the __builtin_vsx_vper* built-ins being
> duplicates of the __builtin_altivec_* built-ins.  The note says we
> should consider deprecation/removeal of the __builtin_vsx_vper*.  Add a
> note that the _builtin_vsx_vper* built-ins are covered by the overloaded
> vec_perm built-ins which use the __builtin_altivec_* built-in definitions.
> 
> gcc/ChangeLog:
>   * config/rs6000/rs6000-builtins.def ( __builtin_vsx_vperm_*):
>   Add comment to existing comment about the built-ins.
> ---
>  gcc/config/rs6000/rs6000-builtins.def | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/gcc/config/rs6000/rs6000-builtins.def 
> b/gcc/config/rs6000/rs6000-builtins.def
> index 96d095da2cb..4c95429f137 100644
> --- a/gcc/config/rs6000/rs6000-builtins.def
> +++ b/gcc/config/rs6000/rs6000-builtins.def
> @@ -1556,6 +1556,14 @@
>  ; These are duplicates of __builtin_altivec_* counterparts, and are being
>  ; kept for backwards compatibility.  The reason for their existence is
>  ; unclear.  TODO: Consider deprecation/removal at some point.
> +; Note, __builtin_vsx_vperm_16qi, __builtin_vsx_vperm_16qi_uns,
> +; __builtin_vsx_vperm_1ti, __builtin_vsx_vperm_v1ti_uns,
> +; __builtin_vsx_vperm_2df, __builtin_vsx_vperm_2di, __builtin_vsx_vperm_2di,
> +; __builtin_vsx_vperm_2di_uns, __builtin_vsx_vperm_4sf,
> +; __builtin_vsx_vperm_4si, __builtin_vsx_vperm_4si_uns,
> +; __builtin_vsx_vperm_8hi, __builtin_altivec_vperm_8hi_uns
> +; are all covered by the overloaded vec_perm built-in which uses the
> +; __builtin_altivec_* built-in definitions.
>const vsc __builtin_vsx_vperm_16qi (vsc, vsc, vuc);
>  VPERM_16QI_X altivec_vperm_v16qi {}
>  


Re: [PATCH 02/11] rs6000, fix arguments, add documentation for vector, element conversions

2024-02-28 Thread Kewen.Lin
Hi,

on 2024/2/21 01:56, Carl Love wrote:
> 
> GCC maintainers:
> 
> This patch fixes the  return type for the __builtin_vsx_xvcvdpuxws and 
> __builtin_vsx_xvcvspuxds built-ins.  They were defined as signed but should 
> have been defined as unsigned.
> 
> The patch has been tested on Power 10 with no regressions.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
> -
> rs6000, fix arguments, add documentation for vector element conversions
> 
> The return type for the __builtin_vsx_xvcvdpuxws, __builtin_vsx_xvcvspuxds,
> __builtin_vsx_xvcvspuxws built-ins should be unsigned.  This patch changes
> the return values from signed to unsigned.
> 
> The documentation for the vector element conversion built-ins:
> 
> __builtin_vsx_xvcvspsxws
> __builtin_vsx_xvcvspsxds
> __builtin_vsx_xvcvspuxds
> __builtin_vsx_xvcvdpsxws
> __builtin_vsx_xvcvdpuxws
> __builtin_vsx_xvcvdpuxds_uns
> __builtin_vsx_xvcvspdp
> __builtin_vsx_xvcvdpsp
> __builtin_vsx_xvcvspuxws
> __builtin_vsx_xvcvsxwdp
> __builtin_vsx_xvcvuxddp_uns
> __builtin_vsx_xvcvuxwdp
> 
> is missing from extend.texi.  This patch adds the missing documentation.

I think we should recommend users to adopt the recommended built-ins in
PVIPR, by checking the corresponding mnemonic in PVIPR, I got:

__builtin_vsx_xvcvspsxws -> vec_signed
__builtin_vsx_xvcvspsxds -> N/A
__builtin_vsx_xvcvspuxds -> N/A
__builtin_vsx_xvcvdpsxws -> vec_signed{e,o}
__builtin_vsx_xvcvdpuxws -> vec_unsigned{e,o}
__builtin_vsx_xvcvdpuxds_uns -> vec_unsigned
__builtin_vsx_xvcvspdp   -> vec_double{e,o}
__builtin_vsx_xvcvdpsp   -> vec_float{e,o}
__builtin_vsx_xvcvspuxws -> vec_unsigned
__builtin_vsx_xvcvsxwdp  -> vec_double{e,o}
__builtin_vsx_xvcvuxddp_uns> vec_double

For __builtin_vsx_xvcvspsxds and __builtin_vsx_xvcvspuxds which don't have
the according PVIPR built-ins, we can extend the current vec_{un,}signed{e,o}
to cover them and document them following the section mentioning PVIPR.

BR,
Kewen

> 
> This patch also adds runnable test cases for each of the built-ins.
> 
> gcc/ChangeLog:
>   * config/rs6000/rs6000-builtins.def (__builtin_vsx_xvcvdpuxws,
>   __builtin_vsx_xvcvspuxds, __builtin_vsx_xvcvspuxws): Change
>   return type from signed to unsigned.
>   * doc/extend.texi (__builtin_vsx_xvcvspsxws,
>   __builtin_vsx_xvcvspsxds, __builtin_vsx_xvcvspuxds,
>   __builtin_vsx_xvcvdpsxws, __builtin_vsx_xvcvdpuxws,
>   __builtin_vsx_xvcvdpuxds_uns, __builtin_vsx_xvcvspdp,
>   __builtin_vsx_xvcvdpsp, __builtin_vsx_xvcvspuxws,
>   __builtin_vsx_xvcvsxwdp, __builtin_vsx_xvcvuxddp_uns,
>   __builtin_vsx_xvcvuxwdp): Add documentation for builtins.
> 
> gcc/testsuite/ChangeLog:
>   * gcc.target/powerpc/vsx-builtin-runnable-1.c: New test file.
> ---
>  gcc/config/rs6000/rs6000-builtins.def |   6 +-
>  gcc/doc/extend.texi   | 135 ++
>  .../powerpc/vsx-builtin-runnable-1.c  | 233 ++
>  3 files changed, 371 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-1.c
> 
> diff --git a/gcc/config/rs6000/rs6000-builtins.def 
> b/gcc/config/rs6000/rs6000-builtins.def
> index d66a53a0fab..fd316f629e5 100644
> --- a/gcc/config/rs6000/rs6000-builtins.def
> +++ b/gcc/config/rs6000/rs6000-builtins.def
> @@ -1724,7 +1724,7 @@
>const vull __builtin_vsx_xvcvdpuxds_uns (vd);
>  XVCVDPUXDS_UNS vsx_fixuns_truncv2dfv2di2 {}
>  
> -  const vsi __builtin_vsx_xvcvdpuxws (vd);
> +  const vui __builtin_vsx_xvcvdpuxws (vd);
>  XVCVDPUXWS vsx_xvcvdpuxws {}
>  
>const vd __builtin_vsx_xvcvspdp (vf);
> @@ -1736,10 +1736,10 @@
>const vsi __builtin_vsx_xvcvspsxws (vf);
>  XVCVSPSXWS vsx_fix_truncv4sfv4si2 {}
>  
> -  const vsll __builtin_vsx_xvcvspuxds (vf);
> +  const vull __builtin_vsx_xvcvspuxds (vf);
>  XVCVSPUXDS vsx_xvcvspuxds {}
>  
> -  const vsi __builtin_vsx_xvcvspuxws (vf);
> +  const vui __builtin_vsx_xvcvspuxws (vf);
>  XVCVSPUXWS vsx_fixuns_truncv4sfv4si2 {}
>  
>const vd __builtin_vsx_xvcvsxddp (vsll);
> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> index 4d8610f6aa8..583b1d890bf 100644
> --- a/gcc/doc/extend.texi
> +++ b/gcc/doc/extend.texi
> @@ -21360,6 +21360,141 @@ __float128 __builtin_sqrtf128 (__float128);
>  __float128 __builtin_fmaf128 (__float128, __float128, __float128);
>  @end smallexample
>  
> +@smallexample
> +vector int __builtin_vsx_xvcvspsxws (vector float);
> +@end smallexample
> +
> +The @code{__builtin_vsx_xvcvspsxws} converts the single precision floating
> +point vector element i to a signed single-precision integer value using
> +round to zero storing the result in element i.  If the source element is NaN
> +the result is set to 0x8000 and VXCI is set to 1.  If the source
> +element is SNaN then VXSNAN is also set to 1.  If the rounded value is 
> greater
> +than 2^31 - 1 

Re: [PATCH 01/11] rs6000, Fix __builtin_vsx_cmple* args and documentation, builtins

2024-02-28 Thread Kewen.Lin
Hi,

on 2024/2/21 01:55, Carl Love wrote:
> 
> GCC maintainers:
> 
> This patch fixes the arguments and return type for the various 
> __builtin_vsx_cmple* built-ins.  They were defined as signed but should have 
> been defined as unsigned.
> 
> The patch has been tested on Power 10 with no regressions.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
> -
> 
> rs6000, Fix __builtin_vsx_cmple* args and documentation, builtins
> 
> The built-ins __builtin_vsx_cmple_u16qi, __builtin_vsx_cmple_u2di,
> __builtin_vsx_cmple_u4si and __builtin_vsx_cmple_u8hi should take
> unsigned arguments and return an unsigned result.  This patch changes
> the arguments and return type from signed to unsigned.

Apparently the types mismatch the corresponding bif names, but I wonder
if these __builtin_vsx_cmple* actually provide some value?

Users can just use vec_cmple as PVIPR defines, as altivec.h shows,
vec_cmple gets redefined with vec_cmpge, these are not for the underlying
implementation.  I also checked the documentation of openXL (xl compiler),
they don't support these either (these are not for compability).

So can we just remove these bifs?

> 
> The documentation for the signed and unsigned versions of
> __builtin_vsx_cmple is missing from extend.texi.  This patch adds the
> missing documentation.
> 
> Test cases are added for each of the signed and unsigned built-ins.
> 
> gcc/ChangeLog:
>   * config/rs6000/rs6000-builtins.def (__builtin_vsx_cmple_u16qi,
>   __builtin_vsx_cmple_u2di, __builtin_vsx_cmple_u4si): Change
>   arguments and return from signed to unsigned.
>   * doc/extend.texi (__builtin_vsx_cmple_16qi,
>   __builtin_vsx_cmple_8hi, __builtin_vsx_cmple_4si,
>   __builtin_vsx_cmple_u16qi, __builtin_vsx_cmple_u8hi,
>   __builtin_vsx_cmple_u4si): Add documentation.
> 
> gcc/testsuite/ChangeLog:
>   * gcc.target/powerpc/vsx-cmple.c: New test file.
> ---
>  gcc/config/rs6000/rs6000-builtins.def|  10 +-
>  gcc/doc/extend.texi  |  23 
>  gcc/testsuite/gcc.target/powerpc/vsx-cmple.c | 127 +++
>  3 files changed, 155 insertions(+), 5 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vsx-cmple.c
> 
> diff --git a/gcc/config/rs6000/rs6000-builtins.def 
> b/gcc/config/rs6000/rs6000-builtins.def
> index 3bc7fed6956..d66a53a0fab 100644
> --- a/gcc/config/rs6000/rs6000-builtins.def
> +++ b/gcc/config/rs6000/rs6000-builtins.def
> @@ -1349,16 +1349,16 @@
>const vss __builtin_vsx_cmple_8hi (vss, vss);
>  CMPLE_8HI vector_ngtv8hi {}
>  
> -  const vsc __builtin_vsx_cmple_u16qi (vsc, vsc);
> +  const vuc __builtin_vsx_cmple_u16qi (vuc, vuc);
>  CMPLE_U16QI vector_ngtuv16qi {}
>  
> -  const vsll __builtin_vsx_cmple_u2di (vsll, vsll);
> +  const vull __builtin_vsx_cmple_u2di (vull, vull);
>  CMPLE_U2DI vector_ngtuv2di {}
>  
> -  const vsi __builtin_vsx_cmple_u4si (vsi, vsi);
> +  const vui __builtin_vsx_cmple_u4si (vui, vui);
>  CMPLE_U4SI vector_ngtuv4si {}
>  
> -  const vss __builtin_vsx_cmple_u8hi (vss, vss);
> +  const vus __builtin_vsx_cmple_u8hi (vus, vus);
>  CMPLE_U8HI vector_ngtuv8hi {}
>  
>const vd __builtin_vsx_concat_2df (double, double);
> @@ -1769,7 +1769,7 @@
>const vf __builtin_vsx_xvcvuxdsp (vull);
>  XVCVUXDSP vsx_xvcvuxdsp {}
>  
> -  const vd __builtin_vsx_xvcvuxwdp (vsi);
> +  const vd __builtin_vsx_xvcvuxwdp (vui);
>  XVCVUXWDP vsx_xvcvuxwdp {}

This change is unexpected, it should not be in this sub-patch. :)

>  
>const vf __builtin_vsx_xvcvuxwsp (vsi);
> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> index 2b8ba1949bf..4d8610f6aa8 100644
> --- a/gcc/doc/extend.texi
> +++ b/gcc/doc/extend.texi
> @@ -22522,6 +22522,29 @@ if the VSX instruction set is available.  The 
> @samp{vec_vsx_ld} and
>  @samp{vec_vsx_st} built-in functions always generate the VSX @samp{LXVD2X},
>  @samp{LXVW4X}, @samp{STXVD2X}, and @samp{STXVW4X} instructions.
>  
> +
> +@smallexample
> +vector signed char __builtin_vsx_cmple_16qi (vector signed char,
> + vector signed char);
> +vector signed short __builtin_vsx_cmple_8hi (vector signed short,
> + vector signed short);
> +vector signed int __builtin_vsx_cmple_4si (vector signed int,
> + vector signed int);
> +vector unsigned char __builtin_vsx_cmple_u16qi (vector unsigned char,
> +vector unsigned char);
> +vector unsigned short __builtin_vsx_cmple_u8hi (vector unsigned short,
> +vector unsigned short);
> +vector unsigned int __builtin_vsx_cmple_u4si (vector unsigned int,
> +  vector unsigned int);
> +@end smallexample

We don't document any 

Re: [PATCH] rs6000: Don't allow immediate value in the vsx_splat pattern [PR113950]

2024-02-26 Thread Kewen.Lin
on 2024/2/27 10:13, Peter Bergner wrote:
> On 2/26/24 7:55 PM, Kewen.Lin wrote:
>> on 2024/2/26 23:07, Peter Bergner wrote:
>>> so I think we should use both Jeevitha's predicate change and
>>> your operands[1] change.
>>
>> Since either the original predicate change or operands[1] change
>> can fix this issue, I think it's implied that only either of them
>> is enough, so we can remove "else if (!REG_P (op1))" arm (or even
>> replaced with one else arm to assert REG_P (op1))?
> 
> splat_input_operand allows, mem, reg and subreg, so I don't think
> we can just assert on REG_P (op1), since op1 could be a subreg.

ah, you are right! I missed the "subreg".

> I do agree we can remove the "if (!REG_P (op1))" test on the else
> branch, since force_reg() has an early exit for regs, so a simple:
> 
>   ...
>   else
> operands[1] = force_reg (mode, op1);
> 
> ..should work.

Yes!

> 
> 
> 
> 
>> Good point, or maybe just an explicit -mvsx like some existing ones, which
>> can avoid to only test some fixed cpu type.
> 
> If a simple "-O1 -vsx" is enough to expose the ICE on an unpacthed
> compiler and a PASS on a patched compiler, then I'm all for it.
> Jeevitha, can you try confirming that?

Jeevitha, can you also check why we have the different behavior on GCC 11 when
you get time?  GCC 12 has new built-in framework, so this ICE gets exposed, but
IMHO it would still be good to double check the previous behavior is due to
some miss support or some other latent bug.  Thanks in advance!

BR,
Kewen



Re: [PATCH] rs6000: Don't allow immediate value in the vsx_splat pattern [PR113950]

2024-02-26 Thread Kewen.Lin
on 2024/2/26 23:07, Peter Bergner wrote:
> On 2/26/24 4:49 AM, Kewen.Lin wrote:
>> on 2024/2/26 14:18, jeevitha wrote:
>>> Hi All,
>>> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
>>> index 6111cc90eb7..e5688ff972a 100644
>>> --- a/gcc/config/rs6000/vsx.md
>>> +++ b/gcc/config/rs6000/vsx.md
>>> @@ -4660,7 +4660,7 @@
>>>  (define_expand "vsx_splat_"
>>>[(set (match_operand:VSX_D 0 "vsx_register_operand")
>>> (vec_duplicate:VSX_D
>>> -(match_operand: 1 "input_operand")))]
>>> +(match_operand: 1 "splat_input_operand")))]
>>>"VECTOR_MEM_VSX_P (mode)"
>>>  {
>>>rtx op1 = operands[1];
>>
>> This hunk actually does force_reg already:
>>
>> ...
>>   else if (!REG_P (op1))
>> op1 = force_reg (mode, op1);
>>
>> but it's assigning to op1 unexpectedly (an omission IMHO), so just
>> simply fix it with:
>>
>>   else if (!REG_P (op1))
>> -op1 = force_reg (mode, op1);
>> +operands[1] = force_reg (mode, op1);
> 
> I agree op1 was an oversight and it should be operands[1].
> That said, I think using more precise predicates is a good thing,

Agreed.

> so I think we should use both Jeevitha's predicate change and
> your operands[1] change.

Since either the original predicate change or operands[1] change
can fix this issue, I think it's implied that only either of them
is enough, so we can remove "else if (!REG_P (op1))" arm (or even
replaced with one else arm to assert REG_P (op1))?

> 
> I'll note that Jeevitha originally had the operands[1] change, but I
> didn't look closely enough at the issue or the pattern and mentioned
> that these kinds of bugs can be caused by too loose constraints and
> predicates, which is when she found the updated predicate to use.
> I believe she already even bootstrapped and regtested the operands[1]
> only change.  Jeevitha???
> 

Good to know that. :)

> 
> 
> 
>>> +/* PR target/113950 */
>>> +/* { dg-do compile } */
>>
>> We need an effective target to ensure vsx support, for now it's 
>> powerpc_vsx_ok.
>> ie: /* { dg-require-effective-target powerpc_vsx_ok } */
> 
> Agreed.
> 
> 
>>> +/* { dg-options "-O1" } */
> 
> I think we should also use a -mcpu=XXX option to ensure VSX is enabled
> when compiling these VSX built-in functions.  I'm fine using any CPU
> (power7 or later) where the ICE exists with an unpatched compiler.
> Otherwise, testing will be limited to our server systems that have
> VSX enabled by default.

Good point, or maybe just an explicit -mvsx like some existing ones, which
can avoid to only test some fixed cpu type.

BR,
Kewen


Re: [PATCH] rs6000: Don't allow immediate value in the vsx_splat pattern [PR113950]

2024-02-26 Thread Kewen.Lin
Hi,

on 2024/2/26 14:18, jeevitha wrote:
> Hi All,
> 
> The following patch has been bootstrapped and regtested on powerpc64le-linux.
> 
> There is no immediate value splatting instruction in powerpc. Currently that
> needs to be stored in a register or memory. For addressing this I have updated
> the predicate for the second operand in vsx_splat to splat_input_operand,
> which will handle the operands appropriately.

The test case fails with error message with GCC 11, but fails with ICE from GCC
12, it's kind of regression, so I think we can make such fix in this stage.

Out of curiosity, did you check why it triggers error messages on GCC 11?  I
guess the difference from GCC 12 is Bill introduced new built-in framework in
GCC12 which adds the support for the bif, but I'm curious what prevent this
being supported before that.

> 
> 2024-02-26  Jeevitha Palanisamy  
> 
> gcc/
>   PR target/113950
>   * config/rs6000/vsx.md (vsx_splat_): Updated the predicates
>   for second operand.
> 
> gcc/testsuite/
>   PR target/113950
>   * gcc.target/powerpc/pr113950.c: New testcase.
> 
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index 6111cc90eb7..e5688ff972a 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -4660,7 +4660,7 @@
>  (define_expand "vsx_splat_"
>[(set (match_operand:VSX_D 0 "vsx_register_operand")
>   (vec_duplicate:VSX_D
> -  (match_operand: 1 "input_operand")))]
> +  (match_operand: 1 "splat_input_operand")))]
>"VECTOR_MEM_VSX_P (mode)"
>  {
>rtx op1 = operands[1];

This hunk actually does force_reg already:

...
  else if (!REG_P (op1))
op1 = force_reg (mode, op1);

but it's assigning to op1 unexpectedly (an omission IMHO), so just
simply fix it with:

  else if (!REG_P (op1))
-op1 = force_reg (mode, op1);
+operands[1] = force_reg (mode, op1);

instead, can you verify?

> diff --git a/gcc/testsuite/gcc.target/powerpc/pr113950.c 
> b/gcc/testsuite/gcc.target/powerpc/pr113950.c
> new file mode 100644
> index 000..29ded29f683
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr113950.c
> @@ -0,0 +1,24 @@
> +/* PR target/113950 */
> +/* { dg-do compile } */

We need an effective target to ensure vsx support, for now it's powerpc_vsx_ok.
ie: /* { dg-require-effective-target powerpc_vsx_ok } */

(most/all of its uses would be replaced with an enhanced powerpc_vsx in next 
stage 1).

BR,
Kewen


> +/* { dg-options "-O1" } */
> +
> +/* Verify we do not ICE on the following.  */
> +
> +void abort (void);
> +
> +int main ()
> +{
> +  int i;
> +  vector signed long long vsll_result, vsll_expected_result;
> +  signed long long sll_arg1;
> +
> +  sll_arg1 = 300;
> +  vsll_expected_result = (vector signed long long) {300, 300};
> +  vsll_result = __builtin_vsx_splat_2di (sll_arg1);  
> +
> +  for (i = 0; i < 2; i++)
> +if (vsll_result[i] != vsll_expected_result[i])
> +  abort();
> +
> +  return 0;
> +}
> 
> 



Re: Repost [PATCH 1/6] Add -mcpu=future

2024-02-26 Thread Kewen.Lin
on 2024/2/21 15:19, Michael Meissner wrote:
> On Tue, Feb 20, 2024 at 06:35:34PM +0800, Kewen.Lin wrote:
>> Hi Mike,
>>
>> Sorry for late reply (just back from vacation).
>>
>> on 2024/2/8 03:58, Michael Meissner wrote:
>>> On Wed, Feb 07, 2024 at 05:21:10PM +0800, Kewen.Lin wrote:
>>>> on 2024/2/6 14:01, Michael Meissner wrote:
>>>> Sorry for the possible confusion here, the "tune_proc" that I referred to 
>>>> is
>>>> the variable in the above else branch:
>>>>
>>>>enum processor_type tune_proc = (TARGET_POWERPC64 ? PROCESSOR_DEFAULT64 
>>>> : PROCESSOR_DEFAULT);
>>>>
>>>> It's either PROCESSOR_DEFAULT64 or PROCESSOR_DEFAULT, so it doesn't have a
>>>> chance to be PROCESSOR_FUTURE, so the checking "tune_proc == 
>>>> PROCESSOR_FUTURE"
>>>> is useless.
>>>
>>> PROCESSOR_DEFAULT can be PROCESSOR_FUTURE if somebody configures GCC with
>>> --with-cpu=future.  While in general it shouldn't occur, it is helpful to
>>> consider all of the corner cases.
>>
>> But it sounds not true, I think you meant TARGET_CPU_DEFAULT instead?
>>
>> On one local ppc64le machine I tried to configure with --with-cpu=power10,
>> I got {,OPTION_}TARGET_CPU_DEFAULT "power10" but PROCESSOR_DEFAULT is still
>> PROCESSOR_POWER7 (PROCESSOR_DEFAULT64 is PROCESSOR_POWER8).  I think these
>> PROCESSOR_DEFAULT{,64} are defined by various headers:
> 
> Yes, I was mistaken.  You are correct TARGET_CPU_DEFAULT is set.  I will 
> change
> the comments.

Thanks!

> 
>> gcc/config/rs6000/aix71.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
>> gcc/config/rs6000/aix71.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER7
>> gcc/config/rs6000/aix72.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
>> gcc/config/rs6000/aix72.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER7
>> gcc/config/rs6000/aix73.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER8
>> gcc/config/rs6000/aix73.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
>> gcc/config/rs6000/darwin.h:#define PROCESSOR_DEFAULT  PROCESSOR_PPC7400
>> gcc/config/rs6000/darwin.h:#define PROCESSOR_DEFAULT64  PROCESSOR_POWER4
>> gcc/config/rs6000/freebsd64.h:#define PROCESSOR_DEFAULT PROCESSOR_PPC7450
>> gcc/config/rs6000/freebsd64.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
>> gcc/config/rs6000/linux64.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
>> gcc/config/rs6000/linux64.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
>> gcc/config/rs6000/rs6000.h:#define PROCESSOR_DEFAULT   PROCESSOR_PPC603
>> gcc/config/rs6000/rs6000.h:#define PROCESSOR_DEFAULT64 PROCESSOR_RS64A
>> gcc/config/rs6000/vxworks.h:#define PROCESSOR_DEFAULT PROCESSOR_PPC604
>>
>> , and they are unlikely to be updated later, no?
>>
>> btw, the given --with-cpu=future will make cpu_index never negative so
>>
>>   ...
>>   else if (cpu_index >= 0)
>> rs6000_tune_index = tune_index = cpu_index;
>>   else
>> ... 
>>
>> so there is no chance to enter "else" arm, that is, that arm only takes
>> effect when no cpu/tune is given (neither -m{cpu,tune} nor --with-cpu=).
> 
> Note, this is existing code.  I didn't modify it.  If we want to change it, we
> should do it as another patch.

Yes, I agree.  Just to clarify, I didn't suggest changing it but instead
suggested almost keeping them, since we don't need any changes in "else"
arm, so instead of updating in arms "if" and "else if" for "future cpu type",
it seems a bit more clear to just check it after this, ie.:



bool explicit_tune = false;
if (rs6000_tune_index >= 0)
  {
tune_index = rs6000_tune_index;
explicit_tune = true;
  }
else if (cpu_index >= 0)
  // as before
  rs6000_tune_index = tune_index = cpu_index;
else
  {
   //as before
   ...
  }

// Check tune_index here instead.

if (processor_target_table[tune_index].processor == PROCESSOR_FUTURE)
  {
tune_index = rs6000_cpu_index_lookup (PROCESSOR_POWER10);
if (explicit_tune)
  warn ...
  }

// as before
rs6000_tune = processor_target_table[tune_index].processor;



, copied from previous comment: 
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643681.html

BR,
Kewen



Re: [PATCH] rs6000: Neuter option -mpower{8,9}-vector [PR109987]

2024-02-20 Thread Kewen.Lin
on 2024/2/21 09:37, Peter Bergner wrote:
> On 2/20/24 3:27 AM, Kewen.Lin wrote:
>> on 2024/2/20 02:45, Segher Boessenkool wrote:
>>> On Tue, Jan 16, 2024 at 10:50:01AM +0800, Kewen.Lin wrote:
>>>> it consists of some aspects:
>>>>   - effective target powerpc_p{8,9}vector_ok are removed
>>>> and replaced with powerpc_vsx_ok.
>>>
>>> So all such testcases already arrange to have p8 or p9 some other way?
> 
> Shouldn't that be replaced with powerpc_vsx instead of powerpc_vsx_ok?
> That way we know VSX code gen is enabled for the options being used,
> even those in RUNTESTFLAGS.
> 
> I thought we agreed that powerpc_vsx_ok was almost always useless and
> we always want to use powerpc_vsx.  ...or did I miss that we removed
> the old powerpc_vsx_ok and renamed powerpc_vsx to powerpc_vsx_ok?

Yes, I think we all agreed that powerpc_vsx matches more with what we
expect, but I'm hesitating to make such change at this stage because:

  1. if testing on an env without vsx support, the test results on these
 affected test cases may change a lot, as many test cases would
 become unsupported (they pass before with explicit -mvsx before).

  2. teach current powerpc_vsx to make use of current_compiler_flags
 just like some existing practices on has_arch_* may help mitigate
 it as not few test cases already have explicit -mvsx.  But AIUI
 current_complier_flags requires dg-options line comes first before
 the effective target line to make options in dg-options take
 effect, it means we need some work to adjust line order for the
 affected test cases.  On the other hand, some enhancement is needed
 for current_compiler_flags as powerpc_vsx (old powerpc_vsx_ok) isn't
 only used in test case but can be also used in some exp check
 where no expected flags exist.

  3. there may be some other similar effective target checks which we
 want to update as well, it means we need a re-visit on the existing
 effective target checks (rs6000 specific).

  4. powerpc_vsx_ok has been there for a long long time, and -mno-vsx
 is rarely used in RUNTESTFLAGS, this only affects testing, so it
 is not that urgent.

so I'm inclined to work on this in next stage 1.  What do you think?

> 
>>>>   - Some test cases are updated with explicit -mvsx.
>>>>   - Some test cases with those two option mixed are adjusted
>>>> to keep the test points, like -mpower8-vector
>>>> -mno-power9-vector are updated with -mdejagnu-cpu=power8
>>>> -mvsx etc.
>>>
>>> -mcpu=power8 implies -mvsx already.
> 
> Then we can omit the explicit -msx option, correct?  Ie, if the
> user forces -mno-vsx in RUNTESTFLAGS, then we'll just skip the
> test case as UNSUPPORTED rather than trying to compile some
> vsx test case with vsx disabled via the options.

Yes, we can strip any -mvsx then, but if we want the test case
to be tested when it's able to, we can still append an extra
-mvsx.  Even if -mno-vsx is specified but if the option order
makes it like "-mno-vsx... -mvsx", powerpc_vsx is supported
so that the test case can be still tested well with -mvsx
enabled, while if the order is like "-mvsx ... -mno-vsx",
powerpc_vsx fails and it becomes unsupported.

BR,
Kewen



Re: [PATCH] rs6000: Neuter option -mpower{8,9}-vector [PR109987]

2024-02-20 Thread Kewen.Lin
on 2024/2/20 19:19, Segher Boessenkool wrote:
> On Tue, Feb 20, 2024 at 05:27:07PM +0800, Kewen.Lin wrote:
>> Good question, it mainly follows the practice of option direct-move here.
>> IMHO at least for power8-vector we want WarnRemoved for now as it's
>> documented before, and we can probably make it (or them) removed later on
>> trunk once all active branch releases don't support it any more.
>>
>> What's your opinion on this?
> 
> Originally I did
>   Warn(%qs is deprecated)
> which already was a mistake.  It then changed to
>   Deprecated
> and then to
>   WarnRemoved
> which make it clearer that it is a bad plan.
> 
> If it is okay to remove an option, we should not talk about it at all
> anymore.  Well maybe warn about it for another release or so, but not
> longer.

OK, thanks for the suggestion.

> 
>>>>  (define_register_constraint "we" 
>>>> "rs6000_constraints[RS6000_CONSTRAINT_we]"
>>>> -  "@internal Like @code{wa}, if @option{-mpower9-vector} and 
>>>> @option{-m64} are
>>>> -   used; otherwise, @code{NO_REGS}.")
>>>> +  "@internal Like @code{wa}, if the cpu type is power9 or up, meanwhile
>>>> +   @option{-mvsx} and @option{-m64} are used; otherwise, @code{NO_REGS}.")
>>>
>>> "if this is a POWER9 or later and @option{-mvsx} and @option{-m64} are
>>> used".  How clumsy.  Maybe we should make the patterns that use "we"
>>> work without mtvsrdd as well?  Hrm, they will still require 64-bit GPRs
>>> of course, unless we can do something tricky.
>>>
>>> We do not need the special constraint at all of course (we can add these
>>> conditions to all patterns that use it: all *two* patterns).  So maybe
>>> that's what we should do :-)
>>
>> Not sure the original intention introducing it (Mike might know it best), but
>> removing it sounds doable.
> 
> It is for mtvsrdd.

Yes, I meant to say not sure if there was some obstacle which made us introduce
a new constraint, or just because it's simple.

> 
>>  btw, it seems more than two patterns using it?
>> like (if I didn't miss something):
>>   - vsx_concat_
>>   - vsx_splat__reg
>>   - vsx_splat_v4si_di
>>   - vsx_mov_64bit
> 
> Yes, it isn't clear we should use this contraint in those last two.  It
> looks like those do not even need the restriction to 64 bit systems.
> Well the last one obviously has that already, but then it could just use
> "wa", no?

For vsx_splat_v4si_di, it's for mtvsrws, ISA notes GPR[RA].bit[32:63] which
implies the context has 64bit GPR?  The last one seems still to distinguish
there is power9 support or not, just use "wa" which only implies power7
doesn't fit with it?

btw, the actual guard for "we" is TARGET_POWERPC64 rather than TARGET_64BIT,
the documentation isn't accurate enough.  Just filed internal issue #1345
for further tracking on this.

> 
>>> -mcpu=power8 implies -mvsx (power7 already).  You can disable VSX, or
>>> VMX as well, but by default it is enabled.
>>
>> Yes, it's meant to consider the explicitly -mno-vsx, which suffers the option
>> order issue.  But considering we raise error for -mno-vsx -mpower{8,9}-vector
>> before, without specifying -mvsx is closer to the previous.
>>
>> I'll adjust it and the below similar ones, thanks!
> 
> It is never supported to do unsupported things :-)
> 
> We need to be able to rely on defaults.  Otherwise, we will have to
> implement all of GCC recursively, in itself, in the testsuite, and in
> individual tests.  Let's not :-)

OK, fair enough.  Thanks!

BR,
Kewen



Re: Repost [PATCH 1/6] Add -mcpu=future

2024-02-20 Thread Kewen.Lin
Hi Mike,

Sorry for late reply (just back from vacation).

on 2024/2/8 03:58, Michael Meissner wrote:
> On Wed, Feb 07, 2024 at 05:21:10PM +0800, Kewen.Lin wrote:
>> on 2024/2/6 14:01, Michael Meissner wrote:
>> Sorry for the possible confusion here, the "tune_proc" that I referred to is
>> the variable in the above else branch:
>>
>>enum processor_type tune_proc = (TARGET_POWERPC64 ? PROCESSOR_DEFAULT64 : 
>> PROCESSOR_DEFAULT);
>>
>> It's either PROCESSOR_DEFAULT64 or PROCESSOR_DEFAULT, so it doesn't have a
>> chance to be PROCESSOR_FUTURE, so the checking "tune_proc == 
>> PROCESSOR_FUTURE"
>> is useless.
> 
> PROCESSOR_DEFAULT can be PROCESSOR_FUTURE if somebody configures GCC with
> --with-cpu=future.  While in general it shouldn't occur, it is helpful to
> consider all of the corner cases.

But it sounds not true, I think you meant TARGET_CPU_DEFAULT instead?

On one local ppc64le machine I tried to configure with --with-cpu=power10,
I got {,OPTION_}TARGET_CPU_DEFAULT "power10" but PROCESSOR_DEFAULT is still
PROCESSOR_POWER7 (PROCESSOR_DEFAULT64 is PROCESSOR_POWER8).  I think these
PROCESSOR_DEFAULT{,64} are defined by various headers:

$ grep -r "define PROCESSOR_DEFAULT" gcc/config/rs6000/
gcc/config/rs6000/aix71.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
gcc/config/rs6000/aix71.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER7
gcc/config/rs6000/aix72.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
gcc/config/rs6000/aix72.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER7
gcc/config/rs6000/aix73.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER8
gcc/config/rs6000/aix73.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
gcc/config/rs6000/darwin.h:#define PROCESSOR_DEFAULT  PROCESSOR_PPC7400
gcc/config/rs6000/darwin.h:#define PROCESSOR_DEFAULT64  PROCESSOR_POWER4
gcc/config/rs6000/freebsd64.h:#define PROCESSOR_DEFAULT PROCESSOR_PPC7450
gcc/config/rs6000/freebsd64.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
gcc/config/rs6000/linux64.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
gcc/config/rs6000/linux64.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
gcc/config/rs6000/rs6000.h:#define PROCESSOR_DEFAULT   PROCESSOR_PPC603
gcc/config/rs6000/rs6000.h:#define PROCESSOR_DEFAULT64 PROCESSOR_RS64A
gcc/config/rs6000/vxworks.h:#define PROCESSOR_DEFAULT PROCESSOR_PPC604

, and they are unlikely to be updated later, no?

btw, the given --with-cpu=future will make cpu_index never negative so

  ...
  else if (cpu_index >= 0)
rs6000_tune_index = tune_index = cpu_index;
  else
... 

so there is no chance to enter "else" arm, that is, that arm only takes
effect when no cpu/tune is given (neither -m{cpu,tune} nor --with-cpu=).

BR,
Kewen



Re: [PATCH] rs6000: Update instruction counts due to combine changes [PR112103]

2024-02-20 Thread Kewen.Lin
Hi Peter,

on 2024/2/20 06:35, Peter Bergner wrote:
> rs6000: Update instruction counts due to combine changes [PR112103]
> 
> The PR91865 combine fix changed instruction counts slightly for rlwinm-0.c.
> Adjust expected instruction counts accordingly.
> 
> This passed on both powerpc64le-linux and powerpc64-linux running the
> testsuite in both 32-bit and 64-bit modes.  Ok for trunk?

OK for trunk, thanks for fixing!

> 
> FYI, I will open a new bug to track the removing of the superfluous
> insns detected in PR112103.

Hope this test case will become not fragile any more once this filed
issue gets fixed. :)

BR,
Kewen

> 
> 
> Peter
> 
> 
> gcc/testsuite/
>   PR target/112103
>   * gcc.target/powerpc/rlwinm-0.c: Adjust expected instruction counts.
> 
> diff --git a/gcc/testsuite/gcc.target/powerpc/rlwinm-0.c 
> b/gcc/testsuite/gcc.target/powerpc/rlwinm-0.c
> index 4f4fca2d8ef..a10d9174306 100644
> --- a/gcc/testsuite/gcc.target/powerpc/rlwinm-0.c
> +++ b/gcc/testsuite/gcc.target/powerpc/rlwinm-0.c
> @@ -4,10 +4,10 @@
>  /* { dg-final { scan-assembler-times {(?n)^\s+[a-z]} 6739 { target ilp32 } } 
> } */
>  /* { dg-final { scan-assembler-times {(?n)^\s+[a-z]} 9716 { target lp64 } } 
> } */
>  /* { dg-final { scan-assembler-times {(?n)^\s+blr} 3375 } } */
> -/* { dg-final { scan-assembler-times {(?n)^\s+rldicl} 3081 { target lp64 } } 
> } */
> +/* { dg-final { scan-assembler-times {(?n)^\s+rldicl} 3090 { target lp64 } } 
> } */
>  
>  /* { dg-final { scan-assembler-times {(?n)^\s+rlwinm} 3197 { target ilp32 } 
> } } */
> -/* { dg-final { scan-assembler-times {(?n)^\s+rlwinm} 3093 { target lp64 } } 
> } */
> +/* { dg-final { scan-assembler-times {(?n)^\s+rlwinm} 3084 { target lp64 } } 
> } */
>  /* { dg-final { scan-assembler-times {(?n)^\s+rotlwi} 154 } } */
>  /* { dg-final { scan-assembler-times {(?n)^\s+srwi} 13 { target ilp32 } } } 
> */
>  /* { dg-final { scan-assembler-times {(?n)^\s+srdi} 13 { target lp64 } } } */



Re: [PATCH] rs6000: Neuter option -mpower{8,9}-vector [PR109987]

2024-02-20 Thread Kewen.Lin
Hi Segher,

Thanks for the review comments!

on 2024/2/20 02:45, Segher Boessenkool wrote:
> Hi!
> 
> On Tue, Jan 16, 2024 at 10:50:01AM +0800, Kewen.Lin wrote:
>> As PR109987 and its duplicated bugs show, -mno-power8-vector
>> (and -mno-power9-vector) cause some problems and as Segher
>> pointed out in [1] they are workaround options, so this patch
>> is to remove -m{no,}-power{8,9}-options.
> 
> Excellent :-)
> 
>> Like what we did
>> for option -mdirect-move before, this patch still keep the
>> corresponding internal flags and they are automatically set
>> based on -mcpu.
> 
> Yup.  That makes the code nicer, and it what we already have anyway!
> 
>> The test suite update takes some efforts,
> 
> Yeah :-/
> 
>> it consists of some aspects:
>>   - effective target powerpc_p{8,9}vector_ok are removed
>> and replaced with powerpc_vsx_ok.
> 
> So all such testcases already arrange to have p8 or p9 some other way?

Some of them already have, but some of them don't, for those
without any p8/p9 are adjusted according to the test points
as below explanation.

> 
>>   - Some cases having -mpower{8,9}-vector are updated with
>> -mvsx, some of them already have -mdejagnu-cpu.  For
>> those that don't have -mdejagnu-cpu, if -mdejagnu-cpu
>> is needed for the test point, then it's appended;
>> otherwise, add additional-options -mdejagnu-cpu=power{8,9}
>> if has_arch_pwr{8,9} isn't satisfied.
> 
> Yeah it's a judgement call every time.
> 
>>   - Some test cases are updated with explicit -mvsx.
>>   - Some test cases with those two option mixed are adjusted
>> to keep the test points, like -mpower8-vector
>> -mno-power9-vector are updated with -mdejagnu-cpu=power8
>> -mvsx etc.
> 
> -mcpu=power8 implies -mvsx already.

Yes, but users can specify -mno-vsx in RUNTESTFLAGS, dejagnu
framework can have different behaviors (options order) for
different versions, this explicit -mvsx is mainly for the
consistency between the checking and the actual testing.
But according to the discussion in an internal thread, the
current powerpc_vsx_ok doesn't work as what we expect, there
will be some changes later.

> 
>>   - Some test cases with -mno-power{8,9}-vector are updated
>> by replacing -mno-power{8,9}-vector with -mno-vsx, or
>> just removing it.
> 
> Okay.
> 
>>   - For some cases, we don't always specify -mdejagnu-cpu to
>> avoid to restrict the testing coverage, it would check
>> has_arch_pwr{8,9} and appended that as need.
> 
> That is in general how all tests should be.  Very sometimes we want to
> test for a specific CPU, for a regression test that exhibited just on a
> certain CPU for example.  But we should never have a -mcpu= (or a
> -mpowerN-vector nastiness thing) to test things on a new CPU!  Just do a
> testsuite ruyn with *that* CPU.  Not many years from now, *all* CPUs
> will have those new instructions anyway, so let's not put noise in the
> testcases that will be irrelevant soon.
> 
>>   - For vect test cases run, it doesn't specify -mcpu=power9
>> for power10 and up.
>>
>> Bootstrapped and regtested on:
>>   - powerpc64-linux-gnu P7/P8/P9 {-m32,-m64}
>>   - powerpc64le-linux-gnu P8/P9/P10
> 
> In general it is nice to test 970 as the lowest vector thing we have,
> abnd/or p4 as a target without anything vector, as well.  But I expect
> thoise will just work for this patch :-)

Thanks for the tips, I'll give them a shot before pushing it.

> 
>> Although it's stage4 now, as the discussion in PR113115 we
>> are still eager to neuter these two options.
> 
> It is mostly a testsuite patch, and testcase patches are fine (and much
> wanted!) in stage 4.  The actual compiler options remain, and behaviour
> does not change for anyone who used the option as intended,

Yes, excepting for one unexpected use that users having one cpu type which
doesn't support power8/power9 capability but meanwhile specifies option
-mpower{8,9}-vector to gain power8/power9 capability (as currently these
options can enable the corresponding flags).  But I don't think it's an
expected use case.

> 
> Okay for trunk.  Thanks!  Comments below:
> 
>>  * config/rs6000/rs6000.opt: Make option power{8,9}-vector as
>>  WarnRemoved.
> 
> Do we want this, or do we want it silent?  Should we remove the options
> later, if we now warn for it?

Good question, it mainly follows the practice of option direct-move here.
IMHO at least for power8-vector we want WarnRemoved for now as it's
documented before, and we can probably make it (or them) removed later on
trunk once all active branch releases don't support it

Re: Repost [PATCH 3/6] PowerPC: Add support for accumulators in DMR registers.

2024-02-07 Thread Kewen.Lin
on 2024/2/7 08:06, Michael Meissner wrote:
> On Thu, Jan 25, 2024 at 05:28:49PM +0800, Kewen.Lin wrote:
>> Hi Mike,
>>
>> on 2024/1/6 07:38, Michael Meissner wrote:
>>> The MMA subsystem added the notion of accumulator registers as an optional
>>> feature of ISA 3.1 (power10).  In ISA 3.1, these accumulators overlapped 
>>> with
>>> the traditional floating point registers 0..31, but logically the 
>>> accumulator
>>> registers were separate from the FPR registers.  In ISA 3.1, it was 
>>> anticipated
>>
>> Using VSX register 0..31 rather than traditional floating point registers 
>> 0..31
>> seems more clear, since floating point registers imply 64 bit long registers.
> 
> Ok.
> 
>>> that in future systems, the accumulator registers may no overlap with the 
>>> FPR
>>> registers.  This patch adds the support for dense math registers as separate
>>> registers.
>>>
>>> This particular patch does not change the MMA support to use the 
>>> accumulators
>>> within the dense math registers.  This patch just adds the basic support for
>>> having separate DMRs.  The next patch will switch the MMA support to use the
>>> accumulators if -mcpu=future is used.
>>>
>>> For testing purposes, I added an undocumented option '-mdense-math' to 
>>> enable
>>> or disable the dense math support.
>>
>> Can we avoid this and use one macro for it instead?  As you might have 
>> noticed
>> that some previous temporary options like -mpower{8,9}-vector cause ICEs due 
>> to
>> some unexpected combination and we are going to neuter them, so let's try our
>> best to avoid it if possible.  I guess one macro TARGET_DENSE_MATH defined by
>> TARGET_FUTURE && TARGET_MMA matches all use places? and specifying 
>> -mcpu=future
>> can enable it while -mcpu=power10 can disable it.
> 
> That depends on whether there will be other things added in the future power
> that are not in the MMA+ instruction set.
> 
> But I can switch to defining TARGET_DENSE_MATH to testing TARGET_FUTURE and
> TARGET_MMA.  That way if/when a new cpu comes out, we will just have to change
> the definition of TARGET_DENSE_MATH and not all of the uses.

Yes, that's what I expected.  Thanks!

> 
> I will also add TARGET_MMA_NO_DENSE_MATH to handle the existing MMA code for
> assemble and disassemble when we don't have dense math instructions.

Nice, I also found having such macro can help when reviewing one latter patch
so suggested a similar there.

>>> -(define_insn_and_split "*movxo"
>>> +(define_insn_and_split "*movxo_nodm"
>>>[(set (match_operand:XO 0 "nonimmediate_operand" "=d,ZwO,d")
>>> (match_operand:XO 1 "input_operand" "ZwO,d,d"))]
>>> -  "TARGET_MMA
>>> +  "TARGET_MMA && !TARGET_DENSE_MATH
>>> && (gpc_reg_operand (operands[0], XOmode)
>>> || gpc_reg_operand (operands[1], XOmode))"
>>>"@
>>> @@ -366,6 +369,31 @@ (define_insn_and_split "*movxo"
>>> (set_attr "length" "*,*,16")
>>> (set_attr "max_prefixed_insns" "2,2,*")])
>>>  
>>> +(define_insn_and_split "*movxo_dm"
>>> +  [(set (match_operand:XO 0 "nonimmediate_operand" "=wa,QwO,wa,wD,wD,wa")
>>> +   (match_operand:XO 1 "input_operand""QwO,wa, wa,wa,wD,wD"))]
>>
>> Why not adopt ZwO rather than QwO?
> 
> You have to split the address into 2 addresses for loading or storing vector
> pairs (or 4 addresses for loading or storing vectors).  Z would allow
> register+register addresses, and you wouldn't be able to create the second 
> address by adding 128 to it.  Hence it uses 'Q' for register only and 'wo' for
> d-form addresses.

Thanks for clarifying.  But without this patch the define_insn_and_split *movxo
adopts "ZwO", IMHO it would mean the current "*movxo" define_insn_and_split have
been problematic?  I thought adjust_address can ensure the new address would be
still valid after adjusting 128 offset, could you double check?

> 
>>
>>> +  "TARGET_DENSE_MATH
>>> +   && (gpc_reg_operand (operands[0], XOmode)
>>> +   || gpc_reg_operand (operands[1], XOmode))"
>>> +  "@
>>> +   #
>>> +   #
>>> +   #
>>> +   dmxxinstdmr512 %0,%1,%Y1,0
>>> +   dmmr %0,%1
>>> +   dmxxextfdmr512 %0,%Y0,%1,0"
>>> +  "&& reloa

Re: Repost [PATCH 1/6] Add -mcpu=future

2024-02-07 Thread Kewen.Lin
on 2024/2/6 14:01, Michael Meissner wrote:
> On Tue, Jan 23, 2024 at 04:44:32PM +0800, Kewen.Lin wrote:
...
>>> diff --git a/gcc/config/rs6000/rs6000-opts.h 
>>> b/gcc/config/rs6000/rs6000-opts.h
>>> index 33fd0efc936..25890ae3034 100644
>>> --- a/gcc/config/rs6000/rs6000-opts.h
>>> +++ b/gcc/config/rs6000/rs6000-opts.h
>>> @@ -67,7 +67,9 @@ enum processor_type
>>> PROCESSOR_MPCCORE,
>>> PROCESSOR_CELL,
>>> PROCESSOR_PPCA2,
>>> -   PROCESSOR_TITAN
>>> +   PROCESSOR_TITAN,
>>> +
>>
>> Nit: unintentional empty line?
>>
>>> +   PROCESSOR_FUTURE
>>>  };
> 
> It was more as a separation.  The MPCCORE, CELL, PPCA2, and TITAN are rather
> old processors.  I don't recall why we kept them after the POWER.
> 
> Logically we should re-order the list and move MPCCORE, etc. earlier, but I
> will delete the blank line in future patches.

Thanks for clarifying, the re-order thing can be done in a separate patch and
in this context one comment line would be better than a blank line. :)

...

>>> + power10 tuning until future tuning is added.  */
>>>if (rs6000_tune_index >= 0)
>>> -tune_index = rs6000_tune_index;
>>> +{
>>> +  enum processor_type cur_proc
>>> +   = processor_target_table[rs6000_tune_index].processor;
>>> +
>>> +  if (cur_proc == PROCESSOR_FUTURE)
>>> +   {
>>> + static bool issued_future_tune_warning = false;
>>> + if (!issued_future_tune_warning)
>>> +   {
>>> + issued_future_tune_warning = true;
>>
>> This seems to ensure we only warn this once, but I noticed that in rs6000/
>> only some OPT_Wpsabi related warnings adopt this way, I wonder if we don't
>> restrict it like this, for a tiny simple case, how many times it would warn?
> 
> In a simple case, you would only get the warning once.  But if you use
> __attribute__((__target__(...))) or #pragma target ... you might see it more
> than once.

OK, considering we only get this warning once for a simple case, I'm inclined
not to keep a static variable for it, it's the same as what we do currently
for option conflict errors emission.  But I'm fine for either.


>>>else
>>>  {
>>> -  size_t i;
>>>enum processor_type tune_proc
>>> = (TARGET_POWERPC64 ? PROCESSOR_DEFAULT64 : PROCESSOR_DEFAULT);
>>>  
>>> -  tune_index = -1;
>>> -  for (i = 0; i < ARRAY_SIZE (processor_target_table); i++)
>>> -   if (processor_target_table[i].processor == tune_proc)
>>> - {
>>> -   tune_index = i;
>>> -   break;
>>> - }
>>> +  tune_index = rs600_cpu_index_lookup (tune_proc == PROCESSOR_FUTURE
>>> +  ? PROCESSOR_POWER10
>>> +  : tune_proc);
>>
>> This part looks useless, as tune_proc is impossible to be PROCESSOR_FUTURE.
> 
> Well in theory, you could configure the compiler with --with-cpu=future or
> --with-tune=future.

Sorry for the possible confusion here, the "tune_proc" that I referred to is
the variable in the above else branch:

   enum processor_type tune_proc = (TARGET_POWERPC64 ? PROCESSOR_DEFAULT64 : 
PROCESSOR_DEFAULT);

It's either PROCESSOR_DEFAULT64 or PROCESSOR_DEFAULT, so it doesn't have a
chance to be PROCESSOR_FUTURE, so the checking "tune_proc == PROCESSOR_FUTURE"
is useless.

That's why I suggested the below flow, it does a final check out of those 
checks,
it looks a bit more clear IMHO.

> 
>>>  }
>>
>> Maybe re-structure the above into:
>>
>> bool explicit_tune = false;
>> if (rs6000_tune_index >= 0)
>>   {
>> tune_index = rs6000_tune_index;
>> explicit_tune = true;
>>   }
>> else if (cpu_index >= 0)
>>   // as before
>>   rs6000_tune_index = tune_index = cpu_index;
>> else
>>   {
>>//as before
>>...
>>   }
>>
>> // Check tune_index here instead.
>>
>> if (processor_target_table[tune_index].processor == PROCESSOR_FUTURE)
>>   {
>> tune_index = rs6000_cpu_index_lookup (PROCESSOR_POWER10);
>> if (explicit_tune)
>>   warn ...
>>   }
>>
>> // as before
>> rs6000_tune = processor_target_table[tune_index].processor;
>>
>>>  


BR,
Kewen



Re: [PATCH v2] rs6000: Rework option -mpowerpc64 handling [PR106680]

2024-02-05 Thread Kewen.Lin
Hi Sebastian,

on 2024/2/5 18:38, Sebastian Huber wrote:
> Hello,
> 
> On 27.12.22 11:16, Kewen.Lin via Gcc-patches wrote:
>> Hi Segher,
>>
>> on 2022/12/24 04:26, Segher Boessenkool wrote:
>>> Hi!
>>>
>>> On Wed, Oct 12, 2022 at 04:12:21PM +0800, Kewen.Lin wrote:
>>>> PR106680 shows that -m32 -mpowerpc64 is different from
>>>> -mpowerpc64 -m32, this is determined by the way how we
>>>> handle option powerpc64 in rs6000_handle_option.
>>>>
>>>> Segher pointed out this difference should be taken as
>>>> a bug and we should ensure that option powerpc64 is
>>>> independent of -m32/-m64.  So this patch removes the
>>>> handlings in rs6000_handle_option and add some necessary
>>>> supports in rs6000_option_override_internal instead.
>>>
>>> Sorry for the late review.
>>>
>>>> +  /* Don't expect powerpc64 enabled on those OSes with 
>>>> OS_MISSING_POWERPC64,
>>>> + since they don't support saving the high part of 64-bit registers on
>>>> + context switch.  If the user explicitly specifies it, we won't 
>>>> interfere
>>>> + with the user's specification.  */
>>>
>>> It depends on the OS, and what you call "context switch".  For example
>>> on Linux the context switches done by the kernel are fine, only things
>>> done by setjmp/longjmp and getcontext/setcontext are not.  So just be a
>>> bit more vague here?  "Since they do not save and restore the high half
>>> of the GPRs correctly in all cases", something like that?
>>>
>>> Okay for trunk like that.  Thanks!
>>>
>>
>> Thanks!  Adjusted as you suggested and committed in r13-4894-gacc727cf02a144.
> 
> I am a bit late, however, this broke the 32-bit support for -mcpu=e6500. For 
> RTEMS, I have the following multilibs:
> 
> MULTILIB_REQUIRED += mcpu=e6500/m32
> MULTILIB_REQUIRED += mcpu=e6500/m32/mvrsave
> MULTILIB_REQUIRED += mcpu=e6500/m32/msoft-float/mno-altivec
> MULTILIB_REQUIRED += mcpu=e6500/m64
> MULTILIB_REQUIRED += mcpu=e6500/m64/mvrsave
> 
> I configured GCC as a bi-arch compiler (32-bit and 64-bit). It seems you 
> removed the -m32 handling, so I am not sure how to approach this issue. I 
> added a test case to the PR:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106680

Thanks for reporting, I'll have a look at it (but I'm starting to be on 
vacation, so there may be slow response).

I'm not sure what's happened in bugzilla recently, but I didn't receive any 
mail notifications on your comments
#c5 and #c6 (sorry for the late response), since PR106680 is in state resolved 
maybe it's good to file a new
one for further tracking. :)

BR,
Kewen



Re: Repost [PATCH 6/6] PowerPC: Add support for 1,024 bit DMR registers.

2024-02-04 Thread Kewen.Lin
Hi Mike,

on 2024/1/6 07:42, Michael Meissner wrote:
> This patch is a prelimianry patch to add the full 1,024 bit dense math 
> register> (DMRs) for -mcpu=future.  The MMA 512-bit accumulators map onto the 
> top of the
> DMR register.
> 
> This patch only adds the new 1,024 bit register support.  It does not add
> support for any instructions that need 1,024 bit registers instead of 512 bit
> registers.
> 
> I used the new mode 'TDOmode' to be the opaque mode used for 1,204 bit

typo: 1,204

> registers.  The 'wD' constraint added in previous patches is used for these
> registers.  I added support to do load and store of DMRs via the VSX 
> registers,
> since there are no load/store dense math instructions.  I added the new 
> keyword
> '__dmr' to create 1,024 bit types that can be loaded into DMRs.  At present, I
> don't have aliases for __dmr512 and __dmr1024 that we've discussed internally.
> 
> The patches have been tested on both little and big endian systems.  Can I 
> check
> it into the master branch?
> 
> 2024-01-05   Michael Meissner  
> 
> gcc/
> 
>   * config/rs6000/mma.md (UNSPEC_DM_INSERT512_UPPER): New unspec.
>   (UNSPEC_DM_INSERT512_LOWER): Likewise.
>   (UNSPEC_DM_EXTRACT512): Likewise.
>   (UNSPEC_DMR_RELOAD_FROM_MEMORY): Likewise.
>   (UNSPEC_DMR_RELOAD_TO_MEMORY): Likewise.
>   (movtdo): New define_expand and define_insn_and_split to implement 1,024
>   bit DMR registers.
>   (movtdo_insert512_upper): New insn.
>   (movtdo_insert512_lower): Likewise.
>   (movtdo_extract512): Likewise.
>   (reload_dmr_from_memory): Likewise.
>   (reload_dmr_to_memory): Likewise.
>   * config/rs6000/rs6000-builtin.cc (rs6000_type_string): Add DMR
>   support.
>   (rs6000_init_builtins): Add support for __dmr keyword.
>   * config/rs6000/rs6000-call.cc (rs6000_return_in_memory): Add support
>   for TDOmode.
>   (rs6000_function_arg): Likewise.
>   * config/rs6000/rs6000-modes.def (TDOmode): New mode.
>   * config/rs6000/rs6000.cc (rs6000_hard_regno_nregs_internal): Add
>   support for TDOmode.
>   (rs6000_hard_regno_mode_ok_uncached): Likewise.
>   (rs6000_hard_regno_mode_ok): Likewise.
>   (rs6000_modes_tieable_p): Likewise.
>   (rs6000_debug_reg_global): Likewise.
>   (rs6000_setup_reg_addr_masks): Likewise.
>   (rs6000_init_hard_regno_mode_ok): Add support for TDOmode.  Setup reload
>   hooks for DMR mode.
>   (reg_offset_addressing_ok_p): Add support for TDOmode.
>   (rs6000_emit_move): Likewise.
>   (rs6000_secondary_reload_simple_move): Likewise.
>   (rs6000_secondary_reload_class): Likewise.
>   (rs6000_mangle_type): Add mangling for __dmr type.
>   (rs6000_dmr_register_move_cost): Add support for TDOmode.
>   (rs6000_split_multireg_move): Likewise.
>   (rs6000_invalid_conversion): Likewise.
>   * config/rs6000/rs6000.h (VECTOR_ALIGNMENT_P): Add TDOmode.
>   (enum rs6000_builtin_type_index): Add DMR type nodes.
>   (dmr_type_node): Likewise.
>   (ptr_dmr_type_node): Likewise.
> 
> gcc/testsuite/
> 
>   * gcc.target/powerpc/dm-1024bit.c: New test.
> ---
>  gcc/config/rs6000/mma.md  | 152 ++
>  gcc/config/rs6000/rs6000-builtin.cc   |  13 ++
>  gcc/config/rs6000/rs6000-call.cc  |  13 +-
>  gcc/config/rs6000/rs6000-modes.def|   4 +
>  gcc/config/rs6000/rs6000.cc   | 135 
>  gcc/config/rs6000/rs6000.h|   7 +-
>  gcc/testsuite/gcc.target/powerpc/dm-1024bit.c |  63 
>  7 files changed, 351 insertions(+), 36 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/dm-1024bit.c
> 
> diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
> index f06e6bbb184..37de9030903 100644
> --- a/gcc/config/rs6000/mma.md
> +++ b/gcc/config/rs6000/mma.md
> @@ -92,6 +92,11 @@ (define_c_enum "unspec"
> UNSPEC_MMA_XXMFACC
> UNSPEC_MMA_XXMTACC
> UNSPEC_DM_ASSEMBLE_ACC
> +   UNSPEC_DM_INSERT512_UPPER
> +   UNSPEC_DM_INSERT512_LOWER
> +   UNSPEC_DM_EXTRACT512
> +   UNSPEC_DMR_RELOAD_FROM_MEMORY
> +   UNSPEC_DMR_RELOAD_TO_MEMORY
>])
>  
>  (define_c_enum "unspecv"
> @@ -879,3 +884,150 @@ (define_insn "mma_"
>[(set_attr "type" "mma")
> (set_attr "prefixed" "yes")
> (set_attr "isa" "dm,not_dm,not_dm")])
> +
> +
> +;; TDOmode (i.e. __dmr).
> +(define_expand "movtdo"
> +  [(set (match_operand:TDO 0 "nonimmediate_operand")
> + (match_operand:TDO 1 "input_operand"))]
> +  "TARGET_DENSE_MATH"
> +{
> +  rs6000_emit_move (operands[0], operands[1], TDOmode);
> +  DONE;
> +})
> +
> +(define_insn_and_split "*movtdo"
> +  [(set (match_operand:TDO 0 "nonimmediate_operand" "=wa,m,wa,wD,wD,wa")
> + (match_operand:TDO 1 "input_operand" "m,wa,wa,wa,wD,wD"))]
> +  "TARGET_DENSE_MATH
> +   && (gpc_reg_operand (operands[0], TDOmode)
> +   || gpc_reg_operand (operands[1], TDOmode))"
> +  "@
> +   #

Re: Repost [PATCH 5/6] PowerPC: Switch to dense math names for all MMA operations.

2024-02-03 Thread Kewen.Lin
Hi Mike,

on 2024/1/6 07:40, Michael Meissner wrote:
> This patch changes the assembler instruction names for MMA instructions from
> the original name used in power10 to the new name when used with the dense 
> math
> system.  I.e. xvf64gerpp becomes dmxvf64gerpp.  The assembler will emit the
> same bits for either spelling.
> 
> The patches have been tested on both little and big endian systems.  Can I 
> check
> it into the master branch?
> 
> 2024-01-05   Michael Meissner  
> 
> gcc/
> 
>   * config/rs6000/mma.md (vvi4i4i8_dm): New int attribute.
>   (avvi4i4i8_dm): Likewise.
>   (vvi4i4i2_dm): Likewise.
>   (avvi4i4i2_dm): Likewise.
>   (vvi4i4_dm): Likewise.
>   (avvi4i4_dm): Likewise.
>   (pvi4i2_dm): Likewise.
>   (apvi4i2_dm): Likewise.
>   (vvi4i4i4_dm): Likewise.
>   (avvi4i4i4_dm): Likewise.
>   (mma_): Add support for running on DMF systems, generating the dense
>   math instruction and using the dense math accumulators.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_   (mma_): Likewise.
>   (mma_   (mma_): Likewise.
>   (mma_): Likewise.
> 
> gcc/testsuite/
> 
>   * gcc.target/powerpc/dm-double-test.c: New test.
>   * lib/target-supports.exp (check_effective_target_ppc_dmr_ok): New
>   target test.
> ---
>  gcc/config/rs6000/mma.md  |  98 +++--
>  .../gcc.target/powerpc/dm-double-test.c   | 194 ++
>  gcc/testsuite/lib/target-supports.exp |  19 ++
>  3 files changed, 299 insertions(+), 12 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/dm-double-test.c
> 
> diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
> index 525a85146ff..f06e6bbb184 100644
> --- a/gcc/config/rs6000/mma.md
> +++ b/gcc/config/rs6000/mma.md
> @@ -227,13 +227,22 @@ (define_int_attr apv[(UNSPEC_MMA_XVF64GERPP 
> "xvf64gerpp")
>  
>  (define_int_attr vvi4i4i8[(UNSPEC_MMA_PMXVI4GER8 "pmxvi4ger8")])
>  
> +(define_int_attr vvi4i4i8_dm [(UNSPEC_MMA_PMXVI4GER8 
> "pmdmxvi4ger8")])

Can we update vvi4i4i8 to

(define_int_attr vvi4i4i8   [(UNSPEC_MMA_PMXVI4GER8 "xvi4ger8")])

by avoiding to introduce vvi4i4i8_dm, then its use places would be like:

-  " %A0,%x1,%x2,%3,%4,%5"
+  "@
+   pmdm %A0,%x1,%x2,%3,%4,%5
+   pm %A0,%x1,%x2,%3,%4,%5
+   pm %A0,%x1,%x2,%3,%4,%5"

and 

- define_insn "mma_"
+ define_insn "mma_pm"

(or updating its use in corresponding bif expander field)

?  

This comment is also applied for the other iterators changes.

> +
>  (define_int_attr avvi4i4i8   [(UNSPEC_MMA_PMXVI4GER8PP   
> "pmxvi4ger8pp")])
>  
> +(define_int_attr avvi4i4i8_dm[(UNSPEC_MMA_PMXVI4GER8PP   
> "pmdmxvi4ger8pp")])
> +
>  (define_int_attr vvi4i4i2[(UNSPEC_MMA_PMXVI16GER2"pmxvi16ger2")
>(UNSPEC_MMA_PMXVI16GER2S   "pmxvi16ger2s")
>(UNSPEC_MMA_PMXVF16GER2"pmxvf16ger2")
>(UNSPEC_MMA_PMXVBF16GER2   
> "pmxvbf16ger2")])
>  
> +(define_int_attr vvi4i4i2_dm [(UNSPEC_MMA_PMXVI16GER2"pmdmxvi16ger2")
> +  (UNSPEC_MMA_PMXVI16GER2S   
> "pmdmxvi16ger2s")
> +  (UNSPEC_MMA_PMXVF16GER2"pmdmxvf16ger2")
> +  (UNSPEC_MMA_PMXVBF16GER2   
> "pmdmxvbf16ger2")])
> +
>  (define_int_attr avvi4i4i2   [(UNSPEC_MMA_PMXVI16GER2PP  "pmxvi16ger2pp")
>(UNSPEC_MMA_PMXVI16GER2SPP 
> "pmxvi16ger2spp")
>(UNSPEC_MMA_PMXVF16GER2PP  "pmxvf16ger2pp")
> @@ -245,25 +254,54 @@ (define_int_attr avvi4i4i2  
> [(UNSPEC_MMA_PMXVI16GER2PP  "pmxvi16ger2pp")
>(UNSPEC_MMA_PMXVBF16GER2NP 
> "pmxvbf16ger2np")
>(UNSPEC_MMA_PMXVBF16GER2NN 
> "pmxvbf16ger2nn")])
>  
> +(define_int_attr avvi4i4i2_dm[(UNSPEC_MMA_PMXVI16GER2PP  
> "pmdmxvi16ger2pp")
> +  (UNSPEC_MMA_PMXVI16GER2SPP 
> "pmdmxvi16ger2spp")
> +  (UNSPEC_MMA_PMXVF16GER2PP  
> "pmdmxvf16ger2pp")
> +  (UNSPEC_MMA_PMXVF16GER2PN  
> "pmdmxvf16ger2pn")
> +  (UNSPEC_MMA_PMXVF16GER2NP  
> "pmdmxvf16ger2np")
> +  (UNSPEC_MMA_PMXVF16GER2NN  
> "pmdmxvf16ger2nn")
> +  (UNSPEC_MMA_PMXVBF16GER2PP 
> "pmdmxvbf16ger2pp")
> +  (UNSPEC_MMA_PMXVBF16GER2PN 
> "pmdmxvbf16ger2pn")
> +  (UNSPEC_MMA_PMXVBF16GER2NP 
> "pmdmxvbf16ger2np")
> +  (UNSPEC_MMA_PMXVBF16GER2NN 
> 

Re: Repost [PATCH 4/6] PowerPC: Make MMA insns support DMR registers.

2024-02-03 Thread Kewen.Lin
Hi Mike,

on 2024/1/6 07:39, Michael Meissner wrote:
> This patch changes the MMA instructions to use either FPR registers
> (-mcpu=power10) or DMRs (-mcpu=future).  In this patch, the existing MMA
> instruction names are used.
> 
> A macro (__PPC_DMR__) is defined if the MMA instructions use the DMRs.
> 
> The patches have been tested on both little and big endian systems.  Can I 
> check
> it into the master branch?
> 
> 2024-01-05   Michael Meissner  
> 
> gcc/
> 
>   * config/rs6000/mma.md (mma_): New define_expand to handle
>   mma_ for dense math and non dense math.
>   (mma_ insn): Restrict to non dense math.
>   (mma_xxsetaccz): Convert to define_expand to handle non dense math and
>   dense math.
>   (mma_xxsetaccz_vsx): Rename from mma_xxsetaccz and restrict usage to non
>   dense math.
>   (mma_xxsetaccz_dm): Dense math version of mma_xxsetaccz.
>   (mma_): Add support for dense math.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   * config/rs6000/rs6000-c.cc (rs6000_target_modify_macros): Define
>   __PPC_DMR__ if we have dense math instructions.
>   * config/rs6000/rs6000.cc (print_operand): Make %A handle only DMRs if
>   dense math and only FPRs if not dense math.
>   (rs6000_split_multireg_move): Do not generate the xxmtacc instruction to
>   prime the DMR registers or the xxmfacc instruction to de-prime
>   instructions if we have dense math register support.
> ---
>  gcc/config/rs6000/mma.md  | 247 +-
>  gcc/config/rs6000/rs6000-c.cc |   3 +
>  gcc/config/rs6000/rs6000.cc   |  35 ++---
>  3 files changed, 176 insertions(+), 109 deletions(-)
> 
> diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
> index bb898919ab5..525a85146ff 100644
> --- a/gcc/config/rs6000/mma.md
> +++ b/gcc/config/rs6000/mma.md
> @@ -559,190 +559,249 @@ (define_insn "*mma_disassemble_acc_dm"
>"dmxxextfdmr256 %0,%1,2"
>[(set_attr "type" "mma")])
>  
> -(define_insn "mma_"
> +;; MMA instructions that do not use their accumulators as an input, still 
> must
> +;; not allow their vector operands to overlap the registers used by the
> +;; accumulator.  We enforce this by marking the output as early clobber.  If 
> we
> +;; have dense math, we don't need the whole prime/de-prime action, so just 
> make
> +;; thse instructions be NOPs.

typo: thse.

> +
> +(define_expand "mma_"
> +  [(set (match_operand:XO 0 "register_operand")
> + (unspec:XO [(match_operand:XO 1 "register_operand")]

s/register_operand/accumulator_operand/?

> +MMA_ACC))]
> +  "TARGET_MMA"
> +{
> +  if (TARGET_DENSE_MATH)
> +{
> +  if (!rtx_equal_p (operands[0], operands[1]))
> + emit_move_insn (operands[0], operands[1]);
> +  DONE;
> +}
> +
> +  /* Generate the prime/de-prime code.  */
> +})
> +
> +(define_insn "*mma_"

May be better to name with "*mma__nodm"?

>[(set (match_operand:XO 0 "fpr_reg_operand" "=")
>   (unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0")]
>   MMA_ACC))]
> -  "TARGET_MMA"
> +  "TARGET_MMA && !TARGET_DENSE_MATH"

I found that "TARGET_MMA && !TARGET_DENSE_MATH" is used much (like changes in 
function
rs6000_split_multireg_move in this patch and some places in previous patches), 
maybe we
can introduce a macro named as TARGET_MMA_NODM short for it?

>" %A0"
>[(set_attr "type" "mma")])
>  
>  ;; We can't have integer constants in XOmode so we wrap this in an
> -;; UNSPEC_VOLATILE.
> +;; UNSPEC_VOLATILE for the non-dense math case.  For dense math, we don't 
> need
> +;; to disable optimization and we can do a normal UNSPEC.
>  
> -(define_insn "mma_xxsetaccz"
> -  [(set (match_operand:XO 0 "fpr_reg_operand" "=d")
> +(define_expand "mma_xxsetaccz"
> +  [(set (match_operand:XO 0 "register_operand")

s/register_operand/accumulator_operand/?

>   (unspec_volatile:XO [(const_int 0)]
>   UNSPECV_MMA_XXSETACCZ))]
>"TARGET_MMA"
> +{
> +  if (TARGET_DENSE_MATH)
> +{
> +  emit_insn (gen_mma_xxsetaccz_dm (operands[0]));
> +  DONE;
> +}
> +})
> +
> +(define_insn "*mma_xxsetaccz_vsx"

s/vsx/nodm/

> +  [(set (match_operand:XO 0 "fpr_reg_operand" "=d")
> + (unspec_volatile:XO [(const_int 0)]
> + UNSPECV_MMA_XXSETACCZ))]
> +  "TARGET_MMA && !TARGET_DENSE_MATH"
>"xxsetaccz %A0"
>[(set_attr "type" "mma")])
>  
> +
> +(define_insn "mma_xxsetaccz_dm"
> +  [(set (match_operand:XO 0 "dmr_operand" "=wD")
> + (unspec:XO [(const_int 0)]
> +UNSPECV_MMA_XXSETACCZ))]
> +  "TARGET_DENSE_MATH"
> +  "dmsetdmrz %0"
> +  [(set_attr "type" "mma")])
> +
>  (define_insn "mma_"
> -  

Re: [PATCH] testsuite: Fix vect_long_mult on Power [PR109705]

2024-01-28 Thread Kewen.Lin
on 2024/1/27 06:42, Andrew Pinski wrote:
> On Mon, Jan 15, 2024 at 6:43 PM Kewen.Lin  wrote:
>>
>> Hi,
>>
>> As pointed out by the discussion in PR109705, the current
>> vect_long_mult effective target check on Power is broken.
>> This patch is to fix it accordingly.
>>
>> With additional change by adding a guard vect_long_mult
>> in gcc.dg/vect/pr25413a.c , it's tested well on Power{8,9}
>> LE & BE (also on Power10 LE as before).
> 
> I see this is still broken for 32bit PowerPC where vect_long_mult
> should return true still since long there is 32bit and there is a
> 32bit vector multiply.
> Can someone test (and apply if approved) the attached patch to see if
> it fixes pr25413a.c for powerpc*-*-* for 32bit?

Thanks for fixing, it works perfectly as tested.

I just pushed it as r14-8485 (also updating with a tab and commit log).

BR,
Kewen

> 
> Thanks,
> Andrew Pinski
> 
>>
>> I'm going to push this soon.
>>
>> BR,
>> Kewen
>> -
>> PR testsuite/109705
>>
>> gcc/testsuite/ChangeLog:
>>
>> * lib/target-supports.exp (check_effective_target_vect_long_mult):
>> Fix powerpc*-*-* checks.
>> ---
>>  gcc/testsuite/lib/target-supports.exp | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/gcc/testsuite/lib/target-supports.exp 
>> b/gcc/testsuite/lib/target-supports.exp
>> index 81ae92a0266..fac32fb3d0e 100644
>> --- a/gcc/testsuite/lib/target-supports.exp
>> +++ b/gcc/testsuite/lib/target-supports.exp
>> @@ -9073,9 +9073,9 @@ proc check_effective_target_vect_int_mult { } {
>>
>>  proc check_effective_target_vect_long_mult { } {
>>  if { [istarget i?86-*-*] || [istarget x86_64-*-*]
>> -|| (([istarget powerpc*-*-*]
>> -  && ![istarget powerpc-*-linux*paired*])
>> -  && [check_effective_target_ilp32])
>> +|| ([istarget powerpc*-*-*]
>> + && [check_effective_target_powerpc_vsx_ok]
>> + && [check_effective_target_has_arch_pwr10])
>>  || [is-effective-target arm_neon]
>>  || ([istarget sparc*-*-*] && [check_effective_target_ilp32])
>>  || [istarget aarch64*-*-*]
>> --
>> 2.39.1


Re: Repost [PATCH 3/6] PowerPC: Add support for accumulators in DMR registers.

2024-01-25 Thread Kewen.Lin
Hi Mike,

on 2024/1/6 07:38, Michael Meissner wrote:
> The MMA subsystem added the notion of accumulator registers as an optional
> feature of ISA 3.1 (power10).  In ISA 3.1, these accumulators overlapped with
> the traditional floating point registers 0..31, but logically the accumulator
> registers were separate from the FPR registers.  In ISA 3.1, it was 
> anticipated

Using VSX register 0..31 rather than traditional floating point registers 0..31
seems more clear, since floating point registers imply 64 bit long registers.

> that in future systems, the accumulator registers may no overlap with the FPR
> registers.  This patch adds the support for dense math registers as separate
> registers.
> 
> This particular patch does not change the MMA support to use the accumulators
> within the dense math registers.  This patch just adds the basic support for
> having separate DMRs.  The next patch will switch the MMA support to use the
> accumulators if -mcpu=future is used.
> 
> For testing purposes, I added an undocumented option '-mdense-math' to enable
> or disable the dense math support.

Can we avoid this and use one macro for it instead?  As you might have noticed
that some previous temporary options like -mpower{8,9}-vector cause ICEs due to
some unexpected combination and we are going to neuter them, so let's try our
best to avoid it if possible.  I guess one macro TARGET_DENSE_MATH defined by
TARGET_FUTURE && TARGET_MMA matches all use places? and specifying -mcpu=future
can enable it while -mcpu=power10 can disable it.

> 
> This patch adds a new constraint (wD).  If MMA is selected but dense math is
> not selected (i.e. -mcpu=power10), the wD constraint will allow access to
> accumulators that overlap with the VSX vector registers 0..31.  If both MMA 
> and

Sorry for nitpicking, it's more accurate with "VSX registers 0..31".

> dense math are selected (i.e. -mcpu=future), the wD constraint will only allow
> dense math registers.
> 
> This patch modifies the existing %A output modifier.  If MMA is selected but
> dense math is not selected, then %A output modifier converts the VSX register
> number to the accumulator number, by dividing it by 4.  If both MMA and dense
> math are selected, then %A will map the separate DMR registers into 0..7.
> 
> The intention is that user code using extended asm can be modified to run on
> both MMA without dense math and MMA with dense math:
> 
> 1)If possible, don't use extended asm, but instead use the MMA 
> built-in
>   functions;
> 
> 2)If you do need to write extended asm, change the d constraints
>   targetting accumulators should now use wD;
> 
> 3)Only use the built-in zero, assemble and disassemble functions 
> create
>   move data between vector quad types and dense math accumulators.
>   I.e. do not use the xxmfacc, xxmtacc, and xxsetaccz directly in the
>   extended asm code.  The reason is these instructions assume there is a
>   1-to-1 correspondence between 4 adjacent FPR registers and an
>   accumulator that overlaps with those instructions.  With accumulators
>   now being separate registers, there no longer is a 1-to-1
>   correspondence.
> 
> It is possible that the mangling for DMRs and the GDB register numbers may
> change in the future.
> 
> 2024-01-05   Michael Meissner  
> 
> gcc/
> 
>   * config/rs6000/constraints.md (wD constraint): New constraint.
>   * config/rs6000/mma.md (UNSPEC_DM_ASSEMBLE_ACC): New unspec.
>   (movxo): Convert into define_expand.
>   (movxo_vsx): Version of movxo where accumulators overlap with VSX vector
>   registers 0..31.
>   (movxo_dm): Verson of movxo that supports separate dense math
>   accumulators.
>   (mma_assemble_acc): Add dense math support to define_expand.
>   (mma_assemble_acc_vsx): Rename from mma_assemble_acc, and restrict it to
>   non dense math systems.
>   (mma_assemble_acc_dm): Dense math version of mma_assemble_acc.
>   (mma_disassemble_acc): Add dense math support to define_expand.
>   (mma_disassemble_acc_vsx): Rename from mma_disassemble_acc, and restrict
>   it to non dense math systems.
>   (mma_disassemble_acc_dm): Dense math version of mma_disassemble_acc.
>   * config/rs6000/predicates.md (dmr_operand): New predicate.
>   (accumulator_operand): Likewise.
>   * config/rs6000/rs6000-cpus.def (ISA_FUTURE_MASKS): Add -mdense-math.
>   (POWERPC_MASKS): Likewise.
>   * config/rs6000/rs6000.cc (enum rs6000_reg_type): Add DMR_REG_TYPE.
>   (enum rs6000_reload_reg_type): Add RELOAD_REG_DMR.
>   (LAST_RELOAD_REG_CLASS): Add support for DMR registers and the wD
>   constraint.
>   (reload_reg_map): Likewise.
>   (rs6000_reg_names): Likewise.
>   (alt_reg_names): Likewise.
>   (rs6000_hard_regno_nregs_internal): Likewise.
>   (rs6000_hard_regno_mode_ok_uncached): Likewise.
>   (rs6000_debug_reg_global): 

Re: [PATCH] testsuite: Make pr104992.c irrelated to target vector feature [PR113418]

2024-01-24 Thread Kewen.Lin
Hi,

Thanks for adjusting this.

on 2024/1/24 19:42, Xi Ruoyao wrote:
> On Wed, 2024-01-24 at 19:08 +0800, chenxiaolong wrote:
>> At 19:00 +0800 on Wednesday, 2024-01-24, Xi Ruoyao wrote:
>>> On Wed, 2024-01-24 at 18:32 +0800, chenxiaolong wrote:
 On 20:09 +0800 on Tuesday, 2024-01-23, Xi Ruoyao wrote:
> The vect_int_mod target selector is evaluated with the options in
> DEFAULT_VECTCFLAGS in effect, but these options are not
> automatically
> passed to tests out of the vect directories.  So this test fails
> on
> targets where integer vector modulo operation is supported but
> requiring
> an option to enable, for example LoongArch.
>
> In this test case, the only expected optimization not happened in
> original is in corge because it needs forward propogation.  So we
> can
> scan the forwprop2 dump (where the vector operation is not
> expanded
> to
> scalars yet) instead of optimized, then we don't need to consider
> vect_int_mod or not.
>
> gcc/testsuite/ChangeLog:
>
>   PR testsuite/113418
>   * gcc.dg/pr104992.c (dg-options): Use -fdump-tree-forwprop2
>   instead of -fdump-tree-optimized.
>   (dg-final): Scan forwprop2 dump instead of optimized, and
> remove
>   the use of vect_int_mod.
> ---
>
> This fixes the test failure on loongarch64-linux-gnu, and I've
> also
> tested it on x86_64-linux-gnu.  Ok for trunk?
>
>  gcc/testsuite/gcc.dg/pr104992.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.dg/pr104992.c
> b/gcc/testsuite/gcc.dg/pr104992.c
> index 82f8c75559c..6fd513d34b2 100644
> --- a/gcc/testsuite/gcc.dg/pr104992.c
> +++ b/gcc/testsuite/gcc.dg/pr104992.c
> @@ -1,6 +1,6 @@
>  /* PR tree-optimization/104992 */
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -Wno-psabi -fdump-tree-optimized" } */
> +/* { dg-options "-O2 -Wno-psabi -fdump-tree-forwprop2" } */
>  
>  #define vector __attribute__((vector_size(4*sizeof(int
>  
> @@ -54,5 +54,4 @@ __attribute__((noipa)) unsigned waldo (unsigned
> x,
> unsigned y, unsigned z) {
>  return x / y * z == x;
>  }
>  
> -/* { dg-final { scan-tree-dump-times " % " 9 "optimized" {
> target {
> ! vect_int_mod } } } } */
> -/* { dg-final { scan-tree-dump-times " % " 6 "optimized" {
> target
> vect_int_mod } } } */
> +/* { dg-final { scan-tree-dump-times " % " 6 "forwprop2" } } */

 Hello, currently vect_int_mod vectorization operation detection
 only
 ppc,amd,riscv,LoongArch architecture support. When -fdump-tree-
 forwprop2 is used instead of -fdump-tree-optimized, The
 check_effective_target_vect_int_mod procedure defined in the
 target-
 supports.exp file will never be called. It will only be called on
 pr104992.c, should we consider supporting other architectures?
>>>
>>> Hmm, then we should remove check_effective_target_vect_int_mod.
>>>
>>> If we want to keep -fdump-tree-optimized for this test case and also
>>> make it correct, we'll at least have to move it into vect/, and write
>>> something like
>>>
>>> { dg-final { scan-tree-dump-times " % " 9 "optimized" { target { !
>>> vect_int_mod } } } }
>>> { dg-final { scan-tree-dump-times " % " 6 "optimized" { target {
>>> vect_int_mod && vect128 } } } }
>>> { dg-final { scan-tree-dump-times " % " 7 "optimized" { target {
>>> vect_int_mod && vect64 && !vect128 } } } }
>>>
>>> and how about vect256 etc?  This would be very nasty and deviating
>>> from
>>> the original purpose of this test case (against PR104992, which is a
>>> missed-optimization issue unrelated to vectors).
>>>
>> Ok, let me think about how to make the pr104992.c test case more
>> reasonable.
> 
> It *is* reasonable with -fdump-tree-forwprop2.  It's purposed to test a
> / b * b -> a - a % b simplification, not vector operations.

I agree.  Dropping check_effective_target_vect_int_mod is fine for Power
as we have separated testing coverage on vector int mod capability in
target specific testsuite.  But I'm not sure if it's similar for the
other ports which enable vect_int_mod.  A conservative approach is to
add one new test case into vect/ as you proposed below.

BR,
Kewen

> 
> If we need a test to test vector int modulo operations we should write a
> new test in vect/, like
> 
> /* ... */
> 
> for (int i = 0; i < 4; i++)
>   x[i] %= y[i];
> 
> /* ... */
> 
> /* { dg-final { scan-tree-dump "vectorized 1 loops" "vect" { target { 
> vect_int_mod } } } } */
> 


Re: [PATCH, V2] PR target/112886, Add %S to print_operand for vector pair support.

2024-01-24 Thread Kewen.Lin
on 2024/1/24 23:51, Peter Bergner wrote:
> On 1/24/24 12:04 AM, Kewen.Lin wrote:
>> on 2024/1/24 11:11, Peter Bergner wrote:
>>> But not with this.  The -mdejagnu-cpu=power10 option already enables -mvsx.
>>> If the user explcitly forces -mno-vsx via RUNTESTFLAGS, then let them.
>>> The options set in RUNTESTFLAGS come after the options in the dg-options
>>> line, so even adding -mvsx like the above won't help the test case PASS
>>
>> But this is NOT true, at least on one of internal Power10 machine 
>> (ltcden2-lp1).
>>
>> With the command below:
>>   
>>   make check-gcc-c RUNTESTFLAGS="--target_board=unix/-mno-vsx 
>> powerpc.exp=pr112886.c"
>>
>> this test case fails without the explicit -mvsx in dg-options.
>>
>> From the verbose dumping, the compilation command looks like:
>>
>> /home/linkw/gcc/build/gcc-test-debug/gcc/xgcc 
>> -B/home/linkw/gcc/build/gcc-test-debug/gcc/
>> /home/linkw/gcc/gcc-test/gcc/testsuite/gcc.target/powerpc/pr112886.c  
>> -mno-vsx 
>> -fdiagnostics-plain-output  -mdejagnu-cpu=power10 -O2 -ffat-lto-objects 
>> -fno-ident -S
>> -o pr112886.s
>>
>> "-mno-vsx" comes **before** "-mdejagnu-cpu=power10 -O2" rather than 
>> **after**.
>>
>> I guess it might be due to different behaviors of different versions of 
>> runtest framework?
> 
> That is confusing, unless as you say, the behavior changed.  The whole reason 
> we added
> -mdejagnu-cpu= (and the dg-skip usage before that) was due to encountering 
> problems
> when the test case wanted a specific -mcpu= value and the user overrode it in 
> their
> RUNTESTFLAGS and that can only happen when its options come last on the 
> command line.

I think the behavior changed, or more accurately: the behavior doesn't keep 
consistent
on all test environments (suspecting due to different versions of runtest).

> 
> Then again, why didn't the powerpc_vsx_ok test not save us here?

Same reason, -mno-vsx comes before the append "-mvsx", so the checking passes.

> 
> 
> 
>> So there can be two cases with user explicitly specified -mno-vsx:
>>
>> 1) RUNTESTFLAGS comes after dg-options (assuming same order for -mvsx in 
>> powerpc_vsx_ok)
>>
>>   powerpc_vsx_ok test failed, so UNSUPPORTED
>>
>>   // with explicit -mvsx does nothing as you said.
>>
>> 2) RUNTESTFLAGS comes before dg-options
>>
>>   powerpc_vsx_ok test succeeds, but FAIL.
>>   
>>  // with suggested -mvsx, make it match the powerpc_vsx_ok checking and the 
>> case not fail.
>>
>> As above I think we still need to append the "-mvsx" explicitly.  As 
>> tested/verified, it
>> does help the case not to fail on ltcden2-lp1.
> 
> I'd like to verify that the behavior did change before we enforce adding that 
> option.

Great.

> The problem is, there are a HUGE number of older test cases that would need 
> updating
> to "fix" them too.  ...and not just adding -mnsx, but -maltivec and basically 
> any
> -mfoo option where the test case is expecting the feature foo to be 
> used/tested.
> It would be a huge mess.

I agree, such "-mno-foo" is rarely used in testing, so it doesn't get enough 
notices.

BR,
Kewen



Re: [PATCH, V2] PR target/112886, Add %S to print_operand for vector pair support.

2024-01-23 Thread Kewen.Lin
on 2024/1/24 11:11, Peter Bergner wrote:
> On 1/23/24 8:30 PM, Kewen.Lin wrote:
>>> -   output_operand_lossage ("invalid %%x value");
>>> +   output_operand_lossage ("invalid %%%c value", (code == 'S' ? 'S' : 
>>> 'x'));
>>
>> Nit: Seems simpler with
>>   
>>   output_operand_lossage ("invalid %%%c value", (char) code);
> 
> Agreed, good catch.
> 
> 
> 
> 
>>> diff --git a/gcc/testsuite/gcc.target/powerpc/pr112886.c 
>>> b/gcc/testsuite/gcc.target/powerpc/pr112886.c
>>> new file mode 100644
>>> index 000..4e59dcda6ea
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/powerpc/pr112886.c
>>> @@ -0,0 +1,29 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-require-effective-target power10_ok } */
>>
>> I think this needs one more:
>>
>> /* { dg-require-effective-target powerpc_vsx_ok } */
> 
> I agree with this...
> 
> 
> 
>>> +/* { dg-options "-mdejagnu-cpu=power10 -O2" } */
>>
>> ... and
>>
>> /* { dg-options "-mdejagnu-cpu=power10 -O2 -mvsx" } */
>>
>> , otherwise with explicit -mno-vsx, this test case would fail.
> 
> But not with this.  The -mdejagnu-cpu=power10 option already enables -mvsx.
> If the user explcitly forces -mno-vsx via RUNTESTFLAGS, then let them.
> The options set in RUNTESTFLAGS come after the options in the dg-options
> line, so even adding -mvsx like the above won't help the test case PASS

But this is NOT true, at least on one of internal Power10 machine (ltcden2-lp1).

With the command below:
  
  make check-gcc-c RUNTESTFLAGS="--target_board=unix/-mno-vsx 
powerpc.exp=pr112886.c"

this test case fails without the explicit -mvsx in dg-options.

>From the verbose dumping, the compilation command looks like:

/home/linkw/gcc/build/gcc-test-debug/gcc/xgcc 
-B/home/linkw/gcc/build/gcc-test-debug/gcc/
/home/linkw/gcc/gcc-test/gcc/testsuite/gcc.target/powerpc/pr112886.c  -mno-vsx 
-fdiagnostics-plain-output  -mdejagnu-cpu=power10 -O2 -ffat-lto-objects 
-fno-ident -S
-o pr112886.s

"-mno-vsx" comes **before** "-mdejagnu-cpu=power10 -O2" rather than **after**.

I guess it might be due to different behaviors of different versions of runtest 
framework?

So there can be two cases with user explicitly specified -mno-vsx:

1) RUNTESTFLAGS comes after dg-options (assuming same order for -mvsx in 
powerpc_vsx_ok)

  powerpc_vsx_ok test failed, so UNSUPPORTED

  // with explicit -mvsx does nothing as you said.

2) RUNTESTFLAGS comes before dg-options

  powerpc_vsx_ok test succeeds, but FAIL.
  
 // with suggested -mvsx, make it match the powerpc_vsx_ok checking and the 
case not fail.

As above I think we still need to append the "-mvsx" explicitly.  As 
tested/verified, it
does help the case not to fail on ltcden2-lp1.

BR,
Kewen

> if we didn't have the powerpc_vsx_ok test.  In other words, the -mvsx option
> doesn't help with anything.
> 
> All we need is the new powerpc_vsx_ok check and that will guard against the 
> FAIL
> in the case the user forces -mno-vsx.  In that case, we'll just get an 
> UNSUPPORTED
> and that is fine.
> 
> Peter
> 
> 
> 



Re: [PATCH, V2] PR target/112886, Add %S to print_operand for vector pair support.

2024-01-23 Thread Kewen.Lin
Hi Mike,

on 2024/1/12 01:29, Michael Meissner wrote:
> This is version 2 of the patch.  The only difference is I made the test case
> simpler to read.
> 
> In looking at support for load vector pair and store vector pair for the
> PowerPC in GCC, I noticed that we were missing a print_operand output modifier
> if you are dealing with vector pairs to print the 2nd register in the vector
> pair.
> 
> If the instruction inside of the asm used the Altivec encoding, then we could
> use the %L modifier:
> 
>   __vector_pair *p, *q, *r;
>   // ...
>   __asm__ ("vaddudm %0,%1,%2\n\tvaddudm %L0,%L1,%L2"
>: "=v" (*p)
>: "v" (*q), "v" (*r));
> 
> Likewise if we know the value to be in a tradiational FPR register, %L will
> work for instructions that use the VSX encoding:
> 
>   __vector_pair *p, *q, *r;
>   // ...
>   __asm__ ("xvadddp %x0,%x1,%x2\n\txvadddp %L0,%L1,%L2"
>: "=f" (*p)
>: "f" (*q), "f" (*r));
> 
> But if have a value that is in a traditional Altivec register, and the
> instruction uses the VSX encoding, %L will a value between 0 and 31, when 
> it
> should give a value between 32 and 63.
> 
> This patch adds %S that acts like %x, except that it adds 1 to the
> register number.
> 
> I have tested this on power10 and power9 little endian systems and on a power9
> big endian system.  There were no regressions in the patch.  Can I apply it to
> the trunk?
> 
> It would be nice if I could apply it to the open branches.  Can I backport it
> after a burn-in period?
> 
> 2024-01-10  Michael Meissner  
> 
> gcc/
> 
>   PR target/112886
>   * config/rs6000/rs6000.cc (print_operand): Add %S output modifier.
>   * doc/md.texi (Modifiers): Mention %S can be used like %x.
> 
> gcc/testsuite/
> 
>   PR target/112886
>   * /gcc.target/powerpc/pr112886.c: New test.
> ---
>  gcc/config/rs6000/rs6000.cc | 10 ---
>  gcc/doc/md.texi |  5 ++--
>  gcc/testsuite/gcc.target/powerpc/pr112886.c | 29 +
>  3 files changed, 39 insertions(+), 5 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr112886.c
> 
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index 872df3140e3..6353a7ccfb2 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -14504,13 +14504,17 @@ print_operand (FILE *file, rtx x, int code)
>   print_operand (file, x, 0);
>return;
>  
> +case 'S':
>  case 'x':
> -  /* X is a FPR or Altivec register used in a VSX context.  */
> +  /* X is a FPR or Altivec register used in a VSX context.  %x prints
> +  the VSX register number, %S prints the 2nd register number for
> +  vector pair, decimal 128-bit floating and IBM 128-bit binary floating
> +  values.  */
>if (!REG_P (x) || !VSX_REGNO_P (REGNO (x)))
> - output_operand_lossage ("invalid %%x value");
> + output_operand_lossage ("invalid %%%c value", (code == 'S' ? 'S' : 
> 'x'));

Nit: Seems simpler with
  
  output_operand_lossage ("invalid %%%c value", (char) code);

>else
>   {
> -   int reg = REGNO (x);
> +   int reg = REGNO (x) + (code == 'S' ? 1 : 0);
> int vsx_reg = (FP_REGNO_P (reg)
>? reg - 32
>: reg - FIRST_ALTIVEC_REGNO + 32);
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 47a87d6ceec..53ec957cb23 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -3386,8 +3386,9 @@ A VSX register (VSR), @code{vs0}@dots{}@code{vs63}.  
> This is either an
>  FPR (@code{vs0}@dots{}@code{vs31} are @code{f0}@dots{}@code{f31}) or a VR
>  (@code{vs32}@dots{}@code{vs63} are @code{v0}@dots{}@code{v31}).
>  
> -When using @code{wa}, you should use the @code{%x} output modifier, so that
> -the correct register number is printed.  For example:
> +When using @code{wa}, you should use either the @code{%x} or @code{%S}
> +output modifier, so that the correct register number is printed.  For
> +example:

As I questioned in [1], "It seems there is no Power specific documentation on
operand modifiers like this %L?", even if we document this new "%S" as above,
users can still be confused.

Does it sound better with something like " ... @code{%x} or @code{%S} (for the
next consecutive register number in the context like vector pair etc.) ... "?

[1] https://gcc.gnu.org/pipermail/gcc-patches/2024-January/642414.html

>  
>  @smallexample
>  asm ("xvadddp %x0,%x1,%x2"

And we can also extend this by adding one more example like your associated test
case (hope it can make it more clear).

> diff --git a/gcc/testsuite/gcc.target/powerpc/pr112886.c 
> b/gcc/testsuite/gcc.target/powerpc/pr112886.c
> new file mode 100644
> index 000..4e59dcda6ea
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr112886.c
> @@ -0,0 +1,29 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target 

Re: Repost [PATCH 2/6] PowerPC: Make -mcpu=future enable -mblock-ops-vector-pair.

2024-01-23 Thread Kewen.Lin
on 2024/1/6 07:37, Michael Meissner wrote:
> This patch re-enables generating load and store vector pair instructions when
> doing certain memory copy operations when -mcpu=future is used.
> 
> During power10 development, it was determined that using store vector pair
> instructions were problematical in a few cases, so we disabled generating load
> and store vector pair instructions for memory options by default.  This patch
> re-enables generating these instructions if -mcpu=future is used.
> 
> The patches have been tested on both little and big endian systems.  Can I 
> check
> it into the master branch?
> 
> 2024-01-05   Michael Meissner  
> 
> gcc/
> 
>   * config/rs6000/rs6000-cpus.def (ISA_FUTURE_MASKS): Add
>   -mblock-ops-vector-pair.

Nit: s/-mblock-ops-vector-pair/OPTION_MASK_BLOCK_OPS_VECTOR_PAIR/

>   (POWERPC_MASKS): Likewise.
> ---
>  gcc/config/rs6000/rs6000-cpus.def | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/gcc/config/rs6000/rs6000-cpus.def 
> b/gcc/config/rs6000/rs6000-cpus.def
> index 8754635f3d9..b6cd6d8cc84 100644
> --- a/gcc/config/rs6000/rs6000-cpus.def
> +++ b/gcc/config/rs6000/rs6000-cpus.def
> @@ -90,6 +90,7 @@
>  
>  /* Flags for a potential future processor that may or may not be delivered.  
> */
>  #define ISA_FUTURE_MASKS (ISA_3_1_MASKS_SERVER   \
> +  | OPTION_MASK_BLOCK_OPS_VECTOR_PAIR\
>| OPTION_MASK_FUTURE)


OK with incorporating change s/ISA_FUTURE_MASKS/ISA_FUTURE_MASKS_SERVER/.  
Thanks!

BR,
Kewen

>  
>  /* Flags that need to be turned off if -mno-power9-vector.  */
> @@ -127,6 +128,7 @@
>  
>  /* Mask of all options to set the default isa flags based on -mcpu=.  */
>  #define POWERPC_MASKS(OPTION_MASK_ALTIVEC
> \
> +  | OPTION_MASK_BLOCK_OPS_VECTOR_PAIR\
>| OPTION_MASK_CMPB \
>| OPTION_MASK_CRYPTO   \
>| OPTION_MASK_DFP  \



Re: Repost [PATCH 1/6] Add -mcpu=future

2024-01-23 Thread Kewen.Lin
Hi Mike,

on 2024/1/6 07:35, Michael Meissner wrote:
> This patch implements support for a potential future PowerPC cpu.  Features
> added with -mcpu=future, may or may not be added to new PowerPC processors.
> 
> This patch adds support for the -mcpu=future option.  If you use -mcpu=future,
> the macro __ARCH_PWR_FUTURE__ is defined, and the assembler .machine directive
> "future" is used.  Future patches in this series will add support for new
> instructions that may be present in future PowerPC processors.
> 
> This particular patch does not any new features.  It exists as a ground work
> for future patches to support for a possible PowerPC processor in the future.
> 
> This patch does not implement any differences in tuning when -mcpu=future is
> used compared to -mcpu=power10.  If -mcpu=future is used, GCC will use power10
> tuning.  If you explicitly use -mtune=future, you will get a warning that
> -mtune=future is not supported, and default tuning will be set for power10.
> 
> The patches have been tested on both little and big endian systems.  Can I 
> check
> it into the master branch?
> 
> 2024-01-05   Michael Meissner  
> 
> gcc/
> 
>   * config/rs6000/rs6000-c.cc (rs6000_target_modify_macros): Define
>   __ARCH_PWR_FUTURE__ if -mcpu=future.
>   * config/rs6000/rs6000-cpus.def (ISA_FUTURE_MASKS): New macro.
>   (POWERPC_MASKS): Add -mcpu=future support.
>   * config/rs6000/rs6000-opts.h (enum processor_type): Add
>   PROCESSOR_FUTURE.
>   * config/rs6000/rs6000-tables.opt: Regenerate.
>   * config/rs6000/rs6000.cc (rs600_cpu_index_lookup): New helper
>   function.
>   (rs6000_option_override_internal): Make -mcpu=future set
>   -mtune=power10.  If the user explicitly uses -mtune=future, give a
>   warning and reset the tuning to power10.
>   (rs6000_option_override_internal): Use power10 costs for future
>   machine.
>   (rs6000_machine_from_flags): Add support for -mcpu=future.
>   (rs6000_opt_masks): Likewise.
>   * config/rs6000/rs6000.h (ASM_CPU_SUPPORT): Likewise.
>   * config/rs6000/rs6000.md (cpu attribute): Likewise.
>   * config/rs6000/rs6000.opt (-mfuture): New undocumented debug switch.
>   * doc/invoke.texi (IBM RS/6000 and PowerPC Options): Document 
> -mcpu=future.
> ---
>  gcc/config/rs6000/rs6000-c.cc   |  2 +
>  gcc/config/rs6000/rs6000-cpus.def   |  6 +++
>  gcc/config/rs6000/rs6000-opts.h |  4 +-
>  gcc/config/rs6000/rs6000-tables.opt |  3 ++
>  gcc/config/rs6000/rs6000.cc | 58 -
>  gcc/config/rs6000/rs6000.h  |  1 +
>  gcc/config/rs6000/rs6000.md |  2 +-
>  gcc/config/rs6000/rs6000.opt|  4 ++
>  gcc/doc/invoke.texi |  2 +-
>  9 files changed, 69 insertions(+), 13 deletions(-)
> 
> diff --git a/gcc/config/rs6000/rs6000-c.cc b/gcc/config/rs6000/rs6000-c.cc
> index ce0b14a8d37..f2fb5bef678 100644
> --- a/gcc/config/rs6000/rs6000-c.cc
> +++ b/gcc/config/rs6000/rs6000-c.cc
> @@ -447,6 +447,8 @@ rs6000_target_modify_macros (bool define_p, HOST_WIDE_INT 
> flags)
>  rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR9");
>if ((flags & OPTION_MASK_POWER10) != 0)
>  rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR10");
> +  if ((flags & OPTION_MASK_FUTURE) != 0)
> +rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR_FUTURE");
>if ((flags & OPTION_MASK_SOFT_FLOAT) != 0)
>  rs6000_define_or_undefine_macro (define_p, "_SOFT_FLOAT");
>if ((flags & OPTION_MASK_RECIP_PRECISION) != 0)
> diff --git a/gcc/config/rs6000/rs6000-cpus.def 
> b/gcc/config/rs6000/rs6000-cpus.def
> index d28cc87eb2a..8754635f3d9 100644
> --- a/gcc/config/rs6000/rs6000-cpus.def
> +++ b/gcc/config/rs6000/rs6000-cpus.def
> @@ -88,6 +88,10 @@
>| OPTION_MASK_POWER10  \
>| OTHER_POWER10_MASKS)
>  
> +/* Flags for a potential future processor that may or may not be delivered.  
> */
> +#define ISA_FUTURE_MASKS (ISA_3_1_MASKS_SERVER   \
> +  | OPTION_MASK_FUTURE)
> +

Nit: Named as "ISA_FUTURE_MASKS_SERVER" seems more accurate as it's constituted
with ISA_3_1_MASKS_**SERVER** ...

>  /* Flags that need to be turned off if -mno-power9-vector.  */
>  #define OTHER_P9_VECTOR_MASKS(OPTION_MASK_FLOAT128_HW
> \
>| OPTION_MASK_P9_MINMAX)
> @@ -135,6 +139,7 @@
>| OPTION_MASK_LOAD_VECTOR_PAIR \
>| OPTION_MASK_POWER10  \
>| OPTION_MASK_P10_FUSION   \
> +  | OPTION_MASK_FUTURE   \
>| OPTION_MASK_HTM  \
>| OPTION_MASK_ISEL \
>| OPTION_MASK_MFCRF

Re: [PATCH] strub: Only unbias stack point for SPARC_STACK_BOUNDARY_HACK [PR113100]

2024-01-17 Thread Kewen.Lin
Hi David,

on 2024/1/18 09:27, David Edelsohn wrote:
> If the fixes remove the failures on AIX, then the patch to disable the tests 
> also can be reverted.
> 

Since I didn't find strub-unsupported*.c failed on ppc64 linux, to ensure it's
related, I reverted your commit r14-6838 and my fix r14-7089 locally and 
supposed
to see those test cases failed on aix, but they passed.  Then I tried to reset
the repo to r14-6275 which added those test cases, and supposed to see they 
failed,
then they still passed.  Not sure if I missed something in the testing, could 
you
kindly double check if those test cases started to fail from r14-6275 on your
env? or some other specific commit?  Or maybe directly verify if they can pass
on latest trunk with r14-6838 reverted.  Just to ensure the reverting matches
our expectation.  Thanks in advance!

btw, the command I used to test on aix is:
make check-gcc RUNTESTFLAGS="--target_board=unix'{-m64,-m32}' 
dg.exp=strub-unsupported*.c"

BR,
Kewen
 
> Thanks, David
> 
> 
> On Wed, Jan 17, 2024 at 8:06 PM Alexandre Oliva  <mailto:ol...@adacore.com>> wrote:
> 
>     David,
> 
> On Jan  7, 2024, "Kewen.Lin"  <mailto:li...@linux.ibm.com>> wrote:
> 
> > As PR113100 shows, the unbiasing introduced by r14-6737 can
> > cause the scrubbing to overrun and screw some critical data
> > on stack like saved toc base consequently cause segfault on
> > Power.
> 
> I suppose this problem that Kewen fixed (thanks) was what caused you to
> install commit r14-6838.  According to posted test results, strub worked
> on AIX until Dec 20, when the fixes for sparc that broke strub on ppc
> went in.
> 
> I can't seem to find the email in which you posted the patch, and I'd
> have appreciated if you'd copied me.  I wouldn't have missed it for so
> long if you had.  Since I couldn't find that patch, I'm responding in
> this thread instead.
> 
> The r14-6838 patch is actually very very broken.  Disabling strub on a
> target is not a matter of changing only the testsuite.  Your additions
> to the tests even broke the strub-unsupported testcases, that tested
> exactly the feature that enables ports to disable strub in a way that
> informs users in case they attempt to use it.
> 
> I'd thus like to revert that patch.
> 
> Kewen's patch needs a little additional cleanup, that I'm preparing now,
> to restore fully-functioning strub on sparc32.
> 
> Please let me know in case you observe any other problems related with
> strub.  I'd be happy to fix them, but I can only do so once I'm aware of
> them.
> 
> In case the reversal or the upcoming cleanup has any negative impact,
> please make sure you let me know.
> 
> Thanks,
> 
> Happy GNU Year!
> 
> -- 
> Alexandre Oliva, happy hacker            https://FSFLA.org/blogs/lxo/ 
> <https://FSFLA.org/blogs/lxo/>
>    Free Software Activist                   GNU Toolchain Engineer
> More tolerance and less prejudice are key for inclusion and diversity
> Excluding neuro-others for not behaving ""normal"" is *not* inclusive
> 


[committed] testsuite, rs6000: Adjust fold-vec-extract-char.p7.c [PR111850]

2024-01-17 Thread Kewen.Lin
Hi,

As PR101169 comment #c4 shows, previously the addi count
update on fold-vec-extract-char.p7.c covered a sub-optimal
code gen issue.  On trunk, pass fold-mem-offsets helps to
recover the best code sequence, so this patch is to
revert the count back to the original which matches the
optimal addi count.

Tested well on powerpc64-linux-gnu P8/P9,
powerpc64le-linux-gnu P9/P10 and powerpc-ibm-aix.

Pushed as r14-8201.

PR testsuite/111850

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/fold-vec-extract-char.p7.c: Update the
checking count of addi to 6.
---
 gcc/testsuite/gcc.target/powerpc/fold-vec-extract-char.p7.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-char.p7.c 
b/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-char.p7.c
index 29a8aa84db2..42599c214e4 100644
--- a/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-char.p7.c
+++ b/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-char.p7.c
@@ -11,7 +11,7 @@
 /* one extsb (extend sign-bit) instruction generated for each test against
unsigned types */

-/* { dg-final { scan-assembler-times {\maddi\M} 9 } } */
+/* { dg-final { scan-assembler-times {\maddi\M} 6 } } */
 /* { dg-final { scan-assembler-times {\mli\M} 6 } } */
 /* { dg-final { scan-assembler-times {\mstxvw4x\M|\mstvx\M|\mstxv\M} 6 } } */
 /* -m32 target uses rlwinm in place of rldicl. */
--
2.34.1


Re: [PATCH V1] rs6000: New pass for replacement of adjacent (load) lxv with lxvp

2024-01-16 Thread Kewen.Lin
on 2024/1/16 06:22, Ajit Agarwal wrote:
> Hello Richard:
> 
> On 15/01/24 6:25 pm, Ajit Agarwal wrote:
>>
>>
>> On 15/01/24 6:14 pm, Ajit Agarwal wrote:
>>> Hello Richard:
>>>
>>> On 15/01/24 3:03 pm, Richard Biener wrote:
 On Sun, Jan 14, 2024 at 4:29 PM Ajit Agarwal  
 wrote:
>
> Hello All:
>
> This patch add the vecload pass to replace adjacent memory accesses lxv 
> with lxvp
> instructions. This pass is added before ira pass.
>
> vecload pass removes one of the defined adjacent lxv (load) and replace 
> with lxvp.
> Due to removal of one of the defined loads the allocno is has only uses 
> but
> not defs.
>
> Due to this IRA pass doesn't assign register pairs like registers in 
> sequence.
> Changes are made in IRA register allocator to assign sequential registers 
> to
> adjacent loads.
>
> Some of the registers are cleared and are not set as profitable registers 
> due
> to zero cost is greater than negative costs and checks are added to 
> compare
> positive costs.
>
> LRA register is changed not to reassign them to different register and 
> form
> the sequential register pairs intact.
>
> contrib/check_GNU_style.sh run on patch looks good.
>
> Bootstrapped and regtested for powerpc64-linux-gnu.
>
> Spec2017 benchmarks are run and I get impressive benefits for some of the 
> FP
> benchmarks.
 i
 I want to point out the aarch64 target recently got a ld/st fusion
 pass which sounds
 related.  It would be nice to have at least common infrastructure for
 this (the aarch64
 one also looks quite more powerful)

Thank Richi for pointing out this pass.  Yeah, it would be nice if we can share
something common.  CC the author Alex as well in case he have more insightful
comments.

>>>
>>> load/store fusion pass in aarch64 is scheduled to use before peephole2 pass 
>>> and after register allocator pass. In our case, if we do after register 
>>> allocator
>>> then we should keep register assigned to lower offset load and other load
>>> that is adjacent to previous load with offset difference of 16 is removed.
>>>
>>> Then we are left with one load with lower offset and register assigned 
>>> by register allocator for lower offset load should be lower than other
>>> adjacent load. If not, we need to change it to lower register and 
>>> propagate them with all the uses of the variable. Similary for other
>>> adjacent load that we are removing, register needs to be propagated to
>>> all the uses.
>>>
>>> In that case we are doing the work of register allocator. In most of our
>>> example testcases the lower offset load is assigned greater register 
>>> than other adjacent load by register allocator and hence we are left
>>> with propagating them always and almost redoing the register allocator
>>> work.
>>>
>>> Is it same/okay to use load/store fusion pass as on aarch64 for our cases
>>> considering the above scenario.
>>>
>>> Please let me know what do you think. 
> 
> I have gone through the implementation of ld/st fusion in aarch64.
> 
> Here is my understanding:
> 
> First all its my mistake that I have mentioned in my earlier mail that 
> this pass is done before peephole2 after RA-pass.
> 
> This pass does it before RA-pass early before early-remat and 
> also before peephole2 after RA-pass.
> 
> This pass does load fusion 2 ldr instruction with adjacent accesses
> into ldp instruction.
> 
> The assembly syntax of ldp instruction is
> 
> ldp w3, w7, [x0]
> 
> It loads [X0] into w3 and [X0+4] into W7.
> 
> Both registers that forms pairs are mentioned in ldp instructions
> and might not be in sequntial order like first register is W3 and
> then next register would be W3+1.
> 
> Thats why the pass before RA-pass works as it has both the defs
> and may not be required in sequential order like first_reg and then
> first_reg+1. It can be any valid registers.
> 
> 
> But in lxvp instructions:
> 
> lxv vs32, 0(r2)
> lxv vs45, 16(r2)
> 
> When we combine above lxv instruction into lxvp, lxvp instruction
> becomes
> 
> lxvp vs32, 0(r2)
> 
> wherein in lxvp  r2+0 is loaded into vs32 and r2+16 is loaded into vs33 
> register (sequential registers). vs33 is hidden in lxvp instruction.
> This is mandatory requirement for lxvp instruction and cannot be in 
> any other sequence. register assignment difference should be 1.

Note that the first register number in the pair should be even, it
means the so-called sequential order should be X, X + 1 (X is even).
This is also the reason why we preferred this pairing to be done
before RA (can catch more opportunities).

> 
> All the uses of r45 has to be propagated with r33.

I think you meant s/r45/vs45/ and s/r33/vs33/.

> 
> And also register allocator can allocate two lxv instructions
> in the following registers.
> 
> lxv vs33, 0(r2)
> lxv vs32, 16(r2)
> 
> To generate lxvp for above lxv instructions 

[PATCH] testsuite: Fix vect_long_mult on Power [PR109705]

2024-01-15 Thread Kewen.Lin
Hi,

As pointed out by the discussion in PR109705, the current
vect_long_mult effective target check on Power is broken.
This patch is to fix it accordingly.

With additional change by adding a guard vect_long_mult
in gcc.dg/vect/pr25413a.c , it's tested well on Power{8,9}
LE & BE (also on Power10 LE as before).

I'm going to push this soon.

BR,
Kewen
-
PR testsuite/109705

gcc/testsuite/ChangeLog:

* lib/target-supports.exp (check_effective_target_vect_long_mult):
Fix powerpc*-*-* checks.
---
 gcc/testsuite/lib/target-supports.exp | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/testsuite/lib/target-supports.exp 
b/gcc/testsuite/lib/target-supports.exp
index 81ae92a0266..fac32fb3d0e 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -9073,9 +9073,9 @@ proc check_effective_target_vect_int_mult { } {

 proc check_effective_target_vect_long_mult { } {
 if { [istarget i?86-*-*] || [istarget x86_64-*-*]
-|| (([istarget powerpc*-*-*]
-  && ![istarget powerpc-*-linux*paired*])
-  && [check_effective_target_ilp32])
+|| ([istarget powerpc*-*-*]
+ && [check_effective_target_powerpc_vsx_ok]
+ && [check_effective_target_has_arch_pwr10])
 || [is-effective-target arm_neon]
 || ([istarget sparc*-*-*] && [check_effective_target_ilp32])
 || [istarget aarch64*-*-*]
--
2.39.1


Re: [Patch, rs6000] Eliminate unnecessary byte swaps for block clear on P8 LE [PR113325]

2024-01-14 Thread Kewen.Lin
Hi Haochen,

on 2024/1/11 16:28, HAO CHEN GUI wrote:
> Hi,
>   This patch eliminates unnecessary byte swaps for block clear on P8
> LE. For block clear, all the bytes are set to zero. The byte order
> doesn't make sense. So the alignment of destination could be set to
> the store mode size in stead of 1 byte in order to eliminates
> unnecessary byte swap instructions on P8 LE. The test case shows the
> problem.

I agree with Richi's concern, a bytes swap can be eliminated if the
bytes swapped result is known as before, one typical case is the vector
constant with predicate const_vector_each_byte_same, we can do some
optimization for that.

BR,
Kewen

> 
>   Bootstrapped and tested on x86 and powerpc64-linux BE and LE with no
> regressions. Is this OK for trunk?
> 
> Thanks
> Gui Haochen
> 
> ChangeLog
> rs6000: Eliminate unnecessary byte swaps for block clear on P8 LE
> 
> gcc/
>   PR target/113325
>   * config/rs6000/rs6000-string.cc (expand_block_clear): Set the
>   alignment of destination to the size of mode.
> 
> gcc/testsuite/
>   PR target/113325
>   * gcc.target/powerpc/pr113325.c: New.
> 
> patch.diff
> diff --git a/gcc/config/rs6000/rs6000-string.cc 
> b/gcc/config/rs6000/rs6000-string.cc
> index 7f777666ba9..4c9b2cbeefc 100644
> --- a/gcc/config/rs6000/rs6000-string.cc
> +++ b/gcc/config/rs6000/rs6000-string.cc
> @@ -140,7 +140,9 @@ expand_block_clear (rtx operands[])
>   }
> 
>dest = adjust_address (orig_dest, mode, offset);
> -
> +  /* Set the alignment of dest to the size of mode in order to
> +  avoid unnecessary byte swaps on LE.  */
> +  set_mem_align (dest, GET_MODE_SIZE (mode) * BITS_PER_UNIT);
>emit_move_insn (dest, CONST0_RTX (mode));
>  }
> 
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr113325.c 
> b/gcc/testsuite/gcc.target/powerpc/pr113325.c
> new file mode 100644
> index 000..4a3cae019c2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr113325.c
> @@ -0,0 +1,9 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mdejagnu-cpu=power8" } */
> +/* { dg-require-effective-target powerpc_p8vector_ok } */
> +/* { dg-final { scan-assembler-not {\mxxpermdi\M} } } */
> +
> +void* foo (void* s1)
> +{
> +  return __builtin_memset (s1, 0, 32);
> +}



Re: [PATCH, rs6000] Enable block compare expand on P9 with m32 and mpowerpc64

2024-01-14 Thread Kewen.Lin
Hi Haochen,

on 2024/1/12 14:48, HAO CHEN GUI wrote:
> Hi,
>   On P9 "setb" is used to set the result of block compare. So it works
> with m32 and mpowerpc64. On P8, carry bit is used. So it can't work
> with m32 and mpowerpc64. This patch enables block compare expand for
> m32 and mpowerpc64 on P9.
> 
>   Bootstrapped and tested on x86 and powerpc64-linux BE and LE with no
> regressions. Is this OK for trunk?

OK with two nits below tweaked.  Thanks!

BR,
Kewen

> 
> Thanks
> Gui Haochen
> 
> 
> ChangeLog
> rs6000: Enable block compare expand on P9 with m32 and mpowerpc64
> 
> gcc/
>   * config/rs6000/rs6000-string.cc (expand_block_compare): Enable
>   P9 with m32 and mpowerpc64.
> 
> gcc/testsuite/
>   * gcc.target/powerpc/block-cmp-1.c: Exclude m32 and mpowerpc64.
>   * gcc.target/powerpc/block-cmp-4.c: Likewise.
>   * gcc.target/powerpc/block-cmp-8.c: New.
> 
> patch.diff
> diff --git a/gcc/config/rs6000/rs6000-string.cc 
> b/gcc/config/rs6000/rs6000-string.cc
> index 018b87f2501..346708071b5 100644
> --- a/gcc/config/rs6000/rs6000-string.cc
> +++ b/gcc/config/rs6000/rs6000-string.cc
> @@ -1677,11 +1677,12 @@ expand_block_compare (rtx operands[])
>/* TARGET_POPCNTD is already guarded at expand cmpmemsi.  */
>gcc_assert (TARGET_POPCNTD);
> 
> -  /* This case is complicated to handle because the subtract
> - with carry instructions do not generate the 64-bit
> - carry and so we must emit code to calculate it ourselves.
> - We choose not to implement this yet.  */
> -  if (TARGET_32BIT && TARGET_POWERPC64)
> +  /* For P8, this case is complicated to handle because the subtract
> + with carry instructions do not generate the 64-bit carry and so
> + we must emit code to calculate it ourselves.  We skip it on P8
> + but setb works well on P9.  */
> +  if (TARGET_32BIT && TARGET_POWERPC64

Nit: Move "&& TARGET_POWERPC64" as one separated line to make it read better.

> +  && !TARGET_P9_MISC)
>  return false;
> 
>/* Allow this param to shut off all expansion.  */
> diff --git a/gcc/testsuite/gcc.target/powerpc/block-cmp-1.c 
> b/gcc/testsuite/gcc.target/powerpc/block-cmp-1.c
> index bcf0cb2ab4f..cd076cf1dce 100644
> --- a/gcc/testsuite/gcc.target/powerpc/block-cmp-1.c
> +++ b/gcc/testsuite/gcc.target/powerpc/block-cmp-1.c
> @@ -1,5 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-options "-O2 -mdejagnu-cpu=power8 -mno-vsx" } */
> +/* { dg-skip-if "" { has_arch_ppc64 && ilp32 } } */
>  /* { dg-final { scan-assembler-not {\mb[l]? memcmp\M} } }  */
> 
>  /* Test that it still can do expand for memcmpsi instead of calling library
> diff --git a/gcc/testsuite/gcc.target/powerpc/block-cmp-4.c 
> b/gcc/testsuite/gcc.target/powerpc/block-cmp-4.c
> index c86febae68a..9373b53a3a4 100644
> --- a/gcc/testsuite/gcc.target/powerpc/block-cmp-4.c
> +++ b/gcc/testsuite/gcc.target/powerpc/block-cmp-4.c
> @@ -1,5 +1,6 @@
>  /* { dg-do compile { target be } } */
>  /* { dg-options "-O2 -mdejagnu-cpu=power7" } */
> +/* { dg-skip-if "" { has_arch_ppc64 && ilp32 } } */
>  /* { dg-final { scan-assembler-not {\mb[l]? memcmp\M} } }  */
> 
>  /* Test that it does expand for memcmpsi instead of calling library on
> diff --git a/gcc/testsuite/gcc.target/powerpc/block-cmp-8.c 
> b/gcc/testsuite/gcc.target/powerpc/block-cmp-8.c
> new file mode 100644
> index 000..b470f873973
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/block-cmp-8.c
> @@ -0,0 +1,8 @@
> +/* { dg-do run { target ilp32 } } */
> +/* { dg-options "-O2 -m32 -mpowerpc64" } */

Nit: -m32 isn't needed.

> +/* { dg-require-effective-target has_arch_ppc64 } */
> +/* { dg-timeout-factor 2 } */
> +
> +/* Verify memcmp on m32 mpowerpc64 */
> +
> +#include "../../gcc.dg/memcmp-1.c"
BR,
Kewen


Re: [PATCH, rs6000] Refactor expand_compare_loop and split it to two functions

2024-01-14 Thread Kewen.Lin
Hi Haochen,

on 2024/1/10 09:35, HAO CHEN GUI wrote:
> Hi,
>   This patch refactors function expand_compare_loop and split it to two
> functions. One is for fixed length and another is for variable length.
> These two functions share some low level common help functions.

I'm expecting refactoring doesn't introduce any functional changes, but
this patch has some enhancements as described below, so I think the
subject is off, it's more like rework.

> 
>   Besides above changes, the patch also does:
> 1. Don't generate load and compare loop when max_bytes is less than
> loop bytes.
> 2. Remove do_load_mask_compare as it's no needed. All sub-targets
> entering the function should support efficient overlapping load and
> compare.
> 3. Implement an variable length overlapping load and compare for the
> case which remain bytes is less than the loop bytes in variable length
> compare. The 4k boundary test and one-byte load and compare loop are
> removed as they're no need now.
> 4. Remove the codes for "bytes > max_bytes" with fixed length as the
> case is already excluded by pre-checking.
> 5. Remove running time codes for "bytes > max_bytes" with variable length
> as it should jump to call library at the beginning.
> 6. Enhance do_overlap_load_compare to avoid overlapping load and compare
> when the remain bytes can be loaded and compared by a smaller unit.

Considering it's stage 4 now and the impact of this patch, let's defer
this to next stage 1, if possible could you organize the above changes
into patches:

1) Refactor expand_compare_loop by splitting into two functions without
   any functional changes.
2) Remove some useless codes like 2, 4, 5.
3) Some more enhancements like 1, 3, 6.

?  It would be helpful for the review.  Thanks!

BR,
Kewen

> 
>   Bootstrapped and tested on x86 and powerpc64-linux BE and LE with no
> regressions. Is this OK for trunk?
> 
> Thanks
> Gui Haochen
> 
> 
> ChangeLog
> rs6000: Refactor expand_compare_loop and split it to two functions
> 
> The original expand_compare_loop has a complicated logical as it's
> designed for both fixed and variable length.  This patch splits it to
> two functions and make these two functions share common help functions.
> Also the 4K boundary test and corresponding one byte load and compare
> are replaced by variable length overlapping load and compare.  The
> do_load_mask_compare is removed as all sub-targets entering the function
> has efficient overlapping load and compare so that mask load is no needed.
> 
> gcc/
>   * config/rs6000/rs6000-string.cc (do_isel): Remove.
>   (do_load_mask_compare): Remove.
>   (do_reg_compare): New.
>   (do_load_and_compare): New.
>   (do_overlap_load_compare): Do load and compare with a small unit
>   other than overlapping load and compare when the remain bytes can
>   be done by one instruction.
>   (expand_compare_loop): Remove.
>   (get_max_inline_loop_bytes): New.
>   (do_load_compare_rest_of_loop): New.
>   (generate_6432_conversion): Set it to a static function and move
>   ahead of gen_diff_handle.
>   (gen_diff_handle): New.
>   (gen_load_compare_loop): New.
>   (gen_library_call): New.
>   (expand_compare_with_fixed_length): New.
>   (expand_compare_with_variable_length): New.
>   (expand_block_compare): Call expand_compare_with_variable_length
>   to expand block compare for variable length.  Call
>   expand_compare_with_fixed_length to expand block compare loop for
>   fixed length.
> 
> gcc/testsuite/
>   * gcc.target/powerpc/block-cmp-5.c: New.
>   * gcc.target/powerpc/block-cmp-6.c: New.
>   * gcc.target/powerpc/block-cmp-7.c: New.
> 
> patch.diff
> diff --git a/gcc/config/rs6000/rs6000-string.cc 
> b/gcc/config/rs6000/rs6000-string.cc
> index f707bb2727e..018b87f2501 100644
> --- a/gcc/config/rs6000/rs6000-string.cc
> +++ b/gcc/config/rs6000/rs6000-string.cc
> @@ -404,21 +404,6 @@ do_ifelse (machine_mode cmpmode, rtx_code comparison,
>LABEL_NUSES (true_label) += 1;
>  }
> 
> -/* Emit an isel of the proper mode for DEST.
> -
> -   DEST is the isel destination register.
> -   SRC1 is the isel source if CR is true.
> -   SRC2 is the isel source if CR is false.
> -   CR is the condition for the isel.  */
> -static void
> -do_isel (rtx dest, rtx cmp, rtx src_t, rtx src_f, rtx cr)
> -{
> -  if (GET_MODE (dest) == DImode)
> -emit_insn (gen_isel_cc_di (dest, cmp, src_t, src_f, cr));
> -  else
> -emit_insn (gen_isel_cc_si (dest, cmp, src_t, src_f, cr));
> -}
> -
>  /* Emit a subtract of the proper mode for DEST.
> 
> DEST is the destination register for the subtract.
> @@ -499,65 +484,61 @@ do_rotl3 (rtx dest, rtx src1, rtx src2)
>  emit_insn (gen_rotlsi3 (dest, src1, src2));
>  }
> 
> -/* Generate rtl for a load, shift, and compare of less than a full word.
> -
> -   LOAD_MODE is the machine mode for the loads.
> -   DIFF is the reg for the difference.
> -   CMP_REM is the reg 

Re: [PATCH] strub: Only unbias stack point for SPARC_STACK_BOUNDARY_HACK [PR113100]

2024-01-14 Thread Kewen.Lin
on 2024/1/12 19:03, Alexandre Oliva wrote:
> On Jan 12, 2024, "Kewen.Lin"  wrote:
> 
>>>> By checking PR112917, IMHO we should keep this unbiasing
>>>> guarded under SPARC_STACK_BOUNDARY_HACK (TARGET_ARCH64 &&
>>>> TARGET_STACK_BIAS), similar to some existing code special
>>>> treating SPARC stack bias.
>>>
>>> I'm afraid this change will most certainly regress 32-bit sparc, because
>>> of the large register save area.
> 
>> Oh, I read the comments and commit logs in PR112917, mainly
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112917#{c4,c5,c6},
>> and the "sparc64" in subject of commit r14-6737 also implies
>> that this unbiasing is only required for sparc64, so I thought
>> it should be safe to guard with SPARC_STACK_BOUNDARY_HACK.
> 
> It is safe, in a way, because that protects potentially active stack
> areas, but it's unsafe in that it may leak data that stack scrubbing was
> supposed to scrub.  There's no conservative solution here, alas; we have
> to get it just right.
> 
> Specifically on sparc32, if __builtin_scrub_leave allocated its own
> frame (it doesn't) with the large register-save area for its potential
> (but inexistent) callees to use, it could overlap with a large chunk of
> the very stack frame that it's supposed to clear.

Thanks for the further explanation!

> 
> Unfortunately, this is slowly drifting away from the notion of stack
> address.  I mean, all of the following could conceivably be returned by
> __builtin_stack_address:
> 
> - the (biased) stack pointer
> 
> - the address of the red zone
> 
> - the unbiased stack pointer
> 
> - the address of the save area reserved by callees for potential callees
> 
> - the boundary between caller- and callee-used stack space
> 
> The last one is what we need for stack scrubbing, so that's what I'm
> planning to implement, but I'm pondering whether to change
> __builtin_stack_address() to take an extra argument to select among the
> different possibilities, or of other means to query these various
> offsets.  It feels like overthinking, so I'm trying to push these
> thoughts aside, but...  Does anyone think that would be a desirable
> feature?  We can always add it later.

One immature idea: maybe we can introduce a hook with clear meaning for
the last one and its default implementation still adopts the function
__builtin_stack_address directly, if this default implementation for
some port is imperfect, someone who is familiar with its own ABIs can
further enhance it with its own hook implementation.

BR,
Kewen


Re: [PATCH] strub: Only unbias stack point for SPARC_STACK_BOUNDARY_HACK [PR113100]

2024-01-11 Thread Kewen.Lin
Hi Alexandre,

on 2024/1/11 17:05, Alexandre Oliva wrote:
> On Jan  7, 2024, "Kewen.Lin"  wrote:
> 
>> As PR113100 shows, the unbiasing introduced by r14-6737 can
>> cause the scrubbing to overrun and screw some critical data
>> on stack like saved toc base consequently cause segfault on
>> Power.
> 
> Ugh.  Sorry about the breakage, and thanks for addressing it during my
> absence.  Happy GNU Year! :-)
> 

No problem!  Happy New Year! :)

>> By checking PR112917, IMHO we should keep this unbiasing
>> guarded under SPARC_STACK_BOUNDARY_HACK (TARGET_ARCH64 &&
>> TARGET_STACK_BIAS), similar to some existing code special
>> treating SPARC stack bias.
> 
> I'm afraid this change will most certainly regress 32-bit sparc, because
> of the large register save area.

Oh, I read the comments and commit logs in PR112917, mainly
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112917#{c4,c5,c6},
and the "sparc64" in subject of commit r14-6737 also implies
that this unbiasing is only required for sparc64, so I thought
it should be safe to guard with SPARC_STACK_BOUNDARY_HACK.

CC Rainer (reporter of PR112917), maybe he already noticed if it's
regressed or not.

> 
> I had been hesitant to introduce yet another target configuration knob,
> but it looks like this is what we're going to have to do to accommodate
> all targets.
> 
>> I also expect the culprit commit can
>> affect those ports with nonzero STACK_POINTER_OFFSET.
> 
> IMHO it really shouldn't.  STACK_POINTER_OFFSET should be the "Offset
> from the stack pointer register to the first location at which outgoing
> arguments are placed", which suggests to me that no data that the callee
> couldn't change should go in the area below (or above) %sp+S_P_O.
> 
> ISTM that PPC sets up a save area between the outgoing args and the

Yes, taking 64-bit PowerPC ELF abi 1.9 as example:

  |   Parameter save area(SP + 48)
  |   TOC save area  (SP + 40)
  |   link editor doubleword (SP + 32)
  |   compiler doubleword(SP + 24)
  |   LR save area   (SP + 16)
  |   CR save area   (SP + 8)
SP  --->  +-- Back chain (SP + 0)

https://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi.html#STACK

64-bit PowerPC ELF abi v2 drops "link editor doubleword" and "compiler
doubleword".  PR113100 failures mainly suffer from the TOC saved value
changed when it's used for TOC BASE restoring.

> stack pointer; I don't think that's very common, but I suppose other
> targets that do so would also define STACK_POINTER_OFFSET to nonzero so
> as to reserve those bits.  But whether they should be cleared by stack
> scrubbing, as on sparc, or preserved, as on ppc, depends on the ABI
> conventions, so we probably can't help yet another knob :-/

I agree, it really depends on ABI conventions, taking 64-bit PowerPC
ELF abi as example, some of them is safe to be scrubbed like "LR save
area", some of them depends on if they are used like "TOC save area".
It reminds me that even if on somewhere there is no failures with
scrubbing them, it doesn't really mean it's always safe to scrub since
it's possible that there are no enough testing coverage.  For example,
back chain gets cleared but no issues would be exposed if there is no
test case happening to do some jobs with back chain.  From this
perspective, excepting for the special need of sparc unbiasing, without
examining the specific ABIs, IMHO it's more conservative (not risky) not
to scrub this area than scrubbing it?

> 
> I'll take care of that, and update the corresponding documentation.

Nice, thanks!  Welcome back. :-)

BR,
Kewen


Re: [PATCH] PR target/112886, Add %S to print_operand for vector pair support

2024-01-10 Thread Kewen.Lin
Hi Mike,

on 2024/1/6 06:18, Michael Meissner wrote:
> In looking at support for load vector pair and store vector pair for the
> PowerPC in GCC, I noticed that we were missing a print_operand output modifier
> if you are dealing with vector pairs to print the 2nd register in the vector
> pair.
> 
> If the instruction inside of the asm used the Altivec encoding, then we could
> use the %L modifier:

It seems there is no Power specific documentation on operand modifiers like this
"%L"?

> 
>   __vector_pair *p, *q, *r;
>   // ...
>   __asm__ ("vaddudm %0,%1,%2\n\tvaddudm %L0,%L1,%L2"
>: "=v" (*p)
>: "v" (*q), "v" (*r));
> 
> Likewise if we know the value to be in a tradiational FPR register, %L will
> work for instructions that use the VSX encoding:
> 
>   __vector_pair *p, *q, *r;
>   // ...
>   __asm__ ("xvadddp %x0,%x1,%x2\n\txvadddp %L0,%L1,%L2"
>: "=f" (*p)
>: "f" (*q), "f" (*r));
> 
> But if have a value that is in a traditional Altivec register, and the
> instruction uses the VSX encoding, %L will a value between 0 and 31, when 
> it
> should give a value between 32 and 63.
> 
> This patch adds %S that acts like %x, except that it adds 1 to the
> register number.

Excepting for Peter's comments, since the existing "%L" has different handlings
on REG_P and MEM_P:

case 'L':
  /* Write second word of DImode or DFmode reference.  Works on register
 or non-indexed memory only.  */
  if (REG_P (x))
fputs (reg_names[REGNO (x) + 1], file);
  else if (MEM_P (x))
...

, maybe we can extend the existing '%X' for this similarly (as it's capital of
%x so easier to remember and it's only used for MEM_P now) instead of 
introducing
a new "%S".  But one argument can be a new character is more clear.  Thoughts?

BR,
Kewen

> 
> I have tested this on power10 and power9 little endian systems and on a power9
> big endian system.  There were no regressions in the patch.  Can I apply it to
> the trunk?
> 
> It would be nice if I could apply it to the open branches.  Can I backport it
> after a burn-in period?
> 
> 2024-01-04  Michael Meissner  
> 
> gcc/
> 
>   PR target/112886
>   * config/rs6000/rs6000.cc (print_operand): Add %S output modifier.
>   * doc/md.texi (Modifiers): Mention %S can be used like %x.
> 
> gcc/testsuite/
> 
>   PR target/112886
>   * /gcc.target/powerpc/pr112886.c: New test.
> ---
>  gcc/config/rs6000/rs6000.cc | 10 +++---
>  gcc/doc/md.texi |  5 +++--
>  gcc/testsuite/gcc.target/powerpc/pr112886.c | 19 +++
>  3 files changed, 29 insertions(+), 5 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr112886.c
> 
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index 5a7e00b03d1..ba89377c9ec 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -14504,13 +14504,17 @@ print_operand (FILE *file, rtx x, int code)
>   print_operand (file, x, 0);
>return;
>  
> +case 'S':
>  case 'x':
> -  /* X is a FPR or Altivec register used in a VSX context.  */
> +  /* X is a FPR or Altivec register used in a VSX context.  %x prints
> +  the VSX register number, %S prints the 2nd register number for
> +  vector pair, decimal 128-bit floating and IBM 128-bit binary floating
> +  values.  */
>if (!REG_P (x) || !VSX_REGNO_P (REGNO (x)))
> - output_operand_lossage ("invalid %%x value");
> + output_operand_lossage ("invalid %%%c value", (code == 'S' ? 'S' : 
> 'x'));
>else
>   {
> -   int reg = REGNO (x);
> +   int reg = REGNO (x) + (code == 'S' ? 1 : 0);
> int vsx_reg = (FP_REGNO_P (reg)
>? reg - 32
>: reg - FIRST_ALTIVEC_REGNO + 32);
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 47a87d6ceec..53ec957cb23 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -3386,8 +3386,9 @@ A VSX register (VSR), @code{vs0}@dots{}@code{vs63}.  
> This is either an
>  FPR (@code{vs0}@dots{}@code{vs31} are @code{f0}@dots{}@code{f31}) or a VR
>  (@code{vs32}@dots{}@code{vs63} are @code{v0}@dots{}@code{v31}).
>  
> -When using @code{wa}, you should use the @code{%x} output modifier, so that
> -the correct register number is printed.  For example:
> +When using @code{wa}, you should use either the @code{%x} or @code{%S}
> +output modifier, so that the correct register number is printed.  For
> +example:
>  
>  @smallexample
>  asm ("xvadddp %x0,%x1,%x2"
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr112886.c 
> b/gcc/testsuite/gcc.target/powerpc/pr112886.c
> new file mode 100644
> index 000..07196bdc220
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr112886.c
> @@ -0,0 +1,19 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target power10_ok } */
> +/* { dg-options 

Re: [PATCH] strub: Only unbias stack point for SPARC_STACK_BOUNDARY_HACK [PR113100]

2024-01-09 Thread Kewen.Lin
on 2024/1/8 19:44, Richard Biener wrote:
> On Mon, Jan 8, 2024 at 3:35 AM Kewen.Lin  wrote:
>>
>> Hi,
>>
>> As PR113100 shows, the unbiasing introduced by r14-6737 can
>> cause the scrubbing to overrun and screw some critical data
>> on stack like saved toc base consequently cause segfault on
>> Power.
>>
>> By checking PR112917, IMHO we should keep this unbiasing
>> guarded under SPARC_STACK_BOUNDARY_HACK (TARGET_ARCH64 &&
>> TARGET_STACK_BIAS), similar to some existing code special
>> treating SPARC stack bias.
>>
>> Bootstrapped and regtested on x86_64-redhat-linux and
>> powerpc64{,le}-linux-gnu.  All reported failures in
>> PR113100 are gone.  I also expect the culprit commit can
>> affect those ports with nonzero STACK_POINTER_OFFSET.
>>
>> Is it ok for trunk?
> 
> OK

Thanks!  Pushed as r14-7089.

BR,
Kewen

> 
>> BR,
>> Kewen
>> -
>> PR middle-end/113100
>>
>> gcc/ChangeLog:
>>
>> * builtins.cc (expand_builtin_stack_address): Guard stack point
>> adjustment with SPARC_STACK_BOUNDARY_HACK.
>> ---
>>  gcc/builtins.cc | 5 -
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/gcc/builtins.cc b/gcc/builtins.cc
>> index 125ea158ebf..9bad1e962b4 100644
>> --- a/gcc/builtins.cc
>> +++ b/gcc/builtins.cc
>> @@ -5450,6 +5450,7 @@ expand_builtin_stack_address ()
>>rtx ret = convert_to_mode (ptr_mode, copy_to_reg (stack_pointer_rtx),
>>  STACK_UNSIGNED);
>>
>> +#ifdef SPARC_STACK_BOUNDARY_HACK
>>/* Unbias the stack pointer, bringing it to the boundary between the
>>   stack area claimed by the active function calling this builtin,
>>   and stack ranges that could get clobbered if it called another
>> @@ -5476,7 +5477,9 @@ expand_builtin_stack_address ()
>>   (caller) function's active area as well, whereas those pushed or
>>   allocated temporarily for a call are regarded as part of the
>>   callee's stack range, rather than the caller's.  */
>> -  ret = plus_constant (ptr_mode, ret, STACK_POINTER_OFFSET);
>> +  if (SPARC_STACK_BOUNDARY_HACK)
>> +ret = plus_constant (ptr_mode, ret, STACK_POINTER_OFFSET);
>> +#endif
>>
>>return force_reg (ptr_mode, ret);
>>  }
>> --
>> 2.39.3


[PATCH] rs6000: Make copysign (x, -1) back to -abs (x) for IEEE128 float [PR112606]

2024-01-07 Thread Kewen.Lin
Hi,

I noticed that commit r14-6192 can't help PR112606 #c3 as
it only takes care of SF/DF but TF/KF can still suffer the
issue.  Similar to commit r14-6192, this patch is to take
care of copysign3 with IEEE128 as well.

Bootstrapped and regtested on powerpc64-linux-gnu P8/P9
and powerpc64le-linux-gnu P9 and P10.

I'm going to push this soon.

BR,
Kewen
-
PR target/112606

gcc/ChangeLog:

* config/rs6000/rs6000.md (copysign3 IEEE128): Change predicate
of the last argument from altivec_register_operand to any_operand.  If
operands[2] is CONST_DOUBLE, emit abs or neg abs depending on its sign
otherwise if it doesn't satisfy altivec_register_operand, force it to
REG using copy_to_mode_reg.
---
 gcc/config/rs6000/rs6000.md | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index c880cec33a2..bc8bc6ab060 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -15020,9 +15020,27 @@ (define_insn "sqrt2"
 (define_expand "copysign3"
   [(use (match_operand:IEEE128 0 "altivec_register_operand"))
(use (match_operand:IEEE128 1 "altivec_register_operand"))
-   (use (match_operand:IEEE128 2 "altivec_register_operand"))]
+   (use (match_operand:IEEE128 2 "any_operand"))]
   "FLOAT128_IEEE_P (mode)"
 {
+  /* Middle-end canonicalizes -fabs (x) to copysign (x, -1),
+ but PowerPC prefers -fabs (x).  */
+  if (CONST_DOUBLE_AS_FLOAT_P (operands[2]))
+{
+  if (real_isneg (CONST_DOUBLE_REAL_VALUE (operands[2])))
+   {
+ rtx abs_res = gen_reg_rtx (mode);
+ emit_insn (gen_abs2 (abs_res, operands[1]));
+ emit_insn (gen_neg2 (operands[0], abs_res));
+   }
+  else
+   emit_insn (gen_abs2 (operands[0], operands[1]));
+  DONE;
+}
+
+  if (!altivec_register_operand (operands[2], mode))
+operands[2] = copy_to_mode_reg (mode, operands[2]);
+
   if (TARGET_FLOAT128_HW)
 emit_insn (gen_copysign3_hard (operands[0], operands[1],
 operands[2]));
--
2.42.0


[PATCH] rs6000: Eliminate zext fed by vclzlsbb [PR111480]

2024-01-07 Thread Kewen.Lin
Hi,

As PR111480 shows, commit r14-4079 only optimizes the case
of vctzlsbb but not for the similar vclzlsbb.  This patch
is to consider vclzlsbb as well and avoid the failure on
the reported test case.  It also simplifies the patterns
with iterator and attribute.

Bootstrapped and regtested on powerpc64-linux-gnu P8/P9
and powerpc64le-linux-gnu P9 and P10.

I'm going to push this soon.

BR,
Kewen
-
PR target/111480

gcc/ChangeLog:

* config/rs6000/vsx.md (VCZLSBB): New int iterator.
(vczlsbb_char): New int attribute.
(vclzlsbb_, vctzlsbb_): Merge to ...
(vczlsbb_): ... this.
(*vctzlsbb_zext_): Rename to ...
(*vczlsbb_zext_): ... this, and extend it to
cover vclzlsbb.
---
 gcc/config/rs6000/vsx.md | 41 ++--
 1 file changed, 18 insertions(+), 23 deletions(-)

diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index 4c1725a7ecd..6111cc90eb7 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -411,6 +411,12 @@ (define_mode_attr VM3_char [(V2DI "d")
   (V2DF  "d")
   (V4SF  "w")])

+;; Iterator and attribute for vector count leading/trailing
+;; zero least-significant bits byte
+(define_int_iterator VCZLSBB [UNSPEC_VCLZLSBB
+ UNSPEC_VCTZLSBB])
+(define_int_attr vczlsbb_char [(UNSPEC_VCLZLSBB "l")
+  (UNSPEC_VCTZLSBB "t")])

 ;; VSX moves

@@ -5855,35 +5861,24 @@ (define_insn "vcmpnezw"
   "vcmpnezw %0,%1,%2"
   [(set_attr "type" "vecsimple")])

-;; Vector Count Leading Zero Least-Significant Bits Byte
-(define_insn "vclzlsbb_"
-  [(set (match_operand:SI 0 "register_operand" "=r")
-   (unspec:SI
-[(match_operand:VSX_EXTRACT_I 1 "altivec_register_operand" "v")]
-UNSPEC_VCLZLSBB))]
-  "TARGET_P9_VECTOR"
-  "vclzlsbb %0,%1"
-  [(set_attr "type" "vecsimple")])
-
-;; Vector Count Trailing Zero Least-Significant Bits Byte
-(define_insn "*vctzlsbb_zext_"
+;; Vector Count Leading/Trailing Zero Least-Significant Bits Byte
+(define_insn "*vczlsbb_zext_"
   [(set (match_operand:DI 0 "register_operand" "=r")
-   (zero_extend:DI
-   (unspec:SI
-[(match_operand:VSX_EXTRACT_I 1 "altivec_register_operand" "v")]
-UNSPEC_VCTZLSBB)))]
+ (zero_extend:DI
+   (unspec:SI
+ [(match_operand:VSX_EXTRACT_I 1 "altivec_register_operand" "v")]
+ VCZLSBB)))]
   "TARGET_P9_VECTOR"
-  "vctzlsbb %0,%1"
+  "vczlsbb %0,%1"
   [(set_attr "type" "vecsimple")])

-;; Vector Count Trailing Zero Least-Significant Bits Byte
-(define_insn "vctzlsbb_"
+(define_insn "vczlsbb_"
   [(set (match_operand:SI 0 "register_operand" "=r")
-(unspec:SI
- [(match_operand:VSX_EXTRACT_I 1 "altivec_register_operand" "v")]
- UNSPEC_VCTZLSBB))]
+ (unspec:SI
+   [(match_operand:VSX_EXTRACT_I 1 "altivec_register_operand" "v")]
+   VCZLSBB))]
   "TARGET_P9_VECTOR"
-  "vctzlsbb %0,%1"
+  "vczlsbb %0,%1"
   [(set_attr "type" "vecsimple")])

 ;; Vector Extract Unsigned Byte Left-Indexed
--
2.42.0


[PATCH] testsuite, rs6000: Adjust pcrel-sibcall-1.c with noipa [PR112751]

2024-01-07 Thread Kewen.Lin
Hi,

As PR112751 shows, commit r14-5628 caused pcrel-sibcall-1.c
to fail as it enables ipa-vrp which makes return values of
functions {x,y,xx} as known and propagated.  This patch is
to adjust it with noipa to make it not fragile.

Tested well on powerpc64-linux-gnu P8/P9 and
powerpc64le-linux-gnu P9/P10.

I'm going to push this soon.

BR,
Kewen
-
PR testsuite/112751

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/pcrel-sibcall-1.c: Replace noinline as noipa.
---
 gcc/testsuite/gcc.target/powerpc/pcrel-sibcall-1.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/gcc/testsuite/gcc.target/powerpc/pcrel-sibcall-1.c 
b/gcc/testsuite/gcc.target/powerpc/pcrel-sibcall-1.c
index 9197788f98f..1b6dffd6073 100644
--- a/gcc/testsuite/gcc.target/powerpc/pcrel-sibcall-1.c
+++ b/gcc/testsuite/gcc.target/powerpc/pcrel-sibcall-1.c
@@ -8,10 +8,10 @@
generated when the caller preserves the TOC but the callee does not.  */

 #pragma GCC target ("cpu=power10,pcrel")
-int x (void) __attribute__((noinline));
-int y (void) __attribute__((noinline));
-int xx (void) __attribute__((noinline));
-
+int x (void) __attribute__((noipa));
+int y (void) __attribute__((noipa));
+int xx (void) __attribute__((noipa));
+
 int x (void)
 {
   return 1;
--
2.39.3


[PATCH] strub: Only unbias stack point for SPARC_STACK_BOUNDARY_HACK [PR113100]

2024-01-07 Thread Kewen.Lin
Hi,

As PR113100 shows, the unbiasing introduced by r14-6737 can
cause the scrubbing to overrun and screw some critical data
on stack like saved toc base consequently cause segfault on
Power.

By checking PR112917, IMHO we should keep this unbiasing
guarded under SPARC_STACK_BOUNDARY_HACK (TARGET_ARCH64 &&
TARGET_STACK_BIAS), similar to some existing code special
treating SPARC stack bias.

Bootstrapped and regtested on x86_64-redhat-linux and
powerpc64{,le}-linux-gnu.  All reported failures in
PR113100 are gone.  I also expect the culprit commit can
affect those ports with nonzero STACK_POINTER_OFFSET.

Is it ok for trunk?

BR,
Kewen
-
PR middle-end/113100

gcc/ChangeLog:

* builtins.cc (expand_builtin_stack_address): Guard stack point
adjustment with SPARC_STACK_BOUNDARY_HACK.
---
 gcc/builtins.cc | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/gcc/builtins.cc b/gcc/builtins.cc
index 125ea158ebf..9bad1e962b4 100644
--- a/gcc/builtins.cc
+++ b/gcc/builtins.cc
@@ -5450,6 +5450,7 @@ expand_builtin_stack_address ()
   rtx ret = convert_to_mode (ptr_mode, copy_to_reg (stack_pointer_rtx),
 STACK_UNSIGNED);

+#ifdef SPARC_STACK_BOUNDARY_HACK
   /* Unbias the stack pointer, bringing it to the boundary between the
  stack area claimed by the active function calling this builtin,
  and stack ranges that could get clobbered if it called another
@@ -5476,7 +5477,9 @@ expand_builtin_stack_address ()
  (caller) function's active area as well, whereas those pushed or
  allocated temporarily for a call are regarded as part of the
  callee's stack range, rather than the caller's.  */
-  ret = plus_constant (ptr_mode, ret, STACK_POINTER_OFFSET);
+  if (SPARC_STACK_BOUNDARY_HACK)
+ret = plus_constant (ptr_mode, ret, STACK_POINTER_OFFSET);
+#endif

   return force_reg (ptr_mode, ret);
 }
--
2.39.3


Re: [Patchv3, rs6000] Clean up pre-checkings of expand_block_compare

2023-12-20 Thread Kewen.Lin
Hi,

on 2023/12/21 09:37, HAO CHEN GUI wrote:
> Hi,
>   This patch cleans up pre-checkings of expand_block_compare. It does
> 1. Assert only P7 above can enter this function as it's already guard
> by the expand.
> 2. Remove P7 processor test as only P7 above can enter this function and
> P7 LE is excluded by targetm.slow_unaligned_access. On P7 BE, the
> performance of expand is better than the performance of library when
> the length is long.
> 
>   Compared to last version,
> https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640833.html
> the main change is to split optimization for size to a separate patch
> and add a testcase for P7 BE.
> 
>   Bootstrapped and tested on x86 and powerpc64-linux BE and LE with no
> regressions. Is this OK for trunk?

OK, thanks!

BR,
Kewen

> 
> Thanks
> Gui Haochen
> 
> ChangeLog
> rs6000: Clean up the pre-checkings of expand_block_compare
> 
> Remove P7 CPU test as only P7 above can enter this function and P7 LE is
> excluded by the checking of targetm.slow_unaligned_access on word_mode.
> Also performance test shows the expand of block compare is better than
> library on P7 BE when the length is from 16 bytes to 64 bytes.
> 
> gcc/
>   * gcc/config/rs6000/rs6000-string.cc (expand_block_compare): Assert
>   only P7 above can enter this function.  Remove P7 CPU test and let
>   P7 BE do the expand.
> 
> gcc/testsuite/
>   * gcc.target/powerpc/block-cmp-4.c: New.
> 
> patch.diff
> diff --git a/gcc/config/rs6000/rs6000-string.cc 
> b/gcc/config/rs6000/rs6000-string.cc
> index 5149273b80e..09db57255fa 100644
> --- a/gcc/config/rs6000/rs6000-string.cc
> +++ b/gcc/config/rs6000/rs6000-string.cc
> @@ -1947,15 +1947,12 @@ expand_block_compare_gpr(unsigned HOST_WIDE_INT 
> bytes, unsigned int base_align,
>  bool
>  expand_block_compare (rtx operands[])
>  {
> +  /* TARGET_POPCNTD is already guarded at expand cmpmemsi.  */
> +  gcc_assert (TARGET_POPCNTD);
> +
>if (optimize_insn_for_size_p ())
>  return false;
> 
> -  rtx target = operands[0];
> -  rtx orig_src1 = operands[1];
> -  rtx orig_src2 = operands[2];
> -  rtx bytes_rtx = operands[3];
> -  rtx align_rtx = operands[4];
> -
>/* This case is complicated to handle because the subtract
>   with carry instructions do not generate the 64-bit
>   carry and so we must emit code to calculate it ourselves.
> @@ -1963,23 +1960,19 @@ expand_block_compare (rtx operands[])
>if (TARGET_32BIT && TARGET_POWERPC64)
>  return false;
> 
> -  bool isP7 = (rs6000_tune == PROCESSOR_POWER7);
> -
>/* Allow this param to shut off all expansion.  */
>if (rs6000_block_compare_inline_limit == 0)
>  return false;
> 
> -  /* targetm.slow_unaligned_access -- don't do unaligned stuff.
> - However slow_unaligned_access returns true on P7 even though the
> - performance of this code is good there.  */
> -  if (!isP7
> -  && (targetm.slow_unaligned_access (word_mode, MEM_ALIGN (orig_src1))
> -   || targetm.slow_unaligned_access (word_mode, MEM_ALIGN (orig_src2
> -return false;
> +  rtx target = operands[0];
> +  rtx orig_src1 = operands[1];
> +  rtx orig_src2 = operands[2];
> +  rtx bytes_rtx = operands[3];
> +  rtx align_rtx = operands[4];
> 
> -  /* Unaligned l*brx traps on P7 so don't do this.  However this should
> - not affect much because LE isn't really supported on P7 anyway.  */
> -  if (isP7 && !BYTES_BIG_ENDIAN)
> +  /* targetm.slow_unaligned_access -- don't do unaligned stuff.  */
> +  if (targetm.slow_unaligned_access (word_mode, MEM_ALIGN (orig_src1))
> +  || targetm.slow_unaligned_access (word_mode, MEM_ALIGN (orig_src2)))
>  return false;
> 
>/* If this is not a fixed size compare, try generating loop code and
> @@ -2027,14 +2020,6 @@ expand_block_compare (rtx operands[])
>if (!IN_RANGE (bytes, 1, max_bytes))
>  return expand_compare_loop (operands);
> 
> -  /* The code generated for p7 and older is not faster than glibc
> - memcmp if alignment is small and length is not short, so bail
> - out to avoid those conditions.  */
> -  if (targetm.slow_unaligned_access (word_mode, base_align * BITS_PER_UNIT)
> -  && ((base_align == 1 && bytes > 16)
> -   || (base_align == 2 && bytes > 32)))
> -return false;
> -
>rtx final_label = NULL;
> 
>if (use_vec)
> diff --git a/gcc/testsuite/gcc.target/powerpc/block-cmp-4.c 
> b/gcc/testsuite/gcc.target/powerpc/block-cmp-4.c
> new file mode 100644
> index 000..c86febae68a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/block-cmp-4.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile { target be } } */
> +/* { dg-options "-O2 -mdejagnu-cpu=power7" } */
> +/* { dg-final { scan-assembler-not {\mb[l]? memcmp\M} } }  */
> +
> +/* Test that it does expand for memcmpsi instead of calling library on
> +   P7 BE when length is less than 32 bytes.  */
> +
> +int foo (const char* s1, const char* s2)
> +{
> +  return __builtin_memcmp (s1, s2, 31);
> +}


Re: [Patch, rs6000] Call library for block memory compare when optimizing for size

2023-12-20 Thread Kewen.Lin
Hi Haochen,

on 2023/12/20 16:56, HAO CHEN GUI wrote:
> Hi,
>   This patch call library function for block memory compare when it's
> optimized for size.
> 
>   Bootstrapped and tested on x86 and powerpc64-linux BE and LE with no
> regressions. Is this OK for trunk?
> 
> Thanks
> Gui Haochen
> 
> ChangeLog
> rs6000: Call library for block memory compare when optimizing for size
> 
> gcc/
>   * config/rs6000/rs6000-string.cc (expand_block_compare): Return
>   false when optimizing for size.
> 
> gcc/testsuite/
>   * gcc.target/powerpc/block-cm-3.c: New.

Nit: s/cm/cmp/

> 
> patch.diff
> diff --git a/gcc/config/rs6000/rs6000-string.cc 
> b/gcc/config/rs6000/rs6000-string.cc
> index 05dc41622f4..5149273b80e 100644
> --- a/gcc/config/rs6000/rs6000-string.cc
> +++ b/gcc/config/rs6000/rs6000-string.cc
> @@ -1947,6 +1947,9 @@ expand_block_compare_gpr(unsigned HOST_WIDE_INT bytes, 
> unsigned int base_align,
>  bool
>  expand_block_compare (rtx operands[])
>  {
> +  if (optimize_insn_for_size_p ())
> +return false;

Thanks for separating this out.  I just noticed that for expander
cmpstrnsi and cmpstrsi, we put this check in the associated {} code,
since it doesn't need extra checks here, I'd prefer to move this
check to cmpmemsi to keep consistent with the existing ones.

OK for trunk with this move (also tested well), thanks!

BR,
Kewen

> +
>rtx target = operands[0];
>rtx orig_src1 = operands[1];
>rtx orig_src2 = operands[2];
> diff --git a/gcc/testsuite/gcc.target/powerpc/block-cmp-3.c 
> b/gcc/testsuite/gcc.target/powerpc/block-cmp-3.c
> new file mode 100644
> index 000..c7e853ad593
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/block-cmp-3.c
> @@ -0,0 +1,8 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Os" } */
> +/* { dg-final { scan-assembler-times {\mb[l]? memcmp\M} 1 } }  */
> +
> +int foo (const char* s1, const char* s2)
> +{
> +  return __builtin_memcmp (s1, s2, 4);
> +}
> 



Re: [Patchv3, rs6000] Correct definition of macro of fixed point efficient unaligned

2023-12-20 Thread Kewen.Lin
Hi Haochen,

on 2023/12/20 16:51, HAO CHEN GUI wrote:
> Hi,
>   The patch corrects the definition of
> TARGET_EFFICIENT_OVERLAPPING_UNALIGNED and replace it with the call of
> slow_unaligned_access.
> 
>   Compared with last version,
> https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640832.html
> the main change is to pass alignment measured by bits to
> slow_unaligned_access.

For a record in case someone would be confused here, Haochen and I had
a discussion offlist, the target hook slow_unaligned_access requires
alignment in bits, the previous version mainly adopts alignment in bytes
excepting for the "UINTVAL (align_rtx)" aforementioned.

> 
>   Bootstrapped and tested on x86 and powerpc64-linux BE and LE with no
> regressions. Is this OK for trunk?
> 
> Thanks
> Gui Haochen
> 
> ChangeLog
> rs6000: Correct definition of macro of fixed point efficient unaligned
> 
> Marco TARGET_EFFICIENT_OVERLAPPING_UNALIGNED is used in rs6000-string.cc to
> guard the platform which is efficient on fixed point unaligned load/store.
> It's originally defined by TARGET_EFFICIENT_UNALIGNED_VSX which is enabled
> from P8 and can be disabled by mno-vsx option. So the definition is wrong.
> This patch corrects the problem and call slow_unaligned_access to judge if
> fixed point unaligned load/store is efficient or not.
> 
> gcc/
>   * config/rs6000/rs6000.h (TARGET_EFFICIENT_OVERLAPPING_UNALIGNED):
>   Remove.
>   * config/rs6000/rs6000-string.cc (select_block_compare_mode):
>   Replace TARGET_EFFICIENT_OVERLAPPING_UNALIGNED with
>   targetm.slow_unaligned_access.
>   (expand_block_compare_gpr): Likewise.
>   (expand_block_compare): Likewise.
>   (expand_strncmp_gpr_sequence): Likewise.
> 
> gcc/testsuite/
>   * gcc.target/powerpc/block-cmp-1.c: New.
>   * gcc.target/powerpc/block-cmp-2.c: New.
> 
> patch.diff
> diff --git a/gcc/config/rs6000/rs6000-string.cc 
> b/gcc/config/rs6000/rs6000-string.cc
> index 44a946cd453..05dc41622f4 100644
> --- a/gcc/config/rs6000/rs6000-string.cc
> +++ b/gcc/config/rs6000/rs6000-string.cc
> @@ -305,7 +305,7 @@ select_block_compare_mode (unsigned HOST_WIDE_INT offset,
>else if (bytes == GET_MODE_SIZE (QImode))
>  return QImode;
>else if (bytes < GET_MODE_SIZE (SImode)
> -&& TARGET_EFFICIENT_OVERLAPPING_UNALIGNED
> +&& !targetm.slow_unaligned_access (SImode, align * BITS_PER_UNIT)
>  && offset >= GET_MODE_SIZE (SImode) - bytes)
>  /* This matches the case were we have SImode and 3 bytes
> and offset >= 1 and permits us to move back one and overlap
> @@ -313,7 +313,7 @@ select_block_compare_mode (unsigned HOST_WIDE_INT offset,
> unwanted bytes off of the input.  */
>  return SImode;
>else if (word_mode_ok && bytes < UNITS_PER_WORD
> -&& TARGET_EFFICIENT_OVERLAPPING_UNALIGNED
> +&& !targetm.slow_unaligned_access (word_mode, align * BITS_PER_UNIT)
>  && offset >= UNITS_PER_WORD-bytes)
>  /* Similarly, if we can use DImode it will get matched here and
> can do an overlapping read that ends at the end of the block.  */
> @@ -1749,7 +1749,8 @@ expand_block_compare_gpr(unsigned HOST_WIDE_INT bytes, 
> unsigned int base_align,
>load_mode_size = GET_MODE_SIZE (load_mode);
>if (bytes >= load_mode_size)
>   cmp_bytes = load_mode_size;
> -  else if (TARGET_EFFICIENT_OVERLAPPING_UNALIGNED)
> +  else if (!targetm.slow_unaligned_access (load_mode,
> +align * BITS_PER_UNIT))
>   {
> /* Move this load back so it doesn't go past the end.
>P8/P9 can do this efficiently.  */
> @@ -2026,7 +2027,7 @@ expand_block_compare (rtx operands[])
>/* The code generated for p7 and older is not faster than glibc
>   memcmp if alignment is small and length is not short, so bail
>   out to avoid those conditions.  */
> -  if (!TARGET_EFFICIENT_OVERLAPPING_UNALIGNED
> +  if (targetm.slow_unaligned_access (word_mode, base_align * BITS_PER_UNIT)

Nit: Maybe add one variable for UINTVAL (align_rtx) before and use it
for base_align and here.

OK for trunk with or without this nit tweaked.  Thanks!

BR,
Kewen

>&& ((base_align == 1 && bytes > 16)
> || (base_align == 2 && bytes > 32)))
>  return false;
> @@ -2168,7 +2169,8 @@ expand_strncmp_gpr_sequence (unsigned HOST_WIDE_INT 
> bytes_to_compare,
>load_mode_size = GET_MODE_SIZE (load_mode);
>if (bytes_to_compare >= load_mode_size)
>   cmp_bytes = load_mode_size;
> -  else if (TARGET_EFFICIENT_OVERLAPPING_UNALIGNED)
> +  else if (!targetm.slow_unaligned_access (load_mode,
> +align * BITS_PER_UNIT))
>   {
> /* Move this load back so it doesn't go past the end.
>P8/P9 can do this efficiently.  */
> diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
> index 326c45221e9..3971a56c588 100644
> --- 

Re: [PATCH] sel-sched: Verify change before replacing dest in EXPR_INSN_RTX [PR112995]

2023-12-20 Thread Kewen.Lin
Hi Jeff,

on 2023/12/21 04:30, Jeff Law wrote:
> 
> 
> On 12/15/23 01:52, Kewen.Lin wrote:
>> Hi,
>>
>> PR112995 exposed one issue in current try_replace_dest_reg
>> that the result rtx insn after replace_dest_with_reg_in_expr
>> is probably unable to match any constraints.  Although there
>> are some checks on the changes onto dest or src of orig_insn,
>> none is performed on the EXPR_INSN_RTX.
>>
>> This patch is to add the check before actually replacing dest
>> in expr with reg.
>>
>> Bootstrapped and regtested on x86_64-redhat-linux and
>> powerpc64{,le}-linux-gnu.
>>
>> Is it ok for trunk?
>>
>> BR,
>> Kewen
>> -
>> PR rtl-optimization/112995
>>
>> gcc/ChangeLog:
>>
>> * sel-sched.cc (try_replace_dest_reg): Check the validity of the
>> replaced insn before actually replacing dest in expr.
>>
>> gcc/testsuite/ChangeLog:
>>
>> * gcc.target/powerpc/pr112995.c: New test.
> Setting aside whether or not we should just deprecate/remove sel-sched for 
> now
> 
> 
> 
> From the PR:
>> with moving up, we have:
>>
>> (insn 46 0 0 (set (reg:DI 64 0 [135])
>>     (sign_extend:DI (reg/v:SI 64 0 [orig:119 c ] [119]))) 31 
>> {extendsidi2}
>>  (expr_list:REG_DEAD (reg/v:SI 9 9 [orig:119 c ] [119])
>>     (nil)))
>>
>> in try_replace_dest_reg, we updated the above EXPR_INSN_RTX to:
>>
>> (insn 48 0 0 (set (reg:DI 32 0)
>>     (sign_extend:DI (reg/v:SI 64 0 [orig:119 c ] [119]))) 31 
>> {extendsidi2}
>>  (nil))
>>
>> This doesn't match any constraint and it's an unexpected modification.
> 
> 
> It would have been helpful to include that in the patch, along with the fact 
> that (reg 32) and (reg 64) are FP and VREGs respectively.  That makes it 
> clearer why the constraints might not match after the change.
> 

Good idea!

> OK for the trunk.

Thanks for the review, as suggested I updated the commit log
as below and committed it as r14-6768-g5fbc77726f68a3.

-
PR112995 exposed one issue in current try_replace_dest_reg
that the result rtx insn after replace_dest_with_reg_in_expr
is probably unable to match any constraints.  Although there
are some checks on the changes onto dest or src of orig_insn,
none is performed on the EXPR_INSN_RTX.

Initially we have:

(insn 31 6 10 2 (set (reg/v:SI 9 9 [orig:119 c ] [119])
 (reg/v:SI 64 0 [orig:119 c ] [119]))
"test.i":5:5 555 {*movsi_internal1} ... )
...
(insn 25 10 27 2 (set (reg:DI 64 0 [135])
  (sign_extend:DI
 (reg/v:SI 9 9 [orig:119 c ] [119])))
 "test.i":6:5 31 {extendsidi2} ...)

with moving up, we have:

(insn 46 0 0 (set (reg:DI 64 0 [135])
  (sign_extend:DI
  (reg/v:SI 64 0 [orig:119 c ] [119])))
   31 {extendsidi2} ...)

in try_replace_dest_reg, we updated the above EXPR_INSN_RTX to:

(insn 48 0 0 (set (reg:DI 32 0)
  (sign_extend:DI
  (reg/v:SI 64 0 [orig:119 c ] [119])))
   31 {extendsidi2} ...)

The dest (reg 64) is a VR (also VSX REG), the updating makes
it become to (reg 32) which is a FPR (also VSX REG), we have
an alternative to match "VR,VR" but no one to match "FPR/VSX,
VR/VSX", so it fails with ICE.

This patch is to add the check before actually replacing dest
in expr with reg.
-

BR,
Kewen


Re: PING^1 [PATCH] sched: Remove debug counter sched_block

2023-12-20 Thread Kewen.Lin
Hi Jeff,

on 2023/12/21 04:43, Jeff Law wrote:
> 
> 
> On 12/11/23 23:17, Kewen.Lin wrote:
>> Hi,
>>
>> Gentle ping this:
>>
>> https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636597.html
>>
>> BR,
>> Kewen
>>
>> on 2023/11/15 17:01, Kewen.Lin wrote:
>>> Hi,
>>>
>>> on 2023/11/10 01:40, Alexander Monakov wrote:
>>>
>>>> I agree with the concern. I hoped that solving the problem by skipping the 
>>>> BB
>>>> like the (bit-rotted) debug code needs to would be a minor surgery. As 
>>>> things
>>>> look now, it may be better to remove the non-working sched_block debug 
>>>> counter
>>>> entirely and implement a good solution for the problem at hand.
>>>>
>>>
>>> According to this comment, I made and tested the below patch to remove the
>>> problematic debug counter:
>>>
>>> Subject: [PATCH] sched: Remove debug counter sched_block
>>>
>>> Currently the debug counter sched_block doesn't work well
>>> since we create dependencies for some insns and those
>>> dependencies are expected to be resolved during scheduling
>>> insns but they can get skipped once we are skipping some
>>> block while respecting sched_block debug counter.
>>>
>>> For example, for the below test case:
>>> -- 
>>> int a, b, c, e, f;
>>> float d;
>>>
>>> void
>>> g ()
>>> {
>>>    float h, i[1];
>>>    for (; f;)
>>>  if (c)
>>>    {
>>> d *e;
>>> if (b)
>>>   {
>>>     float *j = i;
>>>     j[0] = 0;
>>>   }
>>> h = d;
>>>    }
>>>    if (h)
>>>  a = i[0];
>>> }
>>> -- 
>>> ICE occurs with option "-O2 -fdbg-cnt=sched_block:1".
>>>
>>> As the discussion in [1], it seems that we think this debug
>>> counter is useless and can be removed.  It's also implied
>>> that if it's useful and used often, the above issue should
>>> have been cared about and resolved earlier.  So this patch
>>> is to remove this debug counter.
>>>
>>> Bootstrapped and regtested on x86_64-redhat-linux and
>>> powerpc64{,le}-linux-gnu.
>>>
>>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/635852.html
>>>
>>> Is it ok for trunk?
>>>
>>> BR,
>>> Kewen
>>> -
>>>
>>> gcc/ChangeLog:
>>>
>>> * dbgcnt.def (sched_block): Remove.
>>> * sched-rgn.cc (schedule_region): Remove the support of debug count
>>> sched_block.
> OK.  SOrry about the delay.

Thanks for the review, pushed as r14-6766-gef259ebeb39501.

BR,
Kewen


[PATCH] sched: Don't skip empty block by removing no_real_insns_p [PR108273]

2023-12-20 Thread Kewen.Lin
Hi,

This patch follows Richi's suggestion "scheduling shouldn't
special case empty blocks as they usually do not appear" in
[1], it removes function no_real_insns_p and its uses
completely.

There is some case that one block previously has only one
INSN_P, but while scheduling some other blocks this only
INSN_P gets moved there and the block becomes empty so
that the only NOTE_P insn was counted then, but since this
block isn't empty initially and any NOTE_P gets skipped in
a normal block, the count to-be-scheduled doesn't count it
in, it can cause the below assertion to fail:

  /* Sanity check: verify that all region insns were scheduled.  */
  gcc_assert (sched_rgn_n_insns == rgn_n_insns);

A bitmap rgn_init_empty_bb is proposed to detect such case
by recording one one block is empty initially or not before
actual scheduling.  The other changes are mainly to handle
NOTE which wasn't expected before but now we have to face
with.

Bootstrapped and regress-tested on:
  - powerpc64{,le}-linux-gnu
  - x86_64-redhat-linux
  - aarch64-linux-gnu

Also tested this with superblock scheduling (sched2) turned
on by default, bootstrapped and regress-tested again on the
above triples.  I tried to test with seletive-scheduling
1/2 enabled by default, it's bootstrapped & regress-tested
on x86_64-redhat-linux, but both failed to build on
powerpc64{,le}-linux-gnu and aarch64-linux-gnu even without
this patch (so it's unrelated, I've filed two PRs for
observed failures on Power separately).

[1] https://inbox.sourceware.org/gcc-patches/CAFiYyc2hMvbU_+
a47ytnbxf0yrcybwrhru2jdcw5a0px3+n...@mail.gmail.com/

Is it ok for trunk or next stage 1?

BR,
Kewen
-
PR rtl-optimization/108273

gcc/ChangeLog:

* config/aarch64/aarch64.cc (aarch64_sched_adjust_priority): Early
return for NOTE_P.
* haifa-sched.cc (recompute_todo_spec): Likewise.
(setup_insn_reg_pressure_info): Likewise.
(schedule_insn): Handle NOTE_P specially as we don't skip empty block
any more and adopt NONDEBUG_INSN_P somewhere appropriate.
(commit_schedule): Likewise.
(prune_ready_list): Likewise.
(schedule_block): Likewise.
(set_priorities): Likewise.
(fix_tick_ready): Likewise.
(no_real_insns_p): Remove.
* rtl.h (SCHED_GROUP_P): Add NOTE consideration.
* sched-ebb.cc (schedule_ebb): Skip leading labels like note to ensure
that we don't have the chance to have single label block, remove the
call to no_real_insns_p.
* sched-int.h (no_real_insns_p): Remove declaration.
* sched-rgn.cc (free_block_dependencies): Remove the call to
no_real_insns_p.
(compute_priorities): Likewise.
(schedule_region): Remove the call to no_real_insns_p, check
rgn_init_empty_bb and update rgn_n_insns if need.
(sched_rgn_local_init): Init rgn_init_empty_bb.
(sched_rgn_local_free): Free rgn_init_empty_bb.
(rgn_init_empty_bb): New static bitmap.
* sel-sched.cc (sel_region_target_finish): Remove the call to
no_real_insns_p.
---
 gcc/config/aarch64/aarch64.cc |   4 +
 gcc/haifa-sched.cc| 180 ++
 gcc/rtl.h |   4 +-
 gcc/sched-ebb.cc  |  10 +-
 gcc/sched-int.h   |   1 -
 gcc/sched-rgn.cc  |  43 
 gcc/sel-sched.cc  |   3 -
 7 files changed, 125 insertions(+), 120 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 4fd8c2de43a..749eef7a7c5 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -24178,6 +24178,10 @@ aarch64_sched_fusion_priority (rtx_insn *insn, int 
max_pri,
 static int
 aarch64_sched_adjust_priority (rtx_insn *insn, int priority)
 {
+  /* Skip NOTE in empty block.  */
+  if (!INSN_P (insn))
+return priority;
+
   rtx x = PATTERN (insn);

   if (GET_CODE (x) == SET)
diff --git a/gcc/haifa-sched.cc b/gcc/haifa-sched.cc
index 8e8add709b3..6e4724a79f8 100644
--- a/gcc/haifa-sched.cc
+++ b/gcc/haifa-sched.cc
@@ -1207,6 +1207,11 @@ recompute_todo_spec (rtx_insn *next, bool for_backtrack)
   int n_replace = 0;
   bool first_p = true;

+  /* Since we don't skip empty block any more, it's possible
+ to meet NOTE insn now, early return if so.  */
+  if (NOTE_P (next))
+return 0;
+
   if (sd_lists_empty_p (next, SD_LIST_BACK))
 /* NEXT has all its dependencies resolved.  */
 return 0;
@@ -1726,6 +1731,11 @@ setup_insn_reg_pressure_info (rtx_insn *insn)
   int *max_reg_pressure;
   static int death[N_REG_CLASSES];

+  /* Since we don't skip empty block any more, it's possible to
+ schedule NOTE insn now, we should check for it first.  */
+  if (NOTE_P (insn))
+return;
+
   gcc_checking_assert (!DEBUG_INSN_P (insn));

   excess_cost_change = 0;
@@ -4017,10 +4027,10 @@ schedule_insn (rtx_insn *insn)

   /* Scheduling instruction should have all its 

Re: [Patchv2, rs6000] Clean up pre-checkings of expand_block_compare

2023-12-18 Thread Kewen.Lin
Hi Haochen,

on 2023/12/18 10:44, HAO CHEN GUI wrote:
> Hi,
>   This patch cleans up pre-checkings of expand_block_compare. It does
> 1. Assert only P7 above can enter this function as it's already guard
> by the expand.
> 2. Return false when optimizing for size.
> 3. Remove P7 processor test as only P7 above can enter this function and
> P7 LE is excluded by targetm.slow_unaligned_access. On P7 BE, the
> performance of expand is better than the performance of library when
> the length is long.

Maybe it's better to split the handling for optimizing for size out to a
separated patch, since it's not actually a clean up.  Sorry, I should
have suggested this in the previous review.  For 3, as you have evaluated
the performance on Power7, I think it's safe to make this change now, so
this patch is ok for trunk, thanks!

> 
>   Compared to last version,
> https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640082.html
> the main change is to add some comments and move the variable definition
> closed to its use.
> 
>   Bootstrapped and tested on x86 and powerpc64-linux BE and LE with no
> regressions. Is this OK for trunk?
> 
> Thanks
> Gui Haochen
> 
> ChangeLog
> rs6000: Clean up the pre-checkings of expand_block_compare
> 
> gcc/
>   * gcc/config/rs6000/rs6000-string.cc (expand_block_compare): Assert
>   only P7 above can enter this function.  Return false (call library)
>   when it's optimized for size.  Remove P7 CPU test as only P7 above
>   can enter this function and P7 LE is excluded by the checking of
>   targetm.slow_unaligned_access on word_mode.  Also performance test
>   shows the expand of block compare with 16 bytes to 64 bytes length
>   is better than library on P7 BE.

Nit: You can just describe "what's done" but not "why" here, and put "why"
into the commit log instead.

BR,
Kewen

> 
> gcc/testsuite/
>   * gcc.target/powerpc/block-cmp-3.c: New.
> 
> 
> patch.diff
> diff --git a/gcc/config/rs6000/rs6000-string.cc 
> b/gcc/config/rs6000/rs6000-string.cc
> index cb9eeef05d8..49670cef4d7 100644
> --- a/gcc/config/rs6000/rs6000-string.cc
> +++ b/gcc/config/rs6000/rs6000-string.cc
> @@ -1946,36 +1946,32 @@ expand_block_compare_gpr(unsigned HOST_WIDE_INT 
> bytes, unsigned int base_align,
>  bool
>  expand_block_compare (rtx operands[])
>  {
> -  rtx target = operands[0];
> -  rtx orig_src1 = operands[1];
> -  rtx orig_src2 = operands[2];
> -  rtx bytes_rtx = operands[3];
> -  rtx align_rtx = operands[4];
> +  /* TARGET_POPCNTD is already guarded at expand cmpmemsi.  */
> +  gcc_assert (TARGET_POPCNTD);
> 
> -  /* This case is complicated to handle because the subtract
> - with carry instructions do not generate the 64-bit
> - carry and so we must emit code to calculate it ourselves.
> - We choose not to implement this yet.  */
> -  if (TARGET_32BIT && TARGET_POWERPC64)
> +  if (optimize_insn_for_size_p ())
>  return false;
> 
> -  bool isP7 = (rs6000_tune == PROCESSOR_POWER7);
> -
>/* Allow this param to shut off all expansion.  */
>if (rs6000_block_compare_inline_limit == 0)
>  return false;
> 
> -  /* targetm.slow_unaligned_access -- don't do unaligned stuff.
> - However slow_unaligned_access returns true on P7 even though the
> - performance of this code is good there.  */
> -  if (!isP7
> -  && (targetm.slow_unaligned_access (word_mode, MEM_ALIGN (orig_src1))
> -   || targetm.slow_unaligned_access (word_mode, MEM_ALIGN (orig_src2
> +  /* This case is complicated to handle because the subtract
> + with carry instructions do not generate the 64-bit
> + carry and so we must emit code to calculate it ourselves.
> + We choose not to implement this yet.  */
> +  if (TARGET_32BIT && TARGET_POWERPC64)
>  return false;
> 
> -  /* Unaligned l*brx traps on P7 so don't do this.  However this should
> - not affect much because LE isn't really supported on P7 anyway.  */
> -  if (isP7 && !BYTES_BIG_ENDIAN)
> +  rtx target = operands[0];
> +  rtx orig_src1 = operands[1];
> +  rtx orig_src2 = operands[2];
> +  rtx bytes_rtx = operands[3];
> +  rtx align_rtx = operands[4];
> +
> +  /* targetm.slow_unaligned_access -- don't do unaligned stuff.  */
> +if (targetm.slow_unaligned_access (word_mode, MEM_ALIGN (orig_src1))
> + || targetm.slow_unaligned_access (word_mode, MEM_ALIGN (orig_src2)))
>  return false;
> 
>/* If this is not a fixed size compare, try generating loop code and
> @@ -2023,14 +2019,6 @@ expand_block_compare (rtx operands[])
>if (!IN_RANGE (bytes, 1, max_bytes))
>  return expand_compare_loop (operands);
> 
> -  /* The code generated for p7 and older is not faster than glibc
> - memcmp if alignment is small and length is not short, so bail
> - out to avoid those conditions.  */
> -  if (targetm.slow_unaligned_access (word_mode, UINTVAL (align_rtx))
> -  && ((base_align == 1 && bytes > 16)
> -   || (base_align == 2 && bytes > 32)))
> -return false;
> 

Re: [Patchv2, rs6000] Correct definition of macro of fixed point efficient unaligned

2023-12-18 Thread Kewen.Lin
Hi Haochen,

on 2023/12/18 10:43, HAO CHEN GUI wrote:
> Hi,
>   The patch corrects the definition of
> TARGET_EFFICIENT_OVERLAPPING_UNALIGNED and replace it with the call of
> slow_unaligned_access.
> 
>   Compared with last version,
> https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640076.html
> the main change is to replace the macro with slow_unaligned_access.
> 
>   Bootstrapped and tested on x86 and powerpc64-linux BE and LE with no
> regressions. Is this OK for trunk?
> 
> Thanks
> Gui Haochen
> 
> ChangeLog
> rs6000: Correct definition of macro of fixed point efficient unaligned
> 
> Marco TARGET_EFFICIENT_OVERLAPPING_UNALIGNED is used in rs6000-string.cc to
> guard the platform which is efficient on fixed point unaligned load/store.
> It's originally defined by TARGET_EFFICIENT_UNALIGNED_VSX which is enabled
> from P8 and can be disabled by mno-vsx option. So the definition is wrong.
> This patch corrects the problem and call slow_unaligned_access to judge if
> fixed point unaligned load/store is efficient or not.
> 
> gcc/
>   * config/rs6000/rs6000.h (TARGET_EFFICIENT_OVERLAPPING_UNALIGNED):
>   Remove.
>   * config/rs6000/rs6000-string.cc (select_block_compare_mode):
>   Replace TARGET_EFFICIENT_OVERLAPPING_UNALIGNED with
>   targetm.slow_unaligned_access.
>   (expand_block_compare_gpr): Likewise.
>   (expand_block_compare): Likewise.
>   (expand_strncmp_gpr_sequence): Likewise.
> 
> gcc/testsuite/
>   * gcc.target/powerpc/block-cmp-1.c: New.
>   * gcc.target/powerpc/block-cmp-2.c: New.
> 
> patch.diff
> diff --git a/gcc/config/rs6000/rs6000-string.cc 
> b/gcc/config/rs6000/rs6000-string.cc
> index 44a946cd453..cb9eeef05d8 100644
> --- a/gcc/config/rs6000/rs6000-string.cc
> +++ b/gcc/config/rs6000/rs6000-string.cc
> @@ -305,7 +305,7 @@ select_block_compare_mode (unsigned HOST_WIDE_INT offset,
>else if (bytes == GET_MODE_SIZE (QImode))
>  return QImode;
>else if (bytes < GET_MODE_SIZE (SImode)
> -&& TARGET_EFFICIENT_OVERLAPPING_UNALIGNED
> +&& !targetm.slow_unaligned_access (SImode, align)
>  && offset >= GET_MODE_SIZE (SImode) - bytes)
>  /* This matches the case were we have SImode and 3 bytes
> and offset >= 1 and permits us to move back one and overlap
> @@ -313,7 +313,7 @@ select_block_compare_mode (unsigned HOST_WIDE_INT offset,
> unwanted bytes off of the input.  */
>  return SImode;
>else if (word_mode_ok && bytes < UNITS_PER_WORD
> -&& TARGET_EFFICIENT_OVERLAPPING_UNALIGNED
> +&& !targetm.slow_unaligned_access (word_mode, align)
>  && offset >= UNITS_PER_WORD-bytes)
>  /* Similarly, if we can use DImode it will get matched here and
> can do an overlapping read that ends at the end of the block.  */
> @@ -1749,7 +1749,7 @@ expand_block_compare_gpr(unsigned HOST_WIDE_INT bytes, 
> unsigned int base_align,
>load_mode_size = GET_MODE_SIZE (load_mode);
>if (bytes >= load_mode_size)
>   cmp_bytes = load_mode_size;
> -  else if (TARGET_EFFICIENT_OVERLAPPING_UNALIGNED)
> +  else if (!targetm.slow_unaligned_access (load_mode, align))
>   {
> /* Move this load back so it doesn't go past the end.
>P8/P9 can do this efficiently.  */
> @@ -2026,7 +2026,7 @@ expand_block_compare (rtx operands[])
>/* The code generated for p7 and older is not faster than glibc
>   memcmp if alignment is small and length is not short, so bail
>   out to avoid those conditions.  */
> -  if (!TARGET_EFFICIENT_OVERLAPPING_UNALIGNED
> +  if (targetm.slow_unaligned_access (word_mode, UINTVAL (align_rtx))

At the first glance it looks that we can use base_align here instead,
but I noticed that base_align is computed with

  unsigned int base_align = UINTVAL (align_rtx) / BITS_PER_UNIT;

As the internal doc, the alignment already passed in bytes?  If so,
the "/ BITS_PER_UNIT" looks unexpected, could you have a check?
If it is the case, a separated patch for it is appreciated (and
please some other related/similar places too).  Thanks!

>&& ((base_align == 1 && bytes > 16)
> || (base_align == 2 && bytes > 32)))
>  return false;
> @@ -2168,7 +2168,7 @@ expand_strncmp_gpr_sequence (unsigned HOST_WIDE_INT 
> bytes_to_compare,
>load_mode_size = GET_MODE_SIZE (load_mode);
>if (bytes_to_compare >= load_mode_size)
>   cmp_bytes = load_mode_size;
> -  else if (TARGET_EFFICIENT_OVERLAPPING_UNALIGNED)
> +  else if (!targetm.slow_unaligned_access (load_mode, align))
>   {
> /* Move this load back so it doesn't go past the end.
>P8/P9 can do this efficiently.  */
> diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
> index 326c45221e9..3971a56c588 100644
> --- a/gcc/config/rs6000/rs6000.h
> +++ b/gcc/config/rs6000/rs6000.h
> @@ -483,10 +483,6 @@ extern int rs6000_vector_align[];
>  #define TARGET_NO_SF_SUBREG  TARGET_DIRECT_MOVE_64BIT
>  

[PATCH] sel-sched: Verify change before replacing dest in EXPR_INSN_RTX [PR112995]

2023-12-15 Thread Kewen.Lin
Hi,

PR112995 exposed one issue in current try_replace_dest_reg
that the result rtx insn after replace_dest_with_reg_in_expr
is probably unable to match any constraints.  Although there
are some checks on the changes onto dest or src of orig_insn,
none is performed on the EXPR_INSN_RTX.

This patch is to add the check before actually replacing dest
in expr with reg.

Bootstrapped and regtested on x86_64-redhat-linux and
powerpc64{,le}-linux-gnu.

Is it ok for trunk?

BR,
Kewen
-
PR rtl-optimization/112995

gcc/ChangeLog:

* sel-sched.cc (try_replace_dest_reg): Check the validity of the
replaced insn before actually replacing dest in expr.

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/pr112995.c: New test.
---
 gcc/sel-sched.cc| 10 +-
 gcc/testsuite/gcc.target/powerpc/pr112995.c | 14 ++
 2 files changed, 23 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr112995.c

diff --git a/gcc/sel-sched.cc b/gcc/sel-sched.cc
index 1925f4a9461..a35b5e16c91 100644
--- a/gcc/sel-sched.cc
+++ b/gcc/sel-sched.cc
@@ -1614,7 +1614,15 @@ try_replace_dest_reg (ilist_t orig_insns, rtx best_reg, 
expr_t expr)
   /* Make sure that EXPR has the right destination
  register.  */
   if (expr_dest_regno (expr) != REGNO (best_reg))
-replace_dest_with_reg_in_expr (expr, best_reg);
+{
+  rtx_insn *vinsn = EXPR_INSN_RTX (expr);
+  validate_change (vinsn, _DEST (PATTERN (vinsn)), best_reg, 1);
+  bool res = verify_changes (0);
+  cancel_changes (0);
+  if (!res)
+   return false;
+  replace_dest_with_reg_in_expr (expr, best_reg);
+}
   else
 EXPR_TARGET_AVAILABLE (expr) = 1;

diff --git a/gcc/testsuite/gcc.target/powerpc/pr112995.c 
b/gcc/testsuite/gcc.target/powerpc/pr112995.c
new file mode 100644
index 000..4adcb5f3851
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr112995.c
@@ -0,0 +1,14 @@
+/* { dg-require-effective-target float128 } */
+/* { dg-options "-O2 -mdejagnu-cpu=power9 -fselective-scheduling2" } */
+
+/* Verify there is no ICE.  */
+
+int a[10];
+int b(_Float128 e) {
+  int c;
+  _Float128 d;
+  c = e;
+  d = c;
+  d = a[c] + d;
+  return d;
+}
--
2.39.3



Re: PING^1 [PATCH] range: Workaround different type precision issue between _Float128 and long double [PR112788]

2023-12-12 Thread Kewen.Lin
Hi Jakub & Andrew,

on 2023/12/12 22:42, Jakub Jelinek wrote:
> On Tue, Dec 12, 2023 at 09:33:38AM -0500, Andrew MacLeod wrote:
>> I leave this for the release managers, but I am not opposed to it for this
>> release... It would be nice to remove it for the next release
> 
> I can live with it for GCC 14, so ok, but it is very ugly.

Thanks, pushed as r14-6478-gfda8e2f8292a90.

And yes, I strongly agree that we should get rid of this in next release.

> 
> We should fix it in a better way for GCC 15+.
> I think we shouldn't lie, both on the mode precisions and on type
> precisions.  The middle-end already contains some hacks to make it
> work to some extent on 2 different modes with same precision (for BFmode vs.
> HFmode), on the FE side if we need a target hook the C/C++ FE will use
> to choose type ranks and/or the type for binary operations, so be it.
> It would be also great if rs6000 backend had just 2 modes for 128-bit
> floats, one for IBM double double, one for IEEE quad, not 3 as it has now,
> perhaps with TFmode being a macro that conditionally expands to one or the
> other.  Or do some tweaks in target hooks to keep backwards compatibility
> with mode attribute and similar.

Thanks for all the insightful suggestions, I just filed PR112993 for
further tracking and self-assigned it.

BR,
Kewen


[PATCH draft v2] sched: Don't skip empty block in scheduling [PR108273]

2023-12-11 Thread Kewen.Lin
Hi,

on 2023/11/22 17:30, Kewen.Lin wrote:
> on 2023/11/17 20:55, Alexander Monakov wrote:
>>
>> On Fri, 17 Nov 2023, Kewen.Lin wrote:
>>>> I don't think you can run cleanup_cfg after sched_init. I would suggest
>>>> to put it early in schedule_insns.
>>>
>>> Thanks for the suggestion, I placed it at the beginning of haifa_sched_init
>>> instead, since schedule_insns invokes haifa_sched_init, although the
>>> calls rgn_setup_common_sched_info and rgn_setup_sched_infos are executed
>>> ahead but they are all "setup" functions, shouldn't affect or be affected
>>> by this placement.
>>
>> I was worried because sched_init invokes df_analyze, and I'm not sure if
>> cfg_cleanup can invalidate it.
> 
> Thanks for further explaining!  By scanning cleanup_cfg, it seems that it
> considers df, like compact_blocks checks df, try_optimize_cfg invokes
> df_analyze etc., but I agree that moving cleanup_cfg before sched_init
> makes more sense.
> 
>>
>>>> I suspect this may be caused by invoking cleanup_cfg too late.
>>>
>>> By looking into some failures, I found that although cleanup_cfg is executed
>>> there would be still some empty blocks left, by analyzing a few failures 
>>> there
>>> are at least such cases:
>>>   1. empty function body
>>>   2. block holding a label for return.
>>>   3. block without any successor.
>>>   4. block which becomes empty after scheduling some other block.
>>>   5. block which looks mergeable with its always successor but left.
>>>   ...
>>>
>>> For 1,2, there is one single successor EXIT block, I think they don't affect
>>> state transition, for 3, it's the same.  For 4, it depends on if we can have
>>> the assumption this kind of empty block doesn't have the chance to have 
>>> debug
>>> insn (like associated debug insn should be moved along), I'm not sure.  For 
>>> 5,
>>> a reduced test case is:
>>
>> Oh, I should have thought of cases like these, really sorry about the slip
>> of attention, and thanks for showing a testcase for item 5. As Richard as
>> saying in his response, cfg_cleanup cannot be a fix here. The thing to check
>> would be changing no_real_insns_p to always return false, and see if the
>> situation looks recoverable (if it breaks bootstrap, regtest statistics of
>> a non-bootstrapped compiler are still informative).
> 
> As you suggested, I forced no_real_insns_p to return false all the time, some
> issues got exposed, almost all of them are asserting NOTE_P insn shouldn't be
> encountered in those places, so the adjustments for most of them are just to
> consider NOTE_P or this kind of special block and so on.  One draft patch is
> attached, it can be bootstrapped and regress-tested on ppc64{,le} and x86.
> btw, it's without the previous cfg_cleanup adjustment (hope it can get more
> empty blocks and expose more issues).  The draft isn't qualified for code
> review but I hope it can provide some information on what kinds of changes
> are needed for the proposal.  If this is the direction which we all agree on,
> I'll further refine it and post a formal patch.  One thing I want to note is
> that this patch disable one assertion below:
> 
> diff --git a/gcc/sched-rgn.cc b/gcc/sched-rgn.cc
> index e5964f54ead..abd334864fb 100644
> --- a/gcc/sched-rgn.cc
> +++ b/gcc/sched-rgn.cc
> @@ -3219,7 +3219,7 @@ schedule_region (int rgn)
>  }
> 
>/* Sanity check: verify that all region insns were scheduled.  */
> -  gcc_assert (sched_rgn_n_insns == rgn_n_insns);
> +  // gcc_assert (sched_rgn_n_insns == rgn_n_insns);
> 
>sched_finish_ready_list ();
> 
> Some cases can cause this assertion to fail, it's due to the mismatch on
> to-be-scheduled and scheduled insn counts.  The reason why it happens is that
> one block previously has only one INSN_P but while scheduling some other 
> blocks
> it gets moved as well then we ends up with an empty block so that the only
> NOTE_P insn was counted then, but since this block isn't empty initially and
> NOTE_P gets skipped in a normal block, the count to-be-scheduled can't count
> it in.  It can be fixed with special-casing this kind of block for counting
> like initially recording which block is empty and if a block isn't recorded
> before then fix up the count for it accordingly.  I'm not sure if someone may
> have an argument that all the complication make this proposal beaten by
> previous special-casing debug insn approach, looking forward to more comments.
> 

The attached one is the improved draft patch v2 for skipping empty BB, agai

Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp

2023-12-11 Thread Kewen.Lin
Hi Ajit,

on 2023/12/8 16:01, Ajit Agarwal wrote:
> Hello Kewen:
> 

[snip...]

> With UNSPEC_MMA_EXTRACT I could generate the register pair but functionally 
> here is the
> below code which is incorrect.
> 
>  llxvp %vs0,0(%r4)
> xxlor %vs32,%vs0,%vs0
> xvf32ger 0,%vs34,%vs32
> xvf32gerpp 0,%vs34,%vs33
> xxmfacc 0
> stxvp %vs2,0(%r3)
> stxvp %vs0,32(%r3)
> blr
> 
> 
> Here is the RTL Code:
> 
> (insn 19 4 20 2 (set (reg:OO 124 [ *ptr_4(D) ])
> (mem:OO (reg/v/f:DI 122 [ ptr ]) [0 *ptr_4(D)+0 S16 A128])) -1
>  (nil))
> (insn 20 19 9 2 (set (reg:V16QI 129 [orig:124 *ptr_4(D) ] [124])
> (subreg:V16QI (reg:OO 124 [ *ptr_4(D) ]) 0)) -1
>  (nil))
> (insn 9 20 11 2 (set (reg:XO 119 [ _7 ])
> (unspec:XO [
> (reg/v:V16QI 123 [ src ])
> (reg:V16QI 129 [orig:124 *ptr_4(D) ] [124])
> ] UNSPEC_MMA_XVF32GER)) 2195 {mma_xvf32ger}
>  (expr_list:REG_DEAD (reg:OO 124 [ *ptr_4(D) ])
> (nil)))
> (insn 11 9 12 2 (set (reg:XO 120 [ _9 ])
> (unspec:XO [
> (reg:XO 119 [ _7 ])
> (reg/v:V16QI 123 [ src ])
> (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 
> 16B] ])
> ] UNSPEC_MMA_XVF32GERPP)) 2209 {mma_xvf32gerpp}
>  (expr_list:REG_DEAD (reg:V16QI 125 [ MEM[(__vector unsigned char 
> *)ptr_4(D) + 16B] ])
> (expr_list:REG_DEAD (reg/v:V16QI 123 [ src ])
> (expr_list:REG_DEAD (reg:XO 119 [ _7 ])
> (nil)
> (insn 12 11 18 2 (set (mem:XO (reg:DI 126) [1 *dst_10(D)+0 S64 A128])
> (reg:XO 120 [ _9 ])) 
> "../gcc/testsuite/g++.target/powerpc/vecload.C":13:8 2182 {*movxo}
>  (expr_list:REG_DEAD (reg:DI 126)
> (expr_list:REG_DEAD (reg:XO 120 [ _9 ])
> (nil
> (note 18 12 0 NOTE_INSN_DELETED)
> 
> r124 and r129 conflicts live range amd ira generates different registers 
> which will not
> serve our purpose.
> 
> Making r124 and r129 as same will not allocate register by ira as r124 could 
> have both OOmode
> and V16QImode.
> 
> Doing this pass before ira_pass has such above issues and we could solve them 
> after making
> after reload pass.

Could you also attach your latest WIP patch?  I'm going to look into the extra 
move issue with it.

Thanks!

BR,
Kewen


PING^1 [PATCH] sched: Remove debug counter sched_block

2023-12-11 Thread Kewen.Lin
Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636597.html

BR,
Kewen

on 2023/11/15 17:01, Kewen.Lin wrote:
> Hi,
> 
> on 2023/11/10 01:40, Alexander Monakov wrote:
> 
>> I agree with the concern. I hoped that solving the problem by skipping the BB
>> like the (bit-rotted) debug code needs to would be a minor surgery. As things
>> look now, it may be better to remove the non-working sched_block debug 
>> counter
>> entirely and implement a good solution for the problem at hand.
>>
> 
> According to this comment, I made and tested the below patch to remove the
> problematic debug counter:
> 
> Subject: [PATCH] sched: Remove debug counter sched_block
> 
> Currently the debug counter sched_block doesn't work well
> since we create dependencies for some insns and those
> dependencies are expected to be resolved during scheduling
> insns but they can get skipped once we are skipping some
> block while respecting sched_block debug counter.
> 
> For example, for the below test case:
> --
> int a, b, c, e, f;
> float d;
> 
> void
> g ()
> {
>   float h, i[1];
>   for (; f;)
> if (c)
>   {
>   d *e;
>   if (b)
> {
>   float *j = i;
>   j[0] = 0;
> }
>   h = d;
>   }
>   if (h)
> a = i[0];
> }
> --
> ICE occurs with option "-O2 -fdbg-cnt=sched_block:1".
> 
> As the discussion in [1], it seems that we think this debug
> counter is useless and can be removed.  It's also implied
> that if it's useful and used often, the above issue should
> have been cared about and resolved earlier.  So this patch
> is to remove this debug counter.
> 
> Bootstrapped and regtested on x86_64-redhat-linux and
> powerpc64{,le}-linux-gnu.
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/635852.html
> 
> Is it ok for trunk?
> 
> BR,
> Kewen
> -
> 
> gcc/ChangeLog:
> 
>   * dbgcnt.def (sched_block): Remove.
>   * sched-rgn.cc (schedule_region): Remove the support of debug count
>   sched_block.
> ---
>  gcc/dbgcnt.def   |  1 -
>  gcc/sched-rgn.cc | 19 ++-
>  2 files changed, 6 insertions(+), 14 deletions(-)
> 
> diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def
> index 871cbf75d93..a8c4e61e13d 100644
> --- a/gcc/dbgcnt.def
> +++ b/gcc/dbgcnt.def
> @@ -198,7 +198,6 @@ DEBUG_COUNTER (pre_insn)
>  DEBUG_COUNTER (prefetch)
>  DEBUG_COUNTER (registered_jump_thread)
>  DEBUG_COUNTER (sched2_func)
> -DEBUG_COUNTER (sched_block)
>  DEBUG_COUNTER (sched_breakdep)
>  DEBUG_COUNTER (sched_func)
>  DEBUG_COUNTER (sched_insn)
> diff --git a/gcc/sched-rgn.cc b/gcc/sched-rgn.cc
> index e5964f54ead..1c8acf5068a 100644
> --- a/gcc/sched-rgn.cc
> +++ b/gcc/sched-rgn.cc
> @@ -3198,20 +3198,13 @@ schedule_region (int rgn)
>current_sched_info->queue_must_finish_empty = current_nr_blocks == 1;
> 
>curr_bb = first_bb;
> -  if (dbg_cnt (sched_block))
> -{
> -   int saved_last_basic_block = last_basic_block_for_fn (cfun);
> +  int saved_last_basic_block = last_basic_block_for_fn (cfun);
> 
> -   schedule_block (_bb, bb_state[first_bb->index]);
> -   gcc_assert (EBB_FIRST_BB (bb) == first_bb);
> -   sched_rgn_n_insns += sched_n_insns;
> -   realloc_bb_state_array (saved_last_basic_block);
> -   save_state_for_fallthru_edge (last_bb, curr_state);
> -}
> -  else
> -{
> -  sched_rgn_n_insns += rgn_n_insns;
> -}
> +  schedule_block (_bb, bb_state[first_bb->index]);
> +  gcc_assert (EBB_FIRST_BB (bb) == first_bb);
> +  sched_rgn_n_insns += sched_n_insns;
> +  realloc_bb_state_array (saved_last_basic_block);
> +  save_state_for_fallthru_edge (last_bb, curr_state);
> 
>/* Clean up.  */
>if (current_nr_blocks > 1)
> --
> 2.39.1


PING^1 [PATCH] rs6000: New pass to mitigate SP float load perf issue on Power10

2023-12-11 Thread Kewen.Lin
Hi,

Gentle ping:

https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636599.html

BR,
Kewen

on 2023/11/15 17:16, Kewen.Lin wrote:
> Hi,
> 
> As Power ISA defines, when loading a scalar single precision (SP)
> floating point from memory, we have the double precision (DP) format
> in target register converted from SP, it's unlike some other
> architectures which supports SP and DP in registers with their
> separated formats.  The scalar SP instructions operates on DP format
> value in register and round the result to fit in SP (but still
> keeping the value in DP format).
> 
> On Power10, a scalar SP floating point load insn will be cracked into
> two internal operations, one is to load the value, the other is to
> convert SP to DP format.  Comparing to those uncracked load like
> vector SP load, it has extra 3 cycles load-to-use penalty.  When
> evaluating some critical workloads, we found that for some cases we
> don't really need the conversion if all the involved operations are
> only with SP format.  In this case, we can replace the scalar SP
> loads with vector SP load and splat (no conversion), replace all
> involved computation with the corresponding vector operations (with
> Power10 slice-based design, we expect the latency of scalar operation
> and its equivalent vector operation is the same), that is to promote
> the scalar SP loads and their affected computation to vector
> operations.
> 
> For example for the below case:
> 
> void saxpy (int n, float a, float * restrict x, float * restrict y)
> {
>   for (int i = 0; i < n; ++i)
>   y[i] = a*x[i] + y[i];
> }
> 
> At -O2, the loop body would end up with:
> 
> .L3:
> lfsx 12,6,9// conv
> lfsx 0,5,9 // conv
> fmadds 0,0,1,12
> stfsx 0,6,9
> addi 9,9,4
> bdnz .L3
> 
> but it can be implemented with:
> 
> .L3:
> lxvwsx 0,5,9   // load and splat
> lxvwsx 12,6,9
> xvmaddmsp 0,1,12
> stxsiwx 0,6,9  // just store word 1 (BE ordering)
> addi 9,9,4
> bdnz .L3
> 
> Evaluated on Power10, the latter can speed up 23% against the former.
> 
> So this patch is to introduce a pass to recognize such case and
> change the scalar SP operations with the appropriate vector SP
> operations when it's proper.
> 
> The processing of this pass starts from scalar SP loads, first it
> checks if it's valid, further checks all the stmts using its loaded
> result, then propagates from them.  This process of propagation
> mainly goes with function visit_stmt, which first checks the
> validity of the given stmt, then checks the feeders of use operands
> with visit_stmt recursively, finally checks all the stmts using the
> def with visit_stmt recursively.  The purpose is to ensure all
> propagated stmts are valid to be transformed with its equivalent
> vector operations.  For some special operands like constant or
> GIMPLE_NOP def ssa, record them as splatting candidates.  There are
> some validity checks like: if the addressing mode can satisfy index
> form with some adjustments, if there is the corresponding vector
> operation support, and so on.  Once all propagated stmts from one
> load are valid, they are transformed by function transform_stmt by
> respecting the information in stmt_info like sf_type, new_ops etc.
> 
> For example, for the below test case:
> 
>   _4 = MEM[(float *)x_13(D) + ivtmp.13_24 * 1];  // stmt1
>   _7 = MEM[(float *)y_15(D) + ivtmp.13_24 * 1];  // stmt2
>   _8 = .FMA (_4, a_14(D), _7);   // stmt3
>   MEM[(float *)y_15(D) + ivtmp.13_24 * 1] = _8;  // stmt4
> 
> The processing starts from stmt1, which is taken as valid, adds it
> into the chain, then processes its use stmt stmt3, which is also
> valid, iterating its operands _4 whose def is stmt1 (visited), a_14
> which needs splatting and _7 whose def stmt2 is to be processed.
> Then stmt2 is taken as a valid load and it's added into the chain.
> All operands _4, a_14 and _7 of stmt3 are processed well, then it's
> added into the chain.  Then it processes use stmts of _8 (result of
> stmt3), so checks stmt4 which is a valid store.  Since all these
> involved stmts are valid to be transformed, we get below finally:
> 
>   sf_5 = __builtin_vsx_lxvwsx (ivtmp.13_24, x_13(D));
>   sf_25 = __builtin_vsx_lxvwsx (ivtmp.13_24, y_15(D));
>   sf_22 = {a_14(D), a_14(D), a_14(D), a_14(D)};
>   sf_20 = .FMA (sf_5, sf_22, sf_25);
>   __builtin_vsx_stxsiwx (sf_20, ivtmp.13_24, y_15(D));
> 
> Since it needs to do some validity checks and adjustments if allowed,
> such as: check if some scalar operation has the corresponding vector
> support, considering scalar SP load can allow r

PING^6 [PATCH v2] rs6000: Don't use optimize_function_for_speed_p too early [PR108184]

2023-12-11 Thread Kewen.Lin
Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2023-January/609993.html

BR,
Kewen

>>>>> on 2023/1/16 17:08, Kewen.Lin via Gcc-patches wrote:
>>>>>> Hi,
>>>>>>
>>>>>> As Honza pointed out in [1], the current uses of function
>>>>>> optimize_function_for_speed_p in rs6000_option_override_internal
>>>>>> are too early, since the query results from the functions
>>>>>> optimize_function_for_{speed,size}_p could be changed later due
>>>>>> to profile feedback and some function attributes handlings etc.
>>>>>>
>>>>>> This patch is to move optimize_function_for_speed_p to all the
>>>>>> use places of the corresponding flags, which follows the existing
>>>>>> practices.  Maybe we can cache it somewhere at an appropriate
>>>>>> timing, but that's another thing.
>>>>>>
>>>>>> Comparing with v1[2], this version added one test case for
>>>>>> SAVE_TOC_INDIRECT as Segher questioned and suggested, and it
>>>>>> also considered the possibility of explicit option (see test
>>>>>> cases pr108184-2.c and pr108184-4.c).  I believe that excepting
>>>>>> for the intentional change on optimize_function_for_{speed,
>>>>>> size}_p, there is no other function change.
>>>>>>
>>>>>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2022-November/607527.html
>>>>>> [2] https://gcc.gnu.org/pipermail/gcc-patches/2023-January/609379.html
>>>>>>
>>>>>> Bootstrapped and regtested on powerpc64-linux-gnu P8,
>>>>>> powerpc64le-linux-gnu P{9,10} and powerpc-ibm-aix.
>>>>>>
>>>>>> Is it ok for trunk?
>>>>>>
>>>>>> BR,
>>>>>> Kewen
>>>>>> -
>>>>>> gcc/ChangeLog:
>>>>>>
>>>>>>  * config/rs6000/rs6000.cc (rs6000_option_override_internal): Remove
>>>>>>  all optimize_function_for_speed_p uses.
>>>>>>  (fusion_gpr_load_p): Call optimize_function_for_speed_p along
>>>>>>  with TARGET_P8_FUSION_SIGN.
>>>>>>  (expand_fusion_gpr_load): Likewise.
>>>>>>  (rs6000_call_aix): Call optimize_function_for_speed_p along with
>>>>>>  TARGET_SAVE_TOC_INDIRECT.
>>>>>>  * config/rs6000/predicates.md (fusion_gpr_mem_load): Call
>>>>>>  optimize_function_for_speed_p along with TARGET_P8_FUSION_SIGN.
>>>>>>
>>>>>> gcc/testsuite/ChangeLog:
>>>>>>
>>>>>>  * gcc.target/powerpc/pr108184-1.c: New test.
>>>>>>  * gcc.target/powerpc/pr108184-2.c: New test.
>>>>>>  * gcc.target/powerpc/pr108184-3.c: New test.
>>>>>>  * gcc.target/powerpc/pr108184-4.c: New test.
>>>>>> ---
>>>>>>  gcc/config/rs6000/predicates.md   |  5 +++-
>>>>>>  gcc/config/rs6000/rs6000.cc   | 19 +-
>>>>>>  gcc/testsuite/gcc.target/powerpc/pr108184-1.c | 16 
>>>>>>  gcc/testsuite/gcc.target/powerpc/pr108184-2.c | 15 +++
>>>>>>  gcc/testsuite/gcc.target/powerpc/pr108184-3.c | 25 +++
>>>>>>  gcc/testsuite/gcc.target/powerpc/pr108184-4.c | 24 ++
>>>>>>  6 files changed, 97 insertions(+), 7 deletions(-)
>>>>>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108184-1.c
>>>>>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108184-2.c
>>>>>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108184-3.c
>>>>>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108184-4.c
>>>>>>
>>>>>> diff --git a/gcc/config/rs6000/predicates.md 
>>>>>> b/gcc/config/rs6000/predicates.md
>>>>>> index a1764018545..9f84468db84 100644
>>>>>> --- a/gcc/config/rs6000/predicates.md
>>>>>> +++ b/gcc/config/rs6000/predicates.md
>>>>>> @@ -1878,7 +1878,10 @@ (define_predicate "fusion_gpr_mem_load"
>>>>>>
>>>>>>/* Handle sign/zero extend.  */
>>>>>>if (GET_CODE (op) == ZERO_EXTEND
>>>>>> -  || (TARGET_P8_FUSION_SIGN && GET_CODE (op) == SIGN_EXTEND))
>>>>>> + 

  1   2   3   4   5   6   7   8   9   10   >