Re: Adding a new thread model to GCC

2022-10-31 Thread i.nixman--- via Gcc-patches

On 2022-10-31 09:18, Eric Botcazou wrote:

hello Eric!

This also changes libstdc++ to pass -D_WIN32_WINNT=0x0600 but only when 
the
switch --enable-libstdcxx-threads is passed, which means that C++11 
threads
are still disabled by default *unless* MinGW-W64 itself is configured 
for
Windows Vista/Server 2008 or later by default (this has been the case 
in

the development version since end of 2020, for earlier versions you can
configure it --with-default-win32-winnt=0x0600 to get the same effect).


I have faced with "#error Timed lock primitives are not supported on 
Windows targets" and I'm not sure I understood the reason correctly.


as far as I understand, the definition for 
`_GTHREAD_USE_MUTEX_TIMEDLOCK` comes from libstdc++/configure as a 
result of some test.


why did I faced with this error? what should I do to avoid this?



you can configure it --with-default-win32-winnt=0x0600 to get the same 
effect


are you talking about the `--with-default-win32-winnt=` option used on 
MinGW-builds script?





best!


Re: [RFC] propgation leap over memory copy for struct

2022-10-31 Thread Jiufu Guo via Gcc-patches
Segher Boessenkool  writes:

> On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote:
>> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
>> >We know that for struct variable assignment, memory copy may be used.
>> >And for memcpy, we may load and store more bytes as possible at one time.
>> >While it may be not best here:
>
>> So the first question in my mind is can we do better at the gimple 
>> phase?  For the second case in particular can't we just "return a" 
>> rather than copying a into  then returning ?  This feels 
>> a lot like the return value optimization from C++.  I'm not sure if it 
>> applies to the first case or not, it's been a long time since I looked 
>> at NRV optimizations, but it might be worth poking around in there a bit 
>> (tree-nrv.cc).
>
> If it is a bigger struct you end up with quite a lot of stuff in
> registers.  GCC will eventually put that all in memory so it will work
> out fine in the end, but you are likely to get inefficient code.
Yes.  We may need to use memory to save regiters for big struct.
Small struct may be practical to use registers.  We may leverage the
idea that: some type of small struct are passing to function through
registers. 

>
> OTOH, 8 bytes isn't as big as we would want these days, is it?  So it
> would be useful to put smaller temportaries, say 32 bytes and smaller,
> in registers instead of in memory.
I think you mean:  we should try to registers to avoid memory accesing,
and using registers would be ok for more bytes memcpy(32bytes).
Great sugguestion, thanks a lot!

Like below idea:
[r100:TI, r101:TI] = src;  //Or r100:OI/OO = src;
dest = [r100:TI, r101:TI];

Currently, for 8bytes structure, we are using TImode for it.
And subreg/fwprop/cse passes are able to optimize it as expected.
Two concerns here: larger int modes(OI/OO/..) may be not introduced yet;
I'm not sure if current infrastructure supports to use two more
registers for one structure.

>
>> But even so, these kinds of things are still bound to happen, so it's 
>> probably worth thinking about if we can do better in RTL as well.
>
> Always.  It is a mistake to think that having better high-level
> optimisations means that you don't need good low-level optimisations
> anymore: in fact deficiencies there become more glaringly apparent if
> the early pipeline opts become better :-)
Understant, thanks :)

>
>> The first thing that comes to my mind is to annotate memcpy calls that 
>> are structure assignments.  The idea here is that we may want to expand 
>> a memcpy differently in those cases.   Changing how we expand an opaque 
>> memcpy call is unlikely to be beneficial in most cases.  But changing 
>> how we expand a structure copy may be beneficial by exposing the 
>> underlying field values.   This would roughly correspond to your method 
>> #1.
>> 
>> Or instead of changing how we expand, teach the optimizers about these 
>> annotated memcpy calls -- they're just a a copy of each field.   That's 
>> how CSE and the propagators could treat them. After some point we'd 
>> lower them in the usual ways, but at least early in the RTL pipeline we 
>> could keep them as annotated memcpy calls.  This roughly corresponds to 
>> your second suggestion.
>
> Ideally this won't ever make it as far as RTL, if the structures do not
> need to go via memory.  All high-level optimissations should have been
> done earlier, and hopefully it was not expand tiself that forced stuff
> into memory!  :-/
Currently, after early gimple optimization, the struct member accessing
may still need to be in memory (if the mode of the struct is BLK).
For example:

_Bool foo (const A a) { return a.a[0] > 1.0; }

The optimized gimple would be:
  _1 = a.a[0];
  _3 = _1 > 1.0e+0;
  return _3;

During expand to RTL, parm 'a' is store to memory from arg regs firstly,
and "a.a[0]" is also reading from memory.  It may be better to use
"f1" for "a.a[0]" here.

Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF; DF}"
for the struct (BLK may be changed), and using 4 DF registers to access
the structure in expand pass.


Thanks again for your kindly and helpful comments!

BR,
Jeff(Jiufu)

>
>
> Segher


Re: [RFC] propgation leap over memory copy for struct

2022-10-31 Thread Jiufu Guo via Gcc-patches
Jeff Law  writes:

> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
>> Hi,
>>
>> We know that for struct variable assignment, memory copy may be used.
>> And for memcpy, we may load and store more bytes as possible at one time.
>> While it may be not best here:
>> 1. Before/after stuct variable assignment, the vaiable may be operated.
>> And it is hard for some optimizations to leap over memcpy.  Then some struct
>> operations may be sub-optimimal.  Like the issue in PR65421.
>> 2. The size of struct is constant mostly, the memcpy would be expanded.  
>> Using
>> small size to load/store and executing in parallel may not slower than using
>> large size to loat/store. (sure, more registers may be used for smaller 
>> bytes.)
>>
>>
>> In PR65421, For source code as below:
>> t.c
>> #define FN 4
>> typedef struct { double a[FN]; } A;
>>
>> A foo (const A *a) { return *a; }
>> A bar (const A a) { return a; }
>
> So the first question in my mind is can we do better at the gimple
> phase?  For the second case in particular can't we just "return a"
> rather than copying a into  then returning ?  This
> feels a lot like the return value optimization from C++.  I'm not sure
> if it applies to the first case or not, it's been a long time since I
> looked at NRV optimizations, but it might be worth poking around in
> there a bit (tree-nrv.cc).
Thanks for point out this idea!!

Currently the optimized gimple looks like:
  D.3334 = a;
  return D.3334;

and
  D.3336 = *a_2(D);
  return D.3336;

It may be better to have:
"return a;" and "return *a;"
-

If the code looks like:
typedef struct { double a[3]; long l;} A; //mix types
A foo (const A a) { return a; }
A bar (const A *a) { return *a; }

Current optimized gimples looks like:
   = a;
  return ;
and
   = *a_2(D);
  return ;

"return a;" and "return *a;" may be works here too.
>
>
> But even so, these kinds of things are still bound to happen, so it's
> probably worth thinking about if we can do better in RTL as well. 
>
Yeap, thanks!
>
> The first thing that comes to my mind is to annotate memcpy calls that
> are structure assignments.  The idea here is that we may want to
> expand a memcpy differently in those cases.   Changing how we expand
> an opaque memcpy call is unlikely to be beneficial in most cases.  But
> changing how we expand a structure copy may be beneficial by exposing
> the underlying field values.   This would roughly correspond to your
> method #1.
Right.  For general memcpy, we would read/write larger bytes at one
time. Reading/writing small fields may only beneficial for structure
assignment.

>
> Or instead of changing how we expand, teach the optimizers about these
> annotated memcpy calls -- they're just a a copy of each field.  
> That's how CSE and the propagators could treat them. After some point
> we'd lower them in the usual ways, but at least early in the RTL
> pipeline we could keep them as annotated memcpy calls.  This roughly
> corresponds to your second suggestion.
Thanks for your insights about this idea! Using annoated memcpy for
early optimizations, and it would be treated as general memcpy in later
passes.


Thanks again for your very helpful comments and sugguestions!

BR,
Jeff(Jiufu)

>
>
> jeff


Re: [RFC] propgation leap over memory copy for struct

2022-10-31 Thread Jiufu Guo via Gcc-patches
Segher Boessenkool  writes:

> Hi!
>
> On Mon, Oct 31, 2022 at 10:42:35AM +0800, Jiufu Guo wrote:
>> #define FN 4
>> typedef struct { double a[FN]; } A;
>> 
>> A foo (const A *a) { return *a; }
>> A bar (const A a) { return a; }
>> ///
>> 
>> If FN<=2; the size of "A" fits into TImode, then this code can be optimized 
>> (by subreg/cse/fwprop/cprop) as:
>> ---
>> foo:
>> .LFB0:
>> .cfi_startproc
>> blr
>> 
>> bar:
>> .LFB1:
>>  .cfi_startproc
>>  lfd 2,8(3)
>>  lfd 1,0(3)
>>  blr
>> 
>
> I think you swapped foo and bar here?
Oh, thanks!
>
>> If the size of "A" is larger than any INT mode size, RTL insns would be 
>> generated as:
>>13: r125:V2DI=[r112:DI+0x20]
>>14: r126:V2DI=[r112:DI+0x30]
>>15: [r112:DI]=r125:V2DI
>>16: [r112:DI+0x10]=r126:V2DI  /// memcpy for assignment: D.3338 = arg;
>>17: r127:DF=[r112:DI]
>>18: r128:DF=[r112:DI+0x8]
>>19: r129:DF=[r112:DI+0x10]
>>20: r130:DF=[r112:DI+0x18]
>> 
>> 
>> I'm thinking about ways to improve this.
>> Metod1: One way may be changing the memory copy by referencing the type 
>> of the struct if the size of struct is not too big. And generate insns 
>> like the below:
>>13: r125:DF=[r112:DI+0x20]
>>15: r126:DF=[r112:DI+0x28]
>>17: r127:DF=[r112:DI+0x30]
>>19: r128:DF=[r112:DI+0x38]
>>14: [r112:DI]=r125:DF
>>16: [r112:DI+0x8]=r126:DF
>>18: [r112:DI+0x10]=r127:DF
>>20: [r112:DI+0x18]=r128:DF
>>21: r129:DF=[r112:DI]
>>22: r130:DF=[r112:DI+0x8]
>>23: r131:DF=[r112:DI+0x10]
>>24: r132:DF=[r112:DI+0x18]
>
> This is much worse though?  The expansion with memcpy used V2DI, which
> typically is close to 2x faster than DFmode accesses.
Using V2DI, it help to access 2x bytes at one time than DF/DI.
While since those readings can be executed in parallel, it would be not
too bad via using DF/DI.

>
> Or are you trying to avoid small reads of large stores here?  Those
> aren't so bad, large reads of small stores is the nastiness we need to
> avoid.
Here, I want to use 2 DF readings, because optimizations cse/fwprop/dse
could eleminate those memory accesses on same size.
>
> The code we have now does
>
>15: [r112:DI]=r125:V2DI
> ...
>17: r127:DF=[r112:DI]
>18: r128:DF=[r112:DI+0x8]
>
> Can you make this optimised to not use a memory temporary at all, just
> immediately assign from r125 to r127 and r128?
r125 are not directly assinged to r127/r128, since 'insn 15' and 'insn
17/18' are expanded for different gimple stmt:
  D.3331 = a;  ==> 'insn 15' is generated for struct assignment here.
  return D.3331; ==> 'insn 17/18' are prepared for return registers.

I'm trying to eliminate thos  memory temporary, and did not find a good
way.  Method1-3 are the ideas which I'm trying to use to delete those
temporaries.

>
>> Method2: One way may be enhancing CSE to make it able to treat one large
>> memory slot as two(or more) combined slots: 
>>13: r125:V2DI#0=[r112:DI+0x20]
>>13': r125:V2DI#8=[r112:DI+0x28]
>>15: [r112:DI]#0=r125:V2DI#0
>>15': [r112:DI]#8=r125:V2DI#8
>> 
>> This may seems more hack in CSE.
>
> The current CSE pass we have is the pass most in need of a full rewrite
> we have, since many many years.  It does a lot of things, important
> things that we should not lose, but it does a pretty bad job of CSE.
>
>> Method3: For some record type, use "PARALLEL:BLK" instead "MEM:BLK".
>
> :BLK can never be optimised well.  It always has to live in memory, by
> definition.

Thanks for your sugguestions!

BR,
Jeff (Jiufu)
>
>
> Segher


Re: [wwwdocs] [GCC13] Mention Intel __bf16 support in AVX512BF16 intrinsics.

2022-10-31 Thread Hongtao Liu via Gcc-patches
On Tue, Nov 1, 2022 at 9:21 AM Kong, Lingling via Gcc-patches
 wrote:
>
> Hi
>
> The patch is for mention Intel __bf16 support in AVX512BF16 intrinsics.
> Ok for master ?
>
> Thanks,
> Lingling
>
> ---
>  htdocs/gcc-13/changes.html | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/htdocs/gcc-13/changes.html b/htdocs/gcc-13/changes.html index 
> 7c6bfa6e..cd0282f1 100644
> --- a/htdocs/gcc-13/changes.html
> +++ b/htdocs/gcc-13/changes.html
> @@ -230,6 +230,8 @@ a work-in-progress.
>For both C and C++ the __bf16 type is supported on
>x86 systems with SSE2 and above enabled.
>
> +  Use __bf16 type for AVX512BF16 intrinsics.
Could you add more explanations. Like originally it's ..., now it's
..., and what's the difference when users compile the same source
code(which contains avx512bf16 intrinsics) with gcc12(and before) and
GCC13.
> +  
>  
>
>  
> --
> 2.18.2
>


-- 
BR,
Hongtao


[pushed] c++: set TREE_NOTHROW after genericize

2022-10-31 Thread Jason Merrill via Gcc-patches
Tested x86_64-pc-linux-gnu, applying to trunk.

-- >8 --

genericize might introduce function calls (and does on the contracts
branch), so it's safer to set this flag later.

gcc/cp/ChangeLog:

* decl.cc (finish_function): Set TREE_NOTHROW later in the function.
---
 gcc/cp/decl.cc | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/gcc/cp/decl.cc b/gcc/cp/decl.cc
index 87cb7a6c3a4..6e98ea35a39 100644
--- a/gcc/cp/decl.cc
+++ b/gcc/cp/decl.cc
@@ -17867,14 +17867,6 @@ finish_function (bool inline_p)
 
   finish_fname_decls ();
 
-  /* If this function can't throw any exceptions, remember that.  */
-  if (!processing_template_decl
-  && !cp_function_chain->can_throw
-  && !flag_non_call_exceptions
-  && !decl_replaceable_p (fndecl,
- opt_for_fn (fndecl, flag_semantic_interposition)))
-TREE_NOTHROW (fndecl) = 1;
-
   /* This must come after expand_function_end because cleanups might
  have declarations (from inline functions) that need to go into
  this function's blocks.  */
@@ -18099,6 +18091,14 @@ finish_function (bool inline_p)
   && !DECL_OMP_DECLARE_REDUCTION_P (fndecl))
 cp_genericize (fndecl);
 
+  /* If this function can't throw any exceptions, remember that.  */
+  if (!processing_template_decl
+  && !cp_function_chain->can_throw
+  && !flag_non_call_exceptions
+  && !decl_replaceable_p (fndecl,
+ opt_for_fn (fndecl, flag_semantic_interposition)))
+TREE_NOTHROW (fndecl) = 1;
+
   /* Emit the resumer and destroyer functions now, providing that we have
  not encountered some fatal error.  */
   if (coro_emit_helpers)

base-commit: 6a1f27f45e44bcfbcc06a1aad74bb076e56eda36
-- 
2.31.1



[wwwdocs] [GCC13] Mention Intel __bf16 support in AVX512BF16 intrinsics.

2022-10-31 Thread Kong, Lingling via Gcc-patches
Hi

The patch is for mention Intel __bf16 support in AVX512BF16 intrinsics.
Ok for master ?

Thanks,
Lingling

---
 htdocs/gcc-13/changes.html | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/htdocs/gcc-13/changes.html b/htdocs/gcc-13/changes.html index 
7c6bfa6e..cd0282f1 100644
--- a/htdocs/gcc-13/changes.html
+++ b/htdocs/gcc-13/changes.html
@@ -230,6 +230,8 @@ a work-in-progress.
   For both C and C++ the __bf16 type is supported on
   x86 systems with SSE2 and above enabled.
   
+  Use __bf16 type for AVX512BF16 intrinsics.
+  
 
 
 
--
2.18.2



Re: [RFC] propgation leap over memory copy for struct

2022-10-31 Thread Segher Boessenkool
On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote:
> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
> >We know that for struct variable assignment, memory copy may be used.
> >And for memcpy, we may load and store more bytes as possible at one time.
> >While it may be not best here:

> So the first question in my mind is can we do better at the gimple 
> phase?  For the second case in particular can't we just "return a" 
> rather than copying a into  then returning ?  This feels 
> a lot like the return value optimization from C++.  I'm not sure if it 
> applies to the first case or not, it's been a long time since I looked 
> at NRV optimizations, but it might be worth poking around in there a bit 
> (tree-nrv.cc).

If it is a bigger struct you end up with quite a lot of stuff in
registers.  GCC will eventually put that all in memory so it will work
out fine in the end, but you are likely to get inefficient code.

OTOH, 8 bytes isn't as big as we would want these days, is it?  So it
would be useful to put smaller temportaries, say 32 bytes and smaller,
in registers instead of in memory.

> But even so, these kinds of things are still bound to happen, so it's 
> probably worth thinking about if we can do better in RTL as well.

Always.  It is a mistake to think that having better high-level
optimisations means that you don't need good low-level optimisations
anymore: in fact deficiencies there become more glaringly apparent if
the early pipeline opts become better :-)

> The first thing that comes to my mind is to annotate memcpy calls that 
> are structure assignments.  The idea here is that we may want to expand 
> a memcpy differently in those cases.   Changing how we expand an opaque 
> memcpy call is unlikely to be beneficial in most cases.  But changing 
> how we expand a structure copy may be beneficial by exposing the 
> underlying field values.   This would roughly correspond to your method 
> #1.
> 
> Or instead of changing how we expand, teach the optimizers about these 
> annotated memcpy calls -- they're just a a copy of each field.   That's 
> how CSE and the propagators could treat them. After some point we'd 
> lower them in the usual ways, but at least early in the RTL pipeline we 
> could keep them as annotated memcpy calls.  This roughly corresponds to 
> your second suggestion.

Ideally this won't ever make it as far as RTL, if the structures do not
need to go via memory.  All high-level optimissations should have been
done earlier, and hopefully it was not expand tiself that forced stuff
into memory!  :-/


Segher


Re: [RFC] propgation leap over memory copy for struct

2022-10-31 Thread Segher Boessenkool
Hi!

On Mon, Oct 31, 2022 at 10:42:35AM +0800, Jiufu Guo wrote:
> #define FN 4
> typedef struct { double a[FN]; } A;
> 
> A foo (const A *a) { return *a; }
> A bar (const A a) { return a; }
> ///
> 
> If FN<=2; the size of "A" fits into TImode, then this code can be optimized 
> (by subreg/cse/fwprop/cprop) as:
> ---
> foo:
> .LFB0:
> .cfi_startproc
> blr
> 
> bar:
> .LFB1:
>   .cfi_startproc
>   lfd 2,8(3)
>   lfd 1,0(3)
>   blr
> 

I think you swapped foo and bar here?

> If the size of "A" is larger than any INT mode size, RTL insns would be 
> generated as:
>13: r125:V2DI=[r112:DI+0x20]
>14: r126:V2DI=[r112:DI+0x30]
>15: [r112:DI]=r125:V2DI
>16: [r112:DI+0x10]=r126:V2DI  /// memcpy for assignment: D.3338 = arg;
>17: r127:DF=[r112:DI]
>18: r128:DF=[r112:DI+0x8]
>19: r129:DF=[r112:DI+0x10]
>20: r130:DF=[r112:DI+0x18]
> 
> 
> I'm thinking about ways to improve this.
> Metod1: One way may be changing the memory copy by referencing the type 
> of the struct if the size of struct is not too big. And generate insns 
> like the below:
>13: r125:DF=[r112:DI+0x20]
>15: r126:DF=[r112:DI+0x28]
>17: r127:DF=[r112:DI+0x30]
>19: r128:DF=[r112:DI+0x38]
>14: [r112:DI]=r125:DF
>16: [r112:DI+0x8]=r126:DF
>18: [r112:DI+0x10]=r127:DF
>20: [r112:DI+0x18]=r128:DF
>21: r129:DF=[r112:DI]
>22: r130:DF=[r112:DI+0x8]
>23: r131:DF=[r112:DI+0x10]
>24: r132:DF=[r112:DI+0x18]

This is much worse though?  The expansion with memcpy used V2DI, which
typically is close to 2x faster than DFmode accesses.

Or are you trying to avoid small reads of large stores here?  Those
aren't so bad, large reads of small stores is the nastiness we need to
avoid.

The code we have now does

   15: [r112:DI]=r125:V2DI
...
   17: r127:DF=[r112:DI]
   18: r128:DF=[r112:DI+0x8]

Can you make this optimised to not use a memory temporary at all, just
immediately assign from r125 to r127 and r128?

> Method2: One way may be enhancing CSE to make it able to treat one large
> memory slot as two(or more) combined slots: 
>13: r125:V2DI#0=[r112:DI+0x20]
>13': r125:V2DI#8=[r112:DI+0x28]
>15: [r112:DI]#0=r125:V2DI#0
>15': [r112:DI]#8=r125:V2DI#8
> 
> This may seems more hack in CSE.

The current CSE pass we have is the pass most in need of a full rewrite
we have, since many many years.  It does a lot of things, important
things that we should not lose, but it does a pretty bad job of CSE.

> Method3: For some record type, use "PARALLEL:BLK" instead "MEM:BLK".

:BLK can never be optimised well.  It always has to live in memory, by
definition.


Segher


Re: Re: [PATCH] RISC-V: Fix RVV testcases.

2022-10-31 Thread 钟居哲
These cases actually doesn't care about -mabi, they just need 'v' in -march.
Can you tell me how to fix these testcases for "fails on targets without 
ilp32d" ?
These failures are bogus failures since if you specify -mabi=ilp32d when you 
are using GNU toolchain which is build up with "--arch=ilp32" let say.
It will fail. Report there is no "ilp32d". So I fix these testcase by replacing 
"ilp32d" into "ilp32".
Thank you.



juzhe.zh...@rivai.ai
 
From: Palmer Dabbelt
Date: 2022-11-01 06:30
To: gcc-patches
CC: juzhe.zhong; gcc-patches; schwab; Kito Cheng
Subject: Re: [PATCH] RISC-V: Fix RVV testcases.
On Mon, 31 Oct 2022 15:00:49 PDT (-0700), gcc-patches@gcc.gnu.org wrote:
>
> On 10/30/22 19:40, juzhe.zh...@rivai.ai wrote:
>> From: Ju-Zhe Zhong 
>>
>> gcc/testsuite/ChangeLog:
>>
>>  * gcc.target/riscv/rvv/base/abi-2.c: Change ilp32d to ilp32.
>>  * gcc.target/riscv/rvv/base/abi-3.c: Ditto.
>>  * gcc.target/riscv/rvv/base/abi-4.c: Ditto.
>>  * gcc.target/riscv/rvv/base/abi-5.c: Ditto.
>>  * gcc.target/riscv/rvv/base/abi-6.c: Ditto.
>>  * gcc.target/riscv/rvv/base/abi-7.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-1.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-10.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-11.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-12.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-13.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-2.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-3.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-4.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-5.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-6.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-7.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-8.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-9.c: Ditto.
>>  * gcc.target/riscv/rvv/base/pragma-1.c: Ditto.
>>  * gcc.target/riscv/rvv/base/user-1.c: Ditto.
>>  * gcc.target/riscv/rvv/base/user-2.c: Ditto.
>>  * gcc.target/riscv/rvv/base/user-3.c: Ditto.
>>  * gcc.target/riscv/rvv/base/user-4.c: Ditto.
>>  * gcc.target/riscv/rvv/base/user-5.c: Ditto.
>>  * gcc.target/riscv/rvv/base/user-6.c: Ditto.
>>  * gcc.target/riscv/rvv/base/vsetvl-1.c: Ditto.
>
> I'm pretty new to the RISC-V world, but don't some of the cases
> (particularly the abi-* tests) verify that the ABI specification does
> not override the arch specification WRT availability of types?
 
I think that depends on what the ABI specification says here, as it 
could really go many ways.  Most of the RISC-V targets just use -mabi to 
control how arguments end up passed in functions, not the availability 
of types.  I can't find the ABI spec for these, though, so I'm not 
entirely sure how they're supposed to work...
 
That said, I'm not sure why we need any of these -mabi changes?  Just 
from spot checking some of the examples it doesn't look like there 
should be any functional difference between ilp32 and ilp32d here: 
-march is always specified so ilp32d looks valid.  If this is just to 
fix the "fails on targets without ilp32d" [1], then IMO it's not really 
a fix: we're essentially just changing that to "fails on targets without 
ilp32", we either need some sort of automatic march/mabi setting or a 
dependency on the availiable multilibs.  Some of these can probably 
avoid linking, but we'll have execution tests at some point.
 
1: https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604644.html
 


Re: [committed] More gimple const/copy propagation opportunities

2022-10-31 Thread Jeff Law via Gcc-patches



On 10/1/22 12:55, Bernhard Reutner-Fischer wrote:

On Fri, 30 Sep 2022 17:32:34 -0600
Jeff Law  wrote:


+  /* This looks good from a CFG standpoint.  Now look at the guts
+ of PRED.  Basically we want to verify there are no PHI nodes
+ and no real statements.  */
+  if (! gimple_seq_empty_p (phi_nodes (pred)))
+return false;

So, given the below, neither DEBUG nor labels do count towards an
empty seq [coming in from any PHI that is, otherwise it's a different
thing], which is a bit surprising but well, ok. It looks at PHI IL, so
probably yes. Allegedly that's what it is. Neat if that's true.


Right.  A forwarder block is allowed to have local labels, but not 
labels for nonlocal gotos, so we can and should ignore local labels.  
Debug statements must be ignored or you can end up with differing code 
generation based on whether or not -g is enabled or not.


In some contexts PHIs are allowed in forwarders, in other contexts they 
are not.  In this specific case I doubt it matters because of the 
restrictions we put on the CFG, the predecessor block is restricted to a 
single incoming edge.  The only way that'll have a PHI is if the PHI 
became a degenerate during DOM.








+
+  gimple_stmt_iterator gsi;
+  for (gsi = gsi_last_bb (pred); !gsi_end_p (gsi); gsi_prev ())
+{
+  gimple *stmt = gsi_stmt (gsi);
+
+  switch (gimple_code (stmt))
+   {
+ case GIMPLE_LABEL:
+   if (DECL_NONLOCAL (gimple_label_label (as_a  (stmt
+ return false;
+   break;
+
+ case GIMPLE_DEBUG:
+   break;
+
+ default:
+   return false;

don't like, sounds odd. Are we sure there's no other garbage that can
manifest here? int meow=42;, and meow unused won't survive?, pragmas
neither or stuff ?


That would generate a real statement and would thus be rejected. If 
there's anything other than a local label or debug statements in the 
block, then it's rejected.





@@ -583,6 +656,62 @@ record_edge_info (basic_block bb)
if (can_infer_simple_equiv && TREE_CODE (inverted) == EQ_EXPR)
edge_info->record_simple_equiv (op0, op1);
  }
+
+ /* If this block is a single block loop, then we may be able to
+record some equivalences on the loop's exit edge.  */
+ if (single_block_loop_p (bb))
+   {
+ /* We know it's a single block loop.  Now look at the loop
+exit condition.  What we're looking for is whether or not
+the exit condition is loop invariant which we can detect
+by checking if all the SSA_NAMEs referenced are defined
+outside the loop.  */
+ if ((TREE_CODE (op0) != SSA_NAME
+  || gimple_bb (SSA_NAME_DEF_STMT (op0)) != bb)
+ && (TREE_CODE (op1) != SSA_NAME
+ || gimple_bb (SSA_NAME_DEF_STMT (op1)) != bb))
+   {
+ /* At this point we know the exit condition is loop
+invariant.  The only way to get out of the loop is
+if never traverses the backedge to begin with.  This

s/if /if it /


Will fix.  THanks.





+implies that any PHI nodes create equivalances we can

that any threw me off asking for "that if any". Would have been nicer,
i think?


All PHIs at the target of the loop backedge in this case create an 
equivalence.  That's the whole point of the patch, to prove a set of 
circumstances that ultimately require all the PHIs on the loop backedge 
to create an equivalence on the loop exit.







+attach to the loop exit edge.  */

attach it to


Updated, slightly differently, but should be clearer.





+ int alternative

bool


Sure.






+   = (EDGE_PRED (bb, 0)->flags & EDGE_DFS_BACK) ? 1 : 0;
+
+ gphi_iterator gsi;
+ for (gsi = gsi_start_phis (bb);
+  !gsi_end_p (gsi);
+  gsi_next ())
+   {
+ /* If the other alternative is the same as the result,
+then this is a degenerate and can be ignored.  */
+ if (dst == PHI_ARG_DEF (phi, !alternative))
+   continue;
+
+ /* Now get the EDGE_INFO class so we can append
+it to our list.  We want the successor edge
+where the destination is not the source of
+an incoming edge.  */
+ gphi *phi = gsi.phi ();
+ tree src = PHI_ARG_DEF (phi, alternative);
+ tree dst = PHI_RESULT (phi);
+
+ if (EDGE_SUCC (bb, 0)->dest
+ != EDGE_PRED (bb, !alternative)->src)

by now, alternative would be easier to grok if it would have been spelled
from_backedge_p or something. IMHO.


Agreed it's a bit 

Re: [PATCH] Add __builtin_iseqsig()

2022-10-31 Thread Joseph Myers
On Mon, 31 Oct 2022, FX via Gcc-patches wrote:

> - rounded conversions: converting, from an integer or floating point 
> type, into another floating point type, with specific rounding mode 
> passed as argument

These don't have standard C names.  The way to do these in C would be 
using the FENV_ROUND pragma around a conversion, but we don't support any 
of the standard pragmas including those from C99 and they would be a large 
project (cf. Marc Glisse's -ffenv-access patches from August 2020 - 
although some things in FENV_ACCESS are probably rather orthogonal to 
FENV_ROUND, I expect what's required in terms of preventing unwanted code 
movement across rounding mode changes is similar).

It might be possible to add built-in functions for such conversions 
without needing the FENV_ROUND machinery, if you make them expand to insn 
patterns (with temporary rounding mode changes) that are arranged so the 
compiler can't split them up.

(There's a principle of not introducing libm dependencies in code not 
using any ,  or  functions or corresponding 
built-in functions, which would be an issue for generating calls to 
fesetround, inline or in libgcc, from such an operation.  But arguably, 
even if FENV_ROUND shouldn't introduce such dependencies - my assumption 
being that FENV_ROUND should involve target-specific inline 
implementations of the required rounding mode changes - it might be OK to 
document some GCC-specific built-in function as doing so.)

> - conversion to integer: converting, from a floating point type, into an 
> integer type, with specific rounding mode passed as argument

See the fromfp functions (signed and unsigned versions, versions with and 
without raising inexact, rounding mode specified as one of the FP_INT_* 
macros from ).  The versions in TS 18661-1 produced results in an 
integer type (intmax_t / uintmax_t, with the actual width for the 
conversion passed as an argument).  *But* C2X changed that to return the 
result in a floating-point type instead (as part of reducing the use of 
intmax_t in interfaces) - I haven't yet implemented that change in glibc.  
So the way to do such a conversion to an integer type in C2X involves 
calling such a function and then converting its result to that integer 
type.

GCC certainly knows about handling such a pair ( function, 
conversion of its result) as a built-in function (e.g. __builtin_iceil).  
My guess is that in most cases only selected calls would be expanded 
inline - calls where not only is there an appropriate conversion to an 
integer type (matching, or maybe wider than, the width passed to the 
function), but where also the function, rounding mode and width together 
match an operation for which there is a hardware instruction, with other 
cases (including ones where the rounding mode or width aren't constant) 
ending up as runtime calls (unless optimized for constant arguments).  So 
while the interfaces exist in C2X, the built-in function support in GCC 
may be fairly complicated, with the existence of the older TS 18661-1 
version of the functions complicating things further.

> - IEEE operations corresponding to nextDown and nextUp (or are those 
> already available? I have not checked the fine print)

nextdown and nextup have been in glibc since version 2.24.  I expect that 
adding built-in functions that optimize them for constant arguments would 
be straightforward (that doesn't help if what you actually want it some 
way to support those operations at runtime for targets without the 
functions in libm, of course).

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: [PATCH] RISC-V: Fix RVV testcases.

2022-10-31 Thread Palmer Dabbelt

On Mon, 31 Oct 2022 15:00:49 PDT (-0700), gcc-patches@gcc.gnu.org wrote:


On 10/30/22 19:40, juzhe.zh...@rivai.ai wrote:

From: Ju-Zhe Zhong 

gcc/testsuite/ChangeLog:

 * gcc.target/riscv/rvv/base/abi-2.c: Change ilp32d to ilp32.
 * gcc.target/riscv/rvv/base/abi-3.c: Ditto.
 * gcc.target/riscv/rvv/base/abi-4.c: Ditto.
 * gcc.target/riscv/rvv/base/abi-5.c: Ditto.
 * gcc.target/riscv/rvv/base/abi-6.c: Ditto.
 * gcc.target/riscv/rvv/base/abi-7.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-1.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-10.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-11.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-12.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-13.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-2.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-3.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-4.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-5.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-6.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-7.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-8.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-9.c: Ditto.
 * gcc.target/riscv/rvv/base/pragma-1.c: Ditto.
 * gcc.target/riscv/rvv/base/user-1.c: Ditto.
 * gcc.target/riscv/rvv/base/user-2.c: Ditto.
 * gcc.target/riscv/rvv/base/user-3.c: Ditto.
 * gcc.target/riscv/rvv/base/user-4.c: Ditto.
 * gcc.target/riscv/rvv/base/user-5.c: Ditto.
 * gcc.target/riscv/rvv/base/user-6.c: Ditto.
 * gcc.target/riscv/rvv/base/vsetvl-1.c: Ditto.


I'm pretty new to the RISC-V world, but don't some of the cases
(particularly the abi-* tests) verify that the ABI specification does
not override the arch specification WRT availability of types?


I think that depends on what the ABI specification says here, as it 
could really go many ways.  Most of the RISC-V targets just use -mabi to 
control how arguments end up passed in functions, not the availability 
of types.  I can't find the ABI spec for these, though, so I'm not 
entirely sure how they're supposed to work...


That said, I'm not sure why we need any of these -mabi changes?  Just 
from spot checking some of the examples it doesn't look like there 
should be any functional difference between ilp32 and ilp32d here: 
-march is always specified so ilp32d looks valid.  If this is just to 
fix the "fails on targets without ilp32d" [1], then IMO it's not really 
a fix: we're essentially just changing that to "fails on targets without 
ilp32", we either need some sort of automatic march/mabi setting or a 
dependency on the availiable multilibs.  Some of these can probably 
avoid linking, but we'll have execution tests at some point.


1: https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604644.html


Re: Re: [PATCH] RISC-V: Fix RVV testcases.

2022-10-31 Thread 钟居哲
These testcases are not depend on the ABI specification.
I pick up the minimum ABI setting so that it won't fail.
The naming of abi-* tests may be confusing, I can change the naming in the next 
time.


juzhe.zh...@rivai.ai
 
From: Jeff Law
Date: 2022-11-01 06:00
To: juzhe.zhong; gcc-patches
CC: schwab; kito.cheng
Subject: Re: [PATCH] RISC-V: Fix RVV testcases.
 
On 10/30/22 19:40, juzhe.zh...@rivai.ai wrote:
> From: Ju-Zhe Zhong 
>
> gcc/testsuite/ChangeLog:
>
>  * gcc.target/riscv/rvv/base/abi-2.c: Change ilp32d to ilp32.
>  * gcc.target/riscv/rvv/base/abi-3.c: Ditto.
>  * gcc.target/riscv/rvv/base/abi-4.c: Ditto.
>  * gcc.target/riscv/rvv/base/abi-5.c: Ditto.
>  * gcc.target/riscv/rvv/base/abi-6.c: Ditto.
>  * gcc.target/riscv/rvv/base/abi-7.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-1.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-10.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-11.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-12.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-13.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-2.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-3.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-4.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-5.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-6.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-7.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-8.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-9.c: Ditto.
>  * gcc.target/riscv/rvv/base/pragma-1.c: Ditto.
>  * gcc.target/riscv/rvv/base/user-1.c: Ditto.
>  * gcc.target/riscv/rvv/base/user-2.c: Ditto.
>  * gcc.target/riscv/rvv/base/user-3.c: Ditto.
>  * gcc.target/riscv/rvv/base/user-4.c: Ditto.
>  * gcc.target/riscv/rvv/base/user-5.c: Ditto.
>  * gcc.target/riscv/rvv/base/user-6.c: Ditto.
>  * gcc.target/riscv/rvv/base/vsetvl-1.c: Ditto.
 
I'm pretty new to the RISC-V world, but don't some of the cases 
(particularly the abi-* tests) verify that the ABI specification does 
not override the arch specification WRT availability of types?
 
 
Jeff
 


Re: [PATCH v5] RISC-V: Libitm add RISC-V support.

2022-10-31 Thread Jeff Law via Gcc-patches



On 10/29/22 03:01, Xiongchuan Tan wrote:

Reviewed-by: Palmer Dabbelt
Acked-by: Palmer Dabbelt

libitm/ChangeLog:

 * configure.tgt: Add riscv support.
 * config/riscv/asm.h: New file.
 * config/riscv/sjlj.S: New file.
 * config/riscv/target.h: New file.


Pushed to the trunk.

jeff



Re: [RFC] propgation leap over memory copy for struct

2022-10-31 Thread Jeff Law via Gcc-patches



On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:

Hi,

We know that for struct variable assignment, memory copy may be used.
And for memcpy, we may load and store more bytes as possible at one time.
While it may be not best here:
1. Before/after stuct variable assignment, the vaiable may be operated.
And it is hard for some optimizations to leap over memcpy.  Then some struct
operations may be sub-optimimal.  Like the issue in PR65421.
2. The size of struct is constant mostly, the memcpy would be expanded.  Using
small size to load/store and executing in parallel may not slower than using
large size to loat/store. (sure, more registers may be used for smaller bytes.)


In PR65421, For source code as below:
t.c
#define FN 4
typedef struct { double a[FN]; } A;

A foo (const A *a) { return *a; }
A bar (const A a) { return a; }


So the first question in my mind is can we do better at the gimple 
phase?  For the second case in particular can't we just "return a" 
rather than copying a into  then returning ?  This feels 
a lot like the return value optimization from C++.  I'm not sure if it 
applies to the first case or not, it's been a long time since I looked 
at NRV optimizations, but it might be worth poking around in there a bit 
(tree-nrv.cc).



But even so, these kinds of things are still bound to happen, so it's 
probably worth thinking about if we can do better in RTL as well.



The first thing that comes to my mind is to annotate memcpy calls that 
are structure assignments.  The idea here is that we may want to expand 
a memcpy differently in those cases.   Changing how we expand an opaque 
memcpy call is unlikely to be beneficial in most cases.  But changing 
how we expand a structure copy may be beneficial by exposing the 
underlying field values.   This would roughly correspond to your method #1.


Or instead of changing how we expand, teach the optimizers about these 
annotated memcpy calls -- they're just a a copy of each field.   That's 
how CSE and the propagators could treat them. After some point we'd 
lower them in the usual ways, but at least early in the RTL pipeline we 
could keep them as annotated memcpy calls.  This roughly corresponds to 
your second suggestion.



jeff





Re: [PATCH] RISC-V: Fix RVV testcases.

2022-10-31 Thread Jeff Law via Gcc-patches



On 10/30/22 19:40, juzhe.zh...@rivai.ai wrote:

From: Ju-Zhe Zhong 

gcc/testsuite/ChangeLog:

 * gcc.target/riscv/rvv/base/abi-2.c: Change ilp32d to ilp32.
 * gcc.target/riscv/rvv/base/abi-3.c: Ditto.
 * gcc.target/riscv/rvv/base/abi-4.c: Ditto.
 * gcc.target/riscv/rvv/base/abi-5.c: Ditto.
 * gcc.target/riscv/rvv/base/abi-6.c: Ditto.
 * gcc.target/riscv/rvv/base/abi-7.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-1.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-10.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-11.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-12.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-13.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-2.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-3.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-4.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-5.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-6.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-7.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-8.c: Ditto.
 * gcc.target/riscv/rvv/base/mov-9.c: Ditto.
 * gcc.target/riscv/rvv/base/pragma-1.c: Ditto.
 * gcc.target/riscv/rvv/base/user-1.c: Ditto.
 * gcc.target/riscv/rvv/base/user-2.c: Ditto.
 * gcc.target/riscv/rvv/base/user-3.c: Ditto.
 * gcc.target/riscv/rvv/base/user-4.c: Ditto.
 * gcc.target/riscv/rvv/base/user-5.c: Ditto.
 * gcc.target/riscv/rvv/base/user-6.c: Ditto.
 * gcc.target/riscv/rvv/base/vsetvl-1.c: Ditto.


I'm pretty new to the RISC-V world, but don't some of the cases 
(particularly the abi-* tests) verify that the ABI specification does 
not override the arch specification WRT availability of types?



Jeff


Re: [PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector

2022-10-31 Thread Jeff Law via Gcc-patches



On 10/31/22 05:57, Tamar Christina wrote:

Hi All,

The current vector extract pattern can only extract from a vector when the
position to extract is a multiple of the vector bitsize as a whole.

That means extract something like a V2SI from a V4SI vector from position 32
isn't possible as 32 is not a multiple of 64.  Ideally this optab should have
worked on multiple of the element size, but too many targets rely on this
semantic now.

So instead add a new case which allows any extraction as long as the bit pos
is a multiple of the element size.  We use a VEC_PERM to shuffle the elements
into the bottom parts of the vector and then use a subreg to extract the values
out.  This now allows various vector operations that before were being
decomposed into very inefficient scalar operations.

NOTE: I added 3 testcases, I only fixed the 3rd one.

The 1st one missed because we don't optimize VEC_PERM expressions into
bitfields.  The 2nd one is missed because extract_bit_field only works on
vector modes.  In this case the intermediate extract is DImode.

On targets where the scalar mode is tiable to vector modes the extract should
work fine.

However I ran out of time to fix the first two and so will do so in GCC 14.
For now this catches the case that my pattern now introduces more easily.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* expmed.cc (extract_bit_field_1): Add support for vector element
extracts.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/ext_1.c: New.


OK.

jeff




[r13-3570 Regression] FAIL: g++.dg/other/pr39060.C -std=c++98 (test for excess errors) on Linux/x86_64

2022-10-31 Thread haochen.jiang via Gcc-patches
On Linux/x86_64,

259a11555c90783e53c046c310080407ee54a31e is the first bad commit
commit 259a11555c90783e53c046c310080407ee54a31e
Author: Jakub Jelinek 
Date:   Mon Oct 31 09:09:48 2022 +0100

builtins: Add various complex builtins for _Float{16,32,64,128,32x,64x,128x}

caused

FAIL: g++.dg/other/pr39060.C  -std=c++98 (internal compiler error: canonical 
types differ for identical types 'void (A::)(void*)' and 'void (A::)(void*)')
FAIL: g++.dg/other/pr39060.C  -std=c++98 (test for excess errors)

with GCC configured with

../../gcc/configure 
--prefix=/export/users/haochenj/src/gcc-bisect/master/master/r13-3570/usr 
--enable-clocale=gnu --with-system-zlib --with-demangler-in-ld 
--with-fpmath=sse --enable-languages=c,c++,fortran --enable-cet --without-isl 
--enable-libmpx x86_64-linux --disable-bootstrap

To reproduce:

$ cd {build_dir}/gcc && make check RUNTESTFLAGS="dg.exp=g++.dg/other/pr39060.C 
--target_board='unix{-m64\ -march=cascadelake}'"

(Please do not reply to this email, for question about this report, contact me 
at haochen dot jiang at intel.com)


Re: [PATCH 2/8]middle-end: Recognize scalar widening reductions

2022-10-31 Thread Jeff Law via Gcc-patches



On 10/31/22 05:57, Tamar Christina wrote:

Hi All,

This adds a new optab and IFNs for REDUC_PLUS_WIDEN where the resulting
scalar reduction has twice the precision of the input elements.

At some point in a later patch I will also teach the vectorizer to recognize
this builtin once I figure out how the various bits of reductions work.

For now it's generated only by the match.pd pattern.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* internal-fn.def (REDUC_PLUS_WIDEN): New.
* doc/md.texi: Document it.
* match.pd: Recognize widening plus.
* optabs.def (reduc_splus_widen_scal_optab,
reduc_uplus_widen_scal_optab): New.


OK

jeff




Re: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs

2022-10-31 Thread Jeff Law via Gcc-patches



On 10/31/22 05:56, Tamar Christina wrote:

Hi All,

This patch series is to add recognition of pairwise operations (reductions)
in match.pd such that we can benefit from them even at -O1 when the vectorizer
isn't enabled.

Ths use of these allow for a lot simpler codegen in AArch64 and allows us to
avoid quite a lot of codegen warts.

As an example a simple:

typedef float v4sf __attribute__((vector_size (16)));

float
foo3 (v4sf x)
{
   return x[1] + x[2];
}

currently generates:

foo3:
 dup s1, v0.s[1]
 dup s0, v0.s[2]
 fadds0, s1, s0
 ret

while with this patch series now generates:

foo3:
ext v0.16b, v0.16b, v0.16b, #4
faddp   s0, v0.2s
ret

This patch will not perform the operation if the source is not a gimple
register and leaves memory sources to the vectorizer as it's able to deal
correctly with clobbers.

The use of these instruction makes a significant difference in codegen quality
for AArch64 and Arm.

NOTE: The last entry in the series contains tests for all of the previous
patches as it's a bit of an all or nothing thing.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* match.pd (adjacent_data_access_p): Import.
Add new pattern for bitwise plus, min, max, fmax, fmin.
* tree-cfg.cc (verify_gimple_call): Allow function arguments in IFNs.
* tree.cc (adjacent_data_access_p): New.
* tree.h (adjacent_data_access_p): New.


Nice stuff.  I'd pondered some similar stuff at Tachyum, but got dragged 
away before it could be implemented.







diff --git a/gcc/tree.cc b/gcc/tree.cc
index 
007c9325b17076f474e6681c49966c59cf6b91c7..5315af38a1ead89ca5f75dc4b19de9841e29d311
 100644
--- a/gcc/tree.cc
+++ b/gcc/tree.cc
@@ -10457,6 +10457,90 @@ bitmask_inv_cst_vector_p (tree t)
return builder.build ();
  }
  
+/* Returns base address if the two operands represent adjacent access of data

+   such that a pairwise operation can be used.  OP1 must be a lower subpart
+   than OP2.  If POS is not NULL then on return if a value is returned POS
+   will indicate the position of the lower address.  If COMMUTATIVE_P then
+   the operation is also tried by flipping op1 and op2.  */
+
+tree adjacent_data_access_p (tree op1, tree op2, poly_uint64 *pos,
+bool commutative_p)


Formatting nit.  Return type on a different line.


OK with that fixed.


jeff




Re: [PATCH]middle-end simplify complex if expressions where comparisons are inverse of one another.

2022-10-31 Thread Jeff Law via Gcc-patches



On 10/31/22 05:42, Tamar Christina via Gcc-patches wrote:

Hi,

This is a cleaned up version addressing all feedback.

Bootstrapped Regtested on aarch64-none-linux-gnu,
x86_64-pc-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* match.pd: Add new rule.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/if-compare_1.c: New test.
* gcc.target/aarch64/if-compare_2.c: New test.


OK

jeff




Re: [PATCH 1/2]middle-end: Add new tbranch optab to add support for bit-test-and-branch operations

2022-10-31 Thread Jeff Law via Gcc-patches



On 10/31/22 05:53, Tamar Christina wrote:

Hi All,

This adds a new test-and-branch optab that can be used to do a conditional test
of a bit and branch.   This is similar to the cbranch optab but instead can
test any arbitrary bit inside the register.

This patch recognizes boolean comparisons and single bit mask tests.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* dojump.cc (do_jump): Pass along value.
(do_jump_by_parts_greater_rtx): Likewise.
(do_jump_by_parts_zero_rtx): Likewise.
(do_jump_by_parts_equality_rtx): Likewise.
(do_compare_rtx_and_jump): Likewise.
(do_compare_and_jump): Likewise.
* dojump.h (do_compare_rtx_and_jump): New.
* optabs.cc (emit_cmp_and_jump_insn_1): Refactor to take optab to check.
(validate_test_and_branch): New.
(emit_cmp_and_jump_insns): Optiobally take a value, and when value is
supplied then check if it's suitable for tbranch.
* optabs.def (tbranch$a4): New.
* doc/md.texi (tbranch@var{mode}4): Document it.
* optabs.h (emit_cmp_and_jump_insns):
* tree.h (tree_zero_one_valued_p): New.

--- inline copy of patch --
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 
c08691ab4c9a4bfe55ae81e5e228a414d6242d78..f8b32ec12f46d3fb3815f121a16b5a8a1819b66a
 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -6972,6 +6972,13 @@ case, you can and should make operand 1's predicate 
reject some operators
  in the @samp{cstore@var{mode}4} pattern, or remove the pattern altogether
  from the machine description.
  
+@cindex @code{tbranch@var{mode}4} instruction pattern

+@item @samp{tbranch@var{mode}4}
+Conditional branch instruction combined with a bit test-and-compare
+instruction. Operand 0 is a comparison operator.  Operand 1 is the
+operand of the comparison. Operand 2 is the bit position of Operand 1 to test.
+Operand 3 is the @code{code_label} to jump to.


Should we refine/document the set of comparison operators allowed?    Is 
operand 1 an arbitrary RTL expression or more limited?  I'm guessing its 
relatively arbitrary given how you've massaged the existing 
branch-on-bit patterns from the aarch backend.




+
+  if (TREE_CODE (val) != SSA_NAME)
+return false;
+
+  gimple *def = SSA_NAME_DEF_STMT (val);
+  if (!is_gimple_assign (def)
+  || gimple_assign_rhs_code (def) != BIT_AND_EXPR)
+return false;
+
+  tree cst = gimple_assign_rhs2 (def);
+
+  if (!tree_fits_uhwi_p (cst))
+return false;
+
+  tree op0 = gimple_assign_rhs1 (def);
+  if (TREE_CODE (op0) == SSA_NAME)
+{
+  def = SSA_NAME_DEF_STMT (op0);
+  if (gimple_assign_cast_p (def))
+   op0 = gimple_assign_rhs1 (def);
+}
+
+  wide_int wcst = wi::uhwi (tree_to_uhwi (cst),
+   TYPE_PRECISION (TREE_TYPE (op0)));
+  int bitpos;
+
+  if ((bitpos = wi::exact_log2 (wcst)) == -1)
+return false;


Do we have enough information lying around from Ranger to avoid the need 
to walk the def-use chain to discover that we're masking off all but one 
bit?




  


diff --git a/gcc/tree.h b/gcc/tree.h
index 
8f8a9660c9e0605eb516de194640b8c1b531b798..be3d2dee82f692e81082cf21c878c10f9fe9e1f1
 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -4690,6 +4690,7 @@ extern tree signed_or_unsigned_type_for (int, tree);
  extern tree signed_type_for (tree);
  extern tree unsigned_type_for (tree);
  extern bool is_truth_type_for (tree, tree);
+extern bool tree_zero_one_valued_p (tree);


I don't see a definition of this anywhere.


jeff




[PATCH] x86: Track converted/skipped registers in STV

2022-10-31 Thread H.J. Lu via Gcc-patches
When converting integer computations into vector ones, we build a chain
from an integer definition instruction together with all dependent use
instructions.  The integer computations on the chain are converted to
vector ones if the total vector costs are lower than the integer ones.
Since the same register may appear in multiple chains, if it has been
converted or skipped in one chain, its instances in the other chains
must also be converted or skipped, regardless if the total vector costs
are lower than integer ones.  Otherwise, we will get the unexpected
vector mode in integer instruction patterns.

To track skipped registers, we add a bitmap, skipped_regs, when converting
integer computations into vector ones.  When computing gain for vector
computations, we convert or skip a chain if any register on the chain has
been converted or skipped already.

Note: If 2 integer registers on a chain, one has been converted and the
other has been skipped already, it will lead to a compiler error since
we can't undo the conversion.

gcc/

PR target/106933
PR target/106959
* config/i386/i386-features.cc (scalar_chain::skipped_regs): New.
(scalar_chain::update_skipped_regs): Likewise.
(scalar_chain::check_convert_gain): Likewise.
(general_scalar_chain::compute_convert_gain ): Return gain if
check_convert_gain returns non-zero.
(general_scalar_chain::compute_convert_gain): Call
update_skipped_regs if a chain won't be converted.
(timode_scalar_chain::compute_convert_gain): Likewise.
(convert_scalars_to_vector): Initialize and release
scalar_chain::skipped_regs before and after its use.
* config/i386/i386-features.h (scalar_chain): Add
skipped_regs, check_convert_gain and update_skipped_regs.

gcc/testsuite/

* gcc.target/i386/pr106933.c: New test.
* gcc.target/i386/pr106959.c: Likewise.
---
 gcc/config/i386/i386-features.cc | 104 ++-
 gcc/config/i386/i386-features.h  |   5 ++
 gcc/testsuite/gcc.target/i386/pr106933.c |  17 
 gcc/testsuite/gcc.target/i386/pr106959.c |  13 +++
 4 files changed, 137 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr106933.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr106959.c

diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc
index fd212262f50..d9d63cf8d22 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -273,6 +273,8 @@ xlogue_layout::get_stub_rtx (enum xlogue_stub stub)
 
 unsigned scalar_chain::max_id = 0;
 
+bitmap_head scalar_chain::skipped_regs;
+
 namespace {
 
 /* Initialize new chain.  */
@@ -477,6 +479,72 @@ scalar_chain::build (bitmap candidates, unsigned insn_uid)
   BITMAP_FREE (queue);
 }
 
+/* Add all scalar mode registers, which are set by INSN and not used in
+   both vector and scalar modes, to skipped register map. */
+
+void
+scalar_chain::update_skipped_regs (rtx_insn *insn)
+{
+  for (df_ref def = DF_INSN_DEFS (insn);
+   def;
+   def = DF_REF_NEXT_LOC (def))
+{
+  rtx reg = DF_REF_REG (def);
+  if (GET_MODE (reg) == smode
+ && !bitmap_bit_p (defs_conv, REGNO (reg)))
+   bitmap_set_bit (_regs, REGNO (reg));
+}
+}
+
+/* Check convert gain for INSN.  Return 1 if any registers, which are
+   set or used by INSN, have been converted to vector mode.  Return -1
+   if any registers set by INSN are skipped in other chains.  Return 0
+   otherwise.  */
+
+int
+scalar_chain::check_convert_gain (rtx_insn *insn)
+{
+  for (df_ref def = DF_INSN_DEFS (insn);
+   def;
+   def = DF_REF_NEXT_LOC (def))
+{
+  rtx reg = DF_REF_REG (def);
+  if (GET_MODE (reg) == vmode)
+   {
+ if (dump_file)
+   fprintf (dump_file,
+"  Gain 1 for converted register r%d\n",
+REGNO (reg));
+ return 1;
+   }
+  else if (bitmap_bit_p (_regs, REGNO (reg)))
+   {
+ if (dump_file)
+   fprintf (dump_file,
+"  Gain -1 for skipped register r%d\n",
+REGNO (reg));
+ return -1;
+   }
+}
+
+  for (df_ref ref = DF_INSN_USES (insn);
+   ref;
+   ref = DF_REF_NEXT_LOC (ref))
+{
+  rtx reg = DF_REF_REG (ref);
+  if (GET_MODE (reg) == vmode)
+   {
+ if (dump_file)
+   fprintf (dump_file,
+"  Gain 1 for converted register r%d\n",
+REGNO (reg));
+ return 1;
+   }
+}
+
+  return 0;
+}
+
 /* Return a cost of building a vector costant
instead of using a scalar one.  */
 
@@ -515,10 +583,15 @@ general_scalar_chain::compute_convert_gain ()
   EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
 {
   rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
+  /* If check_convert_gain returns non-zero on any INSN, the chain
+must be converted 

[PATCH] libstdc++: Implement ranges::as_rvalue_view from P2446R2

2022-10-31 Thread Patrick Palka via Gcc-patches
Tested on x86_64-pc-linux-gnu, does this look OK for trunk?

libstdc++-v3/ChangeLog:

* include/std/ranges (as_rvalue_view): Define.
(enable_borrowed_range): Define.
(views::__detail::__can_as_rvalue_view): Define.
(views::_AsRvalue, views::as_rvalue): Define.
* testsuite/std/ranges/adaptors/as_rvalue/1.cc: New test.
---
 libstdc++-v3/include/std/ranges   | 88 +++
 .../std/ranges/adaptors/as_rvalue/1.cc| 47 ++
 2 files changed, 135 insertions(+)
 create mode 100644 libstdc++-v3/testsuite/std/ranges/adaptors/as_rvalue/1.cc

diff --git a/libstdc++-v3/include/std/ranges b/libstdc++-v3/include/std/ranges
index 959886a1a55..239b3b61d30 100644
--- a/libstdc++-v3/include/std/ranges
+++ b/libstdc++-v3/include/std/ranges
@@ -8486,6 +8486,94 @@ namespace views::__adaptor
 
 inline constexpr _CartesianProduct cartesian_product;
   }
+
+  template
+requires view<_Vp>
+  class as_rvalue_view : public view_interface>
+  {
+_Vp _M_base = _Vp();
+
+  public:
+as_rvalue_view() requires default_initializable<_Vp> = default;
+
+constexpr explicit
+as_rvalue_view(_Vp __base)
+: _M_base(std::move(__base))
+{ }
+
+constexpr _Vp
+base() const& requires copy_constructible<_Vp> { return _M_base; }
+
+constexpr _Vp
+base() && { return std::move(_M_base); }
+
+constexpr auto
+begin() requires (!__detail::__simple_view<_Vp>)
+{ return move_iterator(ranges::begin(_M_base)); }
+
+constexpr auto
+begin() const requires range
+{ return move_iterator(ranges::begin(_M_base)); }
+
+constexpr auto
+end() requires (!__detail::__simple_view<_Vp>)
+{
+  if constexpr (common_range<_Vp>)
+   return move_iterator(ranges::end(_M_base));
+  else
+   return move_sentinel(ranges::end(_M_base));
+}
+
+constexpr auto
+end() const requires range
+{
+  if constexpr (common_range)
+   return move_iterator(ranges::end(_M_base));
+  else
+   return move_sentinel(ranges::end(_M_base));
+}
+
+constexpr auto
+size() requires sized_range<_Vp>
+{ return ranges::size(_M_base); }
+
+constexpr auto
+size() const requires sized_range
+{ return ranges::size(_M_base); }
+  };
+
+  template
+as_rvalue_view(_Range&&) -> as_rvalue_view>;
+
+  template
+inline constexpr bool enable_borrowed_range>
+  = enable_borrowed_range<_Tp>;
+
+  namespace views
+  {
+namespace __detail
+{
+  template
+   concept __can_as_rvalue_view = requires { 
as_rvalue_view(std::declval<_Tp>()); };
+}
+
+struct _AsRvalue : __adaptor::_RangeAdaptorClosure
+{
+  template
+   requires __detail::__can_as_rvalue_view<_Range>
+   constexpr auto
+   operator() [[nodiscard]] (_Range&& __r) const
+   {
+ if constexpr (same_as,
+   range_reference_t<_Range>>)
+   return views::all(std::forward<_Range>(__r));
+ else
+   return as_rvalue_view(std::forward<_Range>(__r));
+   }
+};
+
+inline constexpr _AsRvalue as_rvalue;
+  }
 #endif // C++23
 } // namespace ranges
 
diff --git a/libstdc++-v3/testsuite/std/ranges/adaptors/as_rvalue/1.cc 
b/libstdc++-v3/testsuite/std/ranges/adaptors/as_rvalue/1.cc
new file mode 100644
index 000..8ca4f50e9d2
--- /dev/null
+++ b/libstdc++-v3/testsuite/std/ranges/adaptors/as_rvalue/1.cc
@@ -0,0 +1,47 @@
+// { dg-options "-std=gnu++23" }
+// { dg-do run { target c++23 } }
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+namespace ranges = std::ranges;
+namespace views = std::views;
+
+constexpr bool
+test01()
+{
+
+  std::unique_ptr a[3] = { std::make_unique(1),
+   std::make_unique(2),
+   std::make_unique(3) };
+  std::unique_ptr b[3];
+  auto v = a | views::as_rvalue;
+  ranges::copy(v, b);
+  VERIFY( ranges::all_of(a, [](auto& p) { return p.get() == nullptr; }) );
+  VERIFY( ranges::equal(b | views::transform([](auto& p) { return *p; }), 
(int[]){1, 2, 3}) );
+
+  return true;
+}
+
+void
+test02()
+{
+  std::unique_ptr x = std::make_unique(42);
+  std::unique_ptr y;
+  __gnu_test::test_input_range rx(, +1);
+  auto v = rx | views::as_rvalue;
+  static_assert(!ranges::common_range);
+  ranges::copy(v, );
+  VERIFY( x.get() == nullptr );
+  VERIFY( *y == 42 );
+}
+
+int
+main()
+{
+  static_assert(test01());
+  test02();
+}
-- 
2.38.1.381.gc03801e19c



Re: [PATCH, v2] Fortran: ordering of hidden procedure arguments [PR107441]

2022-10-31 Thread Harald Anlauf via Gcc-patches

Hi Mikael,

thanks a lot, your testcases broke my initial (and incorrect) patch
in multiple ways.  I understand now that the right solution is much
simpler and smaller.

I've added your testcases, see attached, with a simple scan of the
dump for the generated order of hidden arguments in the function decl
for the last testcase.

Regtested again on x86_64-pc-linux-gnu.  OK now?

Thanks,
Harald


Am 31.10.22 um 10:57 schrieb Mikael Morin:

Le 30/10/2022 à 22:25, Mikael Morin a écrit :

Le 30/10/2022 à 20:23, Mikael Morin a écrit :

Another probable issue is your change to create_function_arglist
changes arglist/hidden_arglist without also changing
typelist/hidden_typelist accordingly.  I think a change to
gfc_get_function_type is also necessary: as the function decl is
changed, the decl type need to be changed as well.

I will see whether I can manage to exhibit testcases for these issues.


Here is a test for the type vs decl mismatch.

! { dg-do run }
!
! PR fortran/107441
! Check that procedure types and procedure decls match when the procedure
! has both chaacter-typed and optional value args.

program p
   interface
 subroutine i(c, o)
   character(*) :: c
   integer, optional, value :: o
 end subroutine i
   end interface
   procedure(i), pointer :: pp

A pointer initialization is missing here:
     pp => s

   call pp("abcd")
contains
   subroutine s(c, o)
 character(*) :: c
 integer, optional, value :: o
 if (present(o)) stop 1
 if (len(c) /= 4) stop 2
 if (c /= "abcd") stop 3
   end subroutine s
end program p



With the additional initialization, the test passes, so it's not very
useful.  The type mismatch is visible in the dump though, so maybe a
dump match can be used.

From 705628c89faa1135ed9a446b84e831bbead6095a Mon Sep 17 00:00:00 2001
From: Harald Anlauf 
Date: Fri, 28 Oct 2022 21:58:08 +0200
Subject: [PATCH] Fortran: ordering of hidden procedure arguments [PR107441]

gcc/fortran/ChangeLog:

	PR fortran/107441
	* trans-decl.cc (create_function_arglist): Adjust the ordering of
	automatically generated hidden procedure arguments to match the
	documented ABI for gfortran.  The present status for optional+value
	arguments is passed before character length and coarray token and
	offset.

gcc/testsuite/ChangeLog:

	PR fortran/107441
* gfortran.dg/coarray/pr107441-caf.f90: New test.
	* gfortran.dg/optional_absent_6.f90: New test.
	* gfortran.dg/optional_absent_7.f90: New test.
---
 gcc/fortran/trans-decl.cc |  8 ++-
 .../gfortran.dg/coarray/pr107441-caf.f90  | 27 +
 .../gfortran.dg/optional_absent_6.f90 | 60 +++
 .../gfortran.dg/optional_absent_7.f90 | 30 ++
 4 files changed, 123 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gfortran.dg/coarray/pr107441-caf.f90
 create mode 100644 gcc/testsuite/gfortran.dg/optional_absent_6.f90
 create mode 100644 gcc/testsuite/gfortran.dg/optional_absent_7.f90

diff --git a/gcc/fortran/trans-decl.cc b/gcc/fortran/trans-decl.cc
index 63515b9072a..64b35f054e5 100644
--- a/gcc/fortran/trans-decl.cc
+++ b/gcc/fortran/trans-decl.cc
@@ -2508,7 +2508,7 @@ create_function_arglist (gfc_symbol * sym)
   tree fndecl;
   gfc_formal_arglist *f;
   tree typelist, hidden_typelist;
-  tree arglist, hidden_arglist;
+  tree arglist, hidden_arglist, optval_arglist;
   tree type;
   tree parm;
 
@@ -2518,6 +2518,7 @@ create_function_arglist (gfc_symbol * sym)
  the new FUNCTION_DECL node.  */
   arglist = NULL_TREE;
   hidden_arglist = NULL_TREE;
+  optval_arglist = NULL_TREE;
   typelist = TYPE_ARG_TYPES (TREE_TYPE (fndecl));
 
   if (sym->attr.entry_master)
@@ -2712,7 +2713,7 @@ create_function_arglist (gfc_symbol * sym)
 			PARM_DECL, get_identifier (name),
 			boolean_type_node);
 
-  hidden_arglist = chainon (hidden_arglist, tmp);
+	  optval_arglist = chainon (optval_arglist, tmp);
   DECL_CONTEXT (tmp) = fndecl;
   DECL_ARTIFICIAL (tmp) = 1;
   DECL_ARG_TYPE (tmp) = boolean_type_node;
@@ -2863,6 +2864,9 @@ create_function_arglist (gfc_symbol * sym)
   typelist = TREE_CHAIN (typelist);
 }
 
+  /* Add hidden present status for optional+value arguments.  */
+  arglist = chainon (arglist, optval_arglist);
+
   /* Add the hidden string length parameters, unless the procedure
  is bind(C).  */
   if (!sym->attr.is_bind_c)
diff --git a/gcc/testsuite/gfortran.dg/coarray/pr107441-caf.f90 b/gcc/testsuite/gfortran.dg/coarray/pr107441-caf.f90
new file mode 100644
index 000..23b2242e217
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/coarray/pr107441-caf.f90
@@ -0,0 +1,27 @@
+! { dg-do run }
+!
+! PR fortran/107441
+! Check that with -fcoarray=lib, coarray metadata arguments are passed
+! in the right order to procedures.
+!
+! Contributed by M.Morin
+
+program p
+  integer :: ci[*]
+  ci = 17
+  call s(1, ci, "abcd")
+contains
+  subroutine s(ra, ca, c)
+integer :: ra, ca[*]
+

Re: [PATCH] Add __builtin_iseqsig()

2022-10-31 Thread FX via Gcc-patches
Hi,

Just adding, from the Fortran 2018 perspective, things we will need to 
implement for which I think support from the middle-end might be necessary:

- rounded conversions: converting, from an integer or floating point type, into 
another floating point type, with specific rounding mode passed as argument
- conversion to integer: converting, from a floating point type, into an 
integer type, with specific rounding mode passed as argument
- IEEE operations corresponding to nextDown and nextUp (or are those already 
available? I have not checked the fine print)

I would like to add them all for GCC 13.

FX

[PATCH] c, analyzer: support named constants in analyzer [PR106302]

2022-10-31 Thread David Malcolm via Gcc-patches
The analyzer's file-descriptor state machine tracks the access mode of
opened files, so that it can emit -Wanalyzer-fd-access-mode-mismatch.

To do this, its symbolic execution needs to "know" the values of the
constants "O_RDONLY", "O_WRONLY", and "O_ACCMODE".  Currently
analyzer/sm-fd.cc simply uses these values directly from the build-time
header files, but these are the values on the host, not those from the
target, which could be different (PR analyzer/106302).

In an earlier discussion of this issue:
  https://gcc.gnu.org/pipermail/gcc/2022-June/238954.html
we talked about adding a target hook for this.

However, I've also been experimenting with extending the fd state
machine to track sockets (PR analyzer/106140).  For this, it's useful to
"know" the values of the constants "SOCK_STREAM" and "SOCK_DGRAM".
Unfortunately, these seem to have many arbitrary differences from target
to target.

For example: Linux/glibc general has SOCK_STREAM == 1, SOCK_DGRAM == 2,
as does AIX, but annoyingly, e.g. Linux on MIPS has them the other way
around.

It seems to me that as the analyzer grows more ambitious modeling of the
behavior of APIs (perhaps via plugins) it's more likely that the
analyzer will need to know the values of named constants, which might
not even exist on the host.

For example, at LPC it was suggested to me that -fanalyzer could check
rules about memory management inside the Linux kernel (probably via a
plugin), but doing so involves a bunch of GFP_* flags (see PR 107472).

So rather than trying to capture all this knowledge in a target hook,
this patch attempts to get at named constant values from the user's
source code.

The patch adds an interface for frontends to call into the analyzer as
the translation unit finishes.  The analyzer can then call back into the
frontend to ask about the values of the named constants it cares about
whilst the frontend's data structures are still around.

The patch implements this for the C frontend, which looks up the names
by looking for named CONST_DECLs (which handles enum values).  Failing
that, it attempts to look up the values of macros but only the simplest
cases are supported (a non-traditional macro with a single CPP_NUMBER
token).  It does this by building a buffer containing the macro
definition and rerunning a lexer on it.

The analyzer gracefully handles the cases where named values aren't
found (such as anything more complicated than described above).

The patch ports the analyzer to use this mechanism for "O_RDONLY",
"O_WRONLY", and "O_ACCMODE".  I have successfully tested my socket patch
to also use this for "SOCK_STREAM" and "SOCK_DGRAM", so the technique
seems to work.

Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.

Are the C frontend parts OK for trunk?

Thanks
Dave

gcc/ChangeLog:
PR analyzer/106302
* Makefile.in (ANALYZER_OBJS): Add analyzer/analyzer-language.o.
(GTFILES): Add analyzer/analyzer-language.cc.

gcc/analyzer/ChangeLog:
PR analyzer/106302
* analyzer-language.cc: New file.
* analyzer-language.h: New file.
* analyzer.h (get_stashed_constant_by_name): New decl.
(log_stashed_constants): New decl.
* engine.cc (impl_run_checkers): Call log_stashed_constants.
* region-model-impl-calls.cc
(region_model::impl_call_analyzer_dump_named_constant): New.
* region-model.cc (region_model::on_stmt_pre): Handle
__analyzer_dump_named_constant.
* region-model.h
(region_model::impl_call_analyzer_dump_named_constant): New decl.
* sm-fd.cc (fd_state_machine::m_O_ACCMODE): New.
(fd_state_machine::m_O_RDONLY): New.
(fd_state_machine::m_O_WRONLY): New.
(fd_state_machine::fd_state_machine): Initialize the new fields.
(fd_state_machine::get_access_mode_from_flag): Use the new fields,
rather than using the host values.

gcc/c/ChangeLog:
PR analyzer/106302
* c-parser.cc: Include "analyzer/analyzer-language.h" and "toplev.h".
(class ana::c_translation_unit): New.
(c_parser_translation_unit): Call ana::on_finish_translation_unit.

gcc/ChangeLog:
* doc/analyzer.texi: Document __analyzer_dump_named_constant.

gcc/testsuite/ChangeLog:
* gcc.dg/analyzer/analyzer-decls.h
(__analyzer_dump_named_constant): New decl.
* gcc.dg/analyzer/fd-4.c (void): Likewise.
(O_ACCMODE): Define.
* gcc.dg/analyzer/fd-access-mode-enum.c: New test, based on .
* gcc.dg/analyzer/fd-5.c: ...this.  Rename to...
* gcc.dg/analyzer/fd-access-mode-macros.c: ...this.
(O_ACCMODE): Define.
* gcc.dg/analyzer/fd-access-mode-target-headers.c: New test, also
based on fd-5.c.
(test_sm_fd_constants): New.
* gcc.dg/analyzer/fd-dup-1.c (O_ACCMODE): Define.
* gcc.dg/analyzer/named-constants-via-enum.c: New test.
* gcc.dg/analyzer/named-constants-via-macros-2.c: New test.
  

Re: [PATCH v4] btf: Add support to BTF_KIND_ENUM64 type

2022-10-31 Thread Indu Bhagat via Gcc-patches

On 10/21/22 2:28 AM, Indu Bhagat via Gcc-patches wrote:

On 10/19/22 19:05, Guillermo E. Martinez wrote:

Hello,

The following is patch v4 to update BTF/CTF backend supporting
BTF_KIND_ENUM64 type. Changes from v3:

   + Remove `ctf_enum_binfo' structure.
   + Remove -m{little,big}-endian from dg-options in testcase.

Comments will be welcomed and appreciated!,

Kind regards,
guillermo
--



Thanks Guillermo.

LGTM.



Pushed on behalf of Guillermo.

Thanks


BTF supports 64-bits enumerators with following encoding:

   struct btf_type:
 name_off: 0 or offset to a valid C identifier
 info.kind_flag: 0 for unsigned, 1 for signed
 info.kind: BTF_KIND_ENUM64
 info.vlen: number of enum values
 size: 1/2/4/8

The btf_type is followed by info.vlen number of:

 struct btf_enum64
 {
   uint32_t name_off;   /* Offset in string section of enumerator 
name.  */
   uint32_t val_lo32;   /* lower 32-bit value for a 64-bit value 
Enumerator */
   uint32_t val_hi32;   /* high 32-bit value for a 64-bit value 
Enumerator */

 };

So, a new btf_enum64 structure was added to represent BTF_KIND_ENUM64
and a new field dtd_enum_unsigned in ctf_dtdef structure to distinguish
when CTF enum is a signed or unsigned type, later that information is
used to encode the BTF enum type.

gcc/ChangeLog:

* btfout.cc (btf_calc_num_vbytes): Compute enumeration size 
depending of

enumerator type btf_enum{,64}.
(btf_asm_type): Update btf_kflag according to enumeration type sign
using dtd_enum_unsigned field for both:  BTF_KIND_ENUM{,64}.
(btf_asm_enum_const): New argument to represent the size of
the BTF enum type, writing the enumerator constant value for
32 bits, if it's 64 bits then explicitly writes lower 32-bits
value and higher 32-bits value.
(output_asm_btf_enum_list): Add enumeration size argument.
* ctfc.cc (ctf_add_enum): New argument to represent CTF enum
basic information.
(ctf_add_generic): Use of ei_{name. size, unsigned} to build the
dtd structure containing enumeration information.
(ctf_add_enumerator): Update comment mention support for BTF
enumeration in 64-bits.
* dwarf2ctf.cc (gen_ctf_enumeration_type): Extract signedness
for enumeration type and use it in ctf_add_enum.
* ctfc.h (ctf_dmdef): Update dmd_value to HOST_WIDE_INT to allow
use 32/64 bits enumerators.
information.
(ctf_dtdef): New field to describe enum signedness.

include/
* btf.h (btf_enum64): Add new definition and new symbolic
constant to BTF_KIND_ENUM64 and BTF_KF_ENUM_{UN,}SIGNED.

gcc/testsuite/ChangeLog:

* gcc.dg/debug/btf/btf-enum-1.c: Update testcase, with correct
info.kflags encoding.
* gcc.dg/debug/btf/btf-enum64-1.c: New testcase.
---
  gcc/btfout.cc | 30 ++---
  gcc/ctfc.cc   | 13 +++---
  gcc/ctfc.h    |  5 ++-
  gcc/dwarf2ctf.cc  |  5 ++-
  gcc/testsuite/gcc.dg/debug/btf/btf-enum-1.c   |  2 +-
  gcc/testsuite/gcc.dg/debug/btf/btf-enum64-1.c | 44 +++
  include/btf.h | 19 ++--
  7 files changed, 100 insertions(+), 18 deletions(-)
  create mode 100644 gcc/testsuite/gcc.dg/debug/btf/btf-enum64-1.c

diff --git a/gcc/btfout.cc b/gcc/btfout.cc
index 997a33fa089..aef9fd70a28 100644
--- a/gcc/btfout.cc
+++ b/gcc/btfout.cc
@@ -223,7 +223,9 @@ btf_calc_num_vbytes (ctf_dtdef_ref dtd)
    break;
  case BTF_KIND_ENUM:
-  vlen_bytes += vlen * sizeof (struct btf_enum);
+  vlen_bytes += (dtd->dtd_data.ctti_size == 0x8)
+    ? vlen * sizeof (struct btf_enum64)
+    : vlen * sizeof (struct btf_enum);
    break;
  case BTF_KIND_FUNC_PROTO:
@@ -622,6 +624,15 @@ btf_asm_type (ctf_container_ref ctfc, 
ctf_dtdef_ref dtd)

    btf_size_type = 0;
  }
+  if (btf_kind == BTF_KIND_ENUM)
+    {
+  btf_kflag = dtd->dtd_enum_unsigned
+    ? BTF_KF_ENUM_UNSIGNED
+    : BTF_KF_ENUM_SIGNED;
+  if (dtd->dtd_data.ctti_size == 0x8)
+    btf_kind = BTF_KIND_ENUM64;
+   }
+
    dw2_asm_output_data (4, dtd->dtd_data.ctti_name, "btt_name");
    dw2_asm_output_data (4, BTF_TYPE_INFO (btf_kind, btf_kflag, 
btf_vlen),

 "btt_info: kind=%u, kflag=%u, vlen=%u",
@@ -634,6 +645,7 @@ btf_asm_type (ctf_container_ref ctfc, 
ctf_dtdef_ref dtd)

  case BTF_KIND_UNION:
  case BTF_KIND_ENUM:
  case BTF_KIND_DATASEC:
+    case BTF_KIND_ENUM64:
    dw2_asm_output_data (4, dtd->dtd_data.ctti_size, "btt_size: %uB",
 dtd->dtd_data.ctti_size);
    return;
@@ -707,13 +719,19 @@ btf_asm_sou_member (ctf_container_ref ctfc, 
ctf_dmdef_t * dmd)

  }
  }
-/* Asm'out an enum constant following a BTF_KIND_ENUM.  */
+/* Asm'out an enum constant following a BTF_KIND_ENUM{,64}.  */
  static void
-btf_asm_enum_const (ctf_dmdef_t * dmd)
+btf_asm_enum_const (unsigned 

Re: [PATCH] Add __builtin_iseqsig()

2022-10-31 Thread Joseph Myers
On Fri, 28 Oct 2022, Jeff Law via Gcc-patches wrote:

> Joseph, do you have bits in this space that are going to be landing soon, or
> is your C2X work focused elsewhere?  Are there other C2X routines we need to
> be proving builtins for?

I don't have any builtins work planned for GCC 13 (maybe adjustments to 
__builtin_tgmath semantics to match changes in C2X, but that's a keyword, 
not a built-in function).

See  
for my comments on (the tests in) this patch.

Lots of  functions could sensibly have built-in versions, whether 
for inline expansion, optimization for constant arguments or both.  Note 
that for those added from TS 18661-4, it will be more convenient to add 
glibc support once MPFR 4.2 is out so that gen-auto-libm-tests doesn't 
depend on an unreleased MPFR version, and likewise MPFR 4.2 will be needed 
for optimizing those functions for constant arguments.  But other 
highlights for which built-in functions might make sense in some cases 
include: issubnormal, iszero (see bugs 77925 / 77926, where Tamar 
Christina's patch needed to be reverted); probably the fromfp functions 
(but note that the interface in C2X is different from that in TS 18661-1 
and I haven't yet implemented those changes in glibc); the functions that 
round their result to a narrower type (supported as hardware operations on 
current POWER, I think); the functions bound to new maximum / minimum 
operations from I 754-2019 (some of which are supported by RISC-V 
instructions).  Also the  functions; I expect to implement those 
for glibc largely using existing built-in functions, but more direct 
built-in function support for the  names may make sense.

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: [PATCH 1/4]middle-end Support not decomposing specific divisions during vectorization.

2022-10-31 Thread Jeff Law via Gcc-patches



On 10/31/22 05:34, Tamar Christina wrote:

The type of the expression should be available via the mode and the
signedness, no?  So maybe to avoid having both RTX and TREE on the target
hook pass it a wide_int instead for the divisor?


Done.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* expmed.h (expand_divmod): Pass tree operands down in addition to RTX.
* expmed.cc (expand_divmod): Likewise.
* explow.cc (round_push, align_dynamic_address): Likewise.
* expr.cc (force_operand, expand_expr_divmod): Likewise.
* optabs.cc (expand_doubleword_mod, expand_doubleword_divmod):
Likewise.
* target.h: Include tree-core.
* target.def (can_special_div_by_const): New.
* targhooks.cc (default_can_special_div_by_const): New.
* targhooks.h (default_can_special_div_by_const): New.
* tree-vect-generic.cc (expand_vector_operation): Use it.
* doc/tm.texi.in: Document it.
* doc/tm.texi: Regenerate.
* tree-vect-patterns.cc (vect_recog_divmod_pattern): Check for support.
* tree-vect-stmts.cc (vectorizable_operation): Likewise.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-div-bitmask-1.c: New test.
* gcc.dg/vect/vect-div-bitmask-2.c: New test.
* gcc.dg/vect/vect-div-bitmask-3.c: New test.
* gcc.dg/vect/vect-div-bitmask.h: New file.

--- inline copy of patch ---


OK for the trunk.


Jeff



Re: [committed] libstdc++: Fix compare_exchange_padding.cc test for std::atomic_ref

2022-10-31 Thread Jonathan Wakely via Gcc-patches
On Mon, 31 Oct 2022 at 17:03, Eric Botcazou  wrote:
>
> > I suppose we could use memcmp on the as variable itself, to inspect
> > the actual stored padding rather than the returned copy of it.
>
> Yes, that's probably the only safe stance when optimization is enabled.


Strictly speaking, it's not safe, because it's undefined to use memcmp
on an object of a non-trivial type. But it should work.



Re: [PATCH] libstdc++-v3: support for extended floating point types

2022-10-31 Thread Jonathan Wakely via Gcc-patches
On Mon, 31 Oct 2022 at 16:57, Jakub Jelinek  wrote:
>
> On Mon, Oct 31, 2022 at 10:26:11AM +, Jonathan Wakely wrote:
> > > --- libstdc++-v3/include/std/complex.jj 2022-10-21 08:55:43.037675332 
> > > +0200
> > > +++ libstdc++-v3/include/std/complex2022-10-21 17:05:36.802243229 
> > > +0200
> > > @@ -142,8 +142,14 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> > >
> > >///  Converting constructor.
> > >template
> > > +#if __cplusplus > 202002L
> > > +   explicit(!requires(_Up __u) { _Tp{__u}; })
> > > +   constexpr complex(const complex<_Up>& __z)
> > > +   : _M_real(_Tp(__z.real())), _M_imag(_Tp(__z.imag())) { }
> >
> > Do these casts to _Tp do anything? The _M_real and _M_imag members are
> > already of type _Tp and we're using () initialization not {} so
> > there's no narrowing concern. _M_real(__z.real()) is already an
> > explicit conversion from decltype(__z.real()) to decltype(_M_real) so
> > the extra _Tp is redundant.
>
> If I use just
>: _M_real(__z.real()), _M_imag(__z.imag()) { }
> then without -Wsystem-headers there are no regressions, but compiling
> g++.dg/cpp23/ext-floating12.C with additional -Wsystem-headers
> -pedantic-errors results in lots of extra errors on that line:
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: 
> converting to ‘_Float32’ from ‘double’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: 
> converting to ‘_Float32’ from ‘double’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: 
> converting to ‘_Float32’ from ‘long double’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: 
> converting to ‘_Float32’ from ‘long double’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: 
> converting to ‘_Float32’ from ‘_Float64’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: 
> converting to ‘_Float32’ from ‘_Float64’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: 
> converting to ‘_Float32’ from ‘_Float128’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: 
> converting to ‘_Float32’ from ‘_Float128’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: 
> converting to ‘_Float64’ from ‘long double’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: 
> converting to ‘_Float64’ from ‘long double’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: 
> converting to ‘_Float64’ from ‘_Float128’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: 
> converting to ‘_Float64’ from ‘_Float128’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: 
> converting to ‘_Float16’ from ‘float’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: 
> converting to ‘_Float16’ from ‘float’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: 
> converting to ‘_Float16’ from ‘double’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: 
> converting to ‘_Float16’ from ‘double’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: 
> converting to ‘_Float16’ from ‘long double’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: 
> converting to ‘_Float16’ from ‘long double’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: 
> converting to ‘_Float16’ from ‘_Float32’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: 
> converting to ‘_Float16’ from ‘_Float32’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: 
> converting to ‘_Float16’ from ‘_Float64’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: 
> converting to ‘_Float16’ from ‘_Float64’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: 
> converting to ‘_Float16’ from ‘_Float128’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: 
> converting to ‘_Float16’ from ‘_Float128’ with greater conversion rank
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: 
> converting to ‘_Float16’ from ‘__bf16’ with unordered conversion ranks
> .../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: 
> converting to ‘_Float16’ from ‘__bf16’ with unordered conversion ranks
> 

Re: [committed] libstdc++: Fix compare_exchange_padding.cc test for std::atomic_ref

2022-10-31 Thread Eric Botcazou via Gcc-patches
> I suppose we could use memcmp on the as variable itself, to inspect
> the actual stored padding rather than the returned copy of it.

Yes, that's probably the only safe stance when optimization is enabled.

-- 
Eric Botcazou




Re: [PATCH] libstdc++-v3: support for extended floating point types

2022-10-31 Thread Jakub Jelinek via Gcc-patches
On Mon, Oct 31, 2022 at 10:26:11AM +, Jonathan Wakely wrote:
> > --- libstdc++-v3/include/std/complex.jj 2022-10-21 08:55:43.037675332 +0200
> > +++ libstdc++-v3/include/std/complex2022-10-21 17:05:36.802243229 +0200
> > @@ -142,8 +142,14 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >
> >///  Converting constructor.
> >template
> > +#if __cplusplus > 202002L
> > +   explicit(!requires(_Up __u) { _Tp{__u}; })
> > +   constexpr complex(const complex<_Up>& __z)
> > +   : _M_real(_Tp(__z.real())), _M_imag(_Tp(__z.imag())) { }
> 
> Do these casts to _Tp do anything? The _M_real and _M_imag members are
> already of type _Tp and we're using () initialization not {} so
> there's no narrowing concern. _M_real(__z.real()) is already an
> explicit conversion from decltype(__z.real()) to decltype(_M_real) so
> the extra _Tp is redundant.

If I use just
   : _M_real(__z.real()), _M_imag(__z.imag()) { }
then without -Wsystem-headers there are no regressions, but compiling
g++.dg/cpp23/ext-floating12.C with additional -Wsystem-headers
-pedantic-errors results in lots of extra errors on that line:
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: converting 
to ‘_Float32’ from ‘double’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: converting 
to ‘_Float32’ from ‘double’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: converting 
to ‘_Float32’ from ‘long double’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: converting 
to ‘_Float32’ from ‘long double’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: converting 
to ‘_Float32’ from ‘_Float64’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: converting 
to ‘_Float32’ from ‘_Float64’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: converting 
to ‘_Float32’ from ‘_Float128’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: converting 
to ‘_Float32’ from ‘_Float128’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: converting 
to ‘_Float64’ from ‘long double’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: converting 
to ‘_Float64’ from ‘long double’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: converting 
to ‘_Float64’ from ‘_Float128’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: converting 
to ‘_Float64’ from ‘_Float128’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: converting 
to ‘_Float16’ from ‘float’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: converting 
to ‘_Float16’ from ‘float’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: converting 
to ‘_Float16’ from ‘double’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: converting 
to ‘_Float16’ from ‘double’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: converting 
to ‘_Float16’ from ‘long double’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: converting 
to ‘_Float16’ from ‘long double’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: converting 
to ‘_Float16’ from ‘_Float32’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: converting 
to ‘_Float16’ from ‘_Float32’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: converting 
to ‘_Float16’ from ‘_Float64’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: converting 
to ‘_Float16’ from ‘_Float64’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: converting 
to ‘_Float16’ from ‘_Float128’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: converting 
to ‘_Float16’ from ‘_Float128’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: converting 
to ‘_Float16’ from ‘__bf16’ with unordered conversion ranks
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: converting 
to ‘_Float16’ from ‘__bf16’ with unordered conversion ranks
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:27: error: converting 
to ‘__bf16’ from ‘float’ with greater conversion rank
.../x86_64-pc-linux-gnu/libstdc++-v3/include/complex:149:48: error: converting 
to ‘__bf16’ from ‘float’ 

Re: Ping^3 [PATCH V2] Add attribute hot judgement for INLINE_HINT_known_hot hint.

2022-10-31 Thread Jeff Law via Gcc-patches



On 10/30/22 19:44, Cui, Lili wrote:

On 10/20/22 19:52, Cui, Lili via Gcc-patches wrote:

Hi Honza,

Gentle ping
https://gcc.gnu.org/pipermail/gcc-patches/2022-September/601934.html

gcc/ChangeLog

* ipa-inline-analysis.cc (do_estimate_edge_time): Add function attribute
judgement for INLINE_HINT_known_hot hint.

gcc/testsuite/ChangeLog:

* gcc.dg/ipa/inlinehint-6.c: New test.
---
   gcc/ipa-inline-analysis.cc  | 13 ---
   gcc/testsuite/gcc.dg/ipa/inlinehint-6.c | 47

+

   2 files changed, 56 insertions(+), 4 deletions(-)
   create mode 100644 gcc/testsuite/gcc.dg/ipa/inlinehint-6.c

diff --git a/gcc/ipa-inline-analysis.cc b/gcc/ipa-inline-analysis.cc
index 1ca685d1b0e..7bd29c36590 100644
--- a/gcc/ipa-inline-analysis.cc
+++ b/gcc/ipa-inline-analysis.cc
@@ -48,6 +48,7 @@ along with GCC; see the file COPYING3.  If not see
   #include "ipa-utils.h"
   #include "cfgexpand.h"
   #include "gimplify.h"
+#include "attribs.h"

   /* Cached node/edge growths.  */
   fast_call_summary
*edge_growth_cache = NULL; @@ -249,15 +250,19 @@

do_estimate_edge_time (struct cgraph_edge *edge, sreal
*ret_nonspec_time)

 hints = estimates.hints;
   }

-  /* When we have profile feedback, we can quite safely identify hot
- edges and for those we disable size limits.  Don't do that when
- probability that caller will call the callee is low however, since it
+  /* When we have profile feedback or function attribute, we can quite

safely

+ identify hot edges and for those we disable size limits.  Don't do that
+ when probability that caller will call the callee is low
+ however, since it
may hurt optimization of the caller's hot path.  */
-  if (edge->count.ipa ().initialized_p () && edge->maybe_hot_p ()
+  if ((edge->count.ipa ().initialized_p () && edge->maybe_hot_p ()
 && (edge->count.ipa () * 2
  > (edge->caller->inlined_to
 ? edge->caller->inlined_to->count.ipa ()
 : edge->caller->count.ipa (
+  || (lookup_attribute ("hot", DECL_ATTRIBUTES (edge->caller->decl))
+ != NULL
+&& lookup_attribute ("hot", DECL_ATTRIBUTES (edge->callee->decl))
+ != NULL))
   hints |= INLINE_HINT_known_hot;

Is the theory here that if the user has marked the caller and callee as hot,
then we're going to assume an edge between them is hot too?  That's not
necessarily true, it could be they're both hot, but via other call chains.  But 
it's
probably a reasonable heuristic in practice.


Yes,  thanks Jeff.


Thanks for the confirmation.  This is OK for the trunk.

jeff




Re: [PATCH 1/2]middle-end Fold BIT_FIELD_REF and Shifts into BIT_FIELD_REFs alone

2022-10-31 Thread Jeff Law via Gcc-patches



On 10/31/22 05:51, Tamar Christina via Gcc-patches wrote:

Hi All,

Here's a respin addressing review comments.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* match.pd: Add bitfield and shift folding.

gcc/testsuite/ChangeLog:

* gcc.dg/bitshift_1.c: New.
* gcc.dg/bitshift_2.c: New.


OK

jeff




Re: [PATCH]middle-end Add optimized float addsub without needing VEC_PERM_EXPR.

2022-10-31 Thread Jeff Law via Gcc-patches



On 10/31/22 05:38, Tamar Christina via Gcc-patches wrote:

Hi All,

This is a respin with all feedback addressed.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* match.pd: Add fneg/fadd rule.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/simd/addsub_1.c: New test.
* gcc.target/aarch64/sve/addsub_1.c: New test.


It's a pretty neat optimization.  I'd been watching it progress. Glad to 
see you were able to address the existing feedback before stage1 closed.



OK for the trunk.


Jeff




[PATCH v7 34/34] Add -mpure-code support to the CM0 functions.

2022-10-31 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

Makefile.in (MPURE_CODE): New macro defines __PURE_CODE__.
(gcc_compile): Appended MPURE_CODE.
lib1funcs.S (FUNC_START_SECTION): Set flags for __PURE_CODE__.
clz2.S (__clzsi2): Added -mpure-code compatible instructions.
ctz2.S (__ctzsi2): Same.
popcnt.S (__popcountsi2, __popcountdi2): Same.
---
 libgcc/Makefile.in|  5 -
 libgcc/config/arm/clz2.S  | 25 ++-
 libgcc/config/arm/ctz2.S  | 38 +--
 libgcc/config/arm/lib1funcs.S |  7 ++-
 libgcc/config/arm/popcnt.S| 33 +-
 5 files changed, 98 insertions(+), 10 deletions(-)

diff --git a/libgcc/Makefile.in b/libgcc/Makefile.in
index 1fe708a93f7..da2da7046cc 100644
--- a/libgcc/Makefile.in
+++ b/libgcc/Makefile.in
@@ -307,6 +307,9 @@ CRTSTUFF_CFLAGS = -O2 $(GCC_CFLAGS) $(INCLUDES) 
$(MULTILIB_CFLAGS) -g0 \
 # Extra flags to use when compiling crt{begin,end}.o.
 CRTSTUFF_T_CFLAGS =
 
+# Pass the -mpure-code flag into assembly for conditional compilation.
+MPURE_CODE = $(if $(findstring -mpure-code,$(CFLAGS)), -D__PURE_CODE__)
+
 MULTIDIR := $(shell $(CC) $(CFLAGS) -print-multi-directory)
 MULTIOSDIR := $(shell $(CC) $(CFLAGS) -print-multi-os-directory)
 
@@ -316,7 +319,7 @@ inst_slibdir = $(slibdir)$(MULTIOSSUBDIR)
 
 gcc_compile_bare = $(CC) $(INTERNAL_CFLAGS) $(CFLAGS-$(http://www.gnu.org/licenses/>.  */
 
 
+#if defined(L_popcountdi2) || defined(L_popcountsi2)
+
+.macro ldmask reg, temp, value
+#if defined(__PURE_CODE__) && (__PURE_CODE__)
+  #ifdef NOT_ISA_TARGET_32BIT
+movs\reg,   \value
+lsls\temp,  \reg,   #8
+orrs\reg,   \temp
+lsls\temp,  \reg,   #16
+orrs\reg,   \temp
+  #else
+// Assumption: __PURE_CODE__ only support M-profile.
+movw\reg((\value) * 0x101)
+movt\reg((\value) * 0x101)
+  #endif
+#else
+ldr \reg,   =((\value) * 0x1010101)
+#endif
+.endm
+
+#endif
+
+
 #ifdef L_popcountdi2
 
 // int __popcountdi2(int)
@@ -49,7 +72,7 @@ FUNC_START_SECTION popcountdi2 .text.sorted.libgcc.popcountdi2
 
   #else /* !__OPTIMIZE_SIZE__ */
 // Load the one-bit alternating mask.
-ldr r3, =0x
+ldmask  r3, r2, 0x55
 
 // Reduce the second word.
 lsrsr2, r1, #1
@@ -62,7 +85,7 @@ FUNC_START_SECTION popcountdi2 .text.sorted.libgcc.popcountdi2
 subsr0, r2
 
 // Load the two-bit alternating mask.
-ldr r3, =0x
+ldmask  r3, r2, 0x33
 
 // Reduce the second word.
 lsrsr2, r1, #2
@@ -140,7 +163,7 @@ FUNC_ENTRY popcountsi2
   #else /* !__OPTIMIZE_SIZE__ */
 
 // Load the one-bit alternating mask.
-ldr r3, =0x
+ldmask  r3, r2, 0x55
 
 // Reduce the word.
 lsrsr1, r0, #1
@@ -148,7 +171,7 @@ FUNC_ENTRY popcountsi2
 subsr0, r1
 
 // Load the two-bit alternating mask.
-ldr r3, =0x
+ldmask  r3, r2, 0x33
 
 // Reduce the word.
 lsrsr1, r0, #2
@@ -158,7 +181,7 @@ FUNC_ENTRY popcountsi2
 addsr0, r1
 
 // Load the four-bit alternating mask.
-ldr r3, =0x0F0F0F0F
+ldmask  r3, r2, 0x0F
 
 // Reduce the word.
 lsrsr1, r0, #4
-- 
2.34.1



[PATCH v7 32/34] Import float<->__fp16 conversion from the CM0 library

2022-10-31 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/eabi/fcast.S (__aeabi_h2f, __aeabi_f2h): Added functions.
* config/arm/fp16 (__gnu_f2h_ieee, __gnu_h2f_ieee, 
__gnu_f2h_alternative,
__gnu_h2f_alternative): Disable build for v6m multilibs.
* config/arm/t-bpabi (LIB1ASMFUNCS): Added _aeabi_f2h_ieee,
_aeabi_h2f_ieee, _aeabi_f2h_alt, and _aeabi_h2f_alt (v6m only).
---
 libgcc/config/arm/eabi/fcast.S | 277 +
 libgcc/config/arm/fp16.c   |   4 +
 libgcc/config/arm/t-bpabi  |   7 +
 3 files changed, 288 insertions(+)

diff --git a/libgcc/config/arm/eabi/fcast.S b/libgcc/config/arm/eabi/fcast.S
index f0d1373d31a..09876a95767 100644
--- a/libgcc/config/arm/eabi/fcast.S
+++ b/libgcc/config/arm/eabi/fcast.S
@@ -254,3 +254,280 @@ FUNC_END D2F_NAME
 
 #endif /* L_arm_d2f || L_arm_truncdfsf2 */
 
+
+#if defined(L_aeabi_h2f_ieee) || defined(L_aeabi_h2f_alt)
+
+#ifdef L_aeabi_h2f_ieee
+  #define H2F_NAME aeabi_h2f
+  #define H2F_ALIAS gnu_h2f_ieee
+#else
+  #define H2F_NAME aeabi_h2f_alt
+  #define H2F_ALIAS gnu_h2f_alternative
+#endif
+
+// float __aeabi_h2f(short hf)
+// float __aeabi_h2f_alt(short hf)
+// Converts a half-precision float in $r0 to single-precision.
+// Rounding, overflow, and underflow conditions are impossible.
+// In IEEE mode, INF, ZERO, and NAN are returned unmodified.
+FUNC_START_SECTION H2F_NAME .text.sorted.libgcc.h2f
+FUNC_ALIAS H2F_ALIAS H2F_NAME
+CFI_START_FUNCTION
+
+// Set up registers for __fp_normalize2().
+push{ rT, lr }
+.cfi_remember_state
+.cfi_adjust_cfa_offset 8
+.cfi_rel_offset rT, 0
+.cfi_rel_offset lr, 4
+
+// Save the mantissa and exponent.
+lslsr2, r0, #17
+
+// Isolate the sign.
+lsrsr0, #15
+lslsr0, #31
+
+// Align the exponent at bit[24] for normalization.
+// If zero, return the original sign.
+lsrsr2, #3
+
+  #ifdef __HAVE_FEATURE_IT
+do_it   eq
+RETc(eq)
+  #else
+beq LLSYM(__h2f_return)
+  #endif
+
+// Split the exponent and mantissa into separate registers.
+// This is the most efficient way to convert subnormals in the
+//  half-precision form into normals in single-precision.
+// This does add a leading implicit '1' to INF and NAN,
+//  but that will be absorbed when the value is re-assembled.
+bl  SYM(__fp_normalize2) __PLT__
+
+   #ifdef L_aeabi_h2f_ieee
+// Set up the exponent bias.  For INF/NAN values, the bias is 223,
+//  where the last '1' accounts for the implicit '1' in the mantissa.
+addsr2, #(255 - 31 - 1)
+
+// Test for INF/NAN.
+cmp r2, #254
+
+  #ifdef __HAVE_FEATURE_IT
+do_it   ne
+  #else
+beq LLSYM(__h2f_assemble)
+  #endif
+
+// For normal values, the bias should have been 111.
+// However, this offset must be adjusted per the INF check above.
+ IT(sub,ne) r2, #((255 - 31 - 1) - (127 - 15 - 1))
+
+#else /* L_aeabi_h2f_alt */
+// Set up the exponent bias.  All values are normal.
+addsr2, #(127 - 15 - 1)
+#endif
+
+LLSYM(__h2f_assemble):
+// Combine exponent and sign.
+lslsr2, #23
+addsr0, r2
+
+// Combine mantissa.
+lsrsr3, #8
+add r0, r3
+
+LLSYM(__h2f_return):
+pop { rT, pc }
+.cfi_restore_state
+
+CFI_END_FUNCTION
+FUNC_END H2F_NAME
+FUNC_END H2F_ALIAS
+
+#endif /* L_aeabi_h2f_ieee || L_aeabi_h2f_alt */
+
+
+#if defined(L_aeabi_f2h_ieee) || defined(L_aeabi_f2h_alt)
+
+#ifdef L_aeabi_f2h_ieee
+  #define F2H_NAME aeabi_f2h
+  #define F2H_ALIAS gnu_f2h_ieee
+#else
+  #define F2H_NAME aeabi_f2h_alt
+  #define F2H_ALIAS gnu_f2h_alternative
+#endif
+
+// short __aeabi_f2h(float f)
+// short __aeabi_f2h_alt(float f)
+// Converts a single-precision float in $r0 to half-precision,
+//  rounding to nearest, ties to even.
+// Values out of range are forced to either ZERO or INF.
+// In IEEE mode, the upper 12 bits of a NAN will be preserved.
+FUNC_START_SECTION F2H_NAME .text.sorted.libgcc.f2h
+FUNC_ALIAS F2H_ALIAS F2H_NAME
+CFI_START_FUNCTION
+
+// Set up the sign.
+lsrsr2, r0, #31
+lslsr2, #15
+
+// Save the exponent and mantissa.
+// If ZERO, return the original sign.
+lslsr0, #1
+
+  #ifdef __HAVE_FEATURE_IT
+do_it   ne,t
+addne   r0, r2
+RETc(ne)
+  #else
+beq LLSYM(__f2h_return)
+  #endif
+
+// Isolate the exponent.
+lsrsr1, r0, #24
+
+  #ifdef L_aeabi_f2h_ieee
+// Check for NAN.
+cmp r1, #255
+beq LLSYM(__f2h_indefinite)
+
+// 

[PATCH v7 30/34] Import float-to-integer conversion from the CM0 library

2022-10-31 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/bpabi-lib.h (muldi3): Removed duplicate.
(fixunssfsi) Removed obsolete RENAME_LIBRARY directive.
* config/arm/eabi/ffixed.S (__aeabi_f2iz, __aeabi_f2uiz,
__aeabi_f2lz, __aeabi_f2ulz): New file.
* config/arm/lib1funcs.S: #include eabi/ffixed.S (v6m only).
* config/arm/t-elf (LIB1ASMFUNCS): Added _internal_fixsfdi,
_internal_fixsfsi, _arm_fixsfdi, and _arm_fixunssfdi.
---
 libgcc/config/arm/bpabi-lib.h   |   6 -
 libgcc/config/arm/eabi/ffixed.S | 414 
 libgcc/config/arm/lib1funcs.S   |   1 +
 libgcc/config/arm/t-elf |   4 +
 4 files changed, 419 insertions(+), 6 deletions(-)
 create mode 100644 libgcc/config/arm/eabi/ffixed.S

diff --git a/libgcc/config/arm/bpabi-lib.h b/libgcc/config/arm/bpabi-lib.h
index 7dd78d5668f..6425c1bad2a 100644
--- a/libgcc/config/arm/bpabi-lib.h
+++ b/libgcc/config/arm/bpabi-lib.h
@@ -32,9 +32,6 @@
 #ifdef L_muldi3
 #define DECLARE_LIBRARY_RENAMES RENAME_LIBRARY (muldi3, lmul)
 #endif
-#ifdef L_muldi3
-#define DECLARE_LIBRARY_RENAMES RENAME_LIBRARY (muldi3, lmul)
-#endif
 #ifdef L_fixdfdi
 #define DECLARE_LIBRARY_RENAMES RENAME_LIBRARY (fixdfdi, d2lz) \
   extern DWtype __fixdfdi (DFtype) __attribute__((pcs("aapcs"))); \
@@ -62,9 +59,6 @@
 #ifdef L_fixunsdfsi
 #define DECLARE_LIBRARY_RENAMES RENAME_LIBRARY (fixunsdfsi, d2uiz)
 #endif
-#ifdef L_fixunssfsi
-#define DECLARE_LIBRARY_RENAMES RENAME_LIBRARY (fixunssfsi, f2uiz)
-#endif
 #ifdef L_floatundidf
 #define DECLARE_LIBRARY_RENAMES RENAME_LIBRARY (floatundidf, ul2d)
 #endif
diff --git a/libgcc/config/arm/eabi/ffixed.S b/libgcc/config/arm/eabi/ffixed.S
new file mode 100644
index 000..61c8a0fe1fd
--- /dev/null
+++ b/libgcc/config/arm/eabi/ffixed.S
@@ -0,0 +1,414 @@
+/* ffixed.S: Thumb-1 optimized float-to-integer conversion
+
+   Copyright (C) 2018-2022 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+// The implementation of __aeabi_f2uiz() expects to tail call __internal_f2iz()
+//  with the flags register set for unsigned conversion.  The __internal_f2iz()
+//  symbol itself is unambiguous, but there is a remote risk that the linker
+//  will prefer some other symbol in place of __aeabi_f2iz().  Importing an
+//  archive file that exports __aeabi_f2iz() will throw an error in this case.
+// As a workaround, this block configures __aeabi_f2iz() for compilation twice.
+// The first version configures __internal_f2iz() as a WEAK standalone symbol,
+//  and the second exports __aeabi_f2iz() and __internal_f2iz() normally.
+// A small bonus: programs only using __aeabi_f2uiz() will be slightly smaller.
+// '_internal_fixsfsi' should appear before '_arm_fixsfsi' in LIB1ASMFUNCS.
+#if defined(L_arm_fixsfsi) || \
+   (defined(L_internal_fixsfsi) && \
+  !(defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__))
+
+// Subsection ordering within fpcore keeps conditional branches within range.
+#define F2IZ_SECTION .text.sorted.libgcc.fpcore.r.fixsfsi
+
+// int __aeabi_f2iz(float)
+// Converts a float in $r0 to signed integer, rounding toward 0.
+// Values out of range are forced to either INT_MAX or INT_MIN.
+// NAN becomes zero.
+#ifdef L_arm_fixsfsi
+FUNC_START_SECTION aeabi_f2iz F2IZ_SECTION
+FUNC_ALIAS fixsfsi aeabi_f2iz
+CFI_START_FUNCTION
+#endif
+
+  #if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+// Flag for unsigned conversion.
+movsr1, #33
+b   SYM(__internal_fixsfdi)
+
+  #else /* !__OPTIMIZE_SIZE__ */
+
+#ifdef L_arm_fixsfsi
+// Flag for signed conversion.
+movsr3, #1
+
+// [unsigned] int internal_f2iz(float, int)
+// Internal function expects a boolean flag in $r1.
+// If the boolean flag is 0, the result is unsigned.
+// If the boolean flag is 1, the result is signed.
+FUNC_ENTRY internal_f2iz
+
+#else /* L_internal_fixsfsi */
+WEAK_START_SECTION internal_f2iz F2IZ_SECTION
+CFI_START_FUNCTION
+

[PATCH v7 29/34] Import integer-to-float conversion from the CM0 library

2022-10-31 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/bpabi-lib.h (__floatdisf, __floatundisf):
Remove obsolete RENAME_LIBRARY directives.
* config/arm/eabi/ffloat.S (__aeabi_i2f, __aeabi_l2f, __aeabi_ui2f,
__aeabi_ul2f): New file.
* config/arm/lib1funcs.S: #include eabi/ffloat.S (v6m only).
* config/arm/t-elf (LIB1ASMFUNCS): Added _arm_floatunsisf,
_arm_floatsisf, and _internal_floatundisf.
Moved _arm_floatundisf to the weak function group
---
 libgcc/config/arm/bpabi-lib.h   |   6 -
 libgcc/config/arm/eabi/ffloat.S | 247 
 libgcc/config/arm/lib1funcs.S   |   1 +
 libgcc/config/arm/t-elf |   5 +-
 4 files changed, 252 insertions(+), 7 deletions(-)
 create mode 100644 libgcc/config/arm/eabi/ffloat.S

diff --git a/libgcc/config/arm/bpabi-lib.h b/libgcc/config/arm/bpabi-lib.h
index 26ad5ffbe8b..7dd78d5668f 100644
--- a/libgcc/config/arm/bpabi-lib.h
+++ b/libgcc/config/arm/bpabi-lib.h
@@ -56,9 +56,6 @@
 #ifdef L_floatdidf
 #define DECLARE_LIBRARY_RENAMES RENAME_LIBRARY (floatdidf, l2d)
 #endif
-#ifdef L_floatdisf
-#define DECLARE_LIBRARY_RENAMES RENAME_LIBRARY (floatdisf, l2f)
-#endif
 
 /* These renames are needed on ARMv6M.  Other targets get them from
assembly routines.  */
@@ -71,9 +68,6 @@
 #ifdef L_floatundidf
 #define DECLARE_LIBRARY_RENAMES RENAME_LIBRARY (floatundidf, ul2d)
 #endif
-#ifdef L_floatundisf
-#define DECLARE_LIBRARY_RENAMES RENAME_LIBRARY (floatundisf, ul2f)
-#endif
 
 /* For ARM bpabi, we only want to use a "__gnu_" prefix for the fixed-point
helper functions - not everything in libgcc - in the interests of
diff --git a/libgcc/config/arm/eabi/ffloat.S b/libgcc/config/arm/eabi/ffloat.S
new file mode 100644
index 000..c8bc55a24b6
--- /dev/null
+++ b/libgcc/config/arm/eabi/ffloat.S
@@ -0,0 +1,247 @@
+/* ffixed.S: Thumb-1 optimized integer-to-float conversion
+
+   Copyright (C) 2018-2022 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#ifdef L_arm_floatsisf
+
+// float __aeabi_i2f(int)
+// Converts a signed integer in $r0 to float.
+
+// On little-endian cores (including all Cortex-M), __floatsisf() can be
+//  implemented as below in 5 instructions.  However, it can also be
+//  implemented by prefixing a single instruction to __floatdisf().
+// A memory savings of 4 instructions at a cost of only 2 execution cycles
+//  seems reasonable enough.  Plus, the trade-off only happens in programs
+//  that require both __floatsisf() and __floatdisf().  Programs only using
+//  __floatsisf() always get the smallest version.
+// When the combined version will be provided, this standalone version
+//  must be declared WEAK, so that the combined version can supersede it.
+// '_arm_floatsisf' should appear before '_arm_floatdisf' in LIB1ASMFUNCS.
+// Same parent section as __ul2f() to keep tail call branch within range.
+#if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+WEAK_START_SECTION aeabi_i2f .text.sorted.libgcc.fpcore.p.floatsisf
+WEAK_ALIAS floatsisf aeabi_i2f
+CFI_START_FUNCTION
+
+#else /* !__OPTIMIZE_SIZE__ */
+FUNC_START_SECTION aeabi_i2f .text.sorted.libgcc.fpcore.p.floatsisf
+FUNC_ALIAS floatsisf aeabi_i2f
+CFI_START_FUNCTION
+
+#endif /* !__OPTIMIZE_SIZE__ */
+
+// Save the sign.
+asrsr3, r0, #31
+
+// Absolute value of the input.
+eorsr0, r3
+subsr0, r3
+
+// Sign extension to long long unsigned.
+eorsr1, r1
+b   SYM(__internal_floatundisf_noswap)
+
+CFI_END_FUNCTION
+FUNC_END floatsisf
+FUNC_END aeabi_i2f
+
+#endif /* L_arm_floatsisf */
+
+
+#ifdef L_arm_floatdisf
+
+// float __aeabi_l2f(long long)
+// Converts a signed 64-bit integer in $r1:$r0 to a float in $r0.
+// See build comments for __floatsisf() above.
+// Same parent section as __ul2f() to keep tail call branch within range.
+#if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+FUNC_START_SECTION 

[PATCH v7 31/34] Import float<->double conversion from the CM0 library

2022-10-31 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/eabi/fcast.S (__aeabi_d2f, __aeabi_f2d): New file.
* config/arm/lib1funcs.S: #include eabi/fcast.S (v6m only).
* config/arm/t-elf (LIB1ASMFUNCS): Added _arm_d2f and _arm_f2d.
---
 libgcc/config/arm/eabi/fcast.S | 256 +
 libgcc/config/arm/lib1funcs.S  |   1 +
 libgcc/config/arm/t-elf|   2 +
 3 files changed, 259 insertions(+)
 create mode 100644 libgcc/config/arm/eabi/fcast.S

diff --git a/libgcc/config/arm/eabi/fcast.S b/libgcc/config/arm/eabi/fcast.S
new file mode 100644
index 000..f0d1373d31a
--- /dev/null
+++ b/libgcc/config/arm/eabi/fcast.S
@@ -0,0 +1,256 @@
+/* fcast.S: Thumb-1 optimized 32- and 64-bit float conversions
+
+   Copyright (C) 2018-2022 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#ifdef L_arm_f2d
+
+// double __aeabi_f2d(float)
+// Converts a single-precision float in $r0 to double-precision in $r1:$r0.
+// Rounding, overflow, and underflow are impossible.
+// INF and ZERO are returned unmodified.
+FUNC_START_SECTION aeabi_f2d .text.sorted.libgcc.fpcore.v.f2d
+FUNC_ALIAS extendsfdf2 aeabi_f2d
+CFI_START_FUNCTION
+
+// Save the sign.
+lsrsr1, r0, #31
+lslsr1, #31
+
+// Set up registers for __fp_normalize2().
+push{ rT, lr }
+.cfi_remember_state
+.cfi_adjust_cfa_offset 8
+.cfi_rel_offset rT, 0
+.cfi_rel_offset lr, 4
+
+// Test for zero.
+lslsr0, #1
+beq LLSYM(__f2d_return)
+
+// Split the exponent and mantissa into separate registers.
+// This is the most efficient way to convert subnormals in the
+//  half-precision form into normals in single-precision.
+// This does add a leading implicit '1' to INF and NAN,
+//  but that will be absorbed when the value is re-assembled.
+movsr2, r0
+bl  SYM(__fp_normalize2) __PLT__
+
+// Set up the exponent bias.  For INF/NAN values, the bias
+//  is 1791 (2047 - 255 - 1), where the last '1' accounts
+//  for the implicit '1' in the mantissa.
+movsr0, #3
+lslsr0, #9
+addsr0, #255
+
+// Test for INF/NAN, promote exponent if necessary
+cmp r2, #255
+beq LLSYM(__f2d_indefinite)
+
+// For normal values, the exponent bias is 895 (1023 - 127 - 1),
+//  which is half of the prepared INF/NAN bias.
+lsrsr0, #1
+
+LLSYM(__f2d_indefinite):
+// Assemble exponent with bias correction.
+addsr2, r0
+lslsr2, #20
+addsr1, r2
+
+// Assemble the high word of the mantissa.
+lsrsr0, r3, #11
+add r1, r0
+
+// Remainder of the mantissa in the low word of the result.
+lslsr0, r3, #21
+
+LLSYM(__f2d_return):
+pop { rT, pc }
+.cfi_restore_state
+
+CFI_END_FUNCTION
+FUNC_END extendsfdf2
+FUNC_END aeabi_f2d
+
+#endif /* L_arm_f2d */
+
+
+#if defined(L_arm_d2f) || defined(L_arm_truncdfsf2)
+
+// HACK: Build two separate implementations:
+//  * __aeabi_d2f() rounds to nearest per traditional IEEE-753 rules.
+//  * __truncdfsf2() rounds towards zero per GCC specification.
+// Presumably, a program will consistently use one ABI or the other,
+//  which means that code size will not be duplicated in practice.
+// Merging two versions with dynamic rounding would be rather hard.
+#ifdef L_arm_truncdfsf2
+  #define D2F_NAME truncdfsf2
+  #define D2F_SECTION .text.sorted.libgcc.fpcore.x.truncdfsf2
+#else
+  #define D2F_NAME aeabi_d2f
+  #define D2F_SECTION .text.sorted.libgcc.fpcore.w.d2f
+#endif
+
+// float __aeabi_d2f(double)
+// Converts a double-precision float in $r1:$r0 to single-precision in $r0.
+// Values out of range become ZERO or 

[PATCH v7 28/34] Import float division from the CM0 library

2022-10-31 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/eabi/fdiv.S (__divsf3, __fp_divloopf): New file.
* config/arm/lib1funcs.S: #include eabi/fdiv.S (v6m only).
* config/arm/t-elf (LIB1ASMFUNCS): Added _divsf3 and _fp_divloopf.
---
 libgcc/config/arm/eabi/fdiv.S | 261 ++
 libgcc/config/arm/lib1funcs.S |   1 +
 libgcc/config/arm/t-elf   |   2 +
 3 files changed, 264 insertions(+)
 create mode 100644 libgcc/config/arm/eabi/fdiv.S

diff --git a/libgcc/config/arm/eabi/fdiv.S b/libgcc/config/arm/eabi/fdiv.S
new file mode 100644
index 000..a6d73892b6d
--- /dev/null
+++ b/libgcc/config/arm/eabi/fdiv.S
@@ -0,0 +1,261 @@
+/* fdiv.S: Thumb-1 optimized 32-bit float division
+
+   Copyright (C) 2018-2022 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#ifdef L_arm_divsf3
+
+// float __aeabi_fdiv(float, float)
+// Returns $r0 after division by $r1.
+// Subsection ordering within fpcore keeps conditional branches within range.
+FUNC_START_SECTION aeabi_fdiv .text.sorted.libgcc.fpcore.n.fdiv
+FUNC_ALIAS divsf3 aeabi_fdiv
+CFI_START_FUNCTION
+
+// Standard registers, compatible with exception handling.
+push{ rT, lr }
+.cfi_remember_state
+.cfi_remember_state
+.cfi_adjust_cfa_offset 8
+.cfi_rel_offset rT, 0
+.cfi_rel_offset lr, 4
+
+// Save for the sign of the result.
+movsr3, r1
+eorsr3, r0
+lsrsrT, r3, #31
+lslsrT, #31
+mov ip, rT
+
+// Set up INF for comparison.
+movsrT, #255
+lslsrT, #24
+
+// Check for divide by 0.  Automatically catches 0/0.
+lslsr2, r1, #1
+beq LLSYM(__fdiv_by_zero)
+
+// Check for INF/INF, or a number divided by itself.
+lslsr3, #1
+beq LLSYM(__fdiv_equal)
+
+// Check the numerator for INF/NAN.
+eorsr3, r2
+cmp r3, rT
+bhs LLSYM(__fdiv_special1)
+
+// Check the denominator for INF/NAN.
+cmp r2, rT
+bhs LLSYM(__fdiv_special2)
+
+// Check the numerator for zero.
+cmp r3, #0
+beq SYM(__fp_zero)
+
+// No action if the numerator is subnormal.
+//  The mantissa will normalize naturally in the division loop.
+lslsr0, #9
+lsrsr1, r3, #24
+beq LLSYM(__fdiv_denominator)
+
+// Restore the numerator's implicit '1'.
+addsr0, #1
+rorsr0, r0
+
+LLSYM(__fdiv_denominator):
+// The denominator must be normalized and left aligned.
+bl  SYM(__fp_normalize2)
+
+// 25 bits of precision will be sufficient.
+movsrT, #64
+
+// Run division.
+bl  SYM(__fp_divloopf)
+b   SYM(__fp_assemble)
+
+LLSYM(__fdiv_equal):
+  #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+movsr3, #(DIVISION_INF_BY_INF)
+  #endif
+
+// The absolute value of both operands are equal, but not 0.
+// If both operands are INF, create a new NAN.
+cmp r2, rT
+beq SYM(__fp_exception)
+
+  #if defined(TRAP_NANS) && TRAP_NANS
+// If both operands are NAN, return the NAN in $r0.
+bhi SYM(__fp_check_nan)
+  #else
+bhi LLSYM(__fdiv_return)
+  #endif
+
+// Return 1.0f, with appropriate sign.
+movsr0, #127
+lslsr0, #23
+add r0, ip
+
+LLSYM(__fdiv_return):
+pop { rT, pc }
+.cfi_restore_state
+
+LLSYM(__fdiv_special2):
+// The denominator is either INF or NAN, numerator is neither.
+// Also, the denominator is not equal to 0.
+// If the denominator is INF, the result goes to 

[PATCH v7 27/34] Import float multiplication from the CM0 library

2022-10-31 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/eabi/fmul.S (__mulsf3): New file.
* config/arm/lib1funcs.S: #include eabi/fmul.S (v6m only).
* config/arm/t-elf (LIB1ASMFUNCS): Moved _mulsf3 to global scope
(this object was previously blocked on v6m builds).
---
 libgcc/config/arm/eabi/fmul.S | 215 ++
 libgcc/config/arm/lib1funcs.S |   1 +
 libgcc/config/arm/t-elf   |   3 +-
 3 files changed, 218 insertions(+), 1 deletion(-)
 create mode 100644 libgcc/config/arm/eabi/fmul.S

diff --git a/libgcc/config/arm/eabi/fmul.S b/libgcc/config/arm/eabi/fmul.S
new file mode 100644
index 000..4ebd5a66f47
--- /dev/null
+++ b/libgcc/config/arm/eabi/fmul.S
@@ -0,0 +1,215 @@
+/* fmul.S: Thumb-1 optimized 32-bit float multiplication
+
+   Copyright (C) 2018-2022 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#ifdef L_arm_mulsf3
+
+// float __aeabi_fmul(float, float)
+// Returns $r0 after multiplication by $r1.
+// Subsection ordering within fpcore keeps conditional branches within range.
+FUNC_START_SECTION aeabi_fmul .text.sorted.libgcc.fpcore.m.fmul
+FUNC_ALIAS mulsf3 aeabi_fmul
+CFI_START_FUNCTION
+
+// Standard registers, compatible with exception handling.
+push{ rT, lr }
+.cfi_remember_state
+.cfi_remember_state
+.cfi_adjust_cfa_offset 8
+.cfi_rel_offset rT, 0
+.cfi_rel_offset lr, 4
+
+// Save the sign of the result.
+movsrT, r1
+eorsrT, r0
+lsrsrT, #31
+lslsrT, #31
+mov ip, rT
+
+// Set up INF for comparison.
+movsrT, #255
+lslsrT, #24
+
+// Check for multiplication by zero.
+lslsr2, r0, #1
+beq LLSYM(__fmul_zero1)
+
+lslsr3, r1, #1
+beq LLSYM(__fmul_zero2)
+
+// Check for INF/NAN.
+cmp r3, rT
+bhs LLSYM(__fmul_special2)
+
+cmp r2, rT
+bhs LLSYM(__fmul_special1)
+
+// Because neither operand is INF/NAN, the result will be finite.
+// It is now safe to modify the original operand registers.
+lslsr0, #9
+
+// Isolate the first exponent.  When normal, add back the implicit '1'.
+// The result is always aligned with the MSB in bit [31].
+// Subnormal mantissas remain effectively multiplied by 2x relative to
+//  normals, but this works because the weight of a subnormal is -126.
+lsrsr2, #24
+beq LLSYM(__fmul_normalize2)
+addsr0, #1
+rorsr0, r0
+
+LLSYM(__fmul_normalize2):
+// IMPORTANT: exp10i() jumps in here!
+// Repeat for the mantissa of the second operand.
+// Short-circuit when the mantissa is 1.0, as the
+//  first mantissa is already prepared in $r0
+lslsr1, #9
+
+// When normal, add back the implicit '1'.
+lsrsr3, #24
+beq LLSYM(__fmul_go)
+addsr1, #1
+rorsr1, r1
+
+LLSYM(__fmul_go):
+// Calculate the final exponent, relative to bit [30].
+addsrT, r2, r3
+subsrT, #127
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+// Short-circuit on multiplication by powers of 2.
+lslsr3, r0, #1
+beq LLSYM(__fmul_simple1)
+
+lslsr3, r1, #1
+beq LLSYM(__fmul_simple2)
+  #endif
+
+// Save $ip across the call.
+// (Alternatively, could push/pop a separate register,
+//  but the four instructions here are equivally fast)
+//  without imposing on the stack.
+add rT, ip
+
+// 32x32 unsigned multiplication, 64 bit result.
+bl  SYM(__umulsidi3) __PLT__
+
+// 

[PATCH v7 33/34] Drop single-precision Thumb-1 soft-float functions

2022-10-31 Thread Daniel Engel
With the complete CM0 library integrated, regression testing showed new
failures with the message "compilation failed to produce executable":

gcc.dg/fixed-point/convert-float-1.c
gcc.dg/fixed-point/convert-float-3.c
gcc.dg/fixed-point/convert-sat.c

Investigating, this appears to be caused by the linker.  I can't find a
comprehensive linker specification to claim this is actually a bug, but it
certainly doesn't match my expectations.  Investigating, I found issues
with the link order of these symbols:

  * __aeabi_fmul()
  * __aeabi_f2d()
  * __aeabi_f2iz()

Specifically, I expect the linker to import the _first_ definition of any
symbol.  This is the basic behavior that allows the soft-float library to
supply missing symbols on architectures without optimized routines.

Comparing the v6-m multilib with the default, I see symbol exports for all
of the affect symbols:

gcc-obj/gcc/libgcc.a:

// assembly routines

_arm_mulsf3.o:
 W __aeabi_fmul
 W __mulsf3

_arm_addsubdf3.o:
0368 T __aeabi_f2d
0368 T __extendsfdf2

_arm_fixsfsi.o:
 T __aeabi_f2iz
 T __fixsfsi

mulsf3.o:


fixsfsi.o:


extendsfdf2.o.o:


gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a:

// assembly routines

_arm_mulsf3.o:
 T __aeabi_fmul
 U __fp_assemble
 U __fp_exception
 U __fp_infinity
 U __fp_zero
 T __mulsf3
 U __umulsidi3

_arm_fixsfsi.o:
 T __aeabi_f2iz
 T __fixsfsi
0002 T __internal_f2iz

_arm_f2d.o:
 T __aeabi_f2d
 T __extendsfdf2
 U __fp_normalize2

// soft-float library

mulsf3.o:
 T __aeabi_fmul

fixsfsi.o:
 T __aeabi_f2iz

extendsfdf2.o:
 T __aeabi_f2d

Given the order of the archive file, I expect the linker to import the affected
functions from the _arm_* archive elements.

For "convert-sat.c", all is well with -march=armv7-m.
...
(/home/mirdan/gcc-obj/gcc/libgcc.a)_arm_muldf3.o
OK> (/home/mirdan/gcc-obj/gcc/libgcc.a)_arm_mulsf3.o
(/home/mirdan/gcc-obj/gcc/libgcc.a)_arm_cmpsf2.o
(/home/mirdan/gcc-obj/gcc/libgcc.a)_arm_fixsfsi.o
(/home/mirdan/gcc-obj/gcc/libgcc.a)_arm_fixunssfsi.o
OK> (/home/mirdan/gcc-obj/gcc/libgcc.a)_arm_addsubdf3.o
(/home/mirdan/gcc-obj/gcc/libgcc.a)_arm_cmpdf2.o
(/home/mirdan/gcc-obj/gcc/libgcc.a)_arm_fixdfsi.o
(/home/mirdan/gcc-obj/gcc/libgcc.a)_arm_fixunsdfsi.o
OK> (/home/mirdan/gcc-obj/gcc/libgcc.a)_fixsfdi.o
(/home/mirdan/gcc-obj/gcc/libgcc.a)_fixdfdi.o
(/home/mirdan/gcc-obj/gcc/libgcc.a)_fixunssfdi.o
(/home/mirdan/gcc-obj/gcc/libgcc.a)_fixunsdfdi.o
...

However, with -march=armv6s-m, the linker imports these symbols from the soft-
float library.  (NOTE: The CM0 library only implements single-precision float
operations, so imports from muldf3.o, fixdfsi.o, etc are expected.)
...
??> (/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)mulsf3.o
??> (/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)fixsfsi.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)muldf3.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)fixdfsi.o
??> (/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)extendsfdf2.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_clzsi2.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_arm_fcmpge.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_arm_fcmple.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_fixsfdi.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_fixunssfdi.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_fixunssfsi.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_arm_cmpdf2.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_fixunsdfsi.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_fixdfdi.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_fixunsdfdi.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)eqdf2.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)gedf2.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)ledf2.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)subdf3.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)floatunsidf.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_arm_cmpsf2.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_fixsfsi.o
...

It seems that the order in which the linker resolves symbols matters.  In the
affected test cases, the linker begins searching for fixed-point function
symbols first: _subQQ.o, _cmpQQ.o, etc.  

[PATCH v7 26/34] Import float addition and subtraction from the CM0 library

2022-10-31 Thread Daniel Engel
Since this is the first import of single-precision functions, some common
parsing and formatting routines are also included.  These common rotines
will be referenced by other functions in subsequent commits.
However, even if the size penalty is accounted entirely to __addsf3(),
the total compiled size is still less than half the size of soft-float.

gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/eabi/fadd.S (__addsf3, __subsf3): Added new functions.
* config/arm/eabi/fneg.S (__negsf2): Added new file.
* config/arm/eabi/futil.S (__fp_normalize2, __fp_lalign2, __fp_assemble,
__fp_overflow, __fp_zero, __fp_check_nan): Added new file with shared
helper functions.
* config/arm/lib1funcs.S: #include eabi/fneg.S and eabi/futil.S (v6m 
only).
* config/arm/t-elf (LIB1ASMFUNCS): Added _arm_addsf3, _arm_frsubsf3,
_fp_exceptionf, _fp_checknanf, _fp_assemblef, and _fp_normalizef.
---
 libgcc/config/arm/eabi/fadd.S  | 306 +++-
 libgcc/config/arm/eabi/fneg.S  |  76 ++
 libgcc/config/arm/eabi/fplib.h |   3 -
 libgcc/config/arm/eabi/futil.S | 418 +
 libgcc/config/arm/lib1funcs.S  |   2 +
 libgcc/config/arm/t-elf|   6 +
 6 files changed, 798 insertions(+), 13 deletions(-)
 create mode 100644 libgcc/config/arm/eabi/fneg.S
 create mode 100644 libgcc/config/arm/eabi/futil.S

diff --git a/libgcc/config/arm/eabi/fadd.S b/libgcc/config/arm/eabi/fadd.S
index fffbd91d1bc..176e330a1b6 100644
--- a/libgcc/config/arm/eabi/fadd.S
+++ b/libgcc/config/arm/eabi/fadd.S
@@ -1,5 +1,7 @@
-/* Copyright (C) 2006-2021 Free Software Foundation, Inc.
-   Contributed by CodeSourcery.
+/* fadd.S: Thumb-1 optimized 32-bit float addition and subtraction
+
+   Copyright (C) 2018-2022 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
 
This file is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
@@ -21,18 +23,302 @@
.  */
 
 
+#ifdef L_arm_frsubsf3
+
+// float __aeabi_frsub(float, float)
+// Returns the floating point difference of $r1 - $r0 in $r0.
+// Subsection ordering within fpcore keeps conditional branches within range.
+FUNC_START_SECTION aeabi_frsub .text.sorted.libgcc.fpcore.b.frsub
+CFI_START_FUNCTION
+
+  #if defined(STRICT_NANS) && STRICT_NANS
+// Check if $r0 is NAN before modifying.
+lslsr2, r0, #1
+movsr3, #255
+lslsr3, #24
+
+// Let fadd() find the NAN in the normal course of operation,
+//  moving it to $r0 and checking the quiet/signaling bit.
+cmp r2, r3
+bhi SYM(__aeabi_fadd)
+  #endif
+
+// Flip sign and run through fadd().
+movsr2, #1
+lslsr2, #31
+addsr0, r2
+b   SYM(__aeabi_fadd)
+
+CFI_END_FUNCTION
+FUNC_END aeabi_frsub
+
+#endif /* L_arm_frsubsf3 */
+
+
 #ifdef L_arm_addsubsf3
 
-FUNC_START aeabi_frsub
+// float __aeabi_fsub(float, float)
+// Returns the floating point difference of $r0 - $r1 in $r0.
+// Subsection ordering within fpcore keeps conditional branches within range.
+FUNC_START_SECTION aeabi_fsub .text.sorted.libgcc.fpcore.c.faddsub
+FUNC_ALIAS subsf3 aeabi_fsub
+CFI_START_FUNCTION
 
-  push {r4, lr}
-  movs r4, #1
-  lsls r4, #31
-  eors r0, r0, r4
-  bl   __aeabi_fadd
-  pop  {r4, pc}
+  #if defined(STRICT_NANS) && STRICT_NANS
+// Check if $r1 is NAN before modifying.
+lslsr2, r1, #1
+movsr3, #255
+lslsr3, #24
 
-  FUNC_END aeabi_frsub
+// Let fadd() find the NAN in the normal course of operation,
+//  moving it to $r0 and checking the quiet/signaling bit.
+cmp r2, r3
+bhi SYM(__aeabi_fadd)
+  #endif
+
+// Flip sign and fall into fadd().
+movsr2, #1
+lslsr2, #31
+addsr1, r2
 
 #endif /* L_arm_addsubsf3 */
 
+
+// The execution of __subsf3() flows directly into __addsf3(), such that
+//  instructions must appear consecutively in the same memory section.
+//  However, this construction inhibits the ability to discard __subsf3()
+//  when only using __addsf3().
+// Therefore, this block configures __addsf3() for compilation twice.
+// The first version is a minimal standalone implementation, and the second
+//  version is the continuation of __subsf3().  The standalone version must
+//  be declared WEAK, so that the combined version can supersede it and
+//  provide both symbols when required.
+// '_arm_addsf3' should appear before '_arm_addsubsf3' in LIB1ASMFUNCS.
+#if defined(L_arm_addsf3) || defined(L_arm_addsubsf3)
+
+#ifdef L_arm_addsf3
+// float __aeabi_fadd(float, float)
+// Returns the floating point 

[PATCH v7 23/34] Refactor Thumb-1 float comparison into a new file

2022-10-31 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/bpabi-v6m.S (__aeabi_cfcmpeq, __aeabi_cfcmple,
__aeabi_cfrcmple, __aeabi_fcmpeq, __aeabi_fcmple, aeabi_fcmple,
__aeabi_fcmpgt, aeabi_fcmpge): Moved to ...
* config/arm/eabi/fcmp.S: New file.
* config/arm/lib1funcs.S: #include eabi/fcmp.S (v6m only).
---
 libgcc/config/arm/bpabi-v6m.S | 63 -
 libgcc/config/arm/eabi/fcmp.S | 89 +++
 libgcc/config/arm/lib1funcs.S |  1 +
 3 files changed, 90 insertions(+), 63 deletions(-)
 create mode 100644 libgcc/config/arm/eabi/fcmp.S

diff --git a/libgcc/config/arm/bpabi-v6m.S b/libgcc/config/arm/bpabi-v6m.S
index d38a9208c60..8e0a45f4716 100644
--- a/libgcc/config/arm/bpabi-v6m.S
+++ b/libgcc/config/arm/bpabi-v6m.S
@@ -49,69 +49,6 @@ FUNC_START aeabi_frsub
 
 #endif /* L_arm_addsubsf3 */
 
-#ifdef L_arm_cmpsf2
-
-FUNC_START aeabi_cfrcmple
-
-   mov ip, r0
-   movsr0, r1
-   mov r1, ip
-   b   6f
-
-FUNC_START aeabi_cfcmpeq
-FUNC_ALIAS aeabi_cfcmple aeabi_cfcmpeq
-
-   @ The status-returning routines are required to preserve all
-   @ registers except ip, lr, and cpsr.
-6: push{r0, r1, r2, r3, r4, lr}
-   bl  __lesf2
-   @ Set the Z flag correctly, and the C flag unconditionally.
-   cmp r0, #0
-   @ Clear the C flag if the return value was -1, indicating
-   @ that the first operand was smaller than the second.
-   bmi 1f
-   movsr1, #0
-   cmn r0, r1
-1:
-   pop {r0, r1, r2, r3, r4, pc}
-
-   FUNC_END aeabi_cfcmple
-   FUNC_END aeabi_cfcmpeq
-   FUNC_END aeabi_cfrcmple
-
-FUNC_START aeabi_fcmpeq
-
-   push{r4, lr}
-   bl  __eqsf2
-   negsr0, r0
-   addsr0, r0, #1
-   pop {r4, pc}
-
-   FUNC_END aeabi_fcmpeq
-
-.macro COMPARISON cond, helper, mode=sf2
-FUNC_START aeabi_fcmp\cond
-
-   push{r4, lr}
-   bl  __\helper\mode
-   cmp r0, #0
-   b\cond  1f
-   movsr0, #0
-   pop {r4, pc}
-1:
-   movsr0, #1
-   pop {r4, pc}
-
-   FUNC_END aeabi_fcmp\cond
-.endm
-
-COMPARISON lt, le
-COMPARISON le, le
-COMPARISON gt, ge
-COMPARISON ge, ge
-
-#endif /* L_arm_cmpsf2 */
-
 #ifdef L_arm_addsubdf3
 
 FUNC_START aeabi_drsub
diff --git a/libgcc/config/arm/eabi/fcmp.S b/libgcc/config/arm/eabi/fcmp.S
new file mode 100644
index 000..96d627f1fea
--- /dev/null
+++ b/libgcc/config/arm/eabi/fcmp.S
@@ -0,0 +1,89 @@
+/* Miscellaneous BPABI functions.  Thumb-1 implementation, suitable for ARMv4T,
+   ARMv6-M and ARMv8-M Baseline like ISA variants.
+
+   Copyright (C) 2006-2020 Free Software Foundation, Inc.
+   Contributed by CodeSourcery.
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#ifdef L_arm_cmpsf2
+
+FUNC_START aeabi_cfrcmple
+
+   mov ip, r0
+   movsr0, r1
+   mov r1, ip
+   b   6f
+
+FUNC_START aeabi_cfcmpeq
+FUNC_ALIAS aeabi_cfcmple aeabi_cfcmpeq
+
+   @ The status-returning routines are required to preserve all
+   @ registers except ip, lr, and cpsr.
+6: push{r0, r1, r2, r3, r4, lr}
+   bl  __lesf2
+   @ Set the Z flag correctly, and the C flag unconditionally.
+   cmp r0, #0
+   @ Clear the C flag if the return value was -1, indicating
+   @ that the first operand was smaller than the second.
+   bmi 1f
+   movsr1, #0
+   cmn r0, r1
+1:
+   pop {r0, r1, r2, r3, r4, pc}
+
+   FUNC_END aeabi_cfcmple
+   FUNC_END aeabi_cfcmpeq
+   FUNC_END aeabi_cfrcmple
+
+FUNC_START aeabi_fcmpeq
+
+   push{r4, lr}
+   bl  __eqsf2
+   negsr0, r0
+   addsr0, r0, #1
+   pop {r4, pc}
+
+   FUNC_END aeabi_fcmpeq
+
+.macro COMPARISON cond, helper, mode=sf2
+FUNC_START aeabi_fcmp\cond
+
+   push{r4, lr}
+   bl  __\helper\mode
+   cmp r0, #0
+   b\cond  1f
+   movsr0, #0
+   pop {r4, pc}
+1:
+   movsr0, #1
+  

[PATCH v7 20/34] Refactor Thumb-1 64-bit division into a new file

2022-10-31 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/bpabi-v6m.S (__aeabi_ldivmod/ldivmod): Moved to ...
* config/arm/eabi/ldiv.S: New file.
* config/arm/lib1funcs.S: #include eabi/ldiv.S (v6m only).
---
 libgcc/config/arm/bpabi-v6m.S |  81 -
 libgcc/config/arm/eabi/ldiv.S | 107 ++
 libgcc/config/arm/lib1funcs.S |   1 +
 3 files changed, 108 insertions(+), 81 deletions(-)
 create mode 100644 libgcc/config/arm/eabi/ldiv.S

diff --git a/libgcc/config/arm/bpabi-v6m.S b/libgcc/config/arm/bpabi-v6m.S
index 3757e99508e..d38a9208c60 100644
--- a/libgcc/config/arm/bpabi-v6m.S
+++ b/libgcc/config/arm/bpabi-v6m.S
@@ -34,87 +34,6 @@
 #endif /* __ARM_EABI__ */
 
 
-.macro test_div_by_zero signed
-   cmp yyh, #0
-   bne 7f
-   cmp yyl, #0
-   bne 7f
-   cmp xxh, #0
-   .ifc\signed, unsigned
-   bne 2f
-   cmp xxl, #0
-2:
-   beq 3f
-   movsxxh, #0
-   mvnsxxh, xxh@ 0x
-   movsxxl, xxh
-3:
-   .else
-   blt 6f
-   bgt 4f
-   cmp xxl, #0
-   beq 5f
-4: movsxxl, #0
-   mvnsxxl, xxl@ 0x
-   lsrsxxh, xxl, #1@ 0x7fff
-   b   5f
-6: movsxxh, #0x80
-   lslsxxh, xxh, #24   @ 0x8000
-   movsxxl, #0
-5:
-   .endif
-   @ tailcalls are tricky on v6-m.
-   push{r0, r1, r2}
-   ldr r0, 1f
-   adr r1, 1f
-   addsr0, r1
-   str r0, [sp, #8]
-   @ We know we are not on armv4t, so pop pc is safe.
-   pop {r0, r1, pc}
-   .align  2
-1:
-   .word   __aeabi_ldiv0 - 1b
-7:
-.endm
-
-#ifdef L_aeabi_ldivmod
-
-FUNC_START aeabi_ldivmod
-   test_div_by_zero signed
-
-   push{r0, r1}
-   mov r0, sp
-   push{r0, lr}
-   ldr r0, [sp, #8]
-   bl  SYM(__gnu_ldivmod_helper)
-   ldr r3, [sp, #4]
-   mov lr, r3
-   add sp, sp, #8
-   pop {r2, r3}
-   RET
-   FUNC_END aeabi_ldivmod
-
-#endif /* L_aeabi_ldivmod */
-
-#ifdef L_aeabi_uldivmod
-
-FUNC_START aeabi_uldivmod
-   test_div_by_zero unsigned
-
-   push{r0, r1}
-   mov r0, sp
-   push{r0, lr}
-   ldr r0, [sp, #8]
-   bl  SYM(__udivmoddi4)
-   ldr r3, [sp, #4]
-   mov lr, r3
-   add sp, sp, #8
-   pop {r2, r3}
-   RET
-   FUNC_END aeabi_uldivmod
-   
-#endif /* L_aeabi_uldivmod */
-
 #ifdef L_arm_addsubsf3
 
 FUNC_START aeabi_frsub
diff --git a/libgcc/config/arm/eabi/ldiv.S b/libgcc/config/arm/eabi/ldiv.S
new file mode 100644
index 000..3c8280ef580
--- /dev/null
+++ b/libgcc/config/arm/eabi/ldiv.S
@@ -0,0 +1,107 @@
+/* Miscellaneous BPABI functions.  Thumb-1 implementation, suitable for ARMv4T,
+   ARMv6-M and ARMv8-M Baseline like ISA variants.
+
+   Copyright (C) 2006-2020 Free Software Foundation, Inc.
+   Contributed by CodeSourcery.
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+.macro test_div_by_zero signed
+cmp yyh, #0
+bne 7f
+cmp yyl, #0
+bne 7f
+cmp xxh, #0
+.ifc\signed, unsigned
+bne 2f
+cmp xxl, #0
+2:
+beq 3f
+movsxxh, #0
+mvnsxxh, xxh@ 0x
+movsxxl, xxh
+3:
+.else
+blt 6f
+bgt 4f
+cmp xxl, #0
+beq 5f
+4:  movsxxl, #0
+mvnsxxl, xxl@ 0x
+lsrsxxh, xxl, #1@ 0x7fff
+b   5f
+6:  movsxxh, #0x80
+lslsxxh, xxh, #24   @ 0x8000
+movsxxl, #0
+5:
+.endif
+@ tailcalls are tricky on v6-m.
+push{r0, r1, r2}
+ldr r0, 1f
+adr r1, 1f
+addsr0, r1
+str r0, [sp, #8]
+@ We know we are not on armv4t, so pop 

[PATCH v7 24/34] Import float comparison from the CM0 library

2022-10-31 Thread Daniel Engel
These functions are significantly smaller and faster than the wrapper
functions and soft-float implementation they replace.  Using the first
comparison operator (e.g. '<=') in any program costs about 70 bytes
initially, but every additional operator incrementally adds just 4 bytes.

NOTE: It seems that the __aeabi_cfcmp*() routines formerly in bpabi-v6m.S
were not well tested, as they returned wrong results for the 'C' flag.
The replacement functions are fully tested.

gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/eabi/fcmp.S (__cmpsf2, __eqsf2, __gesf2,
__aeabi_fcmpne, __aeabi_fcmpun): Added new functions.
(__aeabi_fcmpeq, __aeabi_fcmpne, __aeabi_fcmplt, __aeabi_fcmple,
 __aeabi_fcmpge, __aeabi_fcmpgt, __aeabi_cfcmple, __aeabi_cfcmpeq,
 __aeabi_cfrcmple): Replaced with branches to __internal_cmpsf2().
* config/arm/eabi/fplib.h: New file with fcmp-specific constants
and general build configuration macros.
* config/arm/lib1funcs.S: #include eabi/fplib.h (v6m only).
* config/arm/t-elf (LIB1ASMFUNCS): Added _internal_cmpsf2,
_arm_cfcmpeq, _arm_cfcmple, _arm_cfrcmple, _arm_fcmpeq,
_arm_fcmpge, _arm_fcmpgt, _arm_fcmple, _arm_fcmplt, _arm_fcmpne,
_arm_eqsf2, and _arm_gesf2.
---
 libgcc/config/arm/eabi/fcmp.S  | 643 +
 libgcc/config/arm/eabi/fplib.h |  83 +
 libgcc/config/arm/lib1funcs.S  |   1 +
 libgcc/config/arm/t-elf|  18 +
 4 files changed, 681 insertions(+), 64 deletions(-)
 create mode 100644 libgcc/config/arm/eabi/fplib.h

diff --git a/libgcc/config/arm/eabi/fcmp.S b/libgcc/config/arm/eabi/fcmp.S
index 96d627f1fea..0c813fae8c5 100644
--- a/libgcc/config/arm/eabi/fcmp.S
+++ b/libgcc/config/arm/eabi/fcmp.S
@@ -1,8 +1,7 @@
-/* Miscellaneous BPABI functions.  Thumb-1 implementation, suitable for ARMv4T,
-   ARMv6-M and ARMv8-M Baseline like ISA variants.
+/* fcmp.S: Thumb-1 optimized 32-bit float comparison
 
-   Copyright (C) 2006-2020 Free Software Foundation, Inc.
-   Contributed by CodeSourcery.
+   Copyright (C) 2018-2022 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
 
This file is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
@@ -24,66 +23,582 @@
.  */
 
 
+// The various compare functions in this file all expect to tail call 
__cmpsf2()
+//  with flags set for a particular comparison mode.  The __internal_cmpsf2()
+//  symbol  itself is unambiguous, but there is a remote risk that the linker
+//  will prefer some other symbol in place of __cmpsf2().  Importing an archive
+//  file that also exports __cmpsf2() will throw an error in this case.
+// As a workaround, this block configures __aeabi_f2lz() for compilation twice.
+// The first version configures __internal_cmpsf2() as a WEAK standalone 
symbol,
+//  and the second exports __cmpsf2() and __internal_cmpsf2() normally.
+// A small bonus: programs not using __cmpsf2() itself will be slightly 
smaller.
+// 'L_internal_cmpsf2' should appear before 'L_arm_cmpsf2' in LIB1ASMFUNCS.
+#if defined(L_arm_cmpsf2) || defined(L_internal_cmpsf2)
+
+#define CMPSF2_SECTION .text.sorted.libgcc.fcmp.cmpsf2
+
+// int __cmpsf2(float, float)
+// 
+// Returns the three-way comparison result of $r0 with $r1:
+//  * +1 if ($r0 > $r1), or either argument is NAN
+//  *  0 if ($r0 == $r1)
+//  * -1 if ($r0 < $r1)
+// Uses $r2, $r3, and $ip as scratch space.
+#ifdef L_arm_cmpsf2
+FUNC_START_SECTION cmpsf2 CMPSF2_SECTION
+FUNC_ALIAS lesf2 cmpsf2
+FUNC_ALIAS ltsf2 cmpsf2
+CFI_START_FUNCTION
+
+// Assumption: The 'libgcc' functions should raise exceptions.
+movsr2, #(FCMP_UN_POSITIVE + FCMP_RAISE_EXCEPTIONS + FCMP_3WAY)
+
+// int,int __internal_cmpsf2(float, float, int)
+// Internal function expects a set of control flags in $r2.
+// If ordered, returns a comparison type { 0, 1, 2 } in $r3
+FUNC_ENTRY internal_cmpsf2
+
+#else /* L_internal_cmpsf2 */
+WEAK_START_SECTION internal_cmpsf2 CMPSF2_SECTION
+CFI_START_FUNCTION
+
+#endif
+
+// When operand signs are considered, the comparison result falls
+//  within one of the following quadrants:
+//
+// $r0  $r1  $r0-$r1* flags  result
+//  ++  >  C=0 GT
+//  ++  =  Z=1 EQ
+//  ++  <  C=1 LT
+//  +-  >  C=1 GT
+//  +-  =  C=1 GT
+//  +-  <  C=1 GT
+//  -+  >  C=0 LT
+//  -+  =  C=0 LT
+//  -+  <  C=0 LT
+//  --  >  C=0 LT
+//  --  =  Z=1 EQ
+//  --  <  C=1 GT
+//
+   

[PATCH v7 25/34] Refactor Thumb-1 float subtraction into a new file

2022-10-31 Thread Daniel Engel
This will make it easier to isolate changes in subsequent patches.

gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/bpabi-v6m.S (__aeabi_frsub): Moved to ...
* config/arm/eabi/fadd.S: New file.
* config/arm/lib1funcs.S: #include eabi/fadd.S (v6m only).
---
 libgcc/config/arm/bpabi-v6m.S | 16 ---
 libgcc/config/arm/eabi/fadd.S | 38 +++
 libgcc/config/arm/lib1funcs.S |  1 +
 3 files changed, 39 insertions(+), 16 deletions(-)
 create mode 100644 libgcc/config/arm/eabi/fadd.S

diff --git a/libgcc/config/arm/bpabi-v6m.S b/libgcc/config/arm/bpabi-v6m.S
index 8e0a45f4716..afba648ec57 100644
--- a/libgcc/config/arm/bpabi-v6m.S
+++ b/libgcc/config/arm/bpabi-v6m.S
@@ -33,22 +33,6 @@
.eabi_attribute 25, 1
 #endif /* __ARM_EABI__ */
 
-
-#ifdef L_arm_addsubsf3
-
-FUNC_START aeabi_frsub
-
-  push {r4, lr}
-  movs r4, #1
-  lsls r4, #31
-  eors r0, r0, r4
-  bl   __aeabi_fadd
-  pop  {r4, pc}
-
-  FUNC_END aeabi_frsub
-
-#endif /* L_arm_addsubsf3 */
-
 #ifdef L_arm_addsubdf3
 
 FUNC_START aeabi_drsub
diff --git a/libgcc/config/arm/eabi/fadd.S b/libgcc/config/arm/eabi/fadd.S
new file mode 100644
index 000..fffbd91d1bc
--- /dev/null
+++ b/libgcc/config/arm/eabi/fadd.S
@@ -0,0 +1,38 @@
+/* Copyright (C) 2006-2021 Free Software Foundation, Inc.
+   Contributed by CodeSourcery.
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#ifdef L_arm_addsubsf3
+
+FUNC_START aeabi_frsub
+
+  push {r4, lr}
+  movs r4, #1
+  lsls r4, #31
+  eors r0, r0, r4
+  bl   __aeabi_fadd
+  pop  {r4, pc}
+
+  FUNC_END aeabi_frsub
+
+#endif /* L_arm_addsubsf3 */
+
diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index 188d9d7ff47..d1a2d2f7908 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -2012,6 +2012,7 @@ LSYM(Lchange_\register):
 #include "bpabi-v6m.S"
 #include "eabi/fplib.h"
 #include "eabi/fcmp.S"
+#include "eabi/fadd.S"
 #endif /* NOT_ISA_TARGET_32BIT */
 #include "eabi/lcmp.S"
 #endif /* !__symbian__ */
-- 
2.34.1



[PATCH v7 22/34] Import integer multiplication from the CM0 library

2022-10-31 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/eabi/lmul.S: New file for __muldi3(), __mulsidi3(), and
 __umulsidi3().
* config/arm/lib1funcs.S: #eabi/lmul.S (v6m only).
* config/arm/t-elf: Add the new objects to LIB1ASMFUNCS.
---
 libgcc/config/arm/eabi/lmul.S | 218 ++
 libgcc/config/arm/lib1funcs.S |   1 +
 libgcc/config/arm/t-elf   |  13 +-
 3 files changed, 230 insertions(+), 2 deletions(-)
 create mode 100644 libgcc/config/arm/eabi/lmul.S

diff --git a/libgcc/config/arm/eabi/lmul.S b/libgcc/config/arm/eabi/lmul.S
new file mode 100644
index 000..377e571bf09
--- /dev/null
+++ b/libgcc/config/arm/eabi/lmul.S
@@ -0,0 +1,218 @@
+/* lmul.S: Thumb-1 optimized 64-bit integer multiplication
+
+   Copyright (C) 2018-2022 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#ifdef L_muldi3
+
+// long long __aeabi_lmul(long long, long long)
+// Returns the least significant 64 bits of a 64 bit multiplication.
+// Expects the two multiplicands in $r1:$r0 and $r3:$r2.
+// Returns the product in $r1:$r0 (does not distinguish signed types).
+// Uses $r4 and $r5 as scratch space.
+// Same parent section as __umulsidi3() to keep tail call branch within range.
+FUNC_START_SECTION muldi3 .text.sorted.libgcc.lmul.muldi3
+
+#ifndef __symbian__
+  FUNC_ALIAS aeabi_lmul muldi3
+#endif
+
+CFI_START_FUNCTION
+
+// $r1:$r0 = 0x
+// $r3:$r2 = 0x
+
+// The following operations that only affect the upper 64 bits
+//  can be safely discarded:
+//    * 
+//    * 
+//    * 
+//    * 
+//    * 
+//    * 
+
+// MAYBE: Test for multiply by ZERO on implementations with a 32-cycle
+//  'muls' instruction, and skip over the operation in that case.
+
+// (0x * 0x), free $r1
+mulsxxh,yyl
+
+// (0x * 0x), free $r3
+mulsyyh,xxl
+addsyyh,xxh
+
+// Put the parameters in the correct form for umulsidi3().
+movsxxh,yyl
+b   LLSYM(__mul_overflow)
+
+CFI_END_FUNCTION
+FUNC_END muldi3
+
+#ifndef __symbian__
+  FUNC_END aeabi_lmul
+#endif
+
+#endif /* L_muldi3 */
+
+
+// The following implementation of __umulsidi3() integrates with __muldi3()
+//  above to allow the fast tail call while still preserving the extra
+//  hi-shifted bits of the result.  However, these extra bits add a few
+//  instructions not otherwise required when using only __umulsidi3().
+// Therefore, this block configures __umulsidi3() for compilation twice.
+// The first version is a minimal standalone implementation, and the second
+//  version adds the hi bits of __muldi3().  The standalone version must
+//  be declared WEAK, so that the combined version can supersede it and
+//  provide both symbols in programs that multiply long doubles.
+// This means '_umulsidi3' should appear before '_muldi3' in LIB1ASMFUNCS.
+#if defined(L_muldi3) || defined(L_umulsidi3)
+
+#ifdef L_umulsidi3
+// unsigned long long __umulsidi3(unsigned int, unsigned int)
+// Returns all 64 bits of a 32 bit multiplication.
+// Expects the two multiplicands in $r0 and $r1.
+// Returns the product in $r1:$r0.
+// Uses $r3, $r4 and $ip as scratch space.
+WEAK_START_SECTION umulsidi3 .text.sorted.libgcc.lmul.umulsidi3
+CFI_START_FUNCTION
+
+#else /* L_muldi3 */
+FUNC_ENTRY umulsidi3
+CFI_START_FUNCTION
+
+// 32x32 multiply with 64 bit result.
+// Expand the multiply into 4 parts, since muls only returns 32 bits.
+// (a16h * b16h / 2^32)
+//   + (a16h * b16l / 2^48) + (a16l * b16h / 2^48)
+//   + (a16l * b16l / 2^64)
+
+// MAYBE: Test for multiply by 0 on implementations with a 32-cycle
+//  'muls' instruction, and skip over the operation in that case.
+
+ 

[PATCH v7 19/34] Import 32-bit division from the CM0 library

2022-10-31 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/eabi/idiv.S: New file for __udivsi3() and __divsi3().
* config/arm/lib1funcs.S: #include eabi/idiv.S (v6m only).
---
 libgcc/config/arm/eabi/idiv.S | 299 ++
 libgcc/config/arm/lib1funcs.S |  19 ++-
 2 files changed, 317 insertions(+), 1 deletion(-)
 create mode 100644 libgcc/config/arm/eabi/idiv.S

diff --git a/libgcc/config/arm/eabi/idiv.S b/libgcc/config/arm/eabi/idiv.S
new file mode 100644
index 000..6e54863611a
--- /dev/null
+++ b/libgcc/config/arm/eabi/idiv.S
@@ -0,0 +1,299 @@
+/* div.S: Thumb-1 size-optimized 32-bit integer division
+
+   Copyright (C) 2018-2022 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#ifndef __GNUC__
+
+// int __aeabi_idiv0(int)
+// Helper function for division by 0.
+WEAK_START_SECTION aeabi_idiv0 .text.sorted.libgcc.idiv.idiv0
+FUNC_ALIAS cm0_idiv0 aeabi_idiv0
+CFI_START_FUNCTION
+
+  #if defined(TRAP_EXCEPTIONS) && TRAP_EXCEPTIONS
+svc #(SVC_DIVISION_BY_ZERO)
+  #endif
+
+RET
+
+CFI_END_FUNCTION
+FUNC_END cm0_idiv0
+FUNC_END aeabi_idiv0
+
+#endif /* !__GNUC__ */
+
+
+#ifdef L_divsi3
+
+// int __aeabi_idiv(int, int)
+// idiv_return __aeabi_idivmod(int, int)
+// Returns signed $r0 after division by $r1.
+// Also returns the signed remainder in $r1.
+// Same parent section as __divsi3() to keep branches within range.
+FUNC_START_SECTION divsi3 .text.sorted.libgcc.idiv.divsi3
+
+#ifndef __symbian__
+  FUNC_ALIAS aeabi_idiv divsi3
+  FUNC_ALIAS aeabi_idivmod divsi3
+#endif
+
+CFI_START_FUNCTION
+
+// Extend signs.
+asrsr2, r0, #31
+asrsr3, r1, #31
+
+// Absolute value of the denominator, abort on division by zero.
+eorsr1, r3
+subsr1, r3
+  #if defined(PEDANTIC_DIV0) && PEDANTIC_DIV0
+beq LLSYM(__idivmod_zero)
+  #else
+beq SYM(__uidivmod_zero)
+  #endif
+
+// Absolute value of the numerator.
+eorsr0, r2
+subsr0, r2
+
+// Keep the sign of the numerator in bit[31] (for the remainder).
+// Save the XOR of the signs in bits[15:0] (for the quotient).
+push{ rT, lr }
+.cfi_remember_state
+.cfi_adjust_cfa_offset 8
+.cfi_rel_offset rT, 0
+.cfi_rel_offset lr, 4
+
+lsrsrT, r3, #16
+eorsrT, r2
+
+// Handle division as unsigned.
+bl  SYM(__uidivmod_nonzero) __PLT__
+
+// Set the sign of the remainder.
+asrsr2, rT, #31
+eorsr1, r2
+subsr1, r2
+
+// Set the sign of the quotient.
+sxthr3, rT
+eorsr0, r3
+subsr0, r3
+
+LLSYM(__idivmod_return):
+pop { rT, pc }
+.cfi_restore_state
+
+  #if defined(PEDANTIC_DIV0) && PEDANTIC_DIV0
+LLSYM(__idivmod_zero):
+// Set up the *div0() parameter specified in the ARM runtime ABI:
+//  * 0 if the numerator is 0,
+//  * Or, the largest value of the type manipulated by the calling
+// division function if the numerator is positive,
+//  * Or, the least value of the type manipulated by the calling
+// division function if the numerator is negative.
+subsr1, r0
+orrsr0, r1
+asrsr0, #31
+lsrsr0, #1
+eorsr0, r2
+
+// At least the __aeabi_idiv0() call is common.
+b   SYM(__uidivmod_zero2)
+  #endif /* PEDANTIC_DIV0 */
+
+CFI_END_FUNCTION
+FUNC_END divsi3
+
+#ifndef __symbian__
+  FUNC_END aeabi_idiv
+  FUNC_END aeabi_idivmod
+#endif 
+
+#endif /* L_divsi3 */
+
+
+#ifdef L_udivsi3
+
+// int __aeabi_uidiv(unsigned int, unsigned int)
+// idiv_return __aeabi_uidivmod(unsigned int, unsigned int)
+// Returns unsigned $r0 

[PATCH v7 14/34] Import 'parity' functions from the CM0 library

2022-10-31 Thread Daniel Engel
The functional overlap between the single- and double-word functions makes
functions makes this implementation about half the size of the C functions
if both functions are linked in the same application.

gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/parity.S: New file for __paritysi2/di2().
* config/arm/lib1funcs.S: #include bit/parity.S
* config/arm/t-elf (LIB1ASMFUNCS): Added _paritysi2/di2.
---
 libgcc/config/arm/lib1funcs.S |   1 +
 libgcc/config/arm/parity.S| 120 ++
 libgcc/config/arm/t-elf   |   2 +
 3 files changed, 123 insertions(+)
 create mode 100644 libgcc/config/arm/parity.S

diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index aa5957b8399..3f7b9e739f0 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -1704,6 +1704,7 @@ LSYM(Lover12):
 
 #include "clz2.S"
 #include "ctz2.S"
+#include "parity.S"
 
 /*  */
 /* These next two sections are here despite the fact that they contain Thumb 
diff --git a/libgcc/config/arm/parity.S b/libgcc/config/arm/parity.S
new file mode 100644
index 000..1405bea93a3
--- /dev/null
+++ b/libgcc/config/arm/parity.S
@@ -0,0 +1,120 @@
+/* parity.S: ARM optimized parity functions
+
+   Copyright (C) 2020-2022 Free Software Foundation, Inc.
+   Contributed by Daniel Engel (g...@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#ifdef L_paritydi2
+
+// int __paritydi2(int)
+// Returns '0' if the number of bits set in $r1:r0 is even, and '1' otherwise.
+// Returns the result in $r0.
+FUNC_START_SECTION paritydi2 .text.sorted.libgcc.paritydi2
+CFI_START_FUNCTION
+
+// Combine the upper and lower words, then fall through.
+// Byte-endianness does not matter for this function.
+eorsr0, r1
+
+#endif /* L_paritydi2 */
+
+
+// The implementation of __paritydi2() tightly couples with __paritysi2(),
+//  such that instructions must appear consecutively in the same memory
+//  section for proper flow control.  However, this construction inhibits
+//  the ability to discard __paritydi2() when only using __paritysi2().
+// Therefore, this block configures __paritysi2() for compilation twice.
+// The first version is a minimal standalone implementation, and the second
+//  version is the continuation of __paritydi2().  The standalone version must
+//  be declared WEAK, so that the combined version can supersede it and
+//  provide both symbols when required.
+// '_paritysi2' should appear before '_paritydi2' in LIB1ASMFUNCS.
+#if defined(L_paritysi2) || defined(L_paritydi2)
+
+#ifdef L_paritysi2
+// int __paritysi2(int)
+// Returns '0' if the number of bits set in $r0 is even, and '1' otherwise.
+// Returns the result in $r0.
+// Uses $r2 as scratch space.
+WEAK_START_SECTION paritysi2 .text.sorted.libgcc.paritysi2
+CFI_START_FUNCTION
+
+#else /* L_paritydi2 */
+FUNC_ENTRY paritysi2
+
+#endif
+
+  #if defined(__thumb__) && __thumb__
+#if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+
+// Size optimized: 16 bytes, 40 cycles
+// Speed optimized: 24 bytes, 14 cycles
+movsr2, #16
+
+LLSYM(__parity_loop):
+// Calculate the parity of successively smaller half-words into the 
MSB.
+movsr1, r0
+lslsr1, r2
+eorsr0, r1
+lsrsr2, #1
+bne LLSYM(__parity_loop)
+
+#else /* !__OPTIMIZE_SIZE__ */
+
+// Unroll the loop.  The 'libgcc' reference C implementation replaces
+//  the x2 and the x1 shifts with a constant.  However, since it takes
+//  4 cycles to load, index, and mask the constant result, it doesn't
+//  cost anything to keep shifting (and saves a few bytes).
+lslsr1, r0, #16
+eorsr0, r1
+lslsr1, r0, #8
+eorsr0, r1
+lslsr1, r0, #4
+eorsr0, r1
+lslsr1, r0, 

[PATCH v7 21/34] Import 64-bit division from the CM0 library

2022-10-31 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/bpabi.c: Deleted unused file.
* config/arm/eabi/ldiv.S (__aeabi_ldivmod, __aeabi_uldivmod):
Replaced wrapper functions with a complete implementation.
* config/arm/t-bpabi (LIB2ADD_ST): Removed bpabi.c.
* config/arm/t-elf (LIB1ASMFUNCS): Added _divdi3 and _udivdi3.
---
 libgcc/config/arm/bpabi.c |  42 ---
 libgcc/config/arm/eabi/ldiv.S | 542 +-
 libgcc/config/arm/t-bpabi |   3 +-
 libgcc/config/arm/t-elf   |   9 +
 4 files changed, 474 insertions(+), 122 deletions(-)
 delete mode 100644 libgcc/config/arm/bpabi.c

diff --git a/libgcc/config/arm/bpabi.c b/libgcc/config/arm/bpabi.c
deleted file mode 100644
index d8ba940d1ff..000
--- a/libgcc/config/arm/bpabi.c
+++ /dev/null
@@ -1,42 +0,0 @@
-/* Miscellaneous BPABI functions.
-
-   Copyright (C) 2003-2022 Free Software Foundation, Inc.
-   Contributed by CodeSourcery, LLC.
-
-   This file is free software; you can redistribute it and/or modify it
-   under the terms of the GNU General Public License as published by the
-   Free Software Foundation; either version 3, or (at your option) any
-   later version.
-
-   This file is distributed in the hope that it will be useful, but
-   WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-   General Public License for more details.
-
-   Under Section 7 of GPL version 3, you are granted additional
-   permissions described in the GCC Runtime Library Exception, version
-   3.1, as published by the Free Software Foundation.
-
-   You should have received a copy of the GNU General Public License and
-   a copy of the GCC Runtime Library Exception along with this program;
-   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
-   .  */
-
-extern long long __divdi3 (long long, long long);
-extern unsigned long long __udivdi3 (unsigned long long, 
-unsigned long long);
-extern long long __gnu_ldivmod_helper (long long, long long, long long *);
-
-
-long long
-__gnu_ldivmod_helper (long long a, 
- long long b, 
- long long *remainder)
-{
-  long long quotient;
-
-  quotient = __divdi3 (a, b);
-  *remainder = a - b * quotient;
-  return quotient;
-}
-
diff --git a/libgcc/config/arm/eabi/ldiv.S b/libgcc/config/arm/eabi/ldiv.S
index 3c8280ef580..e3ba6497761 100644
--- a/libgcc/config/arm/eabi/ldiv.S
+++ b/libgcc/config/arm/eabi/ldiv.S
@@ -1,8 +1,7 @@
-/* Miscellaneous BPABI functions.  Thumb-1 implementation, suitable for ARMv4T,
-   ARMv6-M and ARMv8-M Baseline like ISA variants.
+/* ldiv.S: Thumb-1 optimized 64-bit integer division
 
-   Copyright (C) 2006-2020 Free Software Foundation, Inc.
-   Contributed by CodeSourcery.
+   Copyright (C) 2018-2022 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
 
This file is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
@@ -24,84 +23,471 @@
.  */
 
 
-.macro test_div_by_zero signed
-cmp yyh, #0
-bne 7f
-cmp yyl, #0
-bne 7f
-cmp xxh, #0
-.ifc\signed, unsigned
-bne 2f
-cmp xxl, #0
-2:
-beq 3f
-movsxxh, #0
-mvnsxxh, xxh@ 0x
-movsxxl, xxh
-3:
-.else
-blt 6f
-bgt 4f
-cmp xxl, #0
-beq 5f
-4:  movsxxl, #0
-mvnsxxl, xxl@ 0x
-lsrsxxh, xxl, #1@ 0x7fff
-b   5f
-6:  movsxxh, #0x80
-lslsxxh, xxh, #24   @ 0x8000
-movsxxl, #0
-5:
-.endif
-@ tailcalls are tricky on v6-m.
-push{r0, r1, r2}
-ldr r0, 1f
-adr r1, 1f
-addsr0, r1
-str r0, [sp, #8]
-@ We know we are not on armv4t, so pop pc is safe.
-pop {r0, r1, pc}
-.align  2
-1:
-.word   __aeabi_ldiv0 - 1b
-7:
-.endm
-
-#ifdef L_aeabi_ldivmod
-
-FUNC_START aeabi_ldivmod
-test_div_by_zero signed
-
-push{r0, r1}
-mov r0, sp
-push{r0, lr}
-ldr r0, [sp, #8]
-bl  SYM(__gnu_ldivmod_helper)
-ldr r3, [sp, #4]
-mov lr, r3
-add sp, sp, #8
-pop {r2, r3}
+#ifndef __GNUC__
+
+// long long __aeabi_ldiv0(long long)
+// Helper function for division by 0.
+WEAK_START_SECTION aeabi_ldiv0 .text.sorted.libgcc.ldiv.ldiv0
+CFI_START_FUNCTION
+
+  #if defined(TRAP_EXCEPTIONS) && TRAP_EXCEPTIONS
+svc #(SVC_DIVISION_BY_ZERO)
+  #endif
+
 RET
-FUNC_END 

[PATCH v7 16/34] Refactor Thumb-1 64-bit comparison into a new file

2022-10-31 Thread Daniel Engel
This will make it easier to isolate changes in subsequent patches.

gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/bpabi-v6m.S (__aeabi_lcmp, __aeabi_ulcmp): Moved to ...
* config/arm/eabi/lcmp.S: New file.
* config/arm/lib1funcs.S: #include eabi/lcmp.S.
---
 libgcc/config/arm/bpabi-v6m.S | 46 --
 libgcc/config/arm/eabi/lcmp.S | 73 +++
 libgcc/config/arm/lib1funcs.S |  1 +
 3 files changed, 74 insertions(+), 46 deletions(-)
 create mode 100644 libgcc/config/arm/eabi/lcmp.S

diff --git a/libgcc/config/arm/bpabi-v6m.S b/libgcc/config/arm/bpabi-v6m.S
index ea01d3f4d5f..3757e99508e 100644
--- a/libgcc/config/arm/bpabi-v6m.S
+++ b/libgcc/config/arm/bpabi-v6m.S
@@ -33,52 +33,6 @@
.eabi_attribute 25, 1
 #endif /* __ARM_EABI__ */
 
-#ifdef L_aeabi_lcmp
-
-FUNC_START aeabi_lcmp
-   cmp xxh, yyh
-   beq 1f
-   bgt 2f
-   movsr0, #1
-   negsr0, r0
-   RET
-2:
-   movsr0, #1
-   RET
-1:
-   subsr0, xxl, yyl
-   beq 1f
-   bhi 2f
-   movsr0, #1
-   negsr0, r0
-   RET
-2:
-   movsr0, #1
-1:
-   RET
-   FUNC_END aeabi_lcmp
-
-#endif /* L_aeabi_lcmp */
-   
-#ifdef L_aeabi_ulcmp
-
-FUNC_START aeabi_ulcmp
-   cmp xxh, yyh
-   bne 1f
-   subsr0, xxl, yyl
-   beq 2f
-1:
-   bcs 1f
-   movsr0, #1
-   negsr0, r0
-   RET
-1:
-   movsr0, #1
-2:
-   RET
-   FUNC_END aeabi_ulcmp
-
-#endif /* L_aeabi_ulcmp */
 
 .macro test_div_by_zero signed
cmp yyh, #0
diff --git a/libgcc/config/arm/eabi/lcmp.S b/libgcc/config/arm/eabi/lcmp.S
new file mode 100644
index 000..336db1d398c
--- /dev/null
+++ b/libgcc/config/arm/eabi/lcmp.S
@@ -0,0 +1,73 @@
+/* Miscellaneous BPABI functions.  Thumb-1 implementation, suitable for ARMv4T,
+   ARMv6-M and ARMv8-M Baseline like ISA variants.
+
+   Copyright (C) 2006-2020 Free Software Foundation, Inc.
+   Contributed by CodeSourcery.
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#ifdef L_aeabi_lcmp
+
+FUNC_START aeabi_lcmp
+cmp xxh, yyh
+beq 1f
+bgt 2f
+movsr0, #1
+negsr0, r0
+RET
+2:
+movsr0, #1
+RET
+1:
+subsr0, xxl, yyl
+beq 1f
+bhi 2f
+movsr0, #1
+negsr0, r0
+RET
+2:
+movsr0, #1
+1:
+RET
+FUNC_END aeabi_lcmp
+
+#endif /* L_aeabi_lcmp */
+
+#ifdef L_aeabi_ulcmp
+
+FUNC_START aeabi_ulcmp
+cmp xxh, yyh
+bne 1f
+subsr0, xxl, yyl
+beq 2f
+1:
+bcs 1f
+movsr0, #1
+negsr0, r0
+RET
+1:
+movsr0, #1
+2:
+RET
+FUNC_END aeabi_ulcmp
+
+#endif /* L_aeabi_ulcmp */
+
diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index 0eb6d1d52a7..d85a20252d9 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -1991,5 +1991,6 @@ LSYM(Lchange_\register):
 #include "bpabi.S"
 #else /* NOT_ISA_TARGET_32BIT */
 #include "bpabi-v6m.S"
+#include "eabi/lcmp.S"
 #endif /* NOT_ISA_TARGET_32BIT */
 #endif /* !__symbian__ */
-- 
2.34.1



[PATCH v7 12/34] Import 'clrsb' functions from the CM0 library

2022-10-31 Thread Daniel Engel
This implementation provides an efficient tail call to __clzsi2(), making the
functions rather smaller and faster than the C versions.

gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/bits/clz2.S (__clrsbsi2, __clrsbdi2):
Added new functions.
* config/arm/t-elf (LIB1ASMFUNCS):
Added new function objects _clrsbsi2 and _clrsbdi2).
---
 libgcc/config/arm/clz2.S | 108 ++-
 libgcc/config/arm/t-elf  |   2 +
 2 files changed, 108 insertions(+), 2 deletions(-)

diff --git a/libgcc/config/arm/clz2.S b/libgcc/config/arm/clz2.S
index ed04698fef4..3d40811278b 100644
--- a/libgcc/config/arm/clz2.S
+++ b/libgcc/config/arm/clz2.S
@@ -1,4 +1,4 @@
-/* clz2.S: Cortex M0 optimized 'clz' functions
+/* clz2.S: ARM optimized 'clz' and related functions
 
Copyright (C) 2018-2022 Free Software Foundation, Inc.
Contributed by Daniel Engel (g...@danielengel.com)
@@ -23,7 +23,7 @@
.  */
 
 
-#if defined(__ARM_FEATURE_CLZ) && __ARM_FEATURE_CLZ
+#ifdef __ARM_FEATURE_CLZ
 
 #ifdef L_clzdi2
 
@@ -242,3 +242,107 @@ FUNC_END clzdi2
 
 #endif /* !__ARM_FEATURE_CLZ */
 
+
+#ifdef L_clrsbdi2
+
+// int __clrsbdi2(int)
+// Counts the number of "redundant sign bits" in $r1:$r0.
+// Returns the result in $r0.
+// Uses $r2 and $r3 as scratch space.
+FUNC_START_SECTION clrsbdi2 .text.sorted.libgcc.clz2.clrsbdi2
+CFI_START_FUNCTION
+
+  #if defined(__ARM_FEATURE_CLZ) && __ARM_FEATURE_CLZ
+// Invert negative signs to keep counting zeros.
+asrsr3, xxh,#31
+eorsxxl,r3
+eorsxxh,r3
+
+// Same as __clzdi2(), except that the 'C' flag is pre-calculated.
+// Also, the trailing 'subs', since the last bit is not redundant.
+do_it   eq, et
+clzeq   r0, xxl
+clzne   r0, xxh
+addeq   r0, #32
+subsr0, #1
+RET
+
+  #else  /* !__ARM_FEATURE_CLZ */
+// Result if all the bits in the argument are zero.
+// Set it here to keep the flags clean after 'eors' below.
+movsr2, #31
+
+// Invert negative signs to keep counting zeros.
+asrsr3, xxh,#31
+eorsxxh,r3
+
+#if defined(__ARMEB__) && __ARMEB__
+// If the upper word is non-zero, return '__clzsi2(upper) - 1'.
+bne SYM(__internal_clzsi2)
+
+// The upper word is zero, prepare the lower word.
+movsr0, r1
+eorsr0, r3
+
+#else /* !__ARMEB__ */
+// Save the lower word temporarily.
+// This somewhat awkward construction adds one cycle when the
+//  branch is not taken, but prevents a double-branch.
+eorsr3, r0
+
+// If the upper word is non-zero, return '__clzsi2(upper) - 1'.
+movsr0, r1
+bneSYM(__internal_clzsi2)
+
+// Restore the lower word.
+movsr0, r3
+
+#endif /* !__ARMEB__ */
+
+// The upper word is zero, return '31 + __clzsi2(lower)'.
+addsr2, #32
+b   SYM(__internal_clzsi2)
+
+  #endif /* !__ARM_FEATURE_CLZ */
+
+CFI_END_FUNCTION
+FUNC_END clrsbdi2
+
+#endif /* L_clrsbdi2 */
+
+
+#ifdef L_clrsbsi2
+
+// int __clrsbsi2(int)
+// Counts the number of "redundant sign bits" in $r0.
+// Returns the result in $r0.
+// Uses $r2 and possibly $r3 as scratch space.
+FUNC_START_SECTION clrsbsi2 .text.sorted.libgcc.clz2.clrsbsi2
+CFI_START_FUNCTION
+
+// Invert negative signs to keep counting zeros.
+asrsr2, r0,#31
+eorsr0, r2
+
+  #if defined(__ARM_FEATURE_CLZ) && __ARM_FEATURE_CLZ
+// Count.
+clz r0, r0
+
+// The result for a positive value will always be >= 1.
+// By definition, the last bit is not redundant.
+subsr0, #1
+RET
+
+  #else /* !__ARM_FEATURE_CLZ */
+// Result if all the bits in the argument are zero.
+// By definition, the last bit is not redundant.
+movsr2, #31
+b   SYM(__internal_clzsi2)
+
+  #endif  /* !__ARM_FEATURE_CLZ */
+
+CFI_END_FUNCTION
+FUNC_END clrsbsi2
+
+#endif /* L_clrsbsi2 */
+
diff --git a/libgcc/config/arm/t-elf b/libgcc/config/arm/t-elf
index 33b83ac4adf..89071cebe45 100644
--- a/libgcc/config/arm/t-elf
+++ b/libgcc/config/arm/t-elf
@@ -31,6 +31,8 @@ LIB1ASMFUNCS += \
_ashldi3 \
_ashrdi3 \
_lshrdi3 \
+   _clrsbsi2 \
+   _clrsbdi2 \
_clzdi2 \
_ctzdi2 \
_dvmd_tls \
-- 
2.34.1



[PATCH v7 13/34] Import 'ffs' functions from the CM0 library

2022-10-31 Thread Daniel Engel
This implementation provides an efficient tail call to __clzdi2(), making the
functions rather smaller and faster than the C versions.

gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/bits/ctz2.S (__ffssi2, __ffsdi2): New functions.
* config/arm/t-elf (LIB1ASMFUNCS): Added _ffssi2 and _ffsdi2.
---
 libgcc/config/arm/ctz2.S | 77 +++-
 libgcc/config/arm/t-elf  |  2 ++
 2 files changed, 78 insertions(+), 1 deletion(-)

diff --git a/libgcc/config/arm/ctz2.S b/libgcc/config/arm/ctz2.S
index 82c81c6ae11..d57acabae01 100644
--- a/libgcc/config/arm/ctz2.S
+++ b/libgcc/config/arm/ctz2.S
@@ -1,4 +1,4 @@
-/* ctz2.S: ARM optimized 'ctz' functions
+/* ctz2.S: ARM optimized 'ctz' and related functions
 
Copyright (C) 2020-2022 Free Software Foundation, Inc.
Contributed by Daniel Engel (g...@danielengel.com)
@@ -238,3 +238,78 @@ FUNC_END ctzdi2
 
 #endif /* L_ctzsi2 || L_ctzdi2 */
 
+
+#ifdef L_ffsdi2
+
+// int __ffsdi2(int)
+// Return the index of the least significant 1-bit in $r1:r0,
+//  or zero if $r1:r0 is zero.  The least significant bit is index 1.
+// Returns the result in $r0.
+// Uses $r2 and possibly $r3 as scratch space.
+// Same section as __ctzsi2() for sake of the tail call branches.
+FUNC_START_SECTION ffsdi2 .text.sorted.libgcc.ctz2.ffsdi2
+CFI_START_FUNCTION
+
+// Simplify branching by assuming a non-zero lower word.
+// For all such, ffssi2(x) == ctzsi2(x) + 1.
+movsr2,#(33 - CTZ_RESULT_OFFSET)
+
+  #if defined(__ARMEB__) && __ARMEB__
+// HACK: Save the upper word in a scratch register.
+movsr3, r0
+
+// Test the lower word.
+movsr0, r1
+bne SYM(__internal_ctzsi2)
+
+// Test the upper word.
+movsr2,#(65 - CTZ_RESULT_OFFSET)
+movsr0, r3
+bne SYM(__internal_ctzsi2)
+
+  #else /* !__ARMEB__ */
+// Test the lower word.
+cmp r0, #0
+bne SYM(__internal_ctzsi2)
+
+// Test the upper word.
+movsr2,#(65 - CTZ_RESULT_OFFSET)
+movsr0, r1
+bne SYM(__internal_ctzsi2)
+
+  #endif /* !__ARMEB__ */
+
+// Upper and lower words are both zero.
+RET
+
+CFI_END_FUNCTION
+FUNC_END ffsdi2
+
+#endif /* L_ffsdi2 */
+
+
+#ifdef L_ffssi2
+
+// int __ffssi2(int)
+// Return the index of the least significant 1-bit in $r0,
+//  or zero if $r0 is zero.  The least significant bit is index 1.
+// Returns the result in $r0.
+// Uses $r2 and possibly $r3 as scratch space.
+// Same section as __ctzsi2() for sake of the tail call branches.
+FUNC_START_SECTION ffssi2 .text.sorted.libgcc.ctz2.ffssi2
+CFI_START_FUNCTION
+
+// Simplify branching by assuming a non-zero argument.
+// For all such, ffssi2(x) == ctzsi2(x) + 1.
+movsr2,#(33 - CTZ_RESULT_OFFSET)
+
+// Test for zero, return unmodified.
+cmp r0, #0
+bne SYM(__internal_ctzsi2)
+RET
+
+CFI_END_FUNCTION
+FUNC_END ffssi2
+
+#endif /* L_ffssi2 */
+
diff --git a/libgcc/config/arm/t-elf b/libgcc/config/arm/t-elf
index 89071cebe45..346fc766f17 100644
--- a/libgcc/config/arm/t-elf
+++ b/libgcc/config/arm/t-elf
@@ -35,6 +35,8 @@ LIB1ASMFUNCS += \
_clrsbdi2 \
_clzdi2 \
_ctzdi2 \
+   _ffssi2 \
+   _ffsdi2 \
_dvmd_tls \
_divsi3 \
_modsi3 \
-- 
2.34.1



[PATCH v7 18/34] Merge Thumb-2 optimizations for 64-bit comparison

2022-10-31 Thread Daniel Engel
This effectively merges support for all architecture variants into a
common function path with appropriate build conditions.
ARM performance is 1-2 instructions faster; Thumb-2 is about 50% faster.

gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/bpabi.S (__aeabi_lcmp, __aeabi_ulcmp): Removed.
* config/arm/eabi/lcmp.S (__aeabi_lcmp, __aeabi_ulcmp): Added
conditional execution on supported architectures (__ARM_FEATURE_IT).
* config/arm/lib1funcs.S: Moved #include scope of eabi/lcmp.S.
---
 libgcc/config/arm/bpabi.S | 42 ---
 libgcc/config/arm/eabi/lcmp.S | 47 ++-
 libgcc/config/arm/lib1funcs.S |  2 +-
 3 files changed, 47 insertions(+), 44 deletions(-)

diff --git a/libgcc/config/arm/bpabi.S b/libgcc/config/arm/bpabi.S
index 17fe707ddf3..531a64fa98d 100644
--- a/libgcc/config/arm/bpabi.S
+++ b/libgcc/config/arm/bpabi.S
@@ -34,48 +34,6 @@
.eabi_attribute 25, 1
 #endif /* __ARM_EABI__ */
 
-#ifdef L_aeabi_lcmp
-
-ARM_FUNC_START aeabi_lcmp
-   cmp xxh, yyh
-   do_it   lt
-   movlt   r0, #-1
-   do_it   gt
-   movgt   r0, #1
-   do_it   ne
-   RETc(ne)
-   subsr0, xxl, yyl
-   do_it   lo
-   movlo   r0, #-1
-   do_it   hi
-   movhi   r0, #1
-   RET
-   FUNC_END aeabi_lcmp
-
-#endif /* L_aeabi_lcmp */
-   
-#ifdef L_aeabi_ulcmp
-
-ARM_FUNC_START aeabi_ulcmp
-   cmp xxh, yyh
-   do_it   lo
-   movlo   r0, #-1
-   do_it   hi
-   movhi   r0, #1
-   do_it   ne
-   RETc(ne)
-   cmp xxl, yyl
-   do_it   lo
-   movlo   r0, #-1
-   do_it   hi
-   movhi   r0, #1
-   do_it   eq
-   moveq   r0, #0
-   RET
-   FUNC_END aeabi_ulcmp
-
-#endif /* L_aeabi_ulcmp */
-
 .macro test_div_by_zero signed
 /* Tail-call to divide-by-zero handlers which may be overridden by the user,
so unwinding works properly.  */
diff --git a/libgcc/config/arm/eabi/lcmp.S b/libgcc/config/arm/eabi/lcmp.S
index 99c7970ecba..d397325cbef 100644
--- a/libgcc/config/arm/eabi/lcmp.S
+++ b/libgcc/config/arm/eabi/lcmp.S
@@ -46,6 +46,19 @@ FUNC_START_SECTION LCMP_NAME LCMP_SECTION
 subsxxl,yyl
 sbcsxxh,yyh
 
+#ifdef __HAVE_FEATURE_IT
+do_it   lt,t
+
+  #ifdef L_aeabi_lcmp
+movlt   r0,#-1
+  #else
+movlt   r0,#0
+  #endif
+
+// Early return on '<'.
+RETc(lt)
+
+#else /* !__HAVE_FEATURE_IT */
 // With $r2 free, create a known offset value without affecting
 //  the N or Z flags.
 // BUG? The originally unified instruction for v6m was 'mov r2, r3'.
@@ -62,17 +75,27 @@ FUNC_START_SECTION LCMP_NAME LCMP_SECTION
 //  argument is larger, otherwise the offset value remains 0.
 addsr2, #2
 
+#endif
+
 // Check for zero (equality in 64 bits).
 // It doesn't matter which register was originally "hi".
 orrsr0,r1
 
+#ifdef __HAVE_FEATURE_IT
+// The result is already 0 on equality.
+// -1 already returned, so just force +1.
+do_it   ne
+movne   r0, #1
+
+#else /* !__HAVE_FEATURE_IT */
 // The result is already 0 on equality.
 beq LLSYM(__lcmp_return)
 
-LLSYM(__lcmp_lt):
+  LLSYM(__lcmp_lt):
 // Create +1 or -1 from the offset value defined earlier.
 addsr3, #1
 subsr0, r2, r3
+#endif
 
 LLSYM(__lcmp_return):
   #ifdef L_cmpdi2
@@ -111,21 +134,43 @@ FUNC_START_SECTION ULCMP_NAME ULCMP_SECTION
 subsxxl,yyl
 sbcsxxh,yyh
 
+#ifdef __HAVE_FEATURE_IT
+do_it   lo,t
+
+  #ifdef L_aeabi_ulcmp
+movlo   r0, -1
+  #else
+movlo   r0, #0
+  #endif
+
+// Early return on '<'.
+RETc(lo)
+
+#else
 // Capture the carry flg.
 // $r2 will contain -1 if the first value is smaller,
 //  0 if the first value is larger or equal.
 sbcsr2, r2
+#endif
 
 // Check for zero (equality in 64 bits).
 // It doesn't matter which register was originally "hi".
 orrsr0, r1
 
+#ifdef __HAVE_FEATURE_IT
+// The result is already 0 on equality.
+// -1 already returned, so just force +1.
+do_it   ne
+movne   r0, #1
+
+#else /* !__HAVE_FEATURE_IT */
 // The result is already 0 on equality.
 beq LLSYM(__ulcmp_return)
 
 // Assume +1.  If -1 is correct, $r2 will override.
 movsr0, #1
 orrsr0, r2
+#endif
 
 LLSYM(__ulcmp_return):
   #ifdef L_ucmpdi2
diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index d85a20252d9..796f6f30ed9 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -1991,6 +1991,6 @@ 

[PATCH v7 11/34] Import 64-bit shift functions from the CM0 library

2022-10-31 Thread Daniel Engel
The Thumb versions of these functions are each 1-2 instructions smaller
and faster, and branchless when the IT instruction is available.

The ARM versions were converted to the "xxl/xxh" big-endian register
naming convention, but are otherwise unchanged.

gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/bits/shift.S (__ashldi3, __ashrdi3, __lshldi3):
Reduced code size on Thumb architectures;
updated big-endian register naming convention to "xxl/xxh".
---
 libgcc/config/arm/eabi/lshift.S | 338 +---
 1 file changed, 228 insertions(+), 110 deletions(-)

diff --git a/libgcc/config/arm/eabi/lshift.S b/libgcc/config/arm/eabi/lshift.S
index 6e79d96c118..365350dfb2d 100644
--- a/libgcc/config/arm/eabi/lshift.S
+++ b/libgcc/config/arm/eabi/lshift.S
@@ -1,123 +1,241 @@
-/* Copyright (C) 1995-2022 Free Software Foundation, Inc.
+/* lshift.S: ARM optimized 64-bit integer shift
 
-This file is free software; you can redistribute it and/or modify it
-under the terms of the GNU General Public License as published by the
-Free Software Foundation; either version 3, or (at your option) any
-later version.
+   Copyright (C) 2018-2022 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
 
-This file is distributed in the hope that it will be useful, but
-WITHOUT ANY WARRANTY; without even the implied warranty of
-MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-General Public License for more details.
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
 
-Under Section 7 of GPL version 3, you are granted additional
-permissions described in the GCC Runtime Library Exception, version
-3.1, as published by the Free Software Foundation.
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
 
-You should have received a copy of the GNU General Public License and
-a copy of the GCC Runtime Library Exception along with this program;
-see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
-.  */
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
 
 
 #ifdef L_lshrdi3
 
-   FUNC_START lshrdi3
-   FUNC_ALIAS aeabi_llsr lshrdi3
-   
-#ifdef __thumb__
-   lsrsal, r2
-   movsr3, ah
-   lsrsah, r2
-   mov ip, r3
-   subsr2, #32
-   lsrsr3, r2
-   orrsal, r3
-   negsr2, r2
-   mov r3, ip
-   lslsr3, r2
-   orrsal, r3
-   RET
-#else
-   subsr3, r2, #32
-   rsb ip, r2, #32
-   movmi   al, al, lsr r2
-   movpl   al, ah, lsr r3
-   orrmi   al, al, ah, lsl ip
-   mov ah, ah, lsr r2
-   RET
-#endif
-   FUNC_END aeabi_llsr
-   FUNC_END lshrdi3
-
-#endif
-   
+// long long __aeabi_llsr(long long, int)
+// Logical shift right the 64 bit value in $r1:$r0 by the count in $r2.
+// The result is only guaranteed for shifts in the range of '0' to '63'.
+// Uses $r3 as scratch space.
+FUNC_START_SECTION aeabi_llsr .text.sorted.libgcc.lshrdi3
+FUNC_ALIAS lshrdi3 aeabi_llsr
+CFI_START_FUNCTION
+
+  #if defined(__thumb__) && __thumb__
+
+// Save a copy for the remainder.
+movsr3, xxh
+
+// Assume a simple shift.
+lsrsxxl,r2
+lsrsxxh,r2
+
+// Test if the shift distance is larger than 1 word.
+subsr2, #32
+
+#ifdef __HAVE_FEATURE_IT
+do_it   lo,te
+
+// The remainder is opposite the main shift, (32 - x) bits.
+rsblo   r2, #0
+lsllo   r3, r2
+
+// The remainder shift extends into the hi word.
+lsrhs   r3, r2
+
+#else /* !__HAVE_FEATURE_IT */
+bhs LLSYM(__llsr_large)
+
+// The remainder is opposite the main shift, (32 - x) bits.
+rsbsr2, #0
+lslsr3, r2
+
+// Cancel any remaining shift.
+eorsr2, r2
+
+  LLSYM(__llsr_large):
+// Apply any remaining shift to the hi word.
+lsrsr3, r2
+
+#endif /* !__HAVE_FEATURE_IT */
+
+// Merge remainder and result.
+addsxxl,r3
+RET
+
+  #else /* 

[PATCH v7 10/34] Import 'ctz' functions from the CM0 library

2022-10-31 Thread Daniel Engel
This version combines __ctzdi2() with __ctzsi2() into a single object with
an efficient tail call.  The former implementation of __ctzdi2() was in C.

On architectures without __ARM_FEATURE_CLZ, this version merges the formerly
separate Thumb and ARM code sequences into a unified instruction sequence.
This change significantly improves Thumb performance without affecting ARM
performance.  Finally, this version adds a new __OPTIMIZE_SIZE__ build option.

On architectures with __ARM_FEATURE_CLZ, __ctzsi2(0) now returns 32.  Formerly,
__ctzsi2(0) would return -1.  Architectures without __ARM_FEATURE_CLZ have
always returned 32, so this change makes the return value consistent.
This change costs 2 extra instructions (branchless).

Likewise on architectures with __ARM_FEATURE_CLZ,  __ctzdi2(0) now returns
64 instead of 31.

gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/bits/ctz2.S (__ctzdi2): Added a new function.
(__clzsi2): Reduced size on architectures without __ARM_FEATURE_CLZ;
changed so __clzsi2(0)=32 on architectures wtih __ARM_FEATURE_CLZ.
* config/arm/t-elf (LIB1ASMFUNCS): Added _ctzdi2;
moved _ctzsi2 to the weak function objects group.
---
 libgcc/config/arm/ctz2.S | 308 +--
 libgcc/config/arm/t-elf  |   3 +-
 2 files changed, 233 insertions(+), 78 deletions(-)

diff --git a/libgcc/config/arm/ctz2.S b/libgcc/config/arm/ctz2.S
index 1d885dcc71a..82c81c6ae11 100644
--- a/libgcc/config/arm/ctz2.S
+++ b/libgcc/config/arm/ctz2.S
@@ -1,86 +1,240 @@
-/* Copyright (C) 1995-2022 Free Software Foundation, Inc.
+/* ctz2.S: ARM optimized 'ctz' functions
 
-This file is free software; you can redistribute it and/or modify it
-under the terms of the GNU General Public License as published by the
-Free Software Foundation; either version 3, or (at your option) any
-later version.
+   Copyright (C) 2020-2022 Free Software Foundation, Inc.
+   Contributed by Daniel Engel (g...@danielengel.com)
 
-This file is distributed in the hope that it will be useful, but
-WITHOUT ANY WARRANTY; without even the implied warranty of
-MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-General Public License for more details.
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
 
-Under Section 7 of GPL version 3, you are granted additional
-permissions described in the GCC Runtime Library Exception, version
-3.1, as published by the Free Software Foundation.
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
 
-You should have received a copy of the GNU General Public License and
-a copy of the GCC Runtime Library Exception along with this program;
-see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
-.  */
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
 
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
 
-#ifdef L_ctzsi2
-#ifdef NOT_ISA_TARGET_32BIT
-FUNC_START ctzsi2
-   negsr1, r0
-   andsr0, r0, r1
-   movsr1, #28
-   movsr3, #1
-   lslsr3, r3, #16
-   cmp r0, r3 /* 0x1 */
-   bcc 2f
-   lsrsr0, r0, #16
-   subsr1, r1, #16
-2: lsrsr3, r3, #8
-   cmp r0, r3 /* #0x100 */
-   bcc 2f
-   lsrsr0, r0, #8
-   subsr1, r1, #8
-2: lsrsr3, r3, #4
-   cmp r0, r3 /* #0x10 */
-   bcc 2f
-   lsrsr0, r0, #4
-   subsr1, r1, #4
-2: adr r2, 1f
-   ldrbr0, [r2, r0]
-   subsr0, r0, r1
-   bx lr
-.align 2
-1:
-.byte  27, 28, 29, 29, 30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 31
-   FUNC_END ctzsi2
+
+// When the hardware 'ctz' function is available, an efficient version
+//  of __ctzsi2(x) can be created by calculating '31 - __ctzsi2(lsb(x))',
+//  where lsb(x) is 'x' with only the least-significant '1' bit set.
+// The following offset applies to all of the functions in this file.
+#if defined(__ARM_FEATURE_CLZ) && __ARM_FEATURE_CLZ
+  #define CTZ_RESULT_OFFSET 1
 #else
-ARM_FUNC_START ctzsi2
-   rsb r1, r0, #0
-   and r0, r0, r1
-# if defined (__ARM_FEATURE_CLZ)
-   clz r0, r0
-   rsb r0, r0, #31
-   RET
-# else
-   mov r1, #28
-   cmp 

[PATCH v7 17/34] Import 64-bit comparison from CM0 library

2022-10-31 Thread Daniel Engel
These are 2-5 instructions smaller and just as fast.  Branches are
minimized, which will allow easier adaptation to Thumb-2/ARM mode.

gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/eabi/lcmp.S (__aeabi_lcmp, __aeabi_ulcmp): Replaced;
add macro configuration to build __cmpdi2() and __ucmpdi2().
* config/arm/t-elf (LIB1ASMFUNCS): Added _cmpdi2 and _ucmpdi2.
---
 libgcc/config/arm/eabi/lcmp.S | 151 +-
 libgcc/config/arm/t-elf   |   2 +
 2 files changed, 112 insertions(+), 41 deletions(-)

diff --git a/libgcc/config/arm/eabi/lcmp.S b/libgcc/config/arm/eabi/lcmp.S
index 336db1d398c..99c7970ecba 100644
--- a/libgcc/config/arm/eabi/lcmp.S
+++ b/libgcc/config/arm/eabi/lcmp.S
@@ -1,8 +1,7 @@
-/* Miscellaneous BPABI functions.  Thumb-1 implementation, suitable for ARMv4T,
-   ARMv6-M and ARMv8-M Baseline like ISA variants.
+/* lcmp.S: Thumb-1 optimized 64-bit integer comparison
 
-   Copyright (C) 2006-2020 Free Software Foundation, Inc.
-   Contributed by CodeSourcery.
+   Copyright (C) 2018-2022 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
 
This file is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
@@ -24,50 +23,120 @@
.  */
 
 
+#if defined(L_aeabi_lcmp) || defined(L_cmpdi2)
+
 #ifdef L_aeabi_lcmp
+  #define LCMP_NAME aeabi_lcmp
+  #define LCMP_SECTION .text.sorted.libgcc.lcmp
+#else
+  #define LCMP_NAME cmpdi2
+  #define LCMP_SECTION .text.sorted.libgcc.cmpdi2
+#endif
+
+// int __aeabi_lcmp(long long, long long)
+// int __cmpdi2(long long, long long)
+// Compares the 64 bit signed values in $r1:$r0 and $r3:$r2.
+// lcmp() returns $r0 = { -1, 0, +1 } for orderings { <, ==, > } respectively.
+// cmpdi2() returns $r0 = { 0, 1, 2 } for orderings { <, ==, > } respectively.
+// Object file duplication assumes typical programs follow one runtime ABI.
+FUNC_START_SECTION LCMP_NAME LCMP_SECTION
+CFI_START_FUNCTION
+
+// Calculate the difference $r1:$r0 - $r3:$r2.
+subsxxl,yyl
+sbcsxxh,yyh
+
+// With $r2 free, create a known offset value without affecting
+//  the N or Z flags.
+// BUG? The originally unified instruction for v6m was 'mov r2, r3'.
+//  However, this resulted in a compile error with -mthumb:
+//"MOV Rd, Rs with two low registers not permitted".
+// Since unified syntax deprecates the "cpy" instruction, shouldn't
+//  there be a backwards-compatible tranlation available?
+cpy r2, r3
+
+// Evaluate the comparison result.
+blt LLSYM(__lcmp_lt)
+
+// The reference offset ($r2 - $r3) will be +2 iff the first
+//  argument is larger, otherwise the offset value remains 0.
+addsr2, #2
+
+// Check for zero (equality in 64 bits).
+// It doesn't matter which register was originally "hi".
+orrsr0,r1
+
+// The result is already 0 on equality.
+beq LLSYM(__lcmp_return)
+
+LLSYM(__lcmp_lt):
+// Create +1 or -1 from the offset value defined earlier.
+addsr3, #1
+subsr0, r2, r3
+
+LLSYM(__lcmp_return):
+  #ifdef L_cmpdi2
+// Offset to the correct output specification.
+addsr0, #1
+  #endif
 
-FUNC_START aeabi_lcmp
-cmp xxh, yyh
-beq 1f
-bgt 2f
-movsr0, #1
-negsr0, r0
-RET
-2:
-movsr0, #1
-RET
-1:
-subsr0, xxl, yyl
-beq 1f
-bhi 2f
-movsr0, #1
-negsr0, r0
-RET
-2:
-movsr0, #1
-1:
 RET
-FUNC_END aeabi_lcmp
 
-#endif /* L_aeabi_lcmp */
+CFI_END_FUNCTION
+FUNC_END LCMP_NAME
+
+#endif /* L_aeabi_lcmp || L_cmpdi2 */
+
+
+#if defined(L_aeabi_ulcmp) || defined(L_ucmpdi2)
 
 #ifdef L_aeabi_ulcmp
+  #define ULCMP_NAME aeabi_ulcmp
+  #define ULCMP_SECTION .text.sorted.libgcc.ulcmp
+#else
+  #define ULCMP_NAME ucmpdi2
+  #define ULCMP_SECTION .text.sorted.libgcc.ucmpdi2
+#endif
+
+// int __aeabi_ulcmp(unsigned long long, unsigned long long)
+// int __ucmpdi2(unsigned long long, unsigned long long)
+// Compares the 64 bit unsigned values in $r1:$r0 and $r3:$r2.
+// ulcmp() returns $r0 = { -1, 0, +1 } for orderings { <, ==, > } respectively.
+// ucmpdi2() returns $r0 = { 0, 1, 2 } for orderings { <, ==, > } respectively.
+// Object file duplication assumes typical programs follow one runtime ABI.
+FUNC_START_SECTION ULCMP_NAME ULCMP_SECTION
+CFI_START_FUNCTION
+
+// Calculate the 'C' flag.
+subsxxl,yyl
+sbcsxxh,yyh
+
+// Capture the carry flg.
+// $r2 will contain -1 if the first value is smaller,
+//  0 if the first value is larger or 

[PATCH v7 09/34] Import 'clz' functions from the CM0 library

2022-10-31 Thread Daniel Engel
On architectures without __ARM_FEATURE_CLZ, this version combines __clzdi2()
with __clzsi2() into a single object with an efficient tail call.  Also, this
version merges the formerly separate Thumb and ARM code implementations
into a unified instruction sequence.  This change significantly improves
Thumb performance without affecting ARM performance.  Finally, this version
adds a new __OPTIMIZE_SIZE__ build option (binary search loop).

There is no change to the code for architectures with __ARM_FEATURE_CLZ.

gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/bits/clz2.S (__clzsi2, __clzdi2): Reduced code size on
architectures without __ARM_FEATURE_CLZ.
* config/arm/t-elf (LIB1ASMFUNCS): Moved _clzsi2 to new weak roup.
---
 libgcc/config/arm/clz2.S | 363 +--
 libgcc/config/arm/t-elf  |   7 +-
 2 files changed, 237 insertions(+), 133 deletions(-)

diff --git a/libgcc/config/arm/clz2.S b/libgcc/config/arm/clz2.S
index 439341752ba..ed04698fef4 100644
--- a/libgcc/config/arm/clz2.S
+++ b/libgcc/config/arm/clz2.S
@@ -1,145 +1,244 @@
-/* Copyright (C) 1995-2022 Free Software Foundation, Inc.
+/* clz2.S: Cortex M0 optimized 'clz' functions
 
-This file is free software; you can redistribute it and/or modify it
-under the terms of the GNU General Public License as published by the
-Free Software Foundation; either version 3, or (at your option) any
-later version.
+   Copyright (C) 2018-2022 Free Software Foundation, Inc.
+   Contributed by Daniel Engel (g...@danielengel.com)
 
-This file is distributed in the hope that it will be useful, but
-WITHOUT ANY WARRANTY; without even the implied warranty of
-MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-General Public License for more details.
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
 
-Under Section 7 of GPL version 3, you are granted additional
-permissions described in the GCC Runtime Library Exception, version
-3.1, as published by the Free Software Foundation.
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
 
-You should have received a copy of the GNU General Public License and
-a copy of the GCC Runtime Library Exception along with this program;
-see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
-.  */
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#if defined(__ARM_FEATURE_CLZ) && __ARM_FEATURE_CLZ
+
+#ifdef L_clzdi2
+
+// int __clzdi2(long long)
+// Counts leading zero bits in $r1:$r0.
+// Returns the result in $r0.
+FUNC_START_SECTION clzdi2 .text.sorted.libgcc.clz2.clzdi2
+CFI_START_FUNCTION
+
+// Moved here from lib1funcs.S
+cmp xxh,#0
+do_it   eq, et
+clzeq   r0, xxl
+clzne   r0, xxh
+addeq   r0, #32
+RET
+
+CFI_END_FUNCTION
+FUNC_END clzdi2
+
+#endif /* L_clzdi2 */
 
 
 #ifdef L_clzsi2
-#ifdef NOT_ISA_TARGET_32BIT
-FUNC_START clzsi2
-   movsr1, #28
-   movsr3, #1
-   lslsr3, r3, #16
-   cmp r0, r3 /* 0x1 */
-   bcc 2f
-   lsrsr0, r0, #16
-   subsr1, r1, #16
-2: lsrsr3, r3, #8
-   cmp r0, r3 /* #0x100 */
-   bcc 2f
-   lsrsr0, r0, #8
-   subsr1, r1, #8
-2: lsrsr3, r3, #4
-   cmp r0, r3 /* #0x10 */
-   bcc 2f
-   lsrsr0, r0, #4
-   subsr1, r1, #4
-2: adr r2, 1f
-   ldrbr0, [r2, r0]
-   addsr0, r0, r1
-   bx lr
-.align 2
-1:
-.byte 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0
-   FUNC_END clzsi2
-#else
-ARM_FUNC_START clzsi2
-# if defined (__ARM_FEATURE_CLZ)
-   clz r0, r0
-   RET
-# else
-   mov r1, #28
-   cmp r0, #0x1
-   do_it   cs, t
-   movcs   r0, r0, lsr #16
-   subcs   r1, r1, #16
-   cmp r0, #0x100
-   do_it   cs, t
-   movcs   r0, r0, lsr #8
-   subcs   r1, r1, #8
-   cmp r0, #0x10
-   do_it   cs, t
-   movcs   r0, r0, lsr #4
-   subcs   r1, r1, #4
-   adr r2, 1f
-   ldrbr0, [r2, r0]
-   add r0, r0, r1
-   RET
-.align 2
-1:
-.byte 4, 3, 2, 2, 1, 1, 

[PATCH v7 03/34] Fix syntax warnings on conditional instructions

2022-10-31 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/lib1funcs.S (RETLDM, ARM_DIV_BODY, ARM_MOD_BODY,
_interwork_call_via_lr): Moved condition code after the flags
update specifier "s".
(ARM_FUNC_START, THUMB_LDIV0): Removed redundant ".syntax".
---
 libgcc/config/arm/lib1funcs.S | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index 726984a9d1d..f2f82f9d509 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -204,7 +204,7 @@ LSYM(Lend_fde):
 # if defined(__thumb2__)
pop\cond{\regs, lr}
 # else
-   ldm\cond\dirn   sp!, {\regs, lr}
+   ldm\dirn\cond   sp!, {\regs, lr}
 # endif
.endif
.ifnc "\unwind", ""
@@ -220,7 +220,7 @@ LSYM(Lend_fde):
 # if defined(__thumb2__)
pop\cond{\regs, pc}
 # else
-   ldm\cond\dirn   sp!, {\regs, pc}
+   ldm\dirn\cond   sp!, {\regs, pc}
 # endif
.endif
 #endif
@@ -292,7 +292,6 @@ LSYM(Lend_fde):
pop {r1, pc}
 
 #elif defined(__thumb2__)
-   .syntax unified
.ifc \signed, unsigned
cbz r0, 1f
mov r0, #0x
@@ -429,7 +428,6 @@ SYM (__\name):
 /* For Thumb-2 we build everything in thumb mode.  */
 .macro ARM_FUNC_START name
FUNC_START \name
-   .syntax unified
 .endm
 #define EQUIV .thumb_set
 .macro  ARM_CALL name
@@ -643,7 +641,7 @@ pc  .reqr15
orrhs   \result,   \result,   \curbit,  lsr #3
cmp \dividend, #0   @ Early termination?
do_it   ne, t
-   movnes  \curbit,   \curbit,  lsr #4 @ No, any more bits to do?
+   movsne  \curbit,   \curbit,  lsr #4 @ No, any more bits to do?
movne   \divisor,  \divisor, lsr #4
bne 1b
 
@@ -745,7 +743,7 @@ pc  .reqr15
subhs   \dividend, \dividend, \divisor, lsr #3
cmp \dividend, #1
mov \divisor, \divisor, lsr #4
-   subges  \order, \order, #4
+   subsge  \order, \order, #4
bge 1b
 
tst \order, #3
@@ -2093,7 +2091,7 @@ LSYM(Lchange_\register):
.globl .Lchange_lr
 .Lchange_lr:
tst lr, #1
-   stmeqdb r13!, {lr, pc}
+   stmdbeq r13!, {lr, pc}
mov ip, lr
adreq   lr, _arm_return
bx  ip
-- 
2.34.1



[PATCH v7 08/34] Refactor 64-bit shift functions into a new file

2022-10-31 Thread Daniel Engel
This will make it easier to isolate changes in subsequent patches.

gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/lib1funcs.S (__ashldi3, __ashrdi3, __lshldi3): Moved to ...
* config/arm/eabi/lshift.S: New file.
---
 libgcc/config/arm/eabi/lshift.S | 123 
 libgcc/config/arm/lib1funcs.S   | 103 +-
 2 files changed, 124 insertions(+), 102 deletions(-)
 create mode 100644 libgcc/config/arm/eabi/lshift.S

diff --git a/libgcc/config/arm/eabi/lshift.S b/libgcc/config/arm/eabi/lshift.S
new file mode 100644
index 000..6e79d96c118
--- /dev/null
+++ b/libgcc/config/arm/eabi/lshift.S
@@ -0,0 +1,123 @@
+/* Copyright (C) 1995-2022 Free Software Foundation, Inc.
+
+This file is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 3, or (at your option) any
+later version.
+
+This file is distributed in the hope that it will be useful, but
+WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+General Public License for more details.
+
+Under Section 7 of GPL version 3, you are granted additional
+permissions described in the GCC Runtime Library Exception, version
+3.1, as published by the Free Software Foundation.
+
+You should have received a copy of the GNU General Public License and
+a copy of the GCC Runtime Library Exception along with this program;
+see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+.  */
+
+
+#ifdef L_lshrdi3
+
+   FUNC_START lshrdi3
+   FUNC_ALIAS aeabi_llsr lshrdi3
+   
+#ifdef __thumb__
+   lsrsal, r2
+   movsr3, ah
+   lsrsah, r2
+   mov ip, r3
+   subsr2, #32
+   lsrsr3, r2
+   orrsal, r3
+   negsr2, r2
+   mov r3, ip
+   lslsr3, r2
+   orrsal, r3
+   RET
+#else
+   subsr3, r2, #32
+   rsb ip, r2, #32
+   movmi   al, al, lsr r2
+   movpl   al, ah, lsr r3
+   orrmi   al, al, ah, lsl ip
+   mov ah, ah, lsr r2
+   RET
+#endif
+   FUNC_END aeabi_llsr
+   FUNC_END lshrdi3
+
+#endif
+   
+#ifdef L_ashrdi3
+   
+   FUNC_START ashrdi3
+   FUNC_ALIAS aeabi_lasr ashrdi3
+   
+#ifdef __thumb__
+   lsrsal, r2
+   movsr3, ah
+   asrsah, r2
+   subsr2, #32
+   @ If r2 is negative at this point the following step would OR
+   @ the sign bit into all of AL.  That's not what we want...
+   bmi 1f
+   mov ip, r3
+   asrsr3, r2
+   orrsal, r3
+   mov r3, ip
+1:
+   negsr2, r2
+   lslsr3, r2
+   orrsal, r3
+   RET
+#else
+   subsr3, r2, #32
+   rsb ip, r2, #32
+   movmi   al, al, lsr r2
+   movpl   al, ah, asr r3
+   orrmi   al, al, ah, lsl ip
+   mov ah, ah, asr r2
+   RET
+#endif
+
+   FUNC_END aeabi_lasr
+   FUNC_END ashrdi3
+
+#endif
+
+#ifdef L_ashldi3
+
+   FUNC_START ashldi3
+   FUNC_ALIAS aeabi_llsl ashldi3
+   
+#ifdef __thumb__
+   lslsah, r2
+   movsr3, al
+   lslsal, r2
+   mov ip, r3
+   subsr2, #32
+   lslsr3, r2
+   orrsah, r3
+   negsr2, r2
+   mov r3, ip
+   lsrsr3, r2
+   orrsah, r3
+   RET
+#else
+   subsr3, r2, #32
+   rsb ip, r2, #32
+   movmi   ah, ah, lsl r2
+   movpl   ah, al, lsl r3
+   orrmi   ah, ah, al, lsr ip
+   mov al, al, lsl r2
+   RET
+#endif
+   FUNC_END aeabi_llsl
+   FUNC_END ashldi3
+
+#endif
+
diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index 6cf7561835d..aa5957b8399 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -1699,108 +1699,7 @@ LSYM(Lover12):
 
 /* Prevent __aeabi double-word shifts from being produced on SymbianOS.  */
 #ifndef __symbian__
-
-#ifdef L_lshrdi3
-
-   FUNC_START lshrdi3
-   FUNC_ALIAS aeabi_llsr lshrdi3
-   
-#ifdef __thumb__
-   lsrsal, r2
-   movsr3, ah
-   lsrsah, r2
-   mov ip, r3
-   subsr2, #32
-   lsrsr3, r2
-   orrsal, r3
-   negsr2, r2
-   mov r3, ip
-   lslsr3, r2
-   orrsal, r3
-   RET
-#else
-   subsr3, r2, #32
-   rsb ip, r2, #32
-   movmi   al, al, lsr r2
-   movpl   al, ah, lsr r3
-   orrmi   al, al, ah, lsl ip
-   mov ah, ah, lsr r2
-   RET
-#endif
-   FUNC_END aeabi_llsr
-   FUNC_END lshrdi3
-
-#endif
-   
-#ifdef L_ashrdi3
-   
-   FUNC_START ashrdi3
-   FUNC_ALIAS aeabi_lasr ashrdi3
-   
-#ifdef __thumb__
-   lsrsal, r2
-   movsr3, ah
-   asrsah, r2
-   subsr2, #32
-

[PATCH v7 15/34] Import 'popcnt' functions from the CM0 library

2022-10-31 Thread Daniel Engel
The functional overlap between the single- and double-word functions
makes this implementation about 30% smaller than the C functions
if both functions are linked together in the same appliation.

gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/popcnt.S (__popcountsi, __popcountdi2): New file.
* config/arm/lib1funcs.S: #include bit/popcnt.S
* config/arm/t-elf (LIB1ASMFUNCS): Add _popcountsi2/di2.
---
 libgcc/config/arm/lib1funcs.S |   1 +
 libgcc/config/arm/popcnt.S| 189 ++
 libgcc/config/arm/t-elf   |   2 +
 3 files changed, 192 insertions(+)
 create mode 100644 libgcc/config/arm/popcnt.S

diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index 3f7b9e739f0..0eb6d1d52a7 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -1705,6 +1705,7 @@ LSYM(Lover12):
 #include "clz2.S"
 #include "ctz2.S"
 #include "parity.S"
+#include "popcnt.S"
 
 /*  */
 /* These next two sections are here despite the fact that they contain Thumb 
diff --git a/libgcc/config/arm/popcnt.S b/libgcc/config/arm/popcnt.S
new file mode 100644
index 000..4613ea475b0
--- /dev/null
+++ b/libgcc/config/arm/popcnt.S
@@ -0,0 +1,189 @@
+/* popcnt.S: ARM optimized popcount functions
+
+   Copyright (C) 2020-2022 Free Software Foundation, Inc.
+   Contributed by Daniel Engel (g...@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#ifdef L_popcountdi2
+
+// int __popcountdi2(int)
+// Returns the number of bits set in $r1:$r0.
+// Returns the result in $r0.
+FUNC_START_SECTION popcountdi2 .text.sorted.libgcc.popcountdi2
+CFI_START_FUNCTION
+
+  #if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+// Initialize the result.
+// Compensate for the two extra loop (one for each word)
+//  required to detect zero arguments.
+movsr2, #2
+
+LLSYM(__popcountd_loop):
+// Same as __popcounts_loop below, except for $r1.
+subsr2, #1
+subsr3, r1, #1
+andsr1, r3
+bcs LLSYM(__popcountd_loop)
+
+// Repeat the operation for the second word.
+b   LLSYM(__popcounts_loop)
+
+  #else /* !__OPTIMIZE_SIZE__ */
+// Load the one-bit alternating mask.
+ldr r3, =0x
+
+// Reduce the second word.
+lsrsr2, r1, #1
+andsr2, r3
+subsr1, r2
+
+// Reduce the first word.
+lsrsr2, r0, #1
+andsr2, r3
+subsr0, r2
+
+// Load the two-bit alternating mask.
+ldr r3, =0x
+
+// Reduce the second word.
+lsrsr2, r1, #2
+andsr2, r3
+andsr1, r3
+addsr1, r2
+
+// Reduce the first word.
+lsrsr2, r0, #2
+andsr2, r3
+andsr0, r3
+addsr0, r2
+
+// There will be a maximum of 8 bits in each 4-bit field.
+// Jump into the single word flow to combine and complete.
+b   LLSYM(__popcounts_merge)
+
+  #endif /* !__OPTIMIZE_SIZE__ */
+#endif /* L_popcountdi2 */
+
+
+// The implementation of __popcountdi2() tightly couples with __popcountsi2(),
+//  such that instructions must appear consecutively in the same memory
+//  section for proper flow control.  However, this construction inhibits
+//  the ability to discard __popcountdi2() when only using __popcountsi2().
+// Therefore, this block configures __popcountsi2() for compilation twice.
+// The first version is a minimal standalone implementation, and the second
+//  version is the continuation of __popcountdi2().  The standalone version 
must
+//  be declared WEAK, so that the combined version can supersede it and
+//  provide both symbols when required.
+// '_popcountsi2' should appear before '_popcountdi2' in LIB1ASMFUNCS.
+#if defined(L_popcountsi2) || 

[PATCH v7 05/34] Add the __HAVE_FEATURE_IT and IT() macros

2022-10-31 Thread Daniel Engel
These macros complement and extend the existing do_it() macro.
Together, they streamline the process of optimizing short branchless
contitional sequences to support ARM, Thumb-2, and Thumb-1.

The inherent architecture limitations of Thumb-1 means that writing
assembly code is somewhat more tedious.  And, while such code will run
unmodified in an ARM or Thumb-2 enfironment, it will lack one of the
key performance optimizations available there.

Initially, the first idea might be to split the an instruction sequence
with #ifdef(s): one path for Thumb-1 and the other for ARM/Thumb-2.
This could suffice if conditional execution optimizations were rare.

However, #ifdef(s) break flow of an algorithm and shift focus to the
architectural differences instead of the similarities.  On functions
with a high percentage of conditional execution, it starts to become
attractive to split everything into distinct architecture-specific
function objects -- even when the underlying algorithm is identical.

Additionally, duplicated code and comments (whether an individual
operand, a line, or a larger block) become a future maintenance
liability if the two versions aren't kept in sync.

See code comments for limitations and expecated usage.

gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

(__HAVE_FEATURE_IT, IT): New macros.
---
 libgcc/config/arm/lib1funcs.S | 68 +++
 1 file changed, 68 insertions(+)

diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index f2f82f9d509..7a941ee9fc8 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -230,6 +230,7 @@ LSYM(Lend_fde):
ARM and Thumb-2.  However this is only supported by recent gas, so define
a set of macros to allow ARM code on older assemblers.  */
 #if defined(__thumb2__)
+#define __HAVE_FEATURE_IT
 .macro do_it cond, suffix=""
it\suffix   \cond
 .endm
@@ -245,6 +246,9 @@ LSYM(Lend_fde):
\name \dest, \src1, \tmp
 .endm
 #else
+#if !defined(__thumb__)
+#define __HAVE_FEATURE_IT
+#endif
 .macro do_it cond, suffix=""
 .endm
 .macro shift1 op, arg0, arg1, arg2
@@ -259,6 +263,70 @@ LSYM(Lend_fde):
 
 #define COND(op1, op2, cond) op1 ## op2 ## cond
 
+
+/* The IT() macro streamlines the construction of short branchless contitional
+sequences that support ARM, Thumb-2, and Thumb-1.  It is intended as an
+extension to the .do_it macro defined above.  Code not written with the
+intent to support Thumb-1 need not use IT().
+
+   IT()'s main advantage is the minimization of syntax differences.  Unified
+functions can support Thumb-1 without imposiing an undue performance
+penalty on ARM and Thumb-2.  Writing code without duplicate instructions
+and operands keeps the high level function flow clearer and should reduce
+the incidence of maintenance bugs.
+
+   Where conditional execution is supported by ARM and Thumb-2, the specified
+instruction compiles with the conditional suffix 'c'.
+
+   Where Thumb-1 and v6m do not support IT, the given instruction compiles
+with the standard unified syntax suffix "s", and a preceding branch
+instruction is required to implement conditional behavior.
+
+   (Aside: The Thumb-1 "s"-suffix pattern is somewhat simplistic, since it
+does not support 'cmp' or 'tst' with a non-"s" suffix.  It also appends
+"s" to 'mov' and 'add' with high register operands which are otherwise
+legal on v6m.  Use of IT() will result in a compiler error for all of
+these exceptional cases, and a full #ifdef code split will be required.
+However, it is unlikely that code written with Thumb-1 compatibility
+in mind will use such patterns, so IT() still promises a good value.)
+
+   Typical if/then/else usage is:
+
+#ifdef __HAVE_FEATURE_IT
+// ARM and Thumb-2 'true' condition.
+do_it   c,  tee
+#else
+// Thumb-1 'false' condition.  This must be opposite the
+//  sense of the ARM and Thumb-2 condition, since the
+//  branch is taken to skip the 'true' instruction block.
+b!c else_label
+#endif
+
+// Conditional 'true' execution for all compile modes.
+ IT(ins1,c) op1,op2
+ IT(ins2,c) op1,op2
+
+#ifndef __HAVE_FEATURE_IT
+// Thumb-1 branch to skip the 'else' instruction block.
+// Omitted for if/then usage.
+b   end_label
+#endif
+
+   else_label:
+// Conditional 'false' execution for all compile modes.
+// Omitted for if/then usage.
+ IT(ins3,!c) op1,   op2
+ IT(ins4,!c) op1,   op2
+
+   end_label:
+// Unconditional execution resumes here.
+ */
+#ifdef __HAVE_FEATURE_IT
+  #define IT(ins,c) ins##c
+#else
+  #define IT(ins,c) ins##s
+#endif
+
 #ifdef __ARM_EABI__
 .macro ARM_LDIV0 name signed
cmp r0, #0
-- 
2.34.1



[PATCH v7 02/34] Rename THUMB_FUNC_START to THUMB_FUNC_ENTRY

2022-10-31 Thread Daniel Engel
Since THUMB_FUNC_START does not insert the ".text" directive, it aligns
more closely with the new FUNC_ENTRY maro and is renamed accordingly.

THUMB_FUNC_START usage has been universally synonymous with the
".force_thumb" directive, so this is now folded into the definition.
Usage of ".force_thumb" and ".thumb_func" is now tightly coupled
throughout the "arm" subdirectory.

gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/lib1funcs.S: (THUMB_FUNC_START): Renamed to ...
(THUMB_FUNC_ENTRY): for consistency; also added ".force_thumb".
(_call_via_r0): Removed redundant preceding ".force_thumb".
(__gnu_thumb1_case_sqi, __gnu_thumb1_case_uqi, __gnu_thumb1_case_shi,
__gnu_thumb1_case_si): Removed redundant ".force_thumb" and ".syntax".
---
 libgcc/config/arm/lib1funcs.S | 32 +++-
 1 file changed, 11 insertions(+), 21 deletions(-)

diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index a4fa62b3832..726984a9d1d 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -358,10 +358,11 @@ LSYM(Ldiv0):
 #define THUMB_CODE
 #endif
 
-.macro THUMB_FUNC_START name
+.macro THUMB_FUNC_ENTRY name
.globl  SYM (\name)
TYPE(\name)
.thumb_func
+   .force_thumb
 SYM (\name):
 .endm
 
@@ -1944,10 +1945,9 @@ ARM_FUNC_START ctzsi2

.text
.align 0
-.force_thumb
 
 .macro call_via register
-   THUMB_FUNC_START _call_via_\register
+   THUMB_FUNC_ENTRY _call_via_\register
 
bx  \register
nop
@@ -2030,7 +2030,7 @@ _arm_return_r11:
 .macro interwork_with_frame frame, register, name, return
.code   16
 
-   THUMB_FUNC_START \name
+   THUMB_FUNC_ENTRY \name
 
bx  pc
nop
@@ -2047,7 +2047,7 @@ _arm_return_r11:
 .macro interwork register
.code   16
 
-   THUMB_FUNC_START _interwork_call_via_\register
+   THUMB_FUNC_ENTRY _interwork_call_via_\register
 
bx  pc
nop
@@ -2084,7 +2084,7 @@ LSYM(Lchange_\register):
/* The LR case has to be handled a little differently...  */
.code 16
 
-   THUMB_FUNC_START _interwork_call_via_lr
+   THUMB_FUNC_ENTRY _interwork_call_via_lr
 
bx  pc
nop
@@ -2112,9 +2112,7 @@ LSYM(Lchange_\register):

.text
.align 0
-.force_thumb
-   .syntax unified
-   THUMB_FUNC_START __gnu_thumb1_case_sqi
+   THUMB_FUNC_ENTRY __gnu_thumb1_case_sqi
push{r1}
mov r1, lr
lsrsr1, r1, #1
@@ -2131,9 +2129,7 @@ LSYM(Lchange_\register):

.text
.align 0
-.force_thumb
-   .syntax unified
-   THUMB_FUNC_START __gnu_thumb1_case_uqi
+   THUMB_FUNC_ENTRY __gnu_thumb1_case_uqi
push{r1}
mov r1, lr
lsrsr1, r1, #1
@@ -2150,9 +2146,7 @@ LSYM(Lchange_\register):

.text
.align 0
-.force_thumb
-   .syntax unified
-   THUMB_FUNC_START __gnu_thumb1_case_shi
+   THUMB_FUNC_ENTRY __gnu_thumb1_case_shi
push{r0, r1}
mov r1, lr
lsrsr1, r1, #1
@@ -2170,9 +2164,7 @@ LSYM(Lchange_\register):

.text
.align 0
-.force_thumb
-   .syntax unified
-   THUMB_FUNC_START __gnu_thumb1_case_uhi
+   THUMB_FUNC_ENTRY __gnu_thumb1_case_uhi
push{r0, r1}
mov r1, lr
lsrsr1, r1, #1
@@ -2190,9 +2182,7 @@ LSYM(Lchange_\register):

.text
.align 0
-.force_thumb
-   .syntax unified
-   THUMB_FUNC_START __gnu_thumb1_case_si
+   THUMB_FUNC_ENTRY __gnu_thumb1_case_si
push{r0, r1}
mov r1, lr
adds.n  r1, r1, #2  /* Align to word.  */
-- 
2.34.1



[PATCH v7 06/34] Refactor 'clz' functions into a new file

2022-10-31 Thread Daniel Engel
This will make it easier to isolate changes in subsequent patches.

gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/lib1funcs.S (__clzsi2i, __clzdi2): Moved to ...
* config/arm/clz2.S: New file.
---
 libgcc/config/arm/clz2.S  | 145 ++
 libgcc/config/arm/lib1funcs.S | 123 +---
 2 files changed, 146 insertions(+), 122 deletions(-)
 create mode 100644 libgcc/config/arm/clz2.S

diff --git a/libgcc/config/arm/clz2.S b/libgcc/config/arm/clz2.S
new file mode 100644
index 000..439341752ba
--- /dev/null
+++ b/libgcc/config/arm/clz2.S
@@ -0,0 +1,145 @@
+/* Copyright (C) 1995-2022 Free Software Foundation, Inc.
+
+This file is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 3, or (at your option) any
+later version.
+
+This file is distributed in the hope that it will be useful, but
+WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+General Public License for more details.
+
+Under Section 7 of GPL version 3, you are granted additional
+permissions described in the GCC Runtime Library Exception, version
+3.1, as published by the Free Software Foundation.
+
+You should have received a copy of the GNU General Public License and
+a copy of the GCC Runtime Library Exception along with this program;
+see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+.  */
+
+
+#ifdef L_clzsi2
+#ifdef NOT_ISA_TARGET_32BIT
+FUNC_START clzsi2
+   movsr1, #28
+   movsr3, #1
+   lslsr3, r3, #16
+   cmp r0, r3 /* 0x1 */
+   bcc 2f
+   lsrsr0, r0, #16
+   subsr1, r1, #16
+2: lsrsr3, r3, #8
+   cmp r0, r3 /* #0x100 */
+   bcc 2f
+   lsrsr0, r0, #8
+   subsr1, r1, #8
+2: lsrsr3, r3, #4
+   cmp r0, r3 /* #0x10 */
+   bcc 2f
+   lsrsr0, r0, #4
+   subsr1, r1, #4
+2: adr r2, 1f
+   ldrbr0, [r2, r0]
+   addsr0, r0, r1
+   bx lr
+.align 2
+1:
+.byte 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0
+   FUNC_END clzsi2
+#else
+ARM_FUNC_START clzsi2
+# if defined (__ARM_FEATURE_CLZ)
+   clz r0, r0
+   RET
+# else
+   mov r1, #28
+   cmp r0, #0x1
+   do_it   cs, t
+   movcs   r0, r0, lsr #16
+   subcs   r1, r1, #16
+   cmp r0, #0x100
+   do_it   cs, t
+   movcs   r0, r0, lsr #8
+   subcs   r1, r1, #8
+   cmp r0, #0x10
+   do_it   cs, t
+   movcs   r0, r0, lsr #4
+   subcs   r1, r1, #4
+   adr r2, 1f
+   ldrbr0, [r2, r0]
+   add r0, r0, r1
+   RET
+.align 2
+1:
+.byte 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0
+# endif /* !defined (__ARM_FEATURE_CLZ) */
+   FUNC_END clzsi2
+#endif
+#endif /* L_clzsi2 */
+
+#ifdef L_clzdi2
+#if !defined (__ARM_FEATURE_CLZ)
+
+# ifdef NOT_ISA_TARGET_32BIT
+FUNC_START clzdi2
+   push{r4, lr}
+   cmp xxh, #0
+   bne 1f
+#  ifdef __ARMEB__
+   movsr0, xxl
+   bl  __clzsi2
+   addsr0, r0, #32
+   b 2f
+1:
+   bl  __clzsi2
+#  else
+   bl  __clzsi2
+   addsr0, r0, #32
+   b 2f
+1:
+   movsr0, xxh
+   bl  __clzsi2
+#  endif
+2:
+   pop {r4, pc}
+# else /* NOT_ISA_TARGET_32BIT */
+ARM_FUNC_START clzdi2
+   do_push {r4, lr}
+   cmp xxh, #0
+   bne 1f
+#  ifdef __ARMEB__
+   mov r0, xxl
+   bl  __clzsi2
+   add r0, r0, #32
+   b 2f
+1:
+   bl  __clzsi2
+#  else
+   bl  __clzsi2
+   add r0, r0, #32
+   b 2f
+1:
+   mov r0, xxh
+   bl  __clzsi2
+#  endif
+2:
+   RETLDM  r4
+   FUNC_END clzdi2
+# endif /* NOT_ISA_TARGET_32BIT */
+
+#else /* defined (__ARM_FEATURE_CLZ) */
+
+ARM_FUNC_START clzdi2
+   cmp xxh, #0
+   do_it   eq, et
+   clzeq   r0, xxl
+   clzne   r0, xxh
+   addeq   r0, r0, #32
+   RET
+   FUNC_END clzdi2
+
+#endif
+#endif /* L_clzdi2 */
+
diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index 7a941ee9fc8..469fea9ab5c 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -1803,128 +1803,7 @@ LSYM(Lover12):
 
 #endif /* __symbian__ */
 
-#ifdef L_clzsi2
-#ifdef NOT_ISA_TARGET_32BIT
-FUNC_START clzsi2
-   movsr1, #28
-   movsr3, #1
-   lslsr3, r3, #16
-   cmp r0, r3 /* 0x1 */
-   bcc 2f
-   lsrsr0, r0, #16
-   subsr1, r1, #16
-2: lsrsr3, r3, #8
-   cmp r0, r3 /* #0x100 */
-   bcc 2f
-   lsrsr0, r0, #8
-   subsr1, r1, #8
-2: lsrsr3, r3, #4
-   cmp r0, r3 /* #0x10 */
-   bcc 2f
-   lsrs

[PATCH v7 07/34] Refactor 'ctz' functions into a new file

2022-10-31 Thread Daniel Engel
This will make it easier to isolate changes in subsequent patches.

gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/lib1funcs.S (__ctzsi2): Moved to ...
* config/arm/ctz2.S: New file.
---
 libgcc/config/arm/ctz2.S  | 86 +++
 libgcc/config/arm/lib1funcs.S | 65 +-
 2 files changed, 87 insertions(+), 64 deletions(-)
 create mode 100644 libgcc/config/arm/ctz2.S

diff --git a/libgcc/config/arm/ctz2.S b/libgcc/config/arm/ctz2.S
new file mode 100644
index 000..1d885dcc71a
--- /dev/null
+++ b/libgcc/config/arm/ctz2.S
@@ -0,0 +1,86 @@
+/* Copyright (C) 1995-2022 Free Software Foundation, Inc.
+
+This file is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 3, or (at your option) any
+later version.
+
+This file is distributed in the hope that it will be useful, but
+WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+General Public License for more details.
+
+Under Section 7 of GPL version 3, you are granted additional
+permissions described in the GCC Runtime Library Exception, version
+3.1, as published by the Free Software Foundation.
+
+You should have received a copy of the GNU General Public License and
+a copy of the GCC Runtime Library Exception along with this program;
+see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+.  */
+
+
+#ifdef L_ctzsi2
+#ifdef NOT_ISA_TARGET_32BIT
+FUNC_START ctzsi2
+   negsr1, r0
+   andsr0, r0, r1
+   movsr1, #28
+   movsr3, #1
+   lslsr3, r3, #16
+   cmp r0, r3 /* 0x1 */
+   bcc 2f
+   lsrsr0, r0, #16
+   subsr1, r1, #16
+2: lsrsr3, r3, #8
+   cmp r0, r3 /* #0x100 */
+   bcc 2f
+   lsrsr0, r0, #8
+   subsr1, r1, #8
+2: lsrsr3, r3, #4
+   cmp r0, r3 /* #0x10 */
+   bcc 2f
+   lsrsr0, r0, #4
+   subsr1, r1, #4
+2: adr r2, 1f
+   ldrbr0, [r2, r0]
+   subsr0, r0, r1
+   bx lr
+.align 2
+1:
+.byte  27, 28, 29, 29, 30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 31
+   FUNC_END ctzsi2
+#else
+ARM_FUNC_START ctzsi2
+   rsb r1, r0, #0
+   and r0, r0, r1
+# if defined (__ARM_FEATURE_CLZ)
+   clz r0, r0
+   rsb r0, r0, #31
+   RET
+# else
+   mov r1, #28
+   cmp r0, #0x1
+   do_it   cs, t
+   movcs   r0, r0, lsr #16
+   subcs   r1, r1, #16
+   cmp r0, #0x100
+   do_it   cs, t
+   movcs   r0, r0, lsr #8
+   subcs   r1, r1, #8
+   cmp r0, #0x10
+   do_it   cs, t
+   movcs   r0, r0, lsr #4
+   subcs   r1, r1, #4
+   adr r2, 1f
+   ldrbr0, [r2, r0]
+   sub r0, r0, r1
+   RET
+.align 2
+1:
+.byte  27, 28, 29, 29, 30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 31
+# endif /* !defined (__ARM_FEATURE_CLZ) */
+   FUNC_END ctzsi2
+#endif
+#endif /* L_clzsi2 */
+
diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index 469fea9ab5c..6cf7561835d 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -1804,70 +1804,7 @@ LSYM(Lover12):
 #endif /* __symbian__ */
 
 #include "clz2.S"
-
-#ifdef L_ctzsi2
-#ifdef NOT_ISA_TARGET_32BIT
-FUNC_START ctzsi2
-   negsr1, r0
-   andsr0, r0, r1
-   movsr1, #28
-   movsr3, #1
-   lslsr3, r3, #16
-   cmp r0, r3 /* 0x1 */
-   bcc 2f
-   lsrsr0, r0, #16
-   subsr1, r1, #16
-2: lsrsr3, r3, #8
-   cmp r0, r3 /* #0x100 */
-   bcc 2f
-   lsrsr0, r0, #8
-   subsr1, r1, #8
-2: lsrsr3, r3, #4
-   cmp r0, r3 /* #0x10 */
-   bcc 2f
-   lsrsr0, r0, #4
-   subsr1, r1, #4
-2: adr r2, 1f
-   ldrbr0, [r2, r0]
-   subsr0, r0, r1
-   bx lr
-.align 2
-1:
-.byte  27, 28, 29, 29, 30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 31
-   FUNC_END ctzsi2
-#else
-ARM_FUNC_START ctzsi2
-   rsb r1, r0, #0
-   and r0, r0, r1
-# if defined (__ARM_FEATURE_CLZ)
-   clz r0, r0
-   rsb r0, r0, #31
-   RET
-# else
-   mov r1, #28
-   cmp r0, #0x1
-   do_it   cs, t
-   movcs   r0, r0, lsr #16
-   subcs   r1, r1, #16
-   cmp r0, #0x100
-   do_it   cs, t
-   movcs   r0, r0, lsr #8
-   subcs   r1, r1, #8
-   cmp r0, #0x10
-   do_it   cs, t
-   movcs   r0, r0, lsr #4
-   subcs   r1, r1, #4
-   adr r2, 1f
-   ldrbr0, [r2, r0]
-   sub r0, r0, r1
-   RET
-.align 2
-1:
-.byte  27, 28, 29, 29, 30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 31
-# endif /* !defined (__ARM_FEATURE_CLZ) */
-   

[PATCH v7 04/34] Reorganize LIB1ASMFUNCS object wrapper macros

2022-10-31 Thread Daniel Engel
This will make it easier to isolate changes in subsequent patches.

gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/t-elf (LIB1ASMFUNCS): Split macros into logical groups.
---
 libgcc/config/arm/t-elf | 66 +
 1 file changed, 53 insertions(+), 13 deletions(-)

diff --git a/libgcc/config/arm/t-elf b/libgcc/config/arm/t-elf
index 9da6cd37054..93ea1cd8f76 100644
--- a/libgcc/config/arm/t-elf
+++ b/libgcc/config/arm/t-elf
@@ -14,19 +14,59 @@ LIB1ASMFUNCS += _arm_muldf3 _arm_mulsf3
 endif
 endif # !__symbian__
 
-# For most CPUs we have an assembly soft-float implementations.
-# However this is not true for ARMv6M.  Here we want to use the soft-fp C
-# implementation.  The soft-fp code is only build for ARMv6M.  This pulls
-# in the asm implementation for other CPUs.
-LIB1ASMFUNCS += _udivsi3 _divsi3 _umodsi3 _modsi3 _dvmd_tls _bb_init_func \
-   _call_via_rX _interwork_call_via_rX \
-   _lshrdi3 _ashrdi3 _ashldi3 \
-   _arm_negdf2 _arm_addsubdf3 _arm_muldivdf3 _arm_cmpdf2 _arm_unorddf2 \
-   _arm_fixdfsi _arm_fixunsdfsi \
-   _arm_truncdfsf2 _arm_negsf2 _arm_addsubsf3 _arm_muldivsf3 \
-   _arm_cmpsf2 _arm_unordsf2 _arm_fixsfsi _arm_fixunssfsi \
-   _arm_floatdidf _arm_floatdisf _arm_floatundidf _arm_floatundisf \
-   _clzsi2 _clzdi2 _ctzsi2
+# This pulls in the available assembly function implementations.
+# The soft-fp code is only built for ARMv6M, since there is no
+# assembly implementation here for double-precision values.
+
+
+# Group 1: Integer function objects.
+LIB1ASMFUNCS += \
+   _ashldi3 \
+   _ashrdi3 \
+   _lshrdi3 \
+   _clzdi2 \
+   _clzsi2 \
+   _ctzsi2 \
+   _dvmd_tls \
+   _divsi3 \
+   _modsi3 \
+   _udivsi3 \
+   _umodsi3 \
+
+
+# Group 2: Single precision floating point function objects.
+LIB1ASMFUNCS += \
+   _arm_addsubsf3 \
+   _arm_cmpsf2 \
+   _arm_fixsfsi \
+   _arm_fixunssfsi \
+   _arm_floatdisf \
+   _arm_floatundisf \
+   _arm_muldivsf3 \
+   _arm_negsf2 \
+   _arm_unordsf2 \
+
+
+# Group 3: Double precision floating point function objects.
+LIB1ASMFUNCS += \
+   _arm_addsubdf3 \
+   _arm_cmpdf2 \
+   _arm_fixdfsi \
+   _arm_fixunsdfsi \
+   _arm_floatdidf \
+   _arm_floatundidf \
+   _arm_muldivdf3 \
+   _arm_negdf2 \
+   _arm_truncdfsf2 \
+   _arm_unorddf2 \
+
+
+# Group 4: Miscellaneous function objects.
+LIB1ASMFUNCS += \
+   _bb_init_func \
+   _call_via_rX \
+   _interwork_call_via_rX \
+
 
 # Currently there is a bug somewhere in GCC's alias analysis
 # or scheduling code that is breaking _fpmul_parts in fp-bit.c.
-- 
2.34.1



[PATCH v7 01/34] Add and restructure function declaration macros

2022-10-31 Thread Daniel Engel
Most of these changes support subsequent patches in this series.
Particularly, the FUNC_START macro becomes part of a new macro chain:

  * FUNC_ENTRY  Common global symbol directives
  * FUNC_START_SECTION  FUNC_ENTRY to start a new 
  * FUNC_START  FUNC_START_SECTION <".text">

The effective definition of FUNC_START is unchanged from the previous
version of lib1funcs.  See code comments for detailed usage.

The new names FUNC_ENTRY and FUNC_START_SECTION were chosen specifically
to complement the existing FUNC_START name.  Alternate name patterns are
possible (such as {FUNC_SYMBOL, FUNC_START_SECTION, FUNC_START_TEXT}),
but any change to FUNC_START would require refactoring much of libgcc.

Additionally, a parallel chain of new macros supports weak functions:

  * WEAK_ENTRY
  * WEAK_START_SECTION
  * WEAK_START
  * WEAK_ALIAS

Moving the CFI_* macros earlier in the file scope will increase their
scope for use in additional functions.

gcc/libgcc/ChangeLog:
2022-10-09 Daniel Engel 

* config/arm/lib1funcs.S:
(LLSYM): New macro prefix ".L" for strippable local symbols.
(CFI_START_FUNCTION, CFI_END_FUNCTION): Moved earlier in the file.
(FUNC_ENTRY): New macro for symbols with no ".section" directive.
(WEAK_ENTRY): New macro FUNC_ENTRY + ".weak".
(FUNC_START_SECTION): New macro FUNC_ENTRY with  argument.
(WEAK_START_SECTION): New macro FUNC_START_SECTION + ".weak".
(FUNC_START): Redefined in terms of FUNC_START_SECTION <".text">.
(WEAK_START): New macro FUNC_START + ".weak".
(WEAK_ALIAS): New macro FUNC_ALIAS + ".weak".
(FUNC_END): Moved after FUNC_START macro group.
(THUMB_FUNC_START): Moved near the other *FUNC* macros.
(THUMB_SYNTAX, ARM_SYM_START, SYM_END): Deleted unused macros.
---
 libgcc/config/arm/lib1funcs.S | 109 +-
 1 file changed, 69 insertions(+), 40 deletions(-)

diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index 8c39c9f20a2..a4fa62b3832 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -69,11 +69,13 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  
If not, see
 #define TYPE(x) .type SYM(x),function
 #define SIZE(x) .size SYM(x), . - SYM(x)
 #define LSYM(x) .x
+#define LLSYM(x) .L##x
 #else
 #define __PLT__
 #define TYPE(x)
 #define SIZE(x)
 #define LSYM(x) x
+#define LLSYM(x) x
 #endif
 
 /* Function end macros.  Variants for interworking.  */
@@ -182,6 +184,16 @@ LSYM(Lend_fde):
 #endif
 .endm
 
+.macro CFI_START_FUNCTION
+   .cfi_startproc
+   .cfi_remember_state
+.endm
+
+.macro CFI_END_FUNCTION
+   .cfi_restore_state
+   .cfi_endproc
+.endm
+
 /* Don't pass dirn, it's there just to get token pasting right.  */
 
 .macro RETLDM  regs=, cond=, unwind=, dirn=ia
@@ -324,10 +336,6 @@ LSYM(Lend_fde):
 .endm
 #endif
 
-.macro FUNC_END name
-   SIZE (__\name)
-.endm
-
 .macro DIV_FUNC_END name signed
cfi_start   __\name, LSYM(Lend_div0)
 LSYM(Ldiv0):
@@ -340,48 +348,76 @@ LSYM(Ldiv0):
FUNC_END \name
 .endm
 
-.macro THUMB_FUNC_START name
-   .globl  SYM (\name)
-   TYPE(\name)
-   .thumb_func
-SYM (\name):
-.endm
-
 /* Function start macros.  Variants for ARM and Thumb.  */
 
 #ifdef __thumb__
 #define THUMB_FUNC .thumb_func
 #define THUMB_CODE .force_thumb
-# if defined(__thumb2__)
-#define THUMB_SYNTAX
-# else
-#define THUMB_SYNTAX
-# endif
 #else
 #define THUMB_FUNC
 #define THUMB_CODE
-#define THUMB_SYNTAX
 #endif
 
+.macro THUMB_FUNC_START name
+   .globl  SYM (\name)
+   TYPE(\name)
+   .thumb_func
+SYM (\name):
+.endm
+
+/* Strong global symbol, ".text" section.
+   The default macro for function declarations. */
 .macro FUNC_START name
-   .text
+   FUNC_START_SECTION \name .text
+.endm
+
+/* Weak global symbol, ".text" section.
+   Use WEAK_* macros to declare a function/object that may be discarded in by
+the linker when another library or object exports the same name.
+   Typically, functions declared with WEAK_* macros implement a subset of
+functionality provided by the overriding definition, and are discarded
+when the full functionality is required. */
+.macro WEAK_START name
+   .weak SYM(__\name)
+   FUNC_START_SECTION \name .text
+.endm
+
+/* Strong global symbol, alternate section.
+   Use the *_START_SECTION macros for declarations that the linker should
+place in a non-defailt section (e.g. ".rodata", ".text.subsection"). */
+.macro FUNC_START_SECTION name section
+   .section \section,"x"
+   .align 0
+   FUNC_ENTRY \name
+.endm
+
+/* Weak global symbol, alternate section. */
+.macro WEAK_START_SECTION name section
+   .weak SYM(__\name)
+   FUNC_START_SECTION \name \section
+.endm
+
+/* Strong global symbol.
+   Use *_ENTRY macros internal to a function/object body to declare a second
+or subsequent entry point 

[PATCH v7 00/34] libgcc: Thumb-1 Floating-Point Assembly for Cortex M0

2022-10-31 Thread Daniel Engel
Hi Richard,

I am re-submitting my libgcc patch from 2021:

https://gcc.gnu.org/pipermail/gcc-patches/2021-January/563585.html
https://gcc.gnu.org/pipermail/gcc-patches/2021-December/587383.html

I believe I have finally made the stage1 window. 

Regards,
Daniel

---

Changes since v6:

* Rebased and tested with gcc-13

There are no regressions for -march={armv4t,armv6s-m,armv7-m,armv7-a}.
Clean master:

# of expected passes529397
# of unexpected failures41160
# of unexpected successes   12
# of expected failures  3442
# of unresolved testcases   978
# of unsupported tests  28993

Patched master:

# of expected passes529397
# of unexpected failures41160
# of unexpected successes   12
# of expected failures  3442
# of unresolved testcases   978
# of unsupported tests  28993

---

This patch series adds an assembly-language implementation of IEEE-754 compliant
single-precision functions designed for the Cortex M0 (v6m) architecture.  There
are improvements to most of the EABI integer functions as well.  This is the
ibgcc component of a larger library project originally proposed in 2018:

https://gcc.gnu.org/legacy-ml/gcc/2018-11/msg00043.html

As one point of comparison, a test program [1] links 916 bytes from libgcc with
the patched toolchain vs 10276 bytes with gcc-arm-none-eabi-9-2020-q2 toolchain.
That's a 90% size reduction.

I have extensive test vectors [2], and this patch pass all tests on an 
STM32F051.
These vectors were derived from UCB [3], Testfloat [4], and IEEECC754 [5], plus
many of my own generation.

There may be some follow-on projects worth discussing:

* The library is currently integrated into the ARM v6s-m multilib only.  It
is likely that some other architectures would benefit from these routines.
However, I have NOT profiled the existing implementations (ieee754-sf.S) to
estimate where improvements may be found.

* GCC currently lacks test for some functions, such as __aeabi_[u]ldivmod().
There may be useful bits in [1] that can be integrated.

On Cortex M0, the library has (approximately) the following properties:

Function(s) Size (bytes)Cycles  Stack   
Accuracy
__clzsi250  20  0   
exact
__clzsi2 (OPTIMIZE_SIZE)22  51  0   
exact
__clzdi28+__clzsi2  4+__clzsi2  0   
exact

__clrsbsi2  8+__clzsi2  6+__clzsi2  0   
exact
__clrsbdi2  18+__clzsi2 (8..10)+__clzsi20   
exact

__ctzsi252  21  0   
exact
__ctzsi2 (OPTIMIZE_SIZE)24  52  0   
exact
__ctzdi28+__ctzsi2  5+__ctzsi2  0   
exact

__ffssi28   6..(5+__ctzsi2) 0   
exact
__ffsdi214+__ctzsi2 9..(8+__ctzsi2) 0   
exact

__popcountsi2   52  25  0   
exact
__popcountsi2 (OPTIMIZE_SIZE)   14  9..201  0   
exact
__popcountdi2   34+__popcountsi246  0   
exact
__popcountdi2 (OPTIMIZE_SIZE)   12+__popcountsi217..401 0   
exact

__paritysi2 24  14  0   
exact
__paritysi2 (OPTIMIZE_SIZE) 16  38  0   
exact
__paritydi2 2+__paritysi2   1+__paritysi2   0   
exact

__umulsidi3 44  24  0   
exact
__mulsidi3  30+__umulsidi3  24+__umulsidi3  8   
exact
__muldi3 (__aeabi_lmul) 10+__umulsidi3  6+__umulsidi3   0   
exact
__ashldi3 (__aeabi_llsl)22  13  0   
exact
__lshrdi3 (__aeabi_llsr)22  13  0   
exact
__ashrdi3 (__aeabi_lasr)22  13  0   
exact

__aeabi_lcmp20  13  0   
exact
__aeabi_ulcmp   16  10  0   
exact

__udivsi3 (__aeabi_uidiv)   56  72..385 0   
< 1 lsb
__divsi3 (__aeabi_idiv) 38+__udivsi326+__udivsi38   
< 1 lsb
__udivdi3 (__aeabi_uldiv)   164 103..1394   16  
< 1 lsb
__udivdi3 (OPTIMIZE_SIZE)   142 120..1392   16  
< 1 lsb
__divdi3 (__aeabi_ldiv) 54+__udivdi336+__udivdi332  
< 1 lsb

__shared_float   

optabs: Variable index vec_set

2022-10-31 Thread Robin Dapp via Gcc-patches
Hi,

I'm looking into vec_set with variable index on s390.  Uros posted a
patch [1] that did not make it upstream in Nov 2020.  It changed the
mode of the index operand to whatever the target supports in
can_vec_set_var_idx_p.  I missed it back then but we indeed do not make
proper use of vec_set with an index register.

With the patch my local changes to make better use of vec_set work
nicely even though I haven't done a full bootstrap yet.  Were there
other issues with the patch or can it still be applied?

Regards
 Robin

[1]
https://gcc.gnu.org/pipermail/gcc-patches/2020-November/559213.html


Re: [committed] libstdc++: Fix compare_exchange_padding.cc test for std::atomic_ref

2022-10-31 Thread Jonathan Wakely via Gcc-patches
On Mon, 31 Oct 2022 at 15:34, Eric Botcazou  wrote:
>
> > The test was only failing for me with -m32 (and not -m64), so I didn't
> > notice until now. That probably means we should make the test fail more
> > reliably if the padding isn't being cleared.
>
> The tests fail randomly for me on SPARC64/Linux:
>
> FAIL: 29_atomics/atomic/compare_exchange_padding.cc execution test
> FAIL: 29_atomics/atomic_ref/compare_exchange_padding.cc execution test
>
> /home/ebotcazou/src/libstdc++-v3/testsuite/29_atomics/atomic_ref/
> compare_exchange_padding.cc:34: int main(): Assertion 'compare_struct(ts, es)'
> failed.
> FAIL: 29_atomics/atomic_ref/compare_exchange_padding.cc execution test
>
>   std::atomic as{ s };
>   auto ts = as.load();
>   VERIFY( !compare_struct(s, ts) ); // padding cleared on construction
>   as.exchange(s);
>   auto es = as.load();
>   VERIFY( compare_struct(ts, es) ); // padding cleared on exchange
>
> How is it supposed to pass exactly?  AFAICS you have no control on the padding
> bits of ts or es and, indeed, at -O2 the loads are scalarized:
>
>   __buf$c_81 = MEM[(struct S *)&__buf].c;
>   __buf$s_59 = MEM[(struct S *)&__buf].s;
>   __buf ={v} {CLOBBER(eol)};
>   ts.c = __buf$c_81;
>   ts.s = __buf$s_59;
> [...]
>   __buf$c_100 = MEM[(struct S *)&__buf].c;
>   __buf$s_35 = MEM[(struct S *)&__buf].s;
>   __buf ={v} {CLOBBER(eol)};
>   es.c = __buf$c_100;
>   es.s = __buf$s_35;
>   _66 = MEM  [(char * {ref-all})];
>   _101 = MEM  [(char * {ref-all})];
>   if (_66 != _101)
> goto ; [0.04%]
>   else
> goto ; [99.96%]
>
> so the result of the 4-byte comparison is random.


I suppose we could use memcmp on the as variable itself, to inspect
the actual stored padding rather than the returned copy of it.



[GCC][PATCH v2] arm: Add pacbti related multilib support for armv8.1-m.main.

2022-10-31 Thread Srinath Parvathaneni via Gcc-patches
Hi,

This patch adds the support for pacbti multlilib linking by making
"-mbranch-protection=none" as default in the command line for all M-profile
targets and uses "-mbranch-protection=none" for multilib matching. If any
valid value is passed to "-mbranch-protection" in the command line, this
new value overwrites the default value in the command line and uses
"-mbranch-protection=standard" for multilib matching.

Eg 1.

If the passed command line flags are:
a) -march=armv8.1-m.main+mve -mfloat-abi=hard -mfpu=auto
b) -mcpu=cortex-m85+nopacbti -mfloat-abi=hard -mfpu=auto

After this patch the command line flags the compiler receives will be:
a) -march=armv8.1-m.main+mve -mfloat-abi=hard -mfpu=auto 
-mbranch-protection=none
b) -mcpu=cortex-m85+nopacbti -mfloat-abi=hard -mfpu=auto 
-mbranch-protection=none

"-mbranch-protection=none" will be used in the multilib matching.

Eg 2.

If the passed command line flags are:
a) -march=armv8.1-m.main+mve+pacbti -mfloat-abi=hard -mfpu=auto  
-mbranch-protection=pac-ret
b) -mcpu=cortex-m85 -mfloat-abi=hard -mfpu=auto  -mbranch-protection=pac-ret+bti

After this patch the command line flags the compiler receives will be:
a) -march=armv8.1-m.main+mve+pacbti -mfloat-abi=hard -mfpu=auto 
-mbranch-protection=pac-ret
b) -mcpu=cortex-m85 -mfloat-abi=hard -mfpu=auto -mbranch-protection=pac-ret+bti

"-mbranch-protection=standard" will be used in the multilib matching.

Eg 3.

For A-profile target, if the passed command line flags are:
-march=armv8-a+simd -mfloat-abi=hard -mfpu=auto

Even after this patch the command line flags compiler receives will remain the 
same:
-march=armv8-a+simd -mfloat-abi=hard -mfpu=auto

Regression tested on arm-none-eabi and bootstrapped on arm-none-linux-gnueabihf.

Ok for master?

Regards,
Srinath.

gcc/ChangeLog:

2022-10-28  Srinath Parvathaneni  

* common/config/arm/arm-common.cc
(arm_canon_branch_protection_option): Define new function.
* config/arm/arm-cpus.in (armv8.1-m.main): Move dsp option below pacbti
option.
* config/arm/arm.h (arm_canon_branch_protection_option): Define function
prototype.
(CANON_BRANCH_PROTECTION_SPEC_FUNCTION): Define macro.
(MBRANCH_PROTECTION_SPECS): Likewise.
* config/arm/t-rmprofile (MULTI_ARCH_OPTS_RM): Add new options.
(MULTI_ARCH_DIRS_RM): Add new directories.
(MULTILIB_REQUIRED): Add new option.
(MULTILIB_REUSE): Reuse existing multlibs.
(MULTILIB_MATCHES): Match multilib strings.

gcc/testsuite/ChangeLog:

2022-10-28  Srinath Parvathaneni  

* gcc.target/arm/multilib.exp (multilib_config "rmprofile"): Update
tests.
* gcc.target/arm/pac-10.c: New test.
* gcc.target/arm/pac-11.c: Likewise.
* gcc.target/arm/pac-12.c: Likewise.

rb16143.patch.gz
Description: application/gzip


Re: [committed] libstdc++: Fix compare_exchange_padding.cc test for std::atomic_ref

2022-10-31 Thread Eric Botcazou via Gcc-patches
> The test was only failing for me with -m32 (and not -m64), so I didn't
> notice until now. That probably means we should make the test fail more
> reliably if the padding isn't being cleared.

The tests fail randomly for me on SPARC64/Linux:

FAIL: 29_atomics/atomic/compare_exchange_padding.cc execution test
FAIL: 29_atomics/atomic_ref/compare_exchange_padding.cc execution test

/home/ebotcazou/src/libstdc++-v3/testsuite/29_atomics/atomic_ref/
compare_exchange_padding.cc:34: int main(): Assertion 'compare_struct(ts, es)' 
failed.
FAIL: 29_atomics/atomic_ref/compare_exchange_padding.cc execution test

  std::atomic as{ s };
  auto ts = as.load();
  VERIFY( !compare_struct(s, ts) ); // padding cleared on construction
  as.exchange(s);
  auto es = as.load();
  VERIFY( compare_struct(ts, es) ); // padding cleared on exchange

How is it supposed to pass exactly?  AFAICS you have no control on the padding 
bits of ts or es and, indeed, at -O2 the loads are scalarized:

  __buf$c_81 = MEM[(struct S *)&__buf].c;
  __buf$s_59 = MEM[(struct S *)&__buf].s;
  __buf ={v} {CLOBBER(eol)};
  ts.c = __buf$c_81;
  ts.s = __buf$s_59;
[...]
  __buf$c_100 = MEM[(struct S *)&__buf].c;
  __buf$s_35 = MEM[(struct S *)&__buf].s;
  __buf ={v} {CLOBBER(eol)};
  es.c = __buf$c_100;
  es.s = __buf$s_35;
  _66 = MEM  [(char * {ref-all})];
  _101 = MEM  [(char * {ref-all})];
  if (_66 != _101)
goto ; [0.04%]
  else
goto ; [99.96%]

so the result of the 4-byte comparison is random.

-- 
Eric Botcazou




[ada, patch] fix libgnat build on x86_64-linux-gnux32 with glibc <= 2.31

2022-10-31 Thread Matthias Klose
This was introduced with the fix and backports of PR103530 on 
x86_64-linux-gnux32 with older glibc versions (checked with 2.31), where dladdr 
is still in the libdl.so library, and not included in libc.so as in newer glibc 
versions.

Linking of libgnat.so fails with

[...]
/usr/x86_64-linux-gnux32/bin/ld: s-trasym.o: in function 
`system__traceback__symbolic__module_na

me__getXnn':
collect2: error: ld returned 1 exit status
make[8]: *** [gcc-interface/Makefile:677: gnatlib-shared-default] Error 1

https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=9d6c63ba490ec92245f04b5cbafc56abd28e8d22

-- a/gcc/ada/Makefile.rtl
+++ b/gcc/ada/Makefile.rtl
@@ -2650,13 +2650,18 @@ ifeq ($(strip $(filter-out %x32 linux%,$(target_cpu) 
$(target_os))),)

   s-tasinf.adbThe addition of s-tsmona.adbto dladdr, however not setting MISCLIB to -ldl as on other architectures


Proposed patch:


 PR ada/107475
 * Makefile.rtl: Set MISCLIB for x86_64-linux-gnux32.


--- a/gcc/ada/Makefile.rtl
+++ b/gcc/ada/Makefile.rtl
@@ -2584,6 +2584,7 @@ ifeq ($(strip $(filter-out %x32 linux%,$(target_cpu) 
$(target_os))),)

   EXTRA_GNATRTL_TASKING_OBJS=s-linux.o a-exetim.o
   EH_MECHANISM=-gcc
   THREADSLIB=-lpthread -lrt
+  MISCLIB = -ldl
   GNATLIB_SHARED=gnatlib-shared-dual
   GMEM_LIB = gmemlib
   LIBRARY_VERSION := $(LIB_VERSION)

Ok for the trunk and the branches?

Matthias


Re: Adding a new thread model to GCC

2022-10-31 Thread i.nixman--- via Gcc-patches

On 2022-10-31 09:18, Eric Botcazou wrote:

Hi Eric!

thank you very much for the job!
I will try to build our (MinGW-Builds project) builds using this patch 
and will report back.


@Jonathan

what the next steps to be taken to accept this patch?



best!



I have attached a revised version of the original patch at:
  https://gcc.gnu.org/legacy-ml/gcc-patches/2019-06/msg01840.html

This reimplements the GNU threads library on native Windows (except for 
the
Objective-C specific subset) using direct Win32 API calls, in lieu of 
the

implementation based on semaphores.  This base implementations requires
Windows XP/Server 2003, which was the default minimal setting of 
MinGW-W64
until end of 2020.  This also adds the support required for the C++11 
threads,
using again direct Win32 API calls; this additional layer requires 
Windows

Vista/Server 2008 and is enabled only if _WIN32_WINNT >= 0x0600.

This also changes libstdc++ to pass -D_WIN32_WINNT=0x0600 but only when 
the
switch --enable-libstdcxx-threads is passed, which means that C++11 
threads
are still disabled by default *unless* MinGW-W64 itself is configured 
for
Windows Vista/Server 2008 or later by default (this has been the case 
in

the development version since end of 2020, for earlier versions you can
configure it --with-default-win32-winnt=0x0600 to get the same effect).

I only manually tested it on i686-w64-mingw32 and x86_64-w64-mingw32 
but
AdaCore has used it in their C/C++/Ada compilers for 3 years now and 
the

30_threads chapter of the libstdc++ testsuite was clean at the time.


2022-10-31  Eric Botcazou  

libgcc/
* config.host (i[34567]86-*-mingw*): Add thread fragment after EH one
as well as new i386/t-slibgcc-mingw fragment.
(x86_64-*-mingw*): Likewise.
* config/i386/gthr-win32.h: If _WIN32_WINNT is at least 0x0600, define
both __GTHREAD_HAS_COND and __GTHREADS_CXX0X to 1.
Error out if _GTHREAD_USE_MUTEX_TIMEDLOCK is 1.
Include stdlib.h instead of errno.h and do not include _mingw.h.
(CONST_CAST2): Add specific definition for C++.
(ATTRIBUTE_UNUSED): New macro.
(__UNUSED_PARAM): Delete.
Define WIN32_LEAN_AND_MEAN before including windows.h.
	(__gthread_objc_data_tls): Use TLS_OUT_OF_INDEXES instead of 
(DWORD)-1.

(__gthread_objc_init_thread_system): Likewise.
(__gthread_objc_thread_get_data): Minor tweak.
(__gthread_objc_condition_allocate): Use ATTRIBUTE_UNUSED.
(__gthread_objc_condition_deallocate): Likewise.
(__gthread_objc_condition_wait): Likewise.
(__gthread_objc_condition_broadcast): Likewise.
(__gthread_objc_condition_signal): Likewise.
Include sys/time.h.
(__gthr_win32_DWORD): New typedef.
(__gthr_win32_HANDLE): Likewise.
(__gthr_win32_CRITICAL_SECTION): Likewise.
(__gthr_win32_CONDITION_VARIABLE): Likewise.
(__gthread_t): Adjust.
(__gthread_key_t): Likewise.
(__gthread_mutex_t): Likewise.
(__gthread_recursive_mutex_t): Likewise.
(__gthread_cond_t): New typedef.
(__gthread_time_t): Likewise.
(__GTHREAD_MUTEX_INIT_DEFAULT): Delete.
(__GTHREAD_RECURSIVE_MUTEX_INIT_DEFAULT): Likewise.
(__GTHREAD_COND_INIT_FUNCTION): Define.
(__GTHREAD_TIME_INIT): Likewise.
(__gthr_i486_lock_cmp_xchg): Delete.
(__gthr_win32_create): Declare.
(__gthr_win32_join): Likewise.
(__gthr_win32_self): Likewise.
(__gthr_win32_detach): Likewise.
(__gthr_win32_equal): Likewise.
(__gthr_win32_yield): Likewise.
(__gthr_win32_mutex_destroy): Likewise.
	(__gthr_win32_cond_init_function): Likewise if __GTHREADS_HAS_COND is 
1.

(__gthr_win32_cond_broadcast): Likewise.
(__gthr_win32_cond_signal): Likewise.
(__gthr_win32_cond_wait): Likewise.
(__gthr_win32_cond_timedwait): Likewise.
(__gthr_win32_recursive_mutex_init_function): Delete.
(__gthr_win32_recursive_mutex_lock): Likewise.
(__gthr_win32_recursive_mutex_unlock): Likewise.
(__gthr_win32_recursive_mutex_destroy): Likewise.
(__gthread_create): New inline function.
(__gthread_join): Likewise.
(__gthread_self): Likewise.
(__gthread_detach): Likewise.
(__gthread_equal): Likewise.
(__gthread_yield): Likewise.
(__gthread_cond_init_function): Likewise if __GTHREADS_HAS_COND is 1.
(__gthread_cond_broadcast): Likewise.
(__gthread_cond_signal): Likewise.
(__gthread_cond_wait): Likewise.
(__gthread_cond_timedwait): Likewise.
(__GTHREAD_WIN32_INLINE): New macro.
(__GTHREAD_WIN32_COND_INLINE): Likewise.
(__GTHREAD_WIN32_ACTIVE_P): Likewise.
Define WIN32_LEAN_AND_MEAN before including windows.h.
(__gthread_once): Minor tweaks.
(__gthread_key_create): Use ATTRIBUTE_UNUSED and TLS_OUT_OF_INDEXES.
  

[Patch] OpenMP/Fortran: 'target update' with strides + DT components

2022-10-31 Thread Tobias Burnus

I recently saw that gfortran does not support derived type components
with 'target update', an OpenMP 5.0 feature.

When adding it, I also found out that strides where not handled. There
is probably some room of improvement about what to copy and what not,
but copying too much should be fine.

Build + (reg)tested on x86_64-gnu-linux without offloading configured
+ libgomp tested on x86_64-gnu-linux with nvptx offloading.
OK for mainline?

 * * *

PS: Follow-up work items:
* Strides: OpenMP seemingly permits also 'a%b([1,6,19,12])' as
  long as the first index has the lowest address. – And also
  'a%b(:)%c' is permitted – both not handled in this patch
  (and rejected with a compile-time error)
* There seems to be some problems with 'alloc' with pointers
  and allocatables in components – but I have not rechecked.
* For allocatables, 'target update' needs to do a deep mapping;
  I need to check whether that's the case.
Note for the last two: allocatable components only works OG11/OG12
and I urgently need to cleanup + (re)submit that patch to mainline.
(It came too late for GCC 12.)

* There might be also some issue mapping/refcounting, which I have not
  investigated - affecting the 'target exit data' of target-11.f90.

PPS: I intent to file at least one/some PRs about those issues, unless
I can fix them quickly.
-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955
OpenMP/Fortran: 'target update' with strides + DT components

OpenMP 5.0 permits to use arrays with strides and derived
type components for the list items to the 'from'/'to' clauses
of the 'target update' directive.

gcc/fortran/ChangeLog:

	* openmp.cc (gfc_match_omp_clauses): Permit derived types.
	(resolve_omp_clauses):Accept noncontiguous
	arrays.
	* trans-openmp.cc (gfc_trans_omp_clauses): Fixes for
	derived-type changes; fix size for scalars.

libgomp/ChangeLog:

	* testsuite/libgomp.fortran/target-11.f90: New test.
	* testsuite/libgomp.fortran/target-13.f90: New test.

 gcc/fortran/openmp.cc   |  19 ++-
 gcc/fortran/trans-openmp.cc |   9 +-
 libgomp/testsuite/libgomp.fortran/target-11.f90 |  75 +++
 libgomp/testsuite/libgomp.fortran/target-13.f90 | 162 
 4 files changed, 256 insertions(+), 9 deletions(-)

diff --git a/gcc/fortran/openmp.cc b/gcc/fortran/openmp.cc
index 653c43f79ff..2daed74be72 100644
--- a/gcc/fortran/openmp.cc
+++ b/gcc/fortran/openmp.cc
@@ -2499,9 +2499,10 @@ gfc_match_omp_clauses (gfc_omp_clauses **cp, const omp_mask mask,
 	  true) == MATCH_YES)
 	continue;
 	  if ((mask & OMP_CLAUSE_FROM)
-	  && gfc_match_omp_variable_list ("from (",
+	  && (gfc_match_omp_variable_list ("from (",
 	  >lists[OMP_LIST_FROM], false,
-	  NULL, , true) == MATCH_YES)
+	  NULL, , true, true)
+		  == MATCH_YES))
 	continue;
 	  break;
 	case 'g':
@@ -3436,9 +3437,10 @@ gfc_match_omp_clauses (gfc_omp_clauses **cp, const omp_mask mask,
 		continue;
 	}
 	  else if ((mask & OMP_CLAUSE_TO)
-	  && gfc_match_omp_variable_list ("to (",
+	  && (gfc_match_omp_variable_list ("to (",
 	  >lists[OMP_LIST_TO], false,
-	  NULL, , true) == MATCH_YES)
+	  NULL, , true, true)
+		  == MATCH_YES))
 	continue;
 	  break;
 	case 'u':
@@ -7585,8 +7587,11 @@ resolve_omp_clauses (gfc_code *code, gfc_omp_clauses *omp_clauses,
 			   Only raise an error here if we're really sure the
 			   array isn't contiguous.  An expression such as
 			   arr(-n:n,-n:n) could be contiguous even if it looks
-			   like it may not be.  */
+			   like it may not be.
+			   And OpenMP's 'target update' permits strides for
+			   the to/from clause. */
 			if (code->op != EXEC_OACC_UPDATE
+			&& code->op != EXEC_OMP_TARGET_UPDATE
 			&& list != OMP_LIST_CACHE
 			&& list != OMP_LIST_DEPEND
 			&& !gfc_is_simply_contiguous (n->expr, false, true)
@@ -7630,7 +7635,9 @@ resolve_omp_clauses (gfc_code *code, gfc_omp_clauses *omp_clauses,
 			int i;
 			gfc_array_ref *ar = >u.ar;
 			for (i = 0; i < ar->dimen; i++)
-			  if (ar->stride[i] && code->op != EXEC_OACC_UPDATE)
+			  if (ar->stride[i]
+			  && code->op != EXEC_OACC_UPDATE
+			  && code->op != EXEC_OMP_TARGET_UPDATE)
 			{
 			  gfc_error ("Stride should not be specified for "
 	 "array section in %s clause at %L",
diff --git a/gcc/fortran/trans-openmp.cc b/gcc/fortran/trans-openmp.cc
index 9bd4e6c7e1b..4bfdf85cd9b 100644
--- a/gcc/fortran/trans-openmp.cc
+++ b/gcc/fortran/trans-openmp.cc
@@ -3626,7 +3626,10 @@ gfc_trans_omp_clauses (stmtblock_t *block, gfc_omp_clauses *clauses,
 		  gcc_unreachable ();
 		}
 	  tree node = build_omp_clause (input_location, clause_code);
-	  if (n->expr == NULL || n->expr->ref->u.ar.type 

Re: [PATCH Rust front-end v3 01/46] Use DW_ATE_UTF for the Rust 'char' type

2022-10-31 Thread Tom Tromey via Gcc-patches
> "Mark" == Mark Wielaard  writes:

Mark> DW_LANG_Rust_old was used by old rustc compilers <= 2016 before DWARF5
Mark> assigned an official number. It might be recognized by some
Mark> debuggers.

FWIW I wouldn't worry about it any more.
We could probably just remove the '_old' constant.

Tom


[Ping x2] Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx

2022-10-31 Thread Chung-Lin Tang
Ping x2.

On 2022/10/17 10:29 PM, Chung-Lin Tang wrote:
> Ping.
> 
> On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote:
>> Hi Tom,
>> I had a patch submitted earlier, where I reported that the current way of 
>> implementing
>> barriers in libgomp on nvptx created a quite significant performance drop on 
>> some SPEChpc2021
>> benchmarks:
>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html
>>
>> That previous patch wasn't accepted well (admittedly, it was kind of a hack).
>> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
>>
>> Basically, instead of trying to have the GPU do CPU-with-OS-like things that 
>> it isn't suited for,
>> barriers are implemented simplistically with bar.* synchronization 
>> instructions.
>> Tasks are processed after threads have joined, and only if team->task_count 
>> != 0
>>
>> (arguably, there might be a little bit of performance forfeited where 
>> earlier arriving threads
>> could've been used to process tasks ahead of other threads. But that again 
>> falls into requiring
>> implementing complex futex-wait/wake like behavior. Really, that kind of 
>> tasking is not what target
>> offloading is usually used for)
>>
>> Implementation highlight notes:
>> 1. gomp_team_barrier_wake() is now an empty function (threads never "wake" 
>> in the usual manner)
>> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
>> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
>>
>> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
>> The main synchronization is done using a 'bar.red' instruction. This 
>> reduces across all threads
>> the condition (team->task_count != 0), to enable the task processing 
>> down below if any thread
>> created a task. (this bar.red usage required the need of the second GCC 
>> patch in this series)
>>
>> This patch has been tested on x86_64/powerpc64le with nvptx offloading, 
>> using libgomp, ovo, omptests,
>> and sollve_vv testsuites, all without regressions. Also verified that the 
>> SPEChpc 2021 521.miniswp_t
>> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle 
>> has been restored to
>> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?
>>
>> (also suggest backporting to GCC12 branch, if performance regression can be 
>> considered a defect)
>>
>> Thanks,
>> Chung-Lin
>>
>> libgomp/ChangeLog:
>>
>> 2022-09-21  Chung-Lin Tang  
>>
>>  * config/nvptx/bar.c (generation_to_barrier): Remove.
>>  (futex_wait,futex_wake,do_spin,do_wait): Remove.
>>  (GOMP_WAIT_H): Remove.
>>  (#include "../linux/bar.c"): Remove.
>>  (gomp_barrier_wait_end): New function.
>>  (gomp_barrier_wait): Likewise.
>>  (gomp_barrier_wait_last): Likewise.
>>  (gomp_team_barrier_wait_end): Likewise.
>>  (gomp_team_barrier_wait): Likewise.
>>  (gomp_team_barrier_wait_final): Likewise.
>>  (gomp_team_barrier_wait_cancel_end): Likewise.
>>  (gomp_team_barrier_wait_cancel): Likewise.
>>  (gomp_team_barrier_cancel): Likewise.
>>  * config/nvptx/bar.h (gomp_team_barrier_wake): Remove
>>  prototype, add new static inline function.


Re: [PATCH] RISC-V: Change constexpr back to CONSTEXPR

2022-10-31 Thread Kito Cheng via Gcc-patches
Committed, thanks!

On Fri, Oct 28, 2022 at 6:47 AM Jeff Law via Gcc-patches
 wrote:
>
>
> On 10/27/22 08:41, juzhe.zh...@rivai.ai wrote:
> > From: Ju-Zhe Zhong 
> >
> > According to 
> > https://github.com/gcc-mirror/gcc/commit/f95d3d5de72a1c43e8d529bad3ef59afc3214705.
> > Since GCC 4.8.6 doesn't support constexpr, we should change it back to 
> > CONSTEXPR.
> > gcc/ChangeLog:
> >
> >   * config/riscv/riscv-vector-builtins-bases.cc: Change constexpr back 
> > to CONSTEXPR.
> >   * config/riscv/riscv-vector-builtins-shapes.cc (SHAPE): Ditto.
> >   * config/riscv/riscv-vector-builtins.cc (struct 
> > registered_function_hasher): Ditto.
> >   * config/riscv/riscv-vector-builtins.h (struct rvv_arg_type_info): 
> > Ditto.
>
> OK.   Please install.
>
>
> Maybe we can move past gcc-4.8 as a bootstrapping requirement one day ;-)
>
>
> Jeff
>
>


[committed] amdgcn: multi-size vector reductions

2022-10-31 Thread Andrew Stubbs
My recent patch to add additional vector lengths didn't address the 
vector reductions yet.


This patch adds the missing support. Shorter vectors use fewer reduction 
steps, and the means to extract the final value has been adjusted.


Lacking from this is any useful costs, so for loops the vect pass will 
almost always continue to choose the most expensive 64-lane vectors, and 
therefore reductions, with masking for smaller sizes. This works, but 
probably could be improved.


For SLP the compiler will usually use a more appropriately sized 
reduction, so this patch is useful for that case.


Andrewamdgcn: multi-size vector reductions

Add support for vector reductions for any vector width by switching iterators
and generalising the code slightly.  There's no one-instruction way to move an
item from lane 31 to lane 0 (63, 15, 7, 3, and 1 are all fine though), and
vec_extract is probably fewer cycles anyway, so now we always reduce to an
SGPR.

gcc/ChangeLog:

* config/gcn/gcn-valu.md (V64_SI): Delete iterator.
(V64_DI): Likewise.
(V64_1REG): Likewise.
(V64_INT_1REG): Likewise.
(V64_2REG): Likewise.
(V64_ALL): Likewise.
(V64_FP): Likewise.
(reduc__scal_): Use V_ALL. Use gen_vec_extract.
(fold_left_plus_): Use V_FP.
(*_dpp_shr_): Use V_1REG.
(*_dpp_shr_): Use V_DI.
(*plus_carry_dpp_shr_): Use V_INT_1REG.
(*plus_carry_in_dpp_shr_): Use V_SI.
(*plus_carry_dpp_shr_): Use V_DI.
(mov_from_lane63_): Delete.
(mov_from_lane63_): Delete.
* config/gcn/gcn.cc (gcn_expand_reduc_scalar): Support partial vectors.
* config/gcn/gcn.md (unspec): Remove UNSPEC_MOV_FROM_LANE63.

diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md
index 00c0e3be1ea..6274d2e9228 100644
--- a/gcc/config/gcn/gcn-valu.md
+++ b/gcc/config/gcn/gcn-valu.md
@@ -32,11 +32,6 @@ (define_mode_iterator V_DI
 (define_mode_iterator V_DF
  [V2DF V4DF V8DF V16DF V32DF V64DF])
 
-(define_mode_iterator V64_SI
- [V64SI])
-(define_mode_iterator V64_DI
- [V64DI])
-
 ; Vector modes for sub-dword modes
 (define_mode_iterator V_QIHI
  [V2QI V2HI
@@ -77,13 +72,6 @@ (define_mode_iterator V_FP_1REG
   V32HF V32SF
   V64HF V64SF])
 
-; V64_* modes are for where more general support is unimplemented
-; (e.g. reductions)
-(define_mode_iterator V64_1REG
- [V64QI V64HI V64SI V64HF V64SF])
-(define_mode_iterator V64_INT_1REG
- [V64QI V64HI V64SI])
-
 ; Vector modes for two vector registers
 (define_mode_iterator V_2REG
  [V2DI V2DF
@@ -93,9 +81,6 @@ (define_mode_iterator V_2REG
   V32DI V32DF
   V64DI V64DF])
 
-(define_mode_iterator V64_2REG
- [V64DI V64DF])
-
 ; Vector modes with native support
 (define_mode_iterator V_noQI
  [V2HI V2HF V2SI V2SF V2DI V2DF
@@ -158,11 +143,6 @@ (define_mode_iterator V_FP
   V32HF V32SF V32DF
   V64HF V64SF V64DF])
 
-(define_mode_iterator V64_ALL
- [V64QI V64HI V64HF V64SI V64SF V64DI V64DF])
-(define_mode_iterator V64_FP
- [V64HF V64SF V64DF])
-
 (define_mode_attr scalar_mode
   [(V2QI "qi") (V2HI "hi") (V2SI "si")
(V2HF "hf") (V2SF "sf") (V2DI "di") (V2DF "df")
@@ -3528,15 +3508,16 @@ (define_int_attr reduc_insn [(UNSPEC_SMIN_DPP_SHR 
"v_min%i0")
 (define_expand "reduc__scal_"
   [(set (match_operand: 0 "register_operand")
(unspec:
- [(match_operand:V64_ALL 1 "register_operand")]
+ [(match_operand:V_ALL 1 "register_operand")]
  REDUC_UNSPEC))]
   ""
   {
 rtx tmp = gcn_expand_reduc_scalar (mode, operands[1],
   );
 
-/* The result of the reduction is in lane 63 of tmp.  */
-emit_insn (gen_mov_from_lane63_ (operands[0], tmp));
+rtx last_lane = GEN_INT (GET_MODE_NUNITS (mode) - 1);
+emit_insn (gen_vec_extract (operands[0], tmp,
+  last_lane));
 
 DONE;
   })
@@ -3547,7 +3528,7 @@ (define_expand "reduc__scal_"
 (define_expand "fold_left_plus_"
  [(match_operand: 0 "register_operand")
   (match_operand: 1 "gcn_alu_operand")
-  (match_operand:V64_FP 2 "gcn_alu_operand")]
+  (match_operand:V_FP 2 "gcn_alu_operand")]
   "can_create_pseudo_p ()
&& (flag_openacc || flag_openmp
|| flag_associative_math)"
@@ -3563,11 +3544,11 @@ (define_expand "fold_left_plus_"
})
 
 (define_insn "*_dpp_shr_"
-  [(set (match_operand:V64_1REG 0 "register_operand"   "=v")
-   (unspec:V64_1REG
- [(match_operand:V64_1REG 1 "register_operand" "v")
-  (match_operand:V64_1REG 2 "register_operand" "v")
-  (match_operand:SI 3 "const_int_operand"  "n")]
+  [(set (match_operand:V_1REG 0 

[committed] amdgcn: add fmin/fmax patterns

2022-10-31 Thread Andrew Stubbs
This patch adds patterns for the fmin and fmax operators, for scalars, 
vectors, and vector reductions.


The compiler uses smin and smax for most floating-point optimizations, 
etc., but not where the user calls fmin/fmax explicitly.  On amdgcn the 
hardware min/max instructions are already IEEE compliant w.r.t. 
unordered values, so there's no need for separate implementations.


Andrewamdgcn: add fmin/fmax patterns

Add fmin/fmax for scalar, vector, and reductions.  The smin/smax patterns are
already using the IEEE compliant hardware instructions anyway, so we can just
expand to use those insns.

gcc/ChangeLog:

* config/gcn/gcn-valu.md (fminmaxop): New iterator.
(3): New define_expand.
(3): Likewise.
(reduc__scal_): Likewise.
* config/gcn/gcn.md (fexpander): New attribute.

diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md
index 6274d2e9228..3b619512e13 100644
--- a/gcc/config/gcn/gcn-valu.md
+++ b/gcc/config/gcn/gcn-valu.md
@@ -2466,6 +2466,23 @@ (define_insn "3"
   [(set_attr "type" "vop2")
(set_attr "length" "8,8")])
 
+(define_code_iterator fminmaxop [smin smax])
+(define_expand "3"
+  [(set (match_operand:FP 0 "gcn_valu_dst_operand")
+   (fminmaxop:FP
+ (match_operand:FP 1 "gcn_valu_src0_operand")
+ (match_operand:FP 2 "gcn_valu_src1_operand")))]
+  ""
+  {})
+
+(define_expand "3"
+  [(set (match_operand:V_FP 0 "gcn_valu_dst_operand")
+   (fminmaxop:V_FP
+ (match_operand:V_FP 1 "gcn_valu_src0_operand")
+ (match_operand:V_FP 2 "gcn_valu_src1_operand")))]
+  ""
+  {})
+
 ;; }}}
 ;; {{{ FP unops
 
@@ -3522,6 +3539,17 @@ (define_expand "reduc__scal_"
 DONE;
   })
 
+(define_expand "reduc__scal_"
+  [(match_operand: 0 "register_operand")
+   (fminmaxop:V_FP
+ (match_operand:V_FP 1 "register_operand"))]
+  ""
+  {
+/* fmin/fmax are identical to smin/smax.  */
+emit_insn (gen_reduc__scal_ (operands[0], operands[1]));
+DONE;
+  })
+
 ;; Warning: This "-ffast-math" implementation converts in-order reductions
 ;;  into associative reductions. It's also used where OpenMP or
 ;;  OpenACC paralellization has already broken the in-order semantics.
diff --git a/gcc/config/gcn/gcn.md b/gcc/config/gcn/gcn.md
index 6c1a438f9d1..987b76396cc 100644
--- a/gcc/config/gcn/gcn.md
+++ b/gcc/config/gcn/gcn.md
@@ -372,6 +372,10 @@ (define_code_attr expander
(sign_extend "extend")
(zero_extend "zero_extend")])
 
+(define_code_attr fexpander
+  [(smin "fmin")
+   (smax "fmax")])
+
 ;; }}}
 ;; {{{ Miscellaneous instructions
 


[committed] amdgcn: Silence unused parameter warning

2022-10-31 Thread Andrew Stubbs
A function parameter was left over from a previous draft of my 
multiple-vector-length patch. This patch silences the harmless warning.


Andrewamdgcn: Silence unused parameter warning

gcc/ChangeLog:

* config/gcn/gcn.cc (gcn_simd_clone_compute_vecsize_and_simdlen):
Set base_type as ARG_UNUSED.

diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index a9ef5c3dc02..a561976d7f5 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -5026,7 +5026,7 @@ gcn_vectorization_cost (enum vect_cost_for_stmt 
ARG_UNUSED (type_of_cost),
 static int
 gcn_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *ARG_UNUSED 
(node),
struct cgraph_simd_clone *clonei,
-   tree base_type,
+   tree ARG_UNUSED (base_type),
int ARG_UNUSED (num))
 {
   if (known_eq (clonei->simdlen, 0U))


RE: [GCC][PATCH v2] arm: Add cde feature support for Cortex-M55 CPU.

2022-10-31 Thread Srinath Parvathaneni via Gcc-patches
Hi,

> -Original Message-
> From: Christophe Lyon 
> Sent: Monday, October 17, 2022 2:30 PM
> To: Srinath Parvathaneni ; gcc-
> patc...@gcc.gnu.org
> Cc: Richard Earnshaw 
> Subject: Re: [GCC][PATCH] arm: Add cde feature support for Cortex-M55
> CPU.
> 
> Hi Srinath,
> 
> 
> On 10/10/22 10:20, Srinath Parvathaneni via Gcc-patches wrote:
> > Hi,
> >
> > This patch adds cde feature (optional) support for Cortex-M55 CPU,
> > please refer [1] for more details. To use this feature we need to
> > specify +cdecpN (e.g. -mcpu=cortex-m55+cdecp), where N is the
> coprocessor number 0 to 7.
> >
> > Bootstrapped for arm-none-linux-gnueabihf target, regression tested on
> > arm-none-eabi target and found no regressions.
> >
> > [1] https://developer.arm.com/documentation/101051/0101/?lang=en
> (version: r1p1).
> >
> > Ok for master?
> >
> > Regards,
> > Srinath.
> >
> > gcc/ChangeLog:
> >
> > 2022-10-07  Srinath Parvathaneni  
> >
> >  * common/config/arm/arm-common.cc (arm_canon_arch_option_1):
> Ignore cde
> >  options for mlibarch.
> >  * config/arm/arm-cpus.in (begin cpu cortex-m55): Add cde options.
> >  * doc/invoke.texi (CDE): Document options for Cortex-M55 CPU.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 2022-10-07  Srinath Parvathaneni  
> >
> >  * gcc.target/arm/multilib.exp: Add multilib tests for Cortex-M55 
> > CPU.
> >
> >
> > ### Attachment also inlined for ease of reply
> ###
> >
> >
> > diff --git a/gcc/common/config/arm/arm-common.cc
> > b/gcc/common/config/arm/arm-common.cc
> > index
> >
> c38812f1ea6a690cd19b0dc74d963c4f5ae155ca..b6f955b3c012475f398382e72
> c9a
> > 3966412991ec 100644
> > --- a/gcc/common/config/arm/arm-common.cc
> > +++ b/gcc/common/config/arm/arm-common.cc
> > @@ -753,6 +753,15 @@ arm_canon_arch_option_1 (int argc, const char
> **argv, bool arch_for_multilib)
> > arm_initialize_isa (target_isa, selected_cpu->common.isa_bits);
> > arm_parse_option_features (target_isa, _cpu->common,
> >  strchr (cpu, '+'));
> > +  if (arch_for_multilib)
> > +   {
> > + const enum isa_feature removable_bits[] =
> {ISA_IGNORE_FOR_MULTILIB,
> > +isa_nobit};
> > + sbitmap isa_bits = sbitmap_alloc (isa_num_bits);
> > + arm_initialize_isa (isa_bits, removable_bits);
> > + bitmap_and_compl (target_isa, target_isa, isa_bits);
> > +   }
> > +
> 
> I can see the piece of code you add here is exactly the same as the one a few
> lines above when handling "if (arch)". Can this be moved below and thus be
> common to the two cases, or does it have to be performed before
> bitmap_ior of fpu_isa?

Thanks for pointing out this, I have moved the common code below the arch and 
cpu
if blocks in the attached patch.
 
> Also, IIUC, CDE was already optional for other CPUs (M33, M35P, star-mc1),
> so the hunk above fixes a latent bug when handling multilibs for these CPUs
> too? If so, maybe worth splitting the patch into two parts since the above is
> not strictly related to M55?
>
Even though CDE is optional for the mentioned CPUs as per the specs, the code to
enable CDE as optional feature is missing in current compiler.
Current GCC compiler supports CDE as optional feature only with -march options 
and
this pass adds CDE as optional for M55 and so this is not a fix bug.

> But I'm not a maintainer ;-)
> 
> Thanks,
> 
> Christophe
> 
> > if (fpu && strcmp (fpu, "auto") != 0)
> > {
> >   /* The easiest and safest way to remove the default fpu diff
> > --git a/gcc/config/arm/arm-cpus.in b/gcc/config/arm/arm-cpus.in index
> >
> 5a63bc548e54dbfdce5d1df425bd615d81895d80..aa02c04c4924662f3ddd58e
> 69673
> > 92ba3f4b4a87 100644
> > --- a/gcc/config/arm/arm-cpus.in
> > +++ b/gcc/config/arm/arm-cpus.in
> > @@ -1633,6 +1633,14 @@ begin cpu cortex-m55
> >option nomve remove mve mve_float
> >option nofp remove ALL_FP mve_float
> >option nodsp remove MVE mve_float
> > + option cdecp0 add cdecp0
> > + option cdecp1 add cdecp1
> > + option cdecp2 add cdecp2
> > + option cdecp3 add cdecp3
> > + option cdecp4 add cdecp4
> > + option cdecp5 add cdecp5
> > + option cdecp6 add cdecp6
> > + option cdecp7 add cdecp7
> >isa quirk_no_asmcpu quirk_vlldm
> >costs v7m
> >vendor 41
> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index
> >
> aa5655764a0360959f9c1061749d2cc9ebd23489..26857f7a90e42d925bc69086
> 86ac
> > 78138a53c4ad 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -21698,6 +21698,10 @@ floating-point instructions on @samp{cortex-
> m55}.
> >   Disable the M-Profile Vector Extension (MVE) single precision floating-
> point
> >   instructions on @samp{cortex-m55}.
> >
> > +@item +cdecp0, +cdecp1, ... , +cdecp7 Enable the Custom Datapath
> > +Extension (CDE) on selected coprocessors according to the numbers
> > +given in the options in the range 0 to 7 on @samp{cortex-m55}.
> > 

Re: [PATCH]AArch64 Extend umov and sbfx patterns.

2022-10-31 Thread Richard Sandiford via Gcc-patches
Tamar Christina  writes:
> Hi All,
>
> Our zero and sign extend and extract patterns are currently very limited and
> only work for the original register size of the instructions. i.e. limited by
> GPI patterns.  However these instructions extract bits and extend.  This means
> that any register size can be used as an input as long as the extraction makes
> logical sense.
>
> The majority of the attached testcases fail currently to optimize.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-simd.md (aarch64_get_lane): Drop reload
>   penalty.
>   * config/aarch64/aarch64.md
>   (*_ashl): Renamed to...
>   (*_ashl): ...this.
>   (*zero_extend_lshr): Renamed to...
>   (*zero_extend_): ...this.
>   (*extend_ashr): Rename to...
>   (*extend_): ...this.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/bitmove_1.c: New test.
>   * gcc.target/aarch64/bitmove_2.c: New test.

Looks like a nice change, but some comments below.

>
> --- inline copy of patch -- 
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> 8bcc9e76b1cad4a2591fb176175db72d7a190d57..23909c62638b49722568da4555b33c71fd21337e
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -4259,7 +4259,7 @@ (define_insn 
> "*aarch64_get_lane_zero_extend"
>  ;; Extracting lane zero is split into a simple move when it is between SIMD
>  ;; registers or a store.
>  (define_insn_and_split "aarch64_get_lane"
> -  [(set (match_operand: 0 "aarch64_simd_nonimmediate_operand" "=?r, w, 
> Utv")
> +  [(set (match_operand: 0 "aarch64_simd_nonimmediate_operand" "=r, w, 
> Utv")
>   (vec_select:
> (match_operand:VALL_F16_FULL 1 "register_operand" "w, w, w")
> (parallel [(match_operand:SI 2 "immediate_operand" "i, i, i")])))]

Which testcase does this help with?  It didn't look like the new tests
do any vector stuff.

> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 
> 85b400489cb382a01b0c469eff2b600a93805e31..3116feda4fe54e2a21dc3f990b6976d216874260
>  100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -5629,13 +5629,13 @@ (define_insn "*si3_insn2_uxtw"
>  )
>  
>  (define_insn "*3_insn"
> -  [(set (match_operand:SHORT 0 "register_operand" "=r")
> - (ASHIFT:SHORT (match_operand:SHORT 1 "register_operand" "r")
> +  [(set (match_operand:ALLI 0 "register_operand" "=r")
> + (ASHIFT:ALLI (match_operand:ALLI 1 "register_operand" "r")
> (match_operand 2 "const_int_operand" "n")))]
>"UINTVAL (operands[2]) < GET_MODE_BITSIZE (mode)"
>  {
>operands[3] = GEN_INT ( - UINTVAL (operands[2]));
> -  return "\t%w0, %w1, %2, %3";
> +  return "\t%0, %1, %2, %3";
>  }
>[(set_attr "type" "bfx")]
>  )

Similar question here I guess.  There's a separate pattern for SI and DI
shifts, so I wouldn't have expected this to be necessary.

> @@ -5710,40 +5710,40 @@ (define_insn "*extrsi5_insn_di"
>[(set_attr "type" "rotate_imm")]
>  )
>  
> -(define_insn "*_ashl"
> +(define_insn "*_ashl"
>[(set (match_operand:GPI 0 "register_operand" "=r")
>   (ANY_EXTEND:GPI
> -  (ashift:SHORT (match_operand:SHORT 1 "register_operand" "r")
> +  (ashift:ALLX (match_operand:ALLX 1 "register_operand" "r")
>  (match_operand 2 "const_int_operand" "n"]
> -  "UINTVAL (operands[2]) < GET_MODE_BITSIZE (mode)"
> +  "UINTVAL (operands[2]) < GET_MODE_BITSIZE (mode)"

It'd be better to avoid even defining si<-si or si<-di "extensions"
(even though nothing should try to match them), so how about adding:

   >  && 

or similar to the beginning of the condition?  The conditions for
the invalid combos will then be provably false at compile time and
the patterns will be compiled out.

Same comment for the others.

>  {
> -  operands[3] = GEN_INT ( - UINTVAL (operands[2]));
> +  operands[3] = GEN_INT ( - UINTVAL (operands[2]));
>return "bfiz\t%0, %1, %2, %3";
>  }
>[(set_attr "type" "bfx")]
>  )
>  
> -(define_insn "*zero_extend_lshr"
> +(define_insn "*zero_extend_"
>[(set (match_operand:GPI 0 "register_operand" "=r")
>   (zero_extend:GPI
> -  (lshiftrt:SHORT (match_operand:SHORT 1 "register_operand" "r")
> -  (match_operand 2 "const_int_operand" "n"]
> -  "UINTVAL (operands[2]) < GET_MODE_BITSIZE (mode)"
> +  (LSHIFTRT_ONLY:ALLX (match_operand:ALLX 1 "register_operand" "r")
> +  (match_operand 2 "const_int_operand" "n"]
> +  "UINTVAL (operands[2]) < GET_MODE_BITSIZE (mode)"
>  {
> -  operands[3] = GEN_INT ( - UINTVAL (operands[2]));
> +  operands[3] = GEN_INT ( - UINTVAL (operands[2]));
>return "ubfx\t%0, %1, %2, %3";
>  }
>[(set_attr "type" "bfx")]
>  )

I think it'd better to stick to the hard-coded lshiftrt, since nothing
in 

[PATCH 8/8]AArch64: Have reload not choose to do add on the scalar side if both values exist on the SIMD side.

2022-10-31 Thread Tamar Christina via Gcc-patches
Hi All,

Currently we often times generate an r -> r add even if it means we need two
reloads to perform it, i.e. in the case that the values are on the SIMD side.

The pairwise operations expose these more now and so we get suboptimal codegen.

Normally I would have liked to use ^ or $ here, but while this works for the
simple examples, reload inexplicably falls apart on examples that should have
been trivial. It forces a move to r -> w to use the w ADD, which is counter to
what ^ and $ should do.

However ! seems to fix all the regression and still maintains the good codegen.

I have tried looking into whether it's our costings that are off, but I can't
seem anything logical here.  So I'd like to push this change instead along with
test that augment the other testcases that guard the r -> r variants.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* config/aarch64/aarch64.md (*add3_aarch64): Add ! to the r -> r
alternative.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/simd/scalar_addp.c: New test.
* gcc.target/aarch64/simd/scalar_faddp.c: New test.
* gcc.target/aarch64/simd/scalar_faddp2.c: New test.
* gcc.target/aarch64/simd/scalar_fmaxp.c: New test.
* gcc.target/aarch64/simd/scalar_fminp.c: New test.
* gcc.target/aarch64/simd/scalar_maxp.c: New test.
* gcc.target/aarch64/simd/scalar_minp.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
09ae1118371f82ca63146fceb953eb9e820d05a4..c333fb1f72725992bb304c560f1245a242d5192d
 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -2043,7 +2043,7 @@ (define_expand "add3"
 
 (define_insn "*add3_aarch64"
   [(set
-(match_operand:GPI 0 "register_operand" "=rk,rk,w,rk,r,r,rk")
+(match_operand:GPI 0 "register_operand" "=rk,!rk,w,rk,r,r,rk")
 (plus:GPI
  (match_operand:GPI 1 "register_operand" "%rk,rk,w,rk,rk,0,rk")
  (match_operand:GPI 2 "aarch64_pluslong_operand" "I,r,w,J,Uaa,Uai,Uav")))]
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_addp.c 
b/gcc/testsuite/gcc.target/aarch64/simd/scalar_addp.c
new file mode 100644
index 
..5b8d40f19884fc7b4e7decd80758bc36fa76d058
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_addp.c
@@ -0,0 +1,70 @@
+/* { dg-do assemble } */
+/* { dg-additional-options "-save-temps -O1 -std=c99" } */
+/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
+
+typedef long long v2di __attribute__((vector_size (16)));
+typedef unsigned long long v2udi __attribute__((vector_size (16)));
+typedef int v2si __attribute__((vector_size (16)));
+typedef unsigned int v2usi __attribute__((vector_size (16)));
+
+/*
+** foo:
+** addpd0, v0.2d
+** fmovx0, d0
+** ret
+*/
+long long
+foo (v2di x)
+{
+  return x[1] + x[0];
+}
+
+/*
+** foo1:
+** saddlp  v0.1d, v0.2s
+** fmovx0, d0
+** ret
+*/
+long long
+foo1 (v2si x)
+{
+  return x[1] + x[0];
+}
+
+/*
+** foo2:
+** uaddlp  v0.1d, v0.2s
+** fmovx0, d0
+** ret
+*/
+unsigned long long
+foo2 (v2usi x)
+{
+  return x[1] + x[0];
+}
+
+/*
+** foo3:
+** uaddlp  v0.1d, v0.2s
+** add d0, d0, d1
+** fmovx0, d0
+** ret
+*/
+unsigned long long
+foo3 (v2usi x, v2udi y)
+{
+  return (x[1] + x[0]) + y[0];
+}
+
+/*
+** foo4:
+** saddlp  v0.1d, v0.2s
+** add d0, d0, d1
+** fmovx0, d0
+** ret
+*/
+long long
+foo4 (v2si x, v2di y)
+{
+  return (x[1] + x[0]) + y[0];
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c 
b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c
new file mode 100644
index 
..ff455e060fc833b2f63e89c467b91a76fbe31aff
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c
@@ -0,0 +1,66 @@
+/* { dg-do assemble } */
+/* { dg-require-effective-target arm_v8_2a_fp16_scalar_ok } */
+/* { dg-add-options arm_v8_2a_fp16_scalar } */
+/* { dg-additional-options "-save-temps -O1" } */
+/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
+
+typedef double v2df __attribute__((vector_size (16)));
+typedef float v4sf __attribute__((vector_size (16)));
+typedef __fp16 v8hf __attribute__((vector_size (16)));
+
+/*
+** foo:
+** faddp   d0, v0.2d
+** ret
+*/
+double
+foo (v2df x)
+{
+  return x[1] + x[0];
+}
+
+/*
+** foo1:
+** faddp   s0, v0.2s
+** ret
+*/
+float
+foo1 (v4sf x)
+{
+  return x[0] + x[1];
+}
+
+/*
+** foo2:
+** faddp   h0, v0.2h
+** ret
+*/
+__fp16
+foo2 (v8hf x)
+{
+  return x[0] + x[1];
+}
+
+/*
+** foo3:
+** ext v0.16b, v0.16b, v0.16b, #4
+** faddp   s0, v0.2s
+** ret
+*/
+float
+foo3 (v4sf x)
+{
+  return x[1] + x[2];
+}
+
+/*
+** foo4:
+** dup s0, v0.s\[3\]
+** faddp   h0, v0.2h
+** ret
+*/
+__fp16
+foo4 (v8hf x)

[PATCH 7/8]AArch64: Consolidate zero and sign extension patterns and add missing ones.

2022-10-31 Thread Tamar Christina via Gcc-patches
Hi All,

The target has various zero and sign extension patterns.  These however live in
various locations around the MD file and almost all of them are split
differently.  Due to the various patterns we also ended up missing valid
extensions.  For instance smov is almost never generated.

This change tries to make this more manageable by consolidating the patterns as
much as possible and in doing so fix the missing alternatives.

There were also some duplicate patterns.  Note that the
zero_extend<*_ONLY:mode>2  patterns are nearly identical however
QImode lacks an alternative that the others don't have, so I have left them as
3 different patterns next to each other.

In a lot of cases the wrong iterator was used leaving out cases that should
exist.

I've also changed the masks used for zero extensions to hex instead of decimal
as it's more clear what they do that way, and aligns better with output of
other compilers.

This leave the bulk of the extensions in just 3 patterns.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md
(*aarch64_get_lane_zero_extend): Changed to ...
(*aarch64_get_lane_zero_extend): ... This.
(*aarch64_get_lane_extenddi): New.
* config/aarch64/aarch64.md (sidi2, *extendsidi2_aarch64,
qihi2, *extendqihi2_aarch64, *zero_extendsidi2_aarch64): Remove
duplicate patterns.
(2,
*extend2_aarch64): Remove, consolidate
into ...
(extend2): ... This.
(*zero_extendqihi2_aarch64,
*zero_extend2_aarch64): Remove, consolidate into
...
(zero_extend2,
zero_extend2,
zero_extend2):
(*ands_compare0): Renamed to ...
(*ands_compare0): ... This.
* config/aarch64/iterators.md (HI_ONLY, QI_ONLY): New.
(short_mask): Use hex rather than dec and add SI.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/ands_3.c: Update codegen.
* gcc.target/aarch64/sve/slp_1.c: Likewise.
* gcc.target/aarch64/tst_5.c: Likewise.
* gcc.target/aarch64/tst_6.c: Likewise.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 
8a84a8560e982b8155b18541f5504801b3330124..d0b37c4dd48aeafd3d87c90dc3270e71af5a72b9
 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4237,19 +4237,34 @@ (define_insn 
"*aarch64_get_lane_extend"
   [(set_attr "type" "neon_to_gp")]
 )
 
-(define_insn "*aarch64_get_lane_zero_extend"
+(define_insn "*aarch64_get_lane_extenddi"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+   (sign_extend:DI
+ (vec_select:
+   (match_operand:VS 1 "register_operand" "w")
+   (parallel [(match_operand:SI 2 "immediate_operand" "i")]]
+  "TARGET_SIMD"
+  {
+operands[2] = aarch64_endian_lane_rtx (mode,
+  INTVAL (operands[2]));
+return "smov\\t%x0, %1.[%2]";
+  }
+  [(set_attr "type" "neon_to_gp")]
+)
+
+(define_insn "*aarch64_get_lane_zero_extend"
   [(set (match_operand:GPI 0 "register_operand" "=r")
(zero_extend:GPI
- (vec_select:
-   (match_operand:VDQQH 1 "register_operand" "w")
+ (vec_select:
+   (match_operand:VDQV_L 1 "register_operand" "w")
(parallel [(match_operand:SI 2 "immediate_operand" "i")]]
   "TARGET_SIMD"
   {
-operands[2] = aarch64_endian_lane_rtx (mode,
+operands[2] = aarch64_endian_lane_rtx (mode,
   INTVAL (operands[2]));
-return "umov\\t%w0, %1.[%2]";
+return "umov\\t%w0, %1.[%2]";
   }
-  [(set_attr "type" "neon_to_gp")]
+  [(set_attr "type" "neon_to_gp")]
 )
 
 ;; Lane extraction of a value, neither sign nor zero extension
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
3ea16dbc2557c6a4f37104d44a49f77f768eb53d..09ae1118371f82ca63146fceb953eb9e820d05a4
 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -1911,22 +1911,6 @@ (define_insn "storewb_pair_"
 ;; Sign/Zero extension
 ;; ---
 
-(define_expand "sidi2"
-  [(set (match_operand:DI 0 "register_operand")
-   (ANY_EXTEND:DI (match_operand:SI 1 "nonimmediate_operand")))]
-  ""
-)
-
-(define_insn "*extendsidi2_aarch64"
-  [(set (match_operand:DI 0 "register_operand" "=r,r")
-(sign_extend:DI (match_operand:SI 1 "nonimmediate_operand" "r,m")))]
-  ""
-  "@
-   sxtw\t%0, %w1
-   ldrsw\t%0, %1"
-  [(set_attr "type" "extend,load_4")]
-)
-
 (define_insn "*load_pair_extendsidi2_aarch64"
   [(set (match_operand:DI 0 "register_operand" "=r")
(sign_extend:DI (match_operand:SI 1 "aarch64_mem_pair_operand" "Ump")))
@@ -1940,21 +1924,6 @@ (define_insn "*load_pair_extendsidi2_aarch64"
   [(set_attr "type" "load_8")]
 )
 
-(define_insn 

[PATCH 5/8]AArch64 aarch64: Make existing V2HF be usable.

2022-10-31 Thread Tamar Christina via Gcc-patches
Hi All,

The backend has an existing V2HFmode that is used by pairwise operations.
This mode was however never made fully functional.  Amongst other things it was
never declared as a vector type which made it unusable from the mid-end.

It's also lacking an implementation for load/stores so reload ICEs if this mode
is every used.  This finishes the implementation by providing the above.

Note that I have created a new iterator VHSDF_P instead of extending VHSDF
because the previous iterator is used in far more things than just load/stores.

It's also used for instance in intrinsics and extending this would force me to
provide support for mangling the type while we never expose it through
intrinsics.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md (*aarch64_simd_movv2hf): New.
(mov, movmisalign, aarch64_dup_lane,
aarch64_store_lane0, aarch64_simd_vec_set,
@aarch64_simd_vec_copy_lane, vec_set,
reduc__scal_, reduc__scal_,
aarch64_reduc__internal, aarch64_get_lane,
vec_init, vec_extract): Support V2HF.
* config/aarch64/aarch64.cc (aarch64_classify_vector_mode):
Add E_V2HFmode.
* config/aarch64/iterators.md (VHSDF_P): New.
(V2F, VALL_F16_FULL, nunits, Vtype, Vmtype, Vetype, stype, VEL,
Vel, q, vp): Add V2HF.
* config/arm/types.md (neon_fp_reduc_add_h): New.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/sve/slp_1.c: Update testcase.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 
25aed74f8cf939562ed65a578fe32ca76605b58a..93a2888f567460ad10ec050ea7d4f701df4729d1
 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -19,10 +19,10 @@
 ;; .
 
 (define_expand "mov"
-  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
-   (match_operand:VALL_F16 1 "general_operand"))]
+  [(set (match_operand:VALL_F16_FULL 0 "nonimmediate_operand")
+   (match_operand:VALL_F16_FULL 1 "general_operand"))]
   "TARGET_SIMD"
-  "
+{
   /* Force the operand into a register if it is not an
  immediate whose use can be replaced with xzr.
  If the mode is 16 bytes wide, then we will be doing
@@ -46,12 +46,11 @@ (define_expand "mov"
   aarch64_expand_vector_init (operands[0], operands[1]);
   DONE;
 }
-  "
-)
+})
 
 (define_expand "movmisalign"
-  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
-(match_operand:VALL_F16 1 "general_operand"))]
+  [(set (match_operand:VALL_F16_FULL 0 "nonimmediate_operand")
+(match_operand:VALL_F16_FULL 1 "general_operand"))]
   "TARGET_SIMD && !STRICT_ALIGNMENT"
 {
   /* This pattern is not permitted to fail during expansion: if both arguments
@@ -85,10 +84,10 @@ (define_insn "aarch64_simd_dup"
 )
 
 (define_insn "aarch64_dup_lane"
-  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
-   (vec_duplicate:VALL_F16
+  [(set (match_operand:VALL_F16_FULL 0 "register_operand" "=w")
+   (vec_duplicate:VALL_F16_FULL
  (vec_select:
-   (match_operand:VALL_F16 1 "register_operand" "w")
+   (match_operand:VALL_F16_FULL 1 "register_operand" "w")
(parallel [(match_operand:SI 2 "immediate_operand" "i")])
   )))]
   "TARGET_SIMD"
@@ -142,6 +141,29 @@ (define_insn "*aarch64_simd_mov"
 mov_reg, neon_move")]
 )
 
+(define_insn "*aarch64_simd_movv2hf"
+  [(set (match_operand:V2HF 0 "nonimmediate_operand"
+   "=w, m,  m,  w, ?r, ?w, ?r, w, w")
+   (match_operand:V2HF 1 "general_operand"
+   "m,  Dz, w,  w,  w,  r,  r, Dz, Dn"))]
+  "TARGET_SIMD_F16INST
+   && (register_operand (operands[0], V2HFmode)
+   || aarch64_simd_reg_or_zero (operands[1], V2HFmode))"
+   "@
+ldr\\t%s0, %1
+str\\twzr, %0
+str\\t%s1, %0
+mov\\t%0.2s[0], %1.2s[0]
+umov\\t%w0, %1.s[0]
+fmov\\t%s0, %1
+mov\\t%0, %1
+movi\\t%d0, 0
+* return aarch64_output_simd_mov_immediate (operands[1], 32);"
+  [(set_attr "type" "neon_load1_1reg, store_8, neon_store1_1reg,\
+neon_logic, neon_to_gp, f_mcr,\
+mov_reg, neon_move, neon_move")]
+)
+
 (define_insn "*aarch64_simd_mov"
   [(set (match_operand:VQMOV 0 "nonimmediate_operand"
"=w, Umn,  m,  w, ?r, ?w, ?r, w")
@@ -182,7 +204,7 @@ (define_insn "*aarch64_simd_mov"
 
 (define_insn "aarch64_store_lane0"
   [(set (match_operand: 0 "memory_operand" "=m")
-   (vec_select: (match_operand:VALL_F16 1 "register_operand" "w")
+   (vec_select: (match_operand:VALL_F16_FULL 1 "register_operand" "w")
(parallel [(match_operand 2 "const_int_operand" 
"n")])))]
   "TARGET_SIMD
&& ENDIAN_LANE_N (, INTVAL (operands[2])) == 0"
@@ -1035,11 +1057,11 @@ (define_insn "one_cmpl2"
 )
 
 (define_insn 

[PATCH 6/8]AArch64: Add peephole and scheduling logic for pairwise operations that appear late in RTL.

2022-10-31 Thread Tamar Christina via Gcc-patches
Hi All,

Says what it does on the tin.  In case some operations form in RTL due to
a split, combine or any RTL pass then still try to recognize them.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md: Add new peepholes.
* config/aarch64/aarch64.cc (aarch_macro_fusion_pair_p): Schedule
sequential PLUS operations next to each other to increase the chance of
forming pairwise operations.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 
93a2888f567460ad10ec050ea7d4f701df4729d1..20e9adbf7b9b484f9a19f0c62770930dc3941eb2
 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -3425,6 +3425,22 @@ (define_insn "aarch64_faddp"
   [(set_attr "type" "neon_fp_reduc_add_")]
 )
 
+(define_peephole2
+  [(set (match_operand: 0 "register_operand")
+   (vec_select:
+ (match_operand:VHSDF 1 "register_operand")
+ (parallel [(match_operand 2 "const_int_operand")])))
+   (set (match_operand: 3 "register_operand")
+   (plus:
+ (match_dup 0)
+ (match_operand: 5 "register_operand")))]
+  "TARGET_SIMD
+   && ENDIAN_LANE_N (, INTVAL (operands[2])) == 1
+   && REGNO (operands[5]) == REGNO (operands[1])
+   && peep2_reg_dead_p (2, operands[0])"
+  [(set (match_dup 3) (unspec: [(match_dup 1)] UNSPEC_FADDV))]
+)
+
 (define_insn "reduc_plus_scal_"
  [(set (match_operand: 0 "register_operand" "=w")
(unspec: [(match_operand:VDQV 1 "register_operand" "w")]
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
f3bd71c9f10868f9e6ab50d8e36ed3ee3d48ac22..4023b1729d92bf37f5a2fc8fc8cd3a5194532079
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25372,6 +25372,29 @@ aarch_macro_fusion_pair_p (rtx_insn *prev, rtx_insn 
*curr)
 }
 }
 
+  /* Try to schedule vec_select and add together so the peephole works.  */
+  if (simple_sets_p && REG_P (SET_DEST (prev_set)) && REG_P (SET_DEST 
(curr_set))
+  && GET_CODE (SET_SRC (prev_set)) == VEC_SELECT && GET_CODE (SET_SRC 
(curr_set)) == PLUS)
+  {
+/* We're trying to match:
+   prev (vec_select) == (set (reg r0)
+(vec_select (reg r1) n)
+   curr (plus) == (set (reg r2)
+  (plus (reg r0) (reg r1)))  */
+rtx prev_src = SET_SRC (prev_set);
+rtx curr_src = SET_SRC (curr_set);
+rtx parallel = XEXP (prev_src, 1);
+auto idx
+  = ENDIAN_LANE_N (GET_MODE_NUNITS (GET_MODE (XEXP (prev_src, 0))), 1);
+if (GET_CODE (parallel) == PARALLEL
+   && XVECLEN (parallel, 0) == 1
+   && known_eq (INTVAL (XVECEXP (parallel, 0, 0)), idx)
+   && GET_MODE (SET_DEST (prev_set)) == GET_MODE (curr_src)
+   && GET_MODE_INNER (GET_MODE (XEXP (prev_src, 0)))
+   == GET_MODE (XEXP (curr_src, 1)))
+  return true;
+  }
+
   /* Fuse compare (CMP/CMN/TST/BICS) and conditional branch.  */
   if (aarch64_fusion_enabled_p (AARCH64_FUSE_CMP_BRANCH)
   && prev_set && curr_set && any_condjump_p (curr)




-- 
diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 
93a2888f567460ad10ec050ea7d4f701df4729d1..20e9adbf7b9b484f9a19f0c62770930dc3941eb2
 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -3425,6 +3425,22 @@ (define_insn "aarch64_faddp"
   [(set_attr "type" "neon_fp_reduc_add_")]
 )
 
+(define_peephole2
+  [(set (match_operand: 0 "register_operand")
+   (vec_select:
+ (match_operand:VHSDF 1 "register_operand")
+ (parallel [(match_operand 2 "const_int_operand")])))
+   (set (match_operand: 3 "register_operand")
+   (plus:
+ (match_dup 0)
+ (match_operand: 5 "register_operand")))]
+  "TARGET_SIMD
+   && ENDIAN_LANE_N (, INTVAL (operands[2])) == 1
+   && REGNO (operands[5]) == REGNO (operands[1])
+   && peep2_reg_dead_p (2, operands[0])"
+  [(set (match_dup 3) (unspec: [(match_dup 1)] UNSPEC_FADDV))]
+)
+
 (define_insn "reduc_plus_scal_"
  [(set (match_operand: 0 "register_operand" "=w")
(unspec: [(match_operand:VDQV 1 "register_operand" "w")]
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
f3bd71c9f10868f9e6ab50d8e36ed3ee3d48ac22..4023b1729d92bf37f5a2fc8fc8cd3a5194532079
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25372,6 +25372,29 @@ aarch_macro_fusion_pair_p (rtx_insn *prev, rtx_insn 
*curr)
 }
 }
 
+  /* Try to schedule vec_select and add together so the peephole works.  */
+  if (simple_sets_p && REG_P (SET_DEST (prev_set)) && REG_P (SET_DEST 
(curr_set))
+  && GET_CODE (SET_SRC (prev_set)) == VEC_SELECT && GET_CODE (SET_SRC 
(curr_set)) == PLUS)
+  {
+/* We're trying to match:
+   prev (vec_select) == (set (reg r0)
+ 

[PATCH 4/8]AArch64 aarch64: Implement widening reduction patterns

2022-10-31 Thread Tamar Christina via Gcc-patches
Hi All,

This implements the new widening reduction optab in the backend.
Instead of introducing a duplicate definition for the same thing I have
renamed the intrinsics defintions to use the same optab.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* config/aarch64/aarch64-simd-builtins.def (saddlv, uaddlv): Rename to
reduc_splus_widen_scal_ and reduc_uplus_widen_scal_ respectively.
* config/aarch64/aarch64-simd.md (aarch64_addlv): Renamed to
...
(reduc_plus_widen_scal_): ... This.
* config/aarch64/arm_neon.h (vaddlv_s8, vaddlv_s16, vaddlv_u8,
vaddlv_u16, vaddlvq_s8, vaddlvq_s16, vaddlvq_s32, vaddlvq_u8,
vaddlvq_u16, vaddlvq_u32, vaddlv_s32, vaddlv_u32): Use it.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def 
b/gcc/config/aarch64/aarch64-simd-builtins.def
index 
cf46b31627b84476a25762ffc708fd84a4086e43..a4b21e1495c5699d8557a4bcb9e73ef98ae60b35
 100644
--- a/gcc/config/aarch64/aarch64-simd-builtins.def
+++ b/gcc/config/aarch64/aarch64-simd-builtins.def
@@ -190,9 +190,9 @@
   BUILTIN_VDQV_L (UNOP, saddlp, 0, NONE)
   BUILTIN_VDQV_L (UNOPU, uaddlp, 0, NONE)
 
-  /* Implemented by aarch64_addlv.  */
-  BUILTIN_VDQV_L (UNOP, saddlv, 0, NONE)
-  BUILTIN_VDQV_L (UNOPU, uaddlv, 0, NONE)
+  /* Implemented by reduc_plus_widen_scal_.  */
+  BUILTIN_VDQV_L (UNOP, reduc_splus_widen_scal_, 10, NONE)
+  BUILTIN_VDQV_L (UNOPU, reduc_uplus_widen_scal_, 10, NONE)
 
   /* Implemented by aarch64_abd.  */
   BUILTIN_VDQ_BHSI (BINOP, sabd, 0, NONE)
diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 
cf8c094bd4b76981cef2dd5dd7b8e6be0d56101f..25aed74f8cf939562ed65a578fe32ca76605b58a
 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -3455,7 +3455,7 @@ (define_expand "reduc_plus_scal_v4sf"
   DONE;
 })
 
-(define_insn "aarch64_addlv"
+(define_insn "reduc_plus_widen_scal_"
  [(set (match_operand: 0 "register_operand" "=w")
(unspec: [(match_operand:VDQV_L 1 "register_operand" "w")]
USADDLV))]
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index 
cf6af728ca99dae1cb6ab647466cfec32f7e913e..7b2c4c016191bcd6c3e075d27810faedb23854b7
 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -3664,70 +3664,70 @@ __extension__ extern __inline int16_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlv_s8 (int8x8_t __a)
 {
-  return __builtin_aarch64_saddlvv8qi (__a);
+  return __builtin_aarch64_reduc_splus_widen_scal_v8qi (__a);
 }
 
 __extension__ extern __inline int32_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlv_s16 (int16x4_t __a)
 {
-  return __builtin_aarch64_saddlvv4hi (__a);
+  return __builtin_aarch64_reduc_splus_widen_scal_v4hi (__a);
 }
 
 __extension__ extern __inline uint16_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlv_u8 (uint8x8_t __a)
 {
-  return __builtin_aarch64_uaddlvv8qi_uu (__a);
+  return __builtin_aarch64_reduc_uplus_widen_scal_v8qi_uu (__a);
 }
 
 __extension__ extern __inline uint32_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlv_u16 (uint16x4_t __a)
 {
-  return __builtin_aarch64_uaddlvv4hi_uu (__a);
+  return __builtin_aarch64_reduc_uplus_widen_scal_v4hi_uu (__a);
 }
 
 __extension__ extern __inline int16_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlvq_s8 (int8x16_t __a)
 {
-  return __builtin_aarch64_saddlvv16qi (__a);
+  return __builtin_aarch64_reduc_splus_widen_scal_v16qi (__a);
 }
 
 __extension__ extern __inline int32_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlvq_s16 (int16x8_t __a)
 {
-  return __builtin_aarch64_saddlvv8hi (__a);
+  return __builtin_aarch64_reduc_splus_widen_scal_v8hi (__a);
 }
 
 __extension__ extern __inline int64_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlvq_s32 (int32x4_t __a)
 {
-  return __builtin_aarch64_saddlvv4si (__a);
+  return __builtin_aarch64_reduc_splus_widen_scal_v4si (__a);
 }
 
 __extension__ extern __inline uint16_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlvq_u8 (uint8x16_t __a)
 {
-  return __builtin_aarch64_uaddlvv16qi_uu (__a);
+  return __builtin_aarch64_reduc_uplus_widen_scal_v16qi_uu (__a);
 }
 
 __extension__ extern __inline uint32_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlvq_u16 (uint16x8_t __a)
 {
-  return __builtin_aarch64_uaddlvv8hi_uu (__a);
+  return __builtin_aarch64_reduc_uplus_widen_scal_v8hi_uu (__a);
 }
 
 __extension__ extern __inline uint64_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlvq_u32 (uint32x4_t __a)
 {
-  return __builtin_aarch64_uaddlvv4si_uu (__a);
+  return 

[PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs

2022-10-31 Thread Tamar Christina via Gcc-patches
Hi All,

This patch series is to add recognition of pairwise operations (reductions)
in match.pd such that we can benefit from them even at -O1 when the vectorizer
isn't enabled.

Ths use of these allow for a lot simpler codegen in AArch64 and allows us to
avoid quite a lot of codegen warts.

As an example a simple:

typedef float v4sf __attribute__((vector_size (16)));

float
foo3 (v4sf x)
{
  return x[1] + x[2];
}

currently generates:

foo3:
dup s1, v0.s[1]
dup s0, v0.s[2]
fadds0, s1, s0
ret

while with this patch series now generates:

foo3:
ext v0.16b, v0.16b, v0.16b, #4
faddp   s0, v0.2s
ret

This patch will not perform the operation if the source is not a gimple
register and leaves memory sources to the vectorizer as it's able to deal
correctly with clobbers.

The use of these instruction makes a significant difference in codegen quality
for AArch64 and Arm.

NOTE: The last entry in the series contains tests for all of the previous
patches as it's a bit of an all or nothing thing.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* match.pd (adjacent_data_access_p): Import.
Add new pattern for bitwise plus, min, max, fmax, fmin.
* tree-cfg.cc (verify_gimple_call): Allow function arguments in IFNs.
* tree.cc (adjacent_data_access_p): New.
* tree.h (adjacent_data_access_p): New.

--- inline copy of patch -- 
diff --git a/gcc/match.pd b/gcc/match.pd
index 
2617d56091dfbd41ae49f980ee0af3757f5ec1cf..aecaa3520b36e770d11ea9a10eb18db23c0cd9f7
 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -39,7 +39,8 @@ along with GCC; see the file COPYING3.  If not see
HONOR_NANS
uniform_vector_p
expand_vec_cmp_expr_p
-   bitmask_inv_cst_vector_p)
+   bitmask_inv_cst_vector_p
+   adjacent_data_access_p)
 
 /* Operator lists.  */
 (define_operator_list tcc_comparison
@@ -7195,6 +7196,47 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 
 /* Canonicalizations of BIT_FIELD_REFs.  */
 
+/* Canonicalize BIT_FIELD_REFS to pairwise operations. */
+(for op (plus min max FMIN_ALL FMAX_ALL)
+ ifn (IFN_REDUC_PLUS IFN_REDUC_MIN IFN_REDUC_MAX
+ IFN_REDUC_FMIN IFN_REDUC_FMAX)
+ (simplify
+  (op @0 @1)
+   (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type))
+(with { poly_uint64 nloc = 0;
+   tree src = adjacent_data_access_p (@0, @1, , true);
+   tree ntype = build_vector_type (type, 2);
+   tree size = TYPE_SIZE (ntype);
+   tree pos = build_int_cst (TREE_TYPE (size), nloc);
+   poly_uint64 _sz;
+   poly_uint64 _total; }
+ (if (src && is_gimple_reg (src) && ntype
+ && poly_int_tree_p (size, &_sz)
+ && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total)
+ && known_ge (_total, _sz + nloc))
+  (ifn (BIT_FIELD_REF:ntype { src; } { size; } { pos; })))
+
+(for op (lt gt)
+ ifni (IFN_REDUC_MIN IFN_REDUC_MAX)
+ ifnf (IFN_REDUC_FMIN IFN_REDUC_FMAX)
+ (simplify
+  (cond (op @0 @1) @0 @1)
+   (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type))
+(with { poly_uint64 nloc = 0;
+   tree src = adjacent_data_access_p (@0, @1, , false);
+   tree ntype = build_vector_type (type, 2);
+   tree size = TYPE_SIZE (ntype);
+   tree pos = build_int_cst (TREE_TYPE (size), nloc);
+   poly_uint64 _sz;
+   poly_uint64 _total; }
+ (if (src && is_gimple_reg (src) && ntype
+ && poly_int_tree_p (size, &_sz)
+ && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total)
+ && known_ge (_total, _sz + nloc))
+  (if (SCALAR_FLOAT_MODE_P (TYPE_MODE (type)))
+   (ifnf (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))
+   (ifni (BIT_FIELD_REF:ntype { src; } { size; } { pos; }
+
 (simplify
  (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4)
  (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype, @2, @4); }))
diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
index 
91ec33c80a41e1e0cc6224e137dd42144724a168..b19710392940cf469de52d006603ae1e3deb6b76
 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -3492,6 +3492,7 @@ verify_gimple_call (gcall *stmt)
 {
   tree arg = gimple_call_arg (stmt, i);
   if ((is_gimple_reg_type (TREE_TYPE (arg))
+  && !is_gimple_variable (arg)
   && !is_gimple_val (arg))
  || (!is_gimple_reg_type (TREE_TYPE (arg))
  && !is_gimple_lvalue (arg)))
diff --git a/gcc/tree.h b/gcc/tree.h
index 
e6564aaccb7b69cd938ff60b6121aec41b7e8a59..8f8a9660c9e0605eb516de194640b8c1b531b798
 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -5006,6 +5006,11 @@ extern bool integer_pow2p (const_tree);
 
 extern tree bitmask_inv_cst_vector_p (tree);
 
+/* TRUE if the two operands represent adjacent access of data such that a
+   pairwise operation can be used.  */
+
+extern tree adjacent_data_access_p (tree, tree, 

[PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector

2022-10-31 Thread Tamar Christina via Gcc-patches
Hi All,

The current vector extract pattern can only extract from a vector when the
position to extract is a multiple of the vector bitsize as a whole.

That means extract something like a V2SI from a V4SI vector from position 32
isn't possible as 32 is not a multiple of 64.  Ideally this optab should have
worked on multiple of the element size, but too many targets rely on this
semantic now.

So instead add a new case which allows any extraction as long as the bit pos
is a multiple of the element size.  We use a VEC_PERM to shuffle the elements
into the bottom parts of the vector and then use a subreg to extract the values
out.  This now allows various vector operations that before were being
decomposed into very inefficient scalar operations.

NOTE: I added 3 testcases, I only fixed the 3rd one.

The 1st one missed because we don't optimize VEC_PERM expressions into
bitfields.  The 2nd one is missed because extract_bit_field only works on
vector modes.  In this case the intermediate extract is DImode.

On targets where the scalar mode is tiable to vector modes the extract should
work fine.

However I ran out of time to fix the first two and so will do so in GCC 14.
For now this catches the case that my pattern now introduces more easily.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* expmed.cc (extract_bit_field_1): Add support for vector element
extracts.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/ext_1.c: New.

--- inline copy of patch -- 
diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index 
bab020c07222afa38305ef8d7333f271b1965b78..ffdf65210d17580a216477cfe4ac1598941ac9e4
 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -1718,6 +1718,45 @@ extract_bit_field_1 (rtx str_rtx, poly_uint64 bitsize, 
poly_uint64 bitnum,
  return target;
}
}
+  else if (!known_eq (bitnum, 0U)
+  && multiple_p (GET_MODE_UNIT_BITSIZE (tmode), bitnum, ))
+   {
+ /* The encoding has a single stepped pattern.  */
+ poly_uint64 nunits = GET_MODE_NUNITS (new_mode);
+ int nelts = nunits.to_constant ();
+ vec_perm_builder sel (nunits, nelts, 1);
+ int delta = -pos.to_constant ();
+ for (int i = 0; i < nelts; ++i)
+   sel.quick_push ((i - delta) % nelts);
+ vec_perm_indices indices (sel, 1, nunits);
+
+ if (can_vec_perm_const_p (new_mode, new_mode, indices, false))
+   {
+ class expand_operand ops[4];
+ machine_mode outermode = new_mode;
+ machine_mode innermode = tmode;
+ enum insn_code icode
+   = direct_optab_handler (vec_perm_optab, outermode);
+ target = gen_reg_rtx (outermode);
+ if (icode != CODE_FOR_nothing)
+   {
+ rtx sel = vec_perm_indices_to_rtx (outermode, indices);
+ create_output_operand ([0], target, outermode);
+ ops[0].target = 1;
+ create_input_operand ([1], op0, outermode);
+ create_input_operand ([2], op0, outermode);
+ create_input_operand ([3], sel, outermode);
+ if (maybe_expand_insn (icode, 4, ops))
+   return simplify_gen_subreg (innermode, target, outermode, 
0);
+   }
+ else if (targetm.vectorize.vec_perm_const != NULL)
+   {
+ if (targetm.vectorize.vec_perm_const (outermode, outermode,
+   target, op0, op0, 
indices))
+   return simplify_gen_subreg (innermode, target, outermode, 
0);
+   }
+   }
+   }
 }
 
   /* See if we can get a better vector mode before extracting.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/ext_1.c 
b/gcc/testsuite/gcc.target/aarch64/ext_1.c
new file mode 100644
index 
..18a10a14f1161584267a8472e571b3bc2ddf887a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ext_1.c
@@ -0,0 +1,54 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#include 
+
+typedef unsigned int v4si __attribute__((vector_size (16)));
+typedef unsigned int v2si __attribute__((vector_size (8)));
+
+/*
+** extract: { xfail *-*-* }
+** ext v0.16b, v0.16b, v0.16b, #4
+** ret
+*/
+v2si extract (v4si x)
+{
+v2si res = {x[1], x[2]};
+return res;
+}
+
+/*
+** extract1: { xfail *-*-* }
+** ext v0.16b, v0.16b, v0.16b, #4
+** ret
+*/
+v2si extract1 (v4si x)
+{
+v2si res;
+memcpy (, ((int*))+1, sizeof(res));
+return res;
+}
+
+typedef struct cast {
+  int a;
+  v2si b __attribute__((packed));
+} cast_t;
+
+typedef union Data {
+   v4si x;
+   cast_t y;
+} data;  
+
+/*
+** extract2:
+** ext v0.16b, v0.16b, v0.16b, #4
+** ret
+*/
+v2si 

[PATCH 2/8]middle-end: Recognize scalar widening reductions

2022-10-31 Thread Tamar Christina via Gcc-patches
Hi All,

This adds a new optab and IFNs for REDUC_PLUS_WIDEN where the resulting
scalar reduction has twice the precision of the input elements.

At some point in a later patch I will also teach the vectorizer to recognize
this builtin once I figure out how the various bits of reductions work.

For now it's generated only by the match.pd pattern.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* internal-fn.def (REDUC_PLUS_WIDEN): New.
* doc/md.texi: Document it.
* match.pd: Recognize widening plus.
* optabs.def (reduc_splus_widen_scal_optab,
reduc_uplus_widen_scal_optab): New.

--- inline copy of patch -- 
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 
34825549ed4e315b07d36dc3d63bae0cc0a3932d..c08691ab4c9a4bfe55ae81e5e228a414d6242d78
 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5284,6 +5284,20 @@ Compute the sum of the elements of a vector. The vector 
is operand 1, and
 operand 0 is the scalar result, with mode equal to the mode of the elements of
 the input vector.
 
+@cindex @code{reduc_uplus_widen_scal_@var{m}} instruction pattern
+@item @samp{reduc_uplus_widen_scal_@var{m}}
+Compute the sum of the elements of a vector and zero-extend @var{m} to a mode
+that has twice the precision of @var{m}.. The vector is operand 1, and
+operand 0 is the scalar result, with mode equal to twice the precision of the
+mode of the elements of the input vector.
+
+@cindex @code{reduc_splus_widen_scal_@var{m}} instruction pattern
+@item @samp{reduc_splus_widen_scal_@var{m}}
+Compute the sum of the elements of a vector and sign-extend @var{m} to a mode
+that has twice the precision of @var{m}.. The vector is operand 1, and
+operand 0 is the scalar result, with mode equal to twice the precision of the
+mode of the elements of the input vector.
+
 @cindex @code{reduc_and_scal_@var{m}} instruction pattern
 @item @samp{reduc_and_scal_@var{m}}
 @cindex @code{reduc_ior_scal_@var{m}} instruction pattern
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 
5e672183f4def9d0cdc29cf12fe17e8cff928f9f..f64a8421b1087b6c0f3602dc556876b0fd15c7ad
 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -215,6 +215,9 @@ DEF_INTERNAL_OPTAB_FN (RSQRT, ECF_CONST, rsqrt, unary)
 
 DEF_INTERNAL_OPTAB_FN (REDUC_PLUS, ECF_CONST | ECF_NOTHROW,
   reduc_plus_scal, unary)
+DEF_INTERNAL_SIGNED_OPTAB_FN (REDUC_PLUS_WIDEN, ECF_CONST | ECF_NOTHROW,
+ first, reduc_splus_widen_scal,
+ reduc_uplus_widen_scal, unary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (REDUC_MAX, ECF_CONST | ECF_NOTHROW, first,
  reduc_smax_scal, reduc_umax_scal, unary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (REDUC_MIN, ECF_CONST | ECF_NOTHROW, first,
diff --git a/gcc/match.pd b/gcc/match.pd
index 
aecaa3520b36e770d11ea9a10eb18db23c0cd9f7..1d407414bee278c64c00d425d9f025c1c58d853d
 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -7237,6 +7237,14 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
(ifnf (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))
(ifni (BIT_FIELD_REF:ntype { src; } { size; } { pos; }
 
+/* Widening reduction conversions. */
+(simplify
+ (convert (IFN_REDUC_PLUS @0))
+ (if (element_precision (TREE_TYPE (@0)) * 2 == element_precision (type)
+  && TYPE_UNSIGNED (type) == TYPE_UNSIGNED (TREE_TYPE (@0))
+  && ANY_INTEGRAL_TYPE_P (type) && ANY_INTEGRAL_TYPE_P (TREE_TYPE(@0)))
+  (IFN_REDUC_PLUS_WIDEN @0)))
+
 (simplify
  (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4)
  (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype, @2, @4); }))
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 
a6db2342bed6baf13ecbd84112c8432c6972e6fe..9947aed67fb8a3b675cb0aab9aeb059f89644106
 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -346,6 +346,8 @@ OPTAB_D (reduc_fmin_scal_optab, "reduc_fmin_scal_$a")
 OPTAB_D (reduc_smax_scal_optab, "reduc_smax_scal_$a")
 OPTAB_D (reduc_smin_scal_optab, "reduc_smin_scal_$a")
 OPTAB_D (reduc_plus_scal_optab, "reduc_plus_scal_$a")
+OPTAB_D (reduc_splus_widen_scal_optab, "reduc_splus_widen_scal_$a")
+OPTAB_D (reduc_uplus_widen_scal_optab, "reduc_uplus_widen_scal_$a")
 OPTAB_D (reduc_umax_scal_optab, "reduc_umax_scal_$a")
 OPTAB_D (reduc_umin_scal_optab, "reduc_umin_scal_$a")
 OPTAB_D (reduc_and_scal_optab,  "reduc_and_scal_$a")




-- 
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 
34825549ed4e315b07d36dc3d63bae0cc0a3932d..c08691ab4c9a4bfe55ae81e5e228a414d6242d78
 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5284,6 +5284,20 @@ Compute the sum of the elements of a vector. The vector 
is operand 1, and
 operand 0 is the scalar result, with mode equal to the mode of the elements of
 the input vector.
 
+@cindex @code{reduc_uplus_widen_scal_@var{m}} instruction pattern
+@item @samp{reduc_uplus_widen_scal_@var{m}}
+Compute the sum of the elements of a vector and zero-extend 

[PATCH]AArch64 Extend umov and sbfx patterns.

2022-10-31 Thread Tamar Christina via Gcc-patches
Hi All,

Our zero and sign extend and extract patterns are currently very limited and
only work for the original register size of the instructions. i.e. limited by
GPI patterns.  However these instructions extract bits and extend.  This means
that any register size can be used as an input as long as the extraction makes
logical sense.

The majority of the attached testcases fail currently to optimize.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md (aarch64_get_lane): Drop reload
penalty.
* config/aarch64/aarch64.md
(*_ashl): Renamed to...
(*_ashl): ...this.
(*zero_extend_lshr): Renamed to...
(*zero_extend_): ...this.
(*extend_ashr): Rename to...
(*extend_): ...this.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/bitmove_1.c: New test.
* gcc.target/aarch64/bitmove_2.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 
8bcc9e76b1cad4a2591fb176175db72d7a190d57..23909c62638b49722568da4555b33c71fd21337e
 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4259,7 +4259,7 @@ (define_insn 
"*aarch64_get_lane_zero_extend"
 ;; Extracting lane zero is split into a simple move when it is between SIMD
 ;; registers or a store.
 (define_insn_and_split "aarch64_get_lane"
-  [(set (match_operand: 0 "aarch64_simd_nonimmediate_operand" "=?r, w, 
Utv")
+  [(set (match_operand: 0 "aarch64_simd_nonimmediate_operand" "=r, w, 
Utv")
(vec_select:
  (match_operand:VALL_F16_FULL 1 "register_operand" "w, w, w")
  (parallel [(match_operand:SI 2 "immediate_operand" "i, i, i")])))]
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
85b400489cb382a01b0c469eff2b600a93805e31..3116feda4fe54e2a21dc3f990b6976d216874260
 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -5629,13 +5629,13 @@ (define_insn "*si3_insn2_uxtw"
 )
 
 (define_insn "*3_insn"
-  [(set (match_operand:SHORT 0 "register_operand" "=r")
-   (ASHIFT:SHORT (match_operand:SHORT 1 "register_operand" "r")
+  [(set (match_operand:ALLI 0 "register_operand" "=r")
+   (ASHIFT:ALLI (match_operand:ALLI 1 "register_operand" "r")
  (match_operand 2 "const_int_operand" "n")))]
   "UINTVAL (operands[2]) < GET_MODE_BITSIZE (mode)"
 {
   operands[3] = GEN_INT ( - UINTVAL (operands[2]));
-  return "\t%w0, %w1, %2, %3";
+  return "\t%0, %1, %2, %3";
 }
   [(set_attr "type" "bfx")]
 )
@@ -5710,40 +5710,40 @@ (define_insn "*extrsi5_insn_di"
   [(set_attr "type" "rotate_imm")]
 )
 
-(define_insn "*_ashl"
+(define_insn "*_ashl"
   [(set (match_operand:GPI 0 "register_operand" "=r")
(ANY_EXTEND:GPI
-(ashift:SHORT (match_operand:SHORT 1 "register_operand" "r")
+(ashift:ALLX (match_operand:ALLX 1 "register_operand" "r")
   (match_operand 2 "const_int_operand" "n"]
-  "UINTVAL (operands[2]) < GET_MODE_BITSIZE (mode)"
+  "UINTVAL (operands[2]) < GET_MODE_BITSIZE (mode)"
 {
-  operands[3] = GEN_INT ( - UINTVAL (operands[2]));
+  operands[3] = GEN_INT ( - UINTVAL (operands[2]));
   return "bfiz\t%0, %1, %2, %3";
 }
   [(set_attr "type" "bfx")]
 )
 
-(define_insn "*zero_extend_lshr"
+(define_insn "*zero_extend_"
   [(set (match_operand:GPI 0 "register_operand" "=r")
(zero_extend:GPI
-(lshiftrt:SHORT (match_operand:SHORT 1 "register_operand" "r")
-(match_operand 2 "const_int_operand" "n"]
-  "UINTVAL (operands[2]) < GET_MODE_BITSIZE (mode)"
+(LSHIFTRT_ONLY:ALLX (match_operand:ALLX 1 "register_operand" "r")
+(match_operand 2 "const_int_operand" "n"]
+  "UINTVAL (operands[2]) < GET_MODE_BITSIZE (mode)"
 {
-  operands[3] = GEN_INT ( - UINTVAL (operands[2]));
+  operands[3] = GEN_INT ( - UINTVAL (operands[2]));
   return "ubfx\t%0, %1, %2, %3";
 }
   [(set_attr "type" "bfx")]
 )
 
-(define_insn "*extend_ashr"
+(define_insn "*extend_"
   [(set (match_operand:GPI 0 "register_operand" "=r")
(sign_extend:GPI
-(ashiftrt:SHORT (match_operand:SHORT 1 "register_operand" "r")
-(match_operand 2 "const_int_operand" "n"]
-  "UINTVAL (operands[2]) < GET_MODE_BITSIZE (mode)"
+(ASHIFTRT_ONLY:ALLX (match_operand:ALLX 1 "register_operand" "r")
+(match_operand 2 "const_int_operand" "n"]
+  "UINTVAL (operands[2]) < GET_MODE_BITSIZE (mode)"
 {
-  operands[3] = GEN_INT ( - UINTVAL (operands[2]));
+  operands[3] = GEN_INT ( - UINTVAL (operands[2]));
   return "sbfx\\t%0, %1, %2, %3";
 }
   [(set_attr "type" "bfx")]
diff --git a/gcc/testsuite/gcc.target/aarch64/bitmove_1.c 
b/gcc/testsuite/gcc.target/aarch64/bitmove_1.c
new file mode 100644
index 

[PATCH 2/2]AArch64 Support new tbranch optab.

2022-10-31 Thread Tamar Christina via Gcc-patches
Hi All,

This implements the new tbranch optab for AArch64.

Instead of emitting the instruction directly I've chosen to expand the pattern
using a zero extract and generating the existing pattern for comparisons for two
reasons:

  1. Allows for CSE of the actual comparison.
  2. It looks like the code in expand makes the label as unused and removed it
 if it doesn't see a separate reference to it.

Because of this expansion though I disable the pattern at -O0 since we have no
combine in that case so we'd end up with worse code.  I did try emitting the
pattern directly, but as mentioned in no#2 expand would then kill the label.

While doing this I noticed that the version that checks the signbit doesn't work
The reason for this looks like an incorrect pattern.  The [us]fbx
instructions are defined for index + size == regiter size.  They architecturally
alias to different instructions and binutils handles this correctly.

In GCC however we tried to prematurely optimize this and added a separate split
pattern.  But this pattern is also missing alternatives only handling DImode.

This just removes this and relaxes the constraints on the normal bfx pattern.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* config/aarch64/aarch64.md (*tb1): Rename to...
(*tb1): ... this.
(tbranch4): New.
(*): Rename to...
(*): ... this.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/tbz_1.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
2bc2684b82c35a44e0a2cea6e3aaf32d939f8cdf..6a4494a9a370139313cc8e57447717aafa14da2d
 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -943,12 +943,28 @@ (define_insn "*cb1"
  (const_int 1)))]
 )
 
-(define_insn "*tb1"
+(define_expand "tbranch4"
   [(set (pc) (if_then_else
- (EQL (zero_extract:DI (match_operand:GPI 0 "register_operand" "r")
-   (const_int 1)
-   (match_operand 1
- "aarch64_simd_shift_imm_" "n"))
+   (match_operator 0 "aarch64_comparison_operator"
+[(match_operand:ALLI 1 "register_operand")
+ (match_operand:ALLI 2 "aarch64_simd_shift_imm_")])
+   (label_ref (match_operand 3 "" ""))
+   (pc)))]
+  "optimize > 0"
+{
+  rtx bitvalue = gen_reg_rtx (DImode);
+  emit_insn (gen_extzv (bitvalue, operands[1], const1_rtx, operands[2]));
+  operands[2] = const0_rtx;
+  operands[1] = aarch64_gen_compare_reg (GET_CODE (operands[0]), bitvalue,
+operands[2]);
+})
+
+(define_insn "*tb1"
+  [(set (pc) (if_then_else
+ (EQL (zero_extract:GPI (match_operand:ALLI 0 "register_operand" 
"r")
+(const_int 1)
+(match_operand 1
+  "aarch64_simd_shift_imm_" 
"n"))
   (const_int 0))
 (label_ref (match_operand 2 "" ""))
 (pc)))
@@ -959,15 +975,15 @@ (define_insn "*tb1"
   {
if (get_attr_far_branch (insn) == 1)
  return aarch64_gen_far_branch (operands, 2, "Ltb",
-"\\t%0, %1, ");
+"\\t%0, %1, ");
else
  {
operands[1] = GEN_INT (HOST_WIDE_INT_1U << UINTVAL (operands[1]));
-   return "tst\t%0, %1\;\t%l2";
+   return "tst\t%0, %1\;\t%l2";
  }
   }
 else
-  return "\t%0, %1, %l2";
+  return "\t%0, %1, %l2";
   }
   [(set_attr "type" "branch")
(set (attr "length")
@@ -5752,39 +5768,19 @@ (define_expand ""
 )
 
 
-(define_insn "*"
+(define_insn "*"
   [(set (match_operand:GPI 0 "register_operand" "=r")
-   (ANY_EXTRACT:GPI (match_operand:GPI 1 "register_operand" "r")
+   (ANY_EXTRACT:GPI (match_operand:ALLI 1 "register_operand" "r")
 (match_operand 2
-  "aarch64_simd_shift_imm_offset_" "n")
+  "aarch64_simd_shift_imm_offset_" "n")
 (match_operand 3
-  "aarch64_simd_shift_imm_" "n")))]
+  "aarch64_simd_shift_imm_" "n")))]
   "IN_RANGE (INTVAL (operands[2]) + INTVAL (operands[3]),
-1, GET_MODE_BITSIZE (mode) - 1)"
-  "bfx\\t%0, %1, %3, %2"
+1, GET_MODE_BITSIZE (mode))"
+  "bfx\\t%0, %1, %3, %2"
   [(set_attr "type" "bfx")]
 )
 
-;; When the bit position and width add up to 32 we can use a W-reg LSR
-;; instruction taking advantage of the implicit zero-extension of the X-reg.
-(define_split
-  [(set (match_operand:DI 0 "register_operand")
-   (zero_extract:DI (match_operand:DI 1 "register_operand")
-(match_operand 2
-   

[PATCH 1/2]middle-end: Add new tbranch optab to add support for bit-test-and-branch operations

2022-10-31 Thread Tamar Christina via Gcc-patches
Hi All,

This adds a new test-and-branch optab that can be used to do a conditional test
of a bit and branch.   This is similar to the cbranch optab but instead can
test any arbitrary bit inside the register.

This patch recognizes boolean comparisons and single bit mask tests.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* dojump.cc (do_jump): Pass along value.
(do_jump_by_parts_greater_rtx): Likewise.
(do_jump_by_parts_zero_rtx): Likewise.
(do_jump_by_parts_equality_rtx): Likewise.
(do_compare_rtx_and_jump): Likewise.
(do_compare_and_jump): Likewise.
* dojump.h (do_compare_rtx_and_jump): New.
* optabs.cc (emit_cmp_and_jump_insn_1): Refactor to take optab to check.
(validate_test_and_branch): New.
(emit_cmp_and_jump_insns): Optiobally take a value, and when value is
supplied then check if it's suitable for tbranch.
* optabs.def (tbranch$a4): New.
* doc/md.texi (tbranch@var{mode}4): Document it.
* optabs.h (emit_cmp_and_jump_insns):
* tree.h (tree_zero_one_valued_p): New.

--- inline copy of patch -- 
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 
c08691ab4c9a4bfe55ae81e5e228a414d6242d78..f8b32ec12f46d3fb3815f121a16b5a8a1819b66a
 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -6972,6 +6972,13 @@ case, you can and should make operand 1's predicate 
reject some operators
 in the @samp{cstore@var{mode}4} pattern, or remove the pattern altogether
 from the machine description.
 
+@cindex @code{tbranch@var{mode}4} instruction pattern
+@item @samp{tbranch@var{mode}4}
+Conditional branch instruction combined with a bit test-and-compare
+instruction. Operand 0 is a comparison operator.  Operand 1 is the
+operand of the comparison. Operand 2 is the bit position of Operand 1 to test.
+Operand 3 is the @code{code_label} to jump to.
+
 @cindex @code{cbranch@var{mode}4} instruction pattern
 @item @samp{cbranch@var{mode}4}
 Conditional branch instruction combined with a compare instruction.
diff --git a/gcc/dojump.h b/gcc/dojump.h
index 
e379cceb34bb1765cb575636e4c05b61501fc2cf..d1d79c490c420a805fe48d58740a79c1f25fb839
 100644
--- a/gcc/dojump.h
+++ b/gcc/dojump.h
@@ -71,6 +71,10 @@ extern void jumpifnot (tree exp, rtx_code_label *label,
 extern void jumpifnot_1 (enum tree_code, tree, tree, rtx_code_label *,
 profile_probability);
 
+extern void do_compare_rtx_and_jump (rtx, rtx, enum rtx_code, int, tree,
+machine_mode, rtx, rtx_code_label *,
+rtx_code_label *, profile_probability);
+
 extern void do_compare_rtx_and_jump (rtx, rtx, enum rtx_code, int,
 machine_mode, rtx, rtx_code_label *,
 rtx_code_label *, profile_probability);
diff --git a/gcc/dojump.cc b/gcc/dojump.cc
index 
2af0cd1aca3b6af13d5d8799094ee93f18022296..190324f36f1a31990f8c49bc8c0f45c23da5c31e
 100644
--- a/gcc/dojump.cc
+++ b/gcc/dojump.cc
@@ -619,7 +619,7 @@ do_jump (tree exp, rtx_code_label *if_false_label,
}
   do_compare_rtx_and_jump (temp, CONST0_RTX (GET_MODE (temp)),
   NE, TYPE_UNSIGNED (TREE_TYPE (exp)),
-  GET_MODE (temp), NULL_RTX,
+  exp, GET_MODE (temp), NULL_RTX,
   if_false_label, if_true_label, prob);
 }
 
@@ -687,7 +687,7 @@ do_jump_by_parts_greater_rtx (scalar_int_mode mode, int 
unsignedp, rtx op0,
 
   /* All but high-order word must be compared as unsigned.  */
   do_compare_rtx_and_jump (op0_word, op1_word, code, (unsignedp || i > 0),
-  word_mode, NULL_RTX, NULL, if_true_label,
+  NULL, word_mode, NULL_RTX, NULL, if_true_label,
   prob);
 
   /* Emit only one comparison for 0.  Do not emit the last cond jump.  */
@@ -695,8 +695,8 @@ do_jump_by_parts_greater_rtx (scalar_int_mode mode, int 
unsignedp, rtx op0,
break;
 
   /* Consider lower words only if these are equal.  */
-  do_compare_rtx_and_jump (op0_word, op1_word, NE, unsignedp, word_mode,
-  NULL_RTX, NULL, if_false_label,
+  do_compare_rtx_and_jump (op0_word, op1_word, NE, unsignedp, NULL,
+  word_mode, NULL_RTX, NULL, if_false_label,
   prob.invert ());
 }
 
@@ -755,7 +755,7 @@ do_jump_by_parts_zero_rtx (scalar_int_mode mode, rtx op0,
 
   if (part != 0)
 {
-  do_compare_rtx_and_jump (part, const0_rtx, EQ, 1, word_mode,
+  do_compare_rtx_and_jump (part, const0_rtx, EQ, 1, NULL, word_mode,
   NULL_RTX, if_false_label, if_true_label, prob);
   return;
 }
@@ -766,7 +766,7 @@ do_jump_by_parts_zero_rtx (scalar_int_mode mode, rtx op0,
 
   

RE: [PATCH 1/2]middle-end Fold BIT_FIELD_REF and Shifts into BIT_FIELD_REFs alone

2022-10-31 Thread Tamar Christina via Gcc-patches
Hi All,

Here's a respin addressing review comments.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* match.pd: Add bitfield and shift folding.

gcc/testsuite/ChangeLog:

* gcc.dg/bitshift_1.c: New.
* gcc.dg/bitshift_2.c: New.

--- inline copy of patch ---

diff --git a/gcc/match.pd b/gcc/match.pd
index 
70e90cdbfa902830e6b58be84e114e86ff7b4dff..a4ad465b2b074b21835be74732dce295f8db03bc
 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -7245,6 +7245,45 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   && ANY_INTEGRAL_TYPE_P (type) && ANY_INTEGRAL_TYPE_P (TREE_TYPE(@0)))
   (IFN_REDUC_PLUS_WIDEN @0)))
 
+/* Canonicalize BIT_FIELD_REFS and right shift to BIT_FIELD_REFS.  */
+(simplify
+ (rshift (BIT_FIELD_REF @0 @1 @2) INTEGER_CST@3)
+ (if (INTEGRAL_TYPE_P (type)
+  && tree_fits_uhwi_p (@1)
+  && tree_fits_uhwi_p (@3))
+  (with { /* Can't use wide-int here as the precision differs between
+@1 and @3.  */
+ unsigned HOST_WIDE_INT size = tree_to_uhwi (@1);
+ unsigned HOST_WIDE_INT shiftc = tree_to_uhwi (@3);
+ unsigned HOST_WIDE_INT newsize = size - shiftc;
+ tree nsize = wide_int_to_tree (bitsizetype, newsize);
+ tree ntype
+   = build_nonstandard_integer_type (newsize, TYPE_UNSIGNED (type)); }
+   (switch
+(if (INTEGRAL_TYPE_P (ntype) && !BYTES_BIG_ENDIAN)
+ (convert:type (BIT_FIELD_REF:ntype @0 { nsize; } (plus @2 @3
+(if (INTEGRAL_TYPE_P (ntype) && BYTES_BIG_ENDIAN)
+ (convert:type (BIT_FIELD_REF:ntype @0 { nsize; } (minus @2 @3
+
+/* Canonicalize BIT_FIELD_REFS and converts to BIT_FIELD_REFS.  */
+(simplify
+ (convert (BIT_FIELD_REF@3 @0 @1 @2))
+ (if (INTEGRAL_TYPE_P (type)
+  && INTEGRAL_TYPE_P (TREE_TYPE (@3)))
+  (with { unsigned int size_inner = element_precision (TREE_TYPE (@3));
+ unsigned int size_outer  = element_precision (type); }
+   (if (size_inner > size_outer)
+/* Truncating convert, we can shrink the bit field similar to the
+shift case.  */
+(with {
+   tree nsize = wide_int_to_tree (bitsizetype, size_outer);
+   auto sign = TYPE_UNSIGNED (type);
+   tree ntype
+ = build_nonstandard_integer_type (size_outer, sign);
+   gcc_assert (useless_type_conversion_p (type, ntype)); }
+ (if (INTEGRAL_TYPE_P (ntype))
+  (BIT_FIELD_REF:ntype @0 { nsize; } @2)))
+
 (simplify
  (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4)
  (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype, @2, @4); }))
diff --git a/gcc/testsuite/gcc.dg/bitshift_1.c 
b/gcc/testsuite/gcc.dg/bitshift_1.c
new file mode 100644
index 
..5995d0746d2301eb48304629cb4b779b079f1270
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bitshift_1.c
@@ -0,0 +1,50 @@
+/* { dg-do compile { target le } } */
+/* { dg-additional-options "-O2 -save-temps -fdump-tree-optimized" } */
+
+typedef int v4si __attribute__ ((vector_size (16)));
+typedef unsigned int v4usi __attribute__ ((vector_size (16)));
+typedef unsigned short v8uhi __attribute__ ((vector_size (16)));
+
+unsigned int foor (v4usi x)
+{
+return x[1] >> 16;
+}
+/* { dg-final { scan-tree-dump {BIT_FIELD_REF ;} "optimized" 
} } */
+
+unsigned int fool (v4usi x)
+{
+return x[1] << 16;
+}
+/* { dg-final { scan-tree-dump {BIT_FIELD_REF ;} "optimized" 
} } */
+
+unsigned short foor2 (v4usi x)
+{
+return x[3] >> 16;
+}
+/* { dg-final { scan-tree-dump {BIT_FIELD_REF ;} "optimized" 
} } */
+
+unsigned int fool2 (v4usi x)
+{
+return x[0] << 16;
+}
+/* { dg-final { scan-tree-dump {BIT_FIELD_REF ;} "optimized" } 
} */
+
+unsigned char foor3 (v8uhi x)
+{
+return x[3] >> 9;
+}
+/* { dg-final { scan-tree-dump {BIT_FIELD_REF ;} "optimized" } 
} */
+
+unsigned short fool3 (v8uhi x)
+{
+return x[0] << 9;
+}
+/* { dg-final { scan-tree-dump {BIT_FIELD_REF ;} "optimized" } 
} */
+
+unsigned short foo2 (v4si x)
+{
+  int y = x[0] + x[1];
+  return y >> 16;
+}
+/* { dg-final { scan-tree-dump {BIT_FIELD_REF ;} "optimized" } 
} */
+
diff --git a/gcc/testsuite/gcc.dg/bitshift_2.c 
b/gcc/testsuite/gcc.dg/bitshift_2.c
new file mode 100644
index 
..406b4def9d4aebbc83bd5bef92dab825b85f2aa4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bitshift_2.c
@@ -0,0 +1,49 @@
+/* { dg-do compile { target be } } */
+/* { dg-additional-options "-O2 -save-temps -fdump-tree-optimized" } */
+
+typedef int v4si __attribute__ ((vector_size (16)));
+typedef unsigned int v4usi __attribute__ ((vector_size (16)));
+typedef unsigned short v8uhi __attribute__ ((vector_size (16)));
+
+unsigned int foor (v4usi x)
+{
+return x[1] >> 16;
+}
+/* { dg-final { scan-tree-dump {BIT_FIELD_REF ;} "optimized" 
} } */
+
+unsigned int fool (v4usi x)
+{
+return x[1] << 16;
+}
+/* { dg-final { scan-tree-dump {BIT_FIELD_REF ;} "optimized" 
} } */
+
+unsigned short foor2 (v4usi x)

  1   2   >