Re: [PATCH] longlong.h: Do no use asm input cast for clang

2023-01-10 Thread Adhemerval Zanella Netto via Gcc-patches



On 12/12/22 20:52, Segher Boessenkool wrote:
> On Mon, Dec 12, 2022 at 02:10:16PM -0300, Adhemerval Zanella Netto wrote:
>> On 30/11/22 20:24, Segher Boessenkool wrote:
>>> I understand that the casts should be no-ops on the asm side (maybe they
>>> change the sign) and they are present as type-checking.  Can we implement
>>> this type-checking in a different (portable) way?  I think the macro you use
>>> should be named like __asm_output_check_type (..) or so to indicate the
>>> intended purpose.
> 
> I didn't write that.  Please quote correctly.  Thanks!
> 
>> I do not think trying to leverage it on clang side would yield much, it
>> seems that it really does not want to support this extension.  I am not
>> sure we can really make it portable, best option I can think of would to
>> add a mix of __builtin_classify_type and typeof prior asm call (we do
>> something similar to powerp64 syscall code on glibc), although it would
>> still require some gcc specific builtins.
>>
>> I am open for ideas, since to get this header to be clang-compatible on
>> glibc it requires to get it first on gcc.
> 
> How do you intend to modify all the existing copies of the header that
> haven't been updated for over a decade already?> 
> If you think changing all user code that uses longlong.h is a good idea,
> please change it to not use inline asm, use builtins in some cases but
> mostly just rewrite things in plain C.  But GCC cannot rewrite user code
> (not preemptively anyway ;-) ) -- and longlong.h as encountered in the
> wild (not the one in our libgcc source code) is user code.
> 
> If you think changing the copy in libgcc is a good idea, please change
> the original in glibc first?

That's my original intention [1], but Joseph stated that GCC is the upstream
source of this file.  Joseph, would you be ok for a similar patch to glibc
since gcc is reluctant to accept it?

[1] https://sourceware.org/pipermail/libc-alpha/2022-October/143050.html


Re: [PATCH] longlong.h: Do no use asm input cast for clang

2022-12-12 Thread Adhemerval Zanella Netto via Gcc-patches



On 30/11/22 20:24, Segher Boessenkool wrote:
> Hi!
> 
> On Wed, Nov 30, 2022 at 03:16:25PM -0300, Adhemerval Zanella via Gcc-patches 
> wrote:
>> clang by default rejects the input casts with:
>>
>>   error: invalid use of a cast in a inline asm context requiring an
>>   lvalue: remove the cast or build with -fheinous-gnu-extensions
>>
>> And even with -fheinous-gnu-extensions clang still throws an warning
>> and also states that this option might be removed in the future.
>> For gcc the cast are still useful somewhat [1], so just remove it
>> clang is used.
> 
> This is one of the things in inline asm that is tightly tied to GCC
> internals.  You should emulate GCC's behaviour faithfully if you want
> to claim you implement the inline asm GNU C extension.

Agree, that's why I just make it a no-op for clang which indicates that
it does not seem much use of this extension.

> I understand that the casts should be no-ops on the asm side (maybe they
> change the sign) and they are present as type-checking.  Can we implement
> this type-checking in a different (portable) way?  I think the macro you use
> should be named like __asm_output_check_type (..) or so to indicate the
> intended purpose.

I do not think trying to leverage it on clang side would yield much, it
seems that it really does not want to support this extension.  I am not
sure we can really make it portable, best option I can think of would to
add a mix of __builtin_classify_type and typeof prior asm call (we do
something similar to powerp64 syscall code on glibc), although it would
still require some gcc specific builtins.

I am open for ideas, since to get this header to be clang-compatible on
glibc it requires to get it first on gcc.

> 
>> --- a/include/ChangeLog
>> +++ b/include/ChangeLog
> 
> That should not be part of the patch?  Changelog entries should be
> verbatim in the message you send.
> 
> The size of this patch already makes clear this is a bad idea, imo.
> This code is already hard enough to read.

Indeed, I forgot that CL entries were not part of the commit.


[PATCH] longlong.h: Do no use asm input cast for clang

2022-11-30 Thread Adhemerval Zanella via Gcc-patches
clang by default rejects the input casts with:

  error: invalid use of a cast in a inline asm context requiring an
  lvalue: remove the cast or build with -fheinous-gnu-extensions

And even with -fheinous-gnu-extensions clang still throws an warning
and also states that this option might be removed in the future.
For gcc the cast are still useful somewhat [1], so just remove it
clang is used.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581722.html
---
 include/ChangeLog  |  60 ++
 include/longlong.h | 524 +++--
 2 files changed, 325 insertions(+), 259 deletions(-)

diff --git a/include/ChangeLog b/include/ChangeLog
index dda005335c0..747fc923ef5 100644
--- a/include/ChangeLog
+++ b/include/ChangeLog
@@ -1,3 +1,63 @@
+2022-11-30  Adhemerval Zanella  
+
+   * include/longlong.h: Modified.
+   [(__GNUC__) && ! NO_ASM][( (__i386__) ||  (__i486__)) && W_TYPE_SIZE == 
32](add_ss): Modified.
+   [(__GNUC__) && ! NO_ASM][( (__i386__) ||  (__i486__)) && W_TYPE_SIZE == 
32](sub_ddmmss): Modified.
+   [(__GNUC__) && ! NO_ASM][( (__i386__) ||  (__i486__)) && W_TYPE_SIZE == 
32](umul_ppmm): Modified.
+   [(__GNUC__) && ! NO_ASM][( (__i386__) ||  (__i486__)) && W_TYPE_SIZE == 
32](udiv_qrnnd): Modified.
+   [(__GNUC__) && ! NO_ASM][(( (__sparc__) &&  (__arch64__)) ||  
(__sparcv9))  && W_TYPE_SIZE == 64](add_ss): Modified.
+   [(__GNUC__) && ! NO_ASM][(( (__sparc__) &&  (__arch64__)) ||  
(__sparcv9))  && W_TYPE_SIZE == 64](sub_ddmmss): Modified.
+   [(__GNUC__) && ! NO_ASM][(( (__sparc__) &&  (__arch64__)) ||  
(__sparcv9))  && W_TYPE_SIZE == 64](umul_ppmm): Modified.
+   [(__GNUC__) && ! NO_ASM][(__M32R__) && W_TYPE_SIZE == 32](add_ss): 
Modified.
+   [(__GNUC__) && ! NO_ASM][(__M32R__) && W_TYPE_SIZE == 32](sub_ddmmss): 
Modified.
+   [(__GNUC__) && ! NO_ASM][(__arc__) && W_TYPE_SIZE == 32](add_ss): 
Modified.
+   [(__GNUC__) && ! NO_ASM][(__arc__) && W_TYPE_SIZE == 32](sub_ddmmss): 
Modified.
+   [(__GNUC__) && ! NO_ASM][(__arm__) && ( (__thumb2__) || ! __thumb__)  
&& W_TYPE_SIZE == 32][(__ARM_ARCH_2__) || (__ARM_ARCH_2A__)  || 
(__ARM_ARCH_3__)](umul_ppmm): Modified.
+   [(__GNUC__) && ! NO_ASM][(__arm__) && ( (__thumb2__) || ! __thumb__)  
&& W_TYPE_SIZE == 32](add_ss): Modified.
+   [(__GNUC__) && ! NO_ASM][(__arm__) && ( (__thumb2__) || ! __thumb__)  
&& W_TYPE_SIZE == 32](sub_ddmmss): Modified.
+   [(__GNUC__) && ! NO_ASM][(__hppa) && W_TYPE_SIZE == 32](add_ss): 
Modified.
+   [(__GNUC__) && ! NO_ASM][(__hppa) && W_TYPE_SIZE == 32](sub_ddmmss): 
Modified.
+   [(__GNUC__) && ! NO_ASM][(__i960__) && W_TYPE_SIZE == 32](umul_ppmm): 
Modified.
+   [(__GNUC__) && ! NO_ASM][(__i960__) && W_TYPE_SIZE == 32](__umulsidi3): 
Modified.
+   [(__GNUC__) && ! NO_ASM][(__ibm032__)  && W_TYPE_SIZE == 
32](add_ss): Modified.
+   [(__GNUC__) && ! NO_ASM][(__ibm032__)  && W_TYPE_SIZE == 
32](sub_ddmmss): Modified.
+   [(__GNUC__) && ! NO_ASM][(__ibm032__)  && W_TYPE_SIZE == 
32](umul_ppmm): Modified.
+   [(__GNUC__) && ! NO_ASM][(__ibm032__)  && W_TYPE_SIZE == 
32](count_leading_zeros): Modified.
+   [(__GNUC__) && ! NO_ASM][(__m88000__) && W_TYPE_SIZE == 
32][(__mc88110__)](umul_ppmm): Modified.
+   [(__GNUC__) && ! NO_ASM][(__m88000__) && W_TYPE_SIZE == 
32][(__mc88110__)](udiv_qrnnd): Modified.
+   [(__GNUC__) && ! NO_ASM][(__m88000__) && W_TYPE_SIZE == 
32](add_ss): Modified.
+   [(__GNUC__) && ! NO_ASM][(__m88000__) && W_TYPE_SIZE == 
32](sub_ddmmss): Modified.
+   [(__GNUC__) && ! NO_ASM][(__m88000__) && W_TYPE_SIZE == 
32](count_leading_zeros): Modified.
+   [(__GNUC__) && ! NO_ASM][(__mc68000__) && W_TYPE_SIZE == 
32][!((__mcoldfire__))](umul_ppmm): Modified.
+   [(__GNUC__) && ! NO_ASM][(__mc68000__) && W_TYPE_SIZE == 32][( 
(__mc68020__) && ! __mc68060__)](umul_ppmm): Modified.
+   [(__GNUC__) && ! NO_ASM][(__mc68000__) && W_TYPE_SIZE == 32][( 
(__mc68020__) && ! __mc68060__)](udiv_qrnnd): Modified.
+   [(__GNUC__) && ! NO_ASM][(__mc68000__) && W_TYPE_SIZE == 32][( 
(__mc68020__) && ! __mc68060__)](sdiv_qrnnd): Modified.
+   [(__GNUC__) && ! NO_ASM][(__mc68000__) && W_TYPE_SIZE == 
32][(__mcoldfire__)](umul_ppmm): Modified.
+   [(__GNUC__) && ! NO_ASM][(__mc68000__) && W_TYPE_SIZE == 
32](add_ss

Re: [PATCH 3/3] elf: Add _dl_find_eh_frame function

2021-11-23 Thread Adhemerval Zanella via Gcc-patches



On 17/11/2021 10:40, Florian Weimer wrote:
> * Adhemerval Zanella via Libc-alpha:
> 
>> However the code is somewhat complex and I would like to have some feedback
>> if gcc will be willing to accept this change (I assume it would require
>> this code merge on glibc beforehand).
> 
> There's a long review queue on the GCC side due to the stage1 close.
> It may still be considered for GCC 12.  Jakub has also requested that
> we hold off committing the glibc side until the GCC side is reviewed.
> 
> I'll flesh out the commit message and NEWS entry once we have agreed
> upon the interface.
> 
>>> new file mode 100644
>>> index 00..c7313c122d
>>> --- /dev/null
>>> +++ b/elf/dl-find_eh_frame.c
> 
>>> +/* Data for the main executable.  There is usually a large gap between
>>> +   the main executable and initially loaded shared objects.  Record
>>> +   the main executable separately, to increase the chance that the
>>> +   range for the non-closeable mappings below covers only the shared
>>> +   objects (and not also the gap between main executable and shared
>>> +   objects).  */
>>> +static uintptr_t _dl_eh_main_map_start attribute_relro;
>>> +static struct dl_eh_frame_info _dl_eh_main_info attribute_relro;
>>> +
>>> +/* Data for initally loaded shared objects that cannot be unlaoded.
>>
>> s/initally/initially and s/unlaoded/unloaded.
> 
> Fixed.
> 
>>
>>> +   The mapping base addresses are stored in address order in the
>>> +   _dl_eh_nodelete_mappings_bases array (containing
>>> +   _dl_eh_nodelete_mappings_size elements).  The EH data for a base
>>> +   address is stored in the parallel _dl_eh_nodelete_mappings_infos.
>>> +   These arrays are not modified after initialization.  */
>>> +static uintptr_t _dl_eh_nodelete_mappings_end attribute_relro;
>>> +static size_t _dl_eh_nodelete_mappings_size attribute_relro;
>>> +static uintptr_t *_dl_eh_nodelete_mappings_bases attribute_relro;
>>> +static struct dl_eh_frame_info *_dl_eh_nodelete_mappings_infos
>>> +  attribute_relro;
>>> +
>>> +/* Mappings created by dlopen can go away with dlclose, so a data
>>> +   dynamic data structure with some synchronization is needed.
>>
>> This sounds strange ("a data dynamic data").
> 
> I dropped the first data.
> 
>>
>>> +   Individual segments are similar to the _dl_eh_nodelete_mappings
>>
>> Maybe use _dl_eh_nodelete_mappings_*, because '_dl_eh_nodelete_mappings'
>> itself if not defined anywhere.
> 
> Right.
> 
>>> +   Adding new elements to this data structure is another source of
>>> +   quadratic behavior for dlopen.  If the other causes of quadratic
>>> +   behavior are eliminated, a more complicated data structure will be
>>> +   needed.  */
>>
>> This worries me, specially we have reports that python and other dynamic
>> environments do use a lot of plugin and generates a lot of dlopen() calls.
>> What kind of performance implication do you foresee here?
> 
> The additional overhead is not disproportionate to the other sources of
> quadratic behavior.  With 1,000 dlopen'ed objects, overall run-time
> seems to be comparable to the strcmp time required soname matching, for
> example, and is quite difficult to measure.  So we could fix the
> performance regression if we used a hash table for that …
> 
> It's just an undesirable complexity class.  The implementation is not
> actually slow because it's a mostly-linear copy (although a backwards
> one).  Other parts of dlopen involve pointer chasing and are much
> slower.

Right, I agree this should probably won't incur in performance issues,
I was curious if you have any numbers about it.

> 
>>> +/* Allocate an empty segment that is at least SIZE large.  PREVIOUS */
>>
>> What this PREVIOUS refer to?
> 
> Oops, it's now:
> 
> /* Allocate an empty segment that is at least SIZE large.  PREVIOUS
>points to the chain of previously allocated segments and can be
>NULL.  */
> 
>>> +/* Update the version to reflect that an update is happening.  This
>>> +   does not change the bit that controls the active segment chain.
>>> +   Returns the index of the currently active segment chain.  */
>>> +static inline unsigned int
>>> +_dl_eh_mappings_begin_update (void)
>>> +{
>>> +  unsigned int v
>>> += __atomic_wide_counter_fetch_add_relaxed 
>>> (&_dl_eh_loaded_mappings_version

Re: [PATCH 3/3] elf: Add _dl_find_eh_frame function

2021-11-16 Thread Adhemerval Zanella via Gcc-patches



On 03/11/2021 13:28, Florian Weimer via Gcc-patches wrote:
> This function is similar to __gnu_Unwind_Find_exidx as used on arm.
> It can be used to speed up the libgcc unwinder.

Besides the terse patch description, the design seems ok to accomplish the
lock-free read and update.  There are some question and remarks below,
and I still need to revise the tests.

However the code is somewhat complex and I would like to have some feedback
if gcc will be willing to accept this change (I assume it would require
this code merge on glibc beforehand).

> ---
>  NEWS  |   4 +
>  bits/dlfcn_eh_frame.h |  33 +
>  dlfcn/Makefile|   2 +-
>  dlfcn/dlfcn.h |   2 +
>  elf/Makefile  |  31 +-
>  elf/Versions  |   3 +
>  elf/dl-close.c|   4 +
>  elf/dl-find_eh_frame.c| 864 ++
>  elf/dl-find_eh_frame.h|  90 ++
>  elf/dl-find_eh_frame_slow.h   |  55 ++
>  elf/dl-libc_freeres.c |   2 +
>  elf/dl-open.c |   5 +
>  elf/rtld.c|   7 +
>  elf/tst-dl_find_eh_frame-mod1.c   |  10 +
>  elf/tst-dl_find_eh_frame-mod2.c   |  10 +
>  elf/tst-dl_find_eh_frame-mod3.c   |  10 +
>  elf/tst-dl_find_eh_frame-mod4.c   |  10 +
>  elf/tst-dl_find_eh_frame-mod5.c   |  11 +
>  elf/tst-dl_find_eh_frame-mod6.c   |  11 +
>  elf/tst-dl_find_eh_frame-mod7.c   |  10 +
>  elf/tst-dl_find_eh_frame-mod8.c   |  10 +
>  elf/tst-dl_find_eh_frame-mod9.c   |  10 +
>  elf/tst-dl_find_eh_frame-threads.c| 237 +
>  elf/tst-dl_find_eh_frame.c| 179 
>  include/atomic_wide_counter.h |  14 +
>  include/bits/dlfcn_eh_frame.h |   1 +
>  include/link.h|   3 +
>  manual/Makefile   |   2 +-
>  manual/dynlink.texi   |  69 ++
>  manual/libdl.texi |  10 -
>  manual/probes.texi|   2 +-
>  manual/threads.texi   |   2 +-
>  sysdeps/i386/bits/dlfcn_eh_frame.h|  34 +
>  sysdeps/mach/hurd/i386/ld.abilist |   1 +
>  sysdeps/nios2/bits/dlfcn_eh_frame.h   |  34 +
>  sysdeps/unix/sysv/linux/aarch64/ld.abilist|   1 +
>  sysdeps/unix/sysv/linux/alpha/ld.abilist  |   1 +
>  sysdeps/unix/sysv/linux/arc/ld.abilist|   1 +
>  sysdeps/unix/sysv/linux/arm/be/ld.abilist |   1 +
>  sysdeps/unix/sysv/linux/arm/le/ld.abilist |   1 +
>  sysdeps/unix/sysv/linux/csky/ld.abilist   |   1 +
>  sysdeps/unix/sysv/linux/hppa/ld.abilist   |   1 +
>  sysdeps/unix/sysv/linux/i386/ld.abilist   |   1 +
>  sysdeps/unix/sysv/linux/ia64/ld.abilist   |   1 +
>  .../unix/sysv/linux/m68k/coldfire/ld.abilist  |   1 +
>  .../unix/sysv/linux/m68k/m680x0/ld.abilist|   1 +
>  sysdeps/unix/sysv/linux/microblaze/ld.abilist |   1 +
>  .../unix/sysv/linux/mips/mips32/ld.abilist|   1 +
>  .../sysv/linux/mips/mips64/n32/ld.abilist |   1 +
>  .../sysv/linux/mips/mips64/n64/ld.abilist |   1 +
>  sysdeps/unix/sysv/linux/nios2/ld.abilist  |   1 +
>  .../sysv/linux/powerpc/powerpc32/ld.abilist   |   1 +
>  .../linux/powerpc/powerpc64/be/ld.abilist |   1 +
>  .../linux/powerpc/powerpc64/le/ld.abilist |   1 +
>  sysdeps/unix/sysv/linux/riscv/rv32/ld.abilist |   1 +
>  sysdeps/unix/sysv/linux/riscv/rv64/ld.abilist |   1 +
>  .../unix/sysv/linux/s390/s390-32/ld.abilist   |   1 +
>  .../unix/sysv/linux/s390/s390-64/ld.abilist   |   1 +
>  sysdeps/unix/sysv/linux/sh/be/ld.abilist  |   1 +
>  sysdeps/unix/sysv/linux/sh/le/ld.abilist  |   1 +
>  .../unix/sysv/linux/sparc/sparc32/ld.abilist  |   1 +
>  .../unix/sysv/linux/sparc/sparc64/ld.abilist  |   1 +
>  sysdeps/unix/sysv/linux/x86_64/64/ld.abilist  |   1 +
>  sysdeps/unix/sysv/linux/x86_64/x32/ld.abilist |   1 +
>  64 files changed, 1795 insertions(+), 16 deletions(-)
>  create mode 100644 bits/dlfcn_eh_frame.h
>  create mode 100644 elf/dl-find_eh_frame.c
>  create mode 100644 elf/dl-find_eh_frame.h
>  create mode 100644 elf/dl-find_eh_frame_slow.h
>  create mode 100644 elf/tst-dl_find_eh_frame-mod1.c
>  create mode 100644 elf/tst-dl_find_eh_frame-mod2.c
>  create mode 100644 elf/tst-dl_find_eh_frame-mod3.c
>  create mode 100644 elf/tst-dl_find_eh_frame-mod4.c
>  create mode 100644 elf/tst-dl_find_eh_frame-mod5.c
>  create mode 100644 elf/tst-dl_find_eh_frame-mod6.c
>  create mode 100644 elf/tst-dl_find_eh_frame-mod7.c
>  create mode 100644 elf/tst-dl_find_eh_frame-mod8.c
>  create mode 100644 elf/tst-dl_find_eh_frame-mod9.c
>  create mode 100644 

Re: [PATCH 2/3] elf: Introduce GLRO (dl_libc_freeres), called from __libc_freeres

2021-11-15 Thread Adhemerval Zanella via Gcc-patches
Maybe add a comment why this is will be used.

Reviewed-by: Adhemerval Zanella  

On 03/11/2021 13:27, Florian Weimer via Gcc-patches wrote:
> ---
>  elf/Makefile   |  2 +-
>  elf/dl-libc_freeres.c  | 24 
>  elf/rtld.c |  1 +
>  malloc/set-freeres.c   |  5 +
>  sysdeps/generic/ldsodefs.h |  7 +++
>  5 files changed, 38 insertions(+), 1 deletion(-)
>  create mode 100644 elf/dl-libc_freeres.c
> 
> diff --git a/elf/Makefile b/elf/Makefile
> index cb9bcfb799..1c768bdf47 100644
> --- a/elf/Makefile
> +++ b/elf/Makefile
> @@ -68,7 +68,7 @@ elide-routines.os = $(all-dl-routines) dl-support 
> enbl-secure dl-origin \
>  rtld-routines= rtld $(all-dl-routines) dl-sysdep dl-environ 
> dl-minimal \
>dl-error-minimal dl-conflict dl-hwcaps dl-hwcaps_split dl-hwcaps-subdirs \
>dl-usage dl-diagnostics dl-diagnostics-kernel dl-diagnostics-cpu \
> -  dl-mutex
> +  dl-mutex dl-libc_freeres
>  all-rtld-routines = $(rtld-routines) $(sysdep-rtld-routines)
>  
>  CFLAGS-dl-runtime.c += -fexceptions -fasynchronous-unwind-tables

Ok.

> diff --git a/elf/dl-libc_freeres.c b/elf/dl-libc_freeres.c
> new file mode 100644
> index 00..68f305a6f9
> --- /dev/null
> +++ b/elf/dl-libc_freeres.c
> @@ -0,0 +1,24 @@
> +/* Deallocating malloc'ed memory from the dynamic loader.
> +   Copyright (C) 2021 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include 
> +
> +void
> +__rtld_libc_freeres (void)
> +{
> +}

Ok.

> diff --git a/elf/rtld.c b/elf/rtld.c
> index be2d5d8e74..847141e21d 100644
> --- a/elf/rtld.c
> +++ b/elf/rtld.c
> @@ -378,6 +378,7 @@ struct rtld_global_ro _rtld_global_ro attribute_relro =
>  ._dl_catch_error = _rtld_catch_error,
>  ._dl_error_free = _dl_error_free,
>  ._dl_tls_get_addr_soft = _dl_tls_get_addr_soft,
> +._dl_libc_freeres = __rtld_libc_freeres,
>  #ifdef HAVE_DL_DISCOVER_OSVERSION
>  ._dl_discover_osversion = _dl_discover_osversion
>  #endif

Ok.

> diff --git a/malloc/set-freeres.c b/malloc/set-freeres.c
> index 5c19a2725c..856ff7831f 100644
> --- a/malloc/set-freeres.c
> +++ b/malloc/set-freeres.c
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "../nss/nsswitch.h"
>  #include "../libio/libioP.h"
> @@ -67,6 +68,10 @@ __libc_freeres (void)
>  
>call_function_static_weak (__libc_dlerror_result_free);
>  
> +#ifdef SHARED
> +  GLRO (dl_libc_freeres) ();
> +#endif
> +
>for (p = symbol_set_first_element (__libc_freeres_ptrs);
> !symbol_set_end_p (__libc_freeres_ptrs, p); ++p)
>  free (*p);

OK.

> diff --git a/sysdeps/generic/ldsodefs.h b/sysdeps/generic/ldsodefs.h
> index 1318c36dce..c26860430c 100644
> --- a/sysdeps/generic/ldsodefs.h
> +++ b/sysdeps/generic/ldsodefs.h
> @@ -712,6 +712,10 @@ struct rtld_global_ro
>   namespace.  */
>void (*_dl_error_free) (void *);
>void *(*_dl_tls_get_addr_soft) (struct link_map *);
> +
> +  /* Called from __libc_shared to deallocate malloc'ed memory.  */
> +  void (*_dl_libc_freeres) (void);
> +
>  #ifdef HAVE_DL_DISCOVER_OSVERSION
>int (*_dl_discover_osversion) (void);
>  #endif
> @@ -1416,6 +1420,9 @@ __rtld_mutex_init (void)
>  }
>  #endif /* !PTHREAD_IN_LIBC */
>  
> +/* Implementation of GL (dl_libc_freeres).  */
> +void __rtld_libc_freeres (void) attribute_hidden;
> +
>  void __thread_gscope_wait (void) attribute_hidden;
>  # define THREAD_GSCOPE_WAIT() __thread_gscope_wait ()
>  
> 

OK.


Re: [PATCH 1/3] nptl: Extract from pthread_cond_common.c

2021-11-15 Thread Adhemerval Zanella via Gcc-patches



On 03/11/2021 13:27, Florian Weimer via Libc-alpha wrote:
> And make it an installed header.  This addresses a few aliasing
> violations (which do not seem to result in miscompilation due to
> the use of atomics), and also enables use of wide counters in other
> parts of the library.
> 
> The debug output in nptl/tst-cond22 has been adjusted to print
> the 32-bit values instead because it avoids a big-endian/little-endian
> difference.

LGTM, thanks.

Reviewed-by: Adhemerval Zanella  

> ---
>  bits/atomic_wide_counter.h  |  35 
>  include/atomic_wide_counter.h   |  89 +++
>  include/bits/atomic_wide_counter.h  |   1 +
>  misc/Makefile   |   3 +-
>  misc/atomic_wide_counter.c  | 127 +++
>  nptl/Makefile   |  13 +-
>  nptl/pthread_cond_common.c  | 204 
>  nptl/tst-cond22.c   |  14 +-
>  sysdeps/nptl/bits/thread-shared-types.h |  22 +--
>  9 files changed, 310 insertions(+), 198 deletions(-)
>  create mode 100644 bits/atomic_wide_counter.h
>  create mode 100644 include/atomic_wide_counter.h
>  create mode 100644 include/bits/atomic_wide_counter.h
>  create mode 100644 misc/atomic_wide_counter.c
> 
> diff --git a/bits/atomic_wide_counter.h b/bits/atomic_wide_counter.h
> new file mode 100644
> index 00..0687eb554e
> --- /dev/null
> +++ b/bits/atomic_wide_counter.h
> @@ -0,0 +1,35 @@
> +/* Monotonically increasing wide counters (at least 62 bits).
> +   Copyright (C) 2016-2021 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#ifndef _BITS_ATOMIC_WIDE_COUNTER_H
> +#define _BITS_ATOMIC_WIDE_COUNTER_H
> +
> +/* Counter that is monotonically increasing (by less than 2**31 per
> +   increment), with a single writer, and an arbitrary number of
> +   readers.  */
> +typedef union
> +{
> +  __extension__ unsigned long long int __value64;
> +  struct
> +  {
> +unsigned int __low;
> +unsigned int __high;
> +  } __value32;
> +} __atomic_wide_counter;
> +
> +#endif /* _BITS_ATOMIC_WIDE_COUNTER_H */

Ok, it would be included in multiple places so we can't tie to a specific
header.

> diff --git a/include/atomic_wide_counter.h b/include/atomic_wide_counter.h
> new file mode 100644
> index 00..31f009d5e6
> --- /dev/null
> +++ b/include/atomic_wide_counter.h
> @@ -0,0 +1,89 @@
> +/* Monotonically increasing wide counters (at least 62 bits).
> +   Copyright (C) 2016-2021 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#ifndef _ATOMIC_WIDE_COUNTER_H
> +#define _ATOMIC_WIDE_COUNTER_H
> +
> +#include 
> +#include 
> +
> +#if __HAVE_64B_ATOMICS
> +
> +static inline uint64_t
> +__atomic_wide_counter_load_relaxed (__atomic_wide_counter *c)
> +{
> +  return atomic_load_relaxed (>__value64);
> +}
> +
> +static inline uint64_t
> +__atomic_wide_counter_fetch_add_relaxed (__atomic_wide_counter *c,
> + unsigned int val)
> +{
> +  return atomic_fetch_add_relaxed (>__value64, val);
> +}
> +
> +static inline uint64_t
> +__atomic_wide_counter_fetch_a

Re: [PATCH v6] aarch64: Add split-stack support

2018-02-27 Thread Adhemerval Zanella
Ping (with Szabolcs remarks fixed).

On 07/02/2018 16:07, Adhemerval Zanella wrote:
> Changes from previous version:
> 
>   - Changed the wait to call __morestack to use use a branch with link
> instead of a simple branch.  This allows use a call instruction and
> avoid possible issues with later optimization passes which might
> see a branch outside the instruction block (as noticed in previous
> iterations while building a more complex workload as speccpu2006).
> 
>   - Change the return address to use the branch with link value and
> set x12 to save x30.  This simplifies the required instructions
> to setup/save the return address.
> 
> --
> 
> This patch adds the split-stack support on aarch64 (PR #67877).  As for
> other ports this patch should be used along with glibc and gold support.
> 
> The support is done similar to other architectures: a split-stack field
> is allocated before TCB by glibc, a target-specific __morestack implementation
> and helper functions are added in libgcc and compiler supported in adjusted
> (split-stack prologue, va_start for argument handling).  I also plan to
> send the gold support to adjust stack allocation acrosss split-stack
> and default code calls.
> 
> Current approach is to set the final stack adjustments using a 2 instructions
> at most (mov/movk) which limits stack allocation to upper limit of 4GB.
> The morestack call is non standard with x10 hollding the requested stack
> pointer, x11 the argument pointer (if required), and x12 to return
> continuation address.  Unwinding is handled by a personality routine that
> knows how to find stack segments.
> 
> Split-stack prologue on function entry is as follow (this goes before the
> usual function prologue):
> 
> function:
>   mrsx9, tpidr_el0
>   ldur   x9, [x9, -8]
>   movx10, 
>   movk   x10, #0x0, lsl #16
>   subx10, sp, x10
>   movx11, sp  # if function has stacked arguments
>   cmpx9, x10
>   bcc.LX
> main_fn_entry:
>   [function prologue]
> LX:
>   bl __morestack
>   b  main_fn_entry
> 
> Notes:
> 
> 1. Even if a function does not allocate a stack frame, a split-stack prologue
>is created.  It is to avoid issues with tail call for external symbols
>which might require linker adjustment (libgo/runtime/go-varargs.c).
> 
> 2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldur
>to after the required stack calculation.
> 
> 3. Similar to powerpc, When the linker detects a call from split-stack to
>non-split-stack code, it adds 16k (or more) to the value found in 
> "allocate"
>instructions (so non-split-stack code gets a larger stack).  The amount is
>tunable by a linker option.  This feature is only implemented in the GNU
>gold linker.
> 
> 4. AArch64 does not handle >4G stack initially and although it is possible
>to implement it, limiting to 4G allows to materize the allocation with
>only 2 instructions (mov + movk) and thus simplifying the linker
>adjustments required.  Supporting multiple threads each requiring more
>than 4G of stack is probably not that important, and likely to OOM at
>run time.
> 
> 5. The TCB support on GLIBC is meant to be included in version 2.28.
> 
> 6. Besides a regression tests I also checked with a SPECcpu2006 run with
>-fsplit-stack additional option.  I saw no regression besides 416.gamess
>which fails on trunk as well (not sure if some misconfiguration in my
>environment).
> 
> libgcc/ChangeLog:
> 
>   * libgcc/config.host: Use t-stack and t-statck-aarch64 for
>   aarch64*-*-linux.
>   * libgcc/config/aarch64/morestack-c.c: New file.
>   * libgcc/config/aarch64/morestack.S: Likewise.
>   * libgcc/config/aarch64/t-stack-aarch64: Likewise.
>   * libgcc/generic-morestack.c (__splitstack_find): Add aarch64-specific
>   code.
> 
> gcc/ChangeLog:
> 
>   * common/config/aarch64/aarch64-common.c
>   (aarch64_supports_split_stack): New function.
>   (TARGET_SUPPORTS_SPLIT_STACK): New macro.
>   * gcc/config/aarch64/aarch64-linux.h (TARGET_ASM_FILE_END): Remove
>   macro.
>   * gcc/config/aarch64/aarch64-protos.h: Add
>   aarch64_expand_split_stack_prologue and
>   aarch64_split_stack_space_check.
>   * gcc/config/aarch64/aarch64.c (aarch64_expand_builtin_va_start): Use
>   internal argument pointer instead of virtual_incoming_args_rtx.
>   (morestack_ref): New symbol.
>   (aarch64_load_split_stack_value): New function.
>   (aarch64_expand_split_stack_prologue): Likewise.
>   (aarch64_inte

Re: [PATCH] Add sanitizer support for AArch64 ILP32

2018-02-23 Thread Adhemerval Zanella


On 23/02/2018 11:39, Jakub Jelinek wrote:
> On Fri, Feb 23, 2018 at 11:27:19AM -0300, Adhemerval Zanella wrote:
>> This patch adds asan support for aarch64 ilp32.  It is based on 'official'
>> glibc support branch [1] (which is in turn based on Linux branch [2]).
>>
>> Main changes for libsanitizer is the kernel ABI support for internal
>> syscalls. Different than x32, ILP32 tries to leverage 32-bits syscalls
>> with kernel ABI for 64 bit argument passing (ftruncate for instance)
>> are passed using the new Linux generic way for 32 bits (by splitting
>> high and low in two registers).
>>
>> So instead of adding more adhoc defines to ILP32 this patch extends the
>> SANITIZER_USES_CANONICAL_LINUX_SYSCALLS to represent both 64 bits
>> argument syscalls (value of 1) and 32 bits (value of 2).
>>
>> The shadow offset used is similar to the arm one (29).
>>
>> I did a sanity check on aarch64-linux-gnu and x86_64-linux-gnu without
>> any regressions.
>>
>> PS: I sent this change llvm-commit, but it got stalled because llvm
>> lack aarch64 ilp32 support and noone could confirm no breakage due
>> missing buildbot.  Also the idea is not sync it back to libsanitizer.
> 
> It will be a nightmare for merges from upstream, but because the upstream is
> so uncooperative probably there is no other way.

What I meant it the idea is to sync it back to libsanitizer, sorry for the
misunderstanding. I can try to push it again for compiler-rt, but it took
a long way to actually get any reply.

> 
>> gcc/
>>
>>  * gcc/config/aarch64/aarch64.c (aarch64_asan_shadow_offset): Add
> 
> No gcc/ prefixes in gcc/ChangeLog.

Right, fixed locally.

> 
>>  TARGET_ILP32 support.
>>
>> libsanitizer/
> 
> Nor libsanitizer/ prefixes in libsaniter/ChangeLog.
>> --- a/libsanitizer/sanitizer_common/sanitizer_internal_defs.h
>> +++ b/libsanitizer/sanitizer_common/sanitizer_internal_defs.h
>> @@ -117,9 +117,9 @@ typedef signed   long long sptr;  // NOLINT
>>  typedef unsigned long uptr;  // NOLINT
>>  typedef signed   long sptr;  // NOLINT
>>  #endif  // defined(_WIN64)
>> -#if defined(__x86_64__)
>> -// Since x32 uses ILP32 data model in 64-bit hardware mode, we must use
>> -// 64-bit pointer to unwind stack frame.
>> +#if defined(__x86_64__) || SANITIZER_AARCH64_ILP32
>> +// Since x32 adn AArch64 ILP32 use ILP32 data model in 64-bit hardware 
>> mode, 
> 
> s/adn/and/

Ack.

> 
>> @@ -445,11 +467,7 @@ uptr internal_execve(const char *filename, char *const 
>> argv[],
>>  // - sanitizer_common.h
>>  bool FileExists(const char *filename) {
>>struct stat st;
>> -#if SANITIZER_USES_CANONICAL_LINUX_SYSCALLS
>> -  if (internal_syscall(SYSCALL(newfstatat), AT_FDCWD, filename, , 0))
>> -#else
>>if (internal_stat(filename, ))
>> -#endif
>>  return false;
>>// Sanity check: filename is a regular file.
>>return S_ISREG(st.st_mode);
> 
> I'm uneasy about this hunk, you change behavior not just on aarch64 ilp32,
> but for many other targets too.  Please don't.

My understanding it a noop since for SANITIZER_USES_CANONICAL_LINUX_SYSCALLS
internal_stat would be a be a internal_syscall(newfstatat, ...) (also
the internal_* names are exactly to abstract this kind of constructions).

> 
> Do you really want it for GCC8?  We are in stage4 and this isn't a
> regression...
> 
>   Jakub
> 

I don't have a strong opinion about that, it would to good to have on GCC8,
but I think I can wait.


[PATCH] Add sanitizer support for AArch64 ILP32

2018-02-23 Thread Adhemerval Zanella
This patch adds asan support for aarch64 ilp32.  It is based on 'official'
glibc support branch [1] (which is in turn based on Linux branch [2]).

Main changes for libsanitizer is the kernel ABI support for internal
syscalls. Different than x32, ILP32 tries to leverage 32-bits syscalls
with kernel ABI for 64 bit argument passing (ftruncate for instance)
are passed using the new Linux generic way for 32 bits (by splitting
high and low in two registers).

So instead of adding more adhoc defines to ILP32 this patch extends the
SANITIZER_USES_CANONICAL_LINUX_SYSCALLS to represent both 64 bits
argument syscalls (value of 1) and 32 bits (value of 2).

The shadow offset used is similar to the arm one (29).

I did a sanity check on aarch64-linux-gnu and x86_64-linux-gnu without
any regressions.

PS: I sent this change llvm-commit, but it got stalled because llvm
lack aarch64 ilp32 support and noone could confirm no breakage due
missing buildbot.  Also the idea is not sync it back to libsanitizer.

gcc/

* gcc/config/aarch64/aarch64.c (aarch64_asan_shadow_offset): Add
TARGET_ILP32 support.

libsanitizer/

* libsanitizer/sanitizer_common/sanitizer_internal_defs.h
(uhwptr, OFF_T, operator_new_size_type): Define for
SANITIZER_AARCH64_ILP32.
* libsanitizer/sanitizer_common/sanitizer_linux.cc (SYSCALL_LL64):
New define: wrap a 64-bit argument for syscall.
(internal_ftruncate, internal_stat, internal_lstat, internal_unlink,
internal_rename, internal_lseek): Add support for
SANITIZER_USES_CANONICAL_LINUX_SYSCALLS equal to 2.
(FileExists): Use internal_stat regardless.
* libsanitizer/sanitizer_common/sanitizer_platform.h
(SANITIZER_AARCH64_ILP32): Define.
(SANITIZER_USES_CANONICAL_LINUX_SYSCALLS): Define to 1 or 2 depending
of the expected architecture ABI.
* libsanitizer/sanitizer_common/sanitizer_platform_limits_posix.cc
(CHECK_SIZE_AND_OFFSET(ipc_perm,mode): Define for AArch64 ILP32.
* libsanitizer/sanitizer_common/sanitizer_platform_limits_posix.h
(struct_kernel_stat_sz, struct_kernel_stat64_sz,
__sanitizer___kernel_uid_t, __sanitizer___kernel_gid_t): Likewise.
[__sanitizer_dirent] (d_ino, d_off): Likewise.

[1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/arm/ilp32
[2] git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git
---
 gcc/config/aarch64/aarch64.c   |  5 ++-
 .../sanitizer_common/sanitizer_internal_defs.h | 12 +++
 libsanitizer/sanitizer_common/sanitizer_linux.cc   | 40 ++
 libsanitizer/sanitizer_common/sanitizer_platform.h | 15 ++--
 .../sanitizer_platform_limits_posix.cc |  3 +-
 .../sanitizer_platform_limits_posix.h  | 13 +--
 6 files changed, 69 insertions(+), 19 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 33c90ef..eebf85b 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -16171,7 +16171,10 @@ aarch64_split_dimode_const_store (rtx dst, rtx src)
 static unsigned HOST_WIDE_INT
 aarch64_asan_shadow_offset (void)
 {
-  return (HOST_WIDE_INT_1 << 36);
+  if (TARGET_ILP32)
+return (HOST_WIDE_INT_1 << 29);
+  else
+return (HOST_WIDE_INT_1 << 36);
 }
 
 static rtx
diff --git a/libsanitizer/sanitizer_common/sanitizer_internal_defs.h 
b/libsanitizer/sanitizer_common/sanitizer_internal_defs.h
index edd6a21..aef83c2 100644
--- a/libsanitizer/sanitizer_common/sanitizer_internal_defs.h
+++ b/libsanitizer/sanitizer_common/sanitizer_internal_defs.h
@@ -117,9 +117,9 @@ typedef signed   long long sptr;  // NOLINT
 typedef unsigned long uptr;  // NOLINT
 typedef signed   long sptr;  // NOLINT
 #endif  // defined(_WIN64)
-#if defined(__x86_64__)
-// Since x32 uses ILP32 data model in 64-bit hardware mode, we must use
-// 64-bit pointer to unwind stack frame.
+#if defined(__x86_64__) || SANITIZER_AARCH64_ILP32
+// Since x32 adn AArch64 ILP32 use ILP32 data model in 64-bit hardware mode, 
+// we must use 64-bit pointer to unwind stack frame.
 typedef unsigned long long uhwptr;  // NOLINT
 #else
 typedef uptr uhwptr;   // NOLINT
@@ -144,7 +144,7 @@ typedef int error_t;
 typedef int pid_t;
 
 #if SANITIZER_FREEBSD || SANITIZER_NETBSD || SANITIZER_MAC || \
-(SANITIZER_LINUX && defined(__x86_64__))
+(SANITIZER_LINUX && (defined(__x86_64__) || SANITIZER_AARCH64_ILP32))
 typedef u64 OFF_T;
 #else
 typedef uptr OFF_T;
@@ -154,8 +154,8 @@ typedef u64  OFF64_T;
 #if (SANITIZER_WORDSIZE == 64) || SANITIZER_MAC
 typedef uptr operator_new_size_type;
 #else
-# if defined(__s390__) && !defined(__s390x__)
-// Special case: 31-bit s390 has unsigned long as size_t.
+# if (defined(__s390__) && !defined(__s390x__)) || SANITIZER_AARCH64_ILP32
+// Special case: 31-bit s390 and AArch64 ILP32 have unsigned long as size_t.
 typedef unsigned long operator_new_size_type;
 # else
 typedef 

Re: [PATCH v6] aarch64: Add split-stack support

2018-02-15 Thread Adhemerval Zanella


On 13/02/2018 13:13, Szabolcs Nagy wrote:
> On 07/02/18 18:07, Adhemerval Zanella wrote:
>  5. The TCB support on GLIBC is meant to be included in version 2.28.
>>
> ...
>> +/* -fsplit-stack uses a TCB field available on glibc-2.27.  GLIBC also
>> +   exports symbol, __tcb_private_ss, to signal it has the field available
>> +   on TCB bloc.  This aims to prevent binaries linked against newer
>> +   GLIBC to run on non-supported ones.  */
> 
> 
> i suspect this needs to be updated since the glibc patch
> is not committed yet.
> 
> (i'll review the glibc patch, if it looks ok then it can
> be committed after the gcc side is accepted.)

I fixed the commit message locally, thanks for checking on this.

> 
>> +
>> +static bool
>> +aarch64_supports_split_stack (bool report ATTRIBUTE_UNUSED,
>> +  struct gcc_options *opts ATTRIBUTE_UNUSED)
>> +{
>> +#ifndef TARGET_GLIBC_MAJOR
>> +#define TARGET_GLIBC_MAJOR 0
>> +#endif
>> +#ifndef TARGET_GLIBC_MINOR
>> +#define TARGET_GLIBC_MINOR 0
>> +#endif
>> +  /* Note: Can't test DEFAULT_ABI here, it isn't set until later.  */
>> +  if (TARGET_GLIBC_MAJOR * 1000 + TARGET_GLIBC_MINOR >= 2026)
>> +    return true;
>> +
>> +  if (report)
>> +    error ("%<-fsplit-stack%> currently only supported on AArch64 GNU/Linux 
>> with glibc-2.27 or later");
>> +  return false;
>> +}
>> +
>> +#undef TARGET_SUPPORTS_SPLIT_STACK
>> +#define TARGET_SUPPORTS_SPLIT_STACK aarch64_supports_split_stack
>> +


[PATCH v6] aarch64: Add split-stack support

2018-02-07 Thread Adhemerval Zanella
Changes from previous version:

  - Changed the wait to call __morestack to use use a branch with link
instead of a simple branch.  This allows use a call instruction and
avoid possible issues with later optimization passes which might
see a branch outside the instruction block (as noticed in previous
iterations while building a more complex workload as speccpu2006).

  - Change the return address to use the branch with link value and
set x12 to save x30.  This simplifies the required instructions
to setup/save the return address.

--

This patch adds the split-stack support on aarch64 (PR #67877).  As for
other ports this patch should be used along with glibc and gold support.

The support is done similar to other architectures: a split-stack field
is allocated before TCB by glibc, a target-specific __morestack implementation
and helper functions are added in libgcc and compiler supported in adjusted
(split-stack prologue, va_start for argument handling).  I also plan to
send the gold support to adjust stack allocation acrosss split-stack
and default code calls.

Current approach is to set the final stack adjustments using a 2 instructions
at most (mov/movk) which limits stack allocation to upper limit of 4GB.
The morestack call is non standard with x10 hollding the requested stack
pointer, x11 the argument pointer (if required), and x12 to return
continuation address.  Unwinding is handled by a personality routine that
knows how to find stack segments.

Split-stack prologue on function entry is as follow (this goes before the
usual function prologue):

function:
mrsx9, tpidr_el0
ldur   x9, [x9, -8]
movx10, 
movk   x10, #0x0, lsl #16
subx10, sp, x10
movx11, sp  # if function has stacked arguments
cmpx9, x10
bcc.LX
main_fn_entry:
[function prologue]
LX:
bl __morestack
b  main_fn_entry

Notes:

1. Even if a function does not allocate a stack frame, a split-stack prologue
   is created.  It is to avoid issues with tail call for external symbols
   which might require linker adjustment (libgo/runtime/go-varargs.c).

2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldur
   to after the required stack calculation.

3. Similar to powerpc, When the linker detects a call from split-stack to
   non-split-stack code, it adds 16k (or more) to the value found in "allocate"
   instructions (so non-split-stack code gets a larger stack).  The amount is
   tunable by a linker option.  This feature is only implemented in the GNU
   gold linker.

4. AArch64 does not handle >4G stack initially and although it is possible
   to implement it, limiting to 4G allows to materize the allocation with
   only 2 instructions (mov + movk) and thus simplifying the linker
   adjustments required.  Supporting multiple threads each requiring more
   than 4G of stack is probably not that important, and likely to OOM at
   run time.

5. The TCB support on GLIBC is meant to be included in version 2.28.

6. Besides a regression tests I also checked with a SPECcpu2006 run with
   -fsplit-stack additional option.  I saw no regression besides 416.gamess
   which fails on trunk as well (not sure if some misconfiguration in my
   environment).

libgcc/ChangeLog:

* libgcc/config.host: Use t-stack and t-statck-aarch64 for
aarch64*-*-linux.
* libgcc/config/aarch64/morestack-c.c: New file.
* libgcc/config/aarch64/morestack.S: Likewise.
* libgcc/config/aarch64/t-stack-aarch64: Likewise.
* libgcc/generic-morestack.c (__splitstack_find): Add aarch64-specific
code.

gcc/ChangeLog:

* common/config/aarch64/aarch64-common.c
(aarch64_supports_split_stack): New function.
(TARGET_SUPPORTS_SPLIT_STACK): New macro.
* gcc/config/aarch64/aarch64-linux.h (TARGET_ASM_FILE_END): Remove
macro.
* gcc/config/aarch64/aarch64-protos.h: Add
aarch64_expand_split_stack_prologue and
aarch64_split_stack_space_check.
* gcc/config/aarch64/aarch64.c (aarch64_expand_builtin_va_start): Use
internal argument pointer instead of virtual_incoming_args_rtx.
(morestack_ref): New symbol.
(aarch64_load_split_stack_value): New function.
(aarch64_expand_split_stack_prologue): Likewise.
(aarch64_internal_arg_pointer): Likewise.
(aarch64_file_end): Emit the split-stack note sections.
(aarch64_split_stack_space_check): Likewise.
(TARGET_ASM_FILE_END): New macro.
(TARGET_INTERNAL_ARG_POINTER): Likewise.
* gcc/config/aarch64/aarch64.h (aarch64_frame): Add
split_stack_arg_pointer to setup the argument pointer when using
split-stack.
* gcc/config/aarch64/aarch64.md
(UNSPECV_STACK_CHECK): New define.
(split_stack_prologue): New expand.
(split_stack_space_check): Likewise.
---

Re: [PATCH] Enable ifunc attribute by default for ARM GNU/Linux

2017-10-10 Thread Adhemerval Zanella


On 09/10/2017 19:20, Joseph Myers wrote:
> On Mon, 9 Oct 2017, Adhemerval Zanella wrote:
> 
>>  *-*-linux*)
>>  case ${target} in
>> -aarch64*-* | i[34567]86-* | powerpc*-* | s390*-* | sparc*-* | x86_64-*)
>> +aarch64*-* | i[34567]86-* | powerpc*-* | s390*-* | sparc*-* | x86_64-* 
>> | arm*-*)
> 
> I think the cases here are meant to be in alphabetical order.
> 

Ack, I will change.  Is it ok to commit with the change?


[PATCH] Enable ifunc attribute by default for ARM GNU/Linux

2017-10-09 Thread Adhemerval Zanella
Similar to other architectures with IFUNC binutils/glibc support, this
patch enables the ifunc attribute for ARM GNU/Linux.  Although not
required for build master GLIBC, the intention is to allow refactor
its assembly implementation to C [1].

Tested compilation of glibc (in conjunction with a glibc patch to
support using the attribute on ARM) with build-many-glibcs.py (with
a patch to add a armv7 variant which enables multiarch).  I have
not run the GCC tests for ARM.

Adhemerval Zanella  <adhemerval.zane...@linaro.org>

* config.gcc (default_gnu_indirect_function): Default to yes for
arm*-*-linux* with glibc.

[1] https://sourceware.org/ml/libc-alpha/2017-10/msg00334.html
---
 gcc/ChangeLog  | 5 +
 gcc/config.gcc | 2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 91a55e8..26aa8f6 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -3104,7 +3104,7 @@ case ${target} in
 ;;
 *-*-linux*)
case ${target} in
-   aarch64*-* | i[34567]86-* | powerpc*-* | s390*-* | sparc*-* | x86_64-*)
+   aarch64*-* | i[34567]86-* | powerpc*-* | s390*-* | sparc*-* | x86_64-* 
| arm*-*)
default_gnu_indirect_function=yes
;;
esac
-- 
2.7.4



Re: [PATCH v5] aarch64: Add split-stack initial support

2017-08-31 Thread Adhemerval Zanella
Ping.

On 26/07/2017 16:08, Adhemerval Zanella wrote:
> This is an update patch based on my previous submission [1].  The changes
> from previous version are:
> 
>   - Update aarch64_supports_split_stack to return true an let the loader
> to actually emit an error if it does not provide the required TCB
> field.  The TCB field support is planed for GLIBC 2.27.
> 
>   - Some cleanup on morestack-c.c to avoid code duplication (ss_pointer
> function).
> 
> I am still some sporadic failures, but it due missing linker support
> for split calls due not enough stack size (on pprof and other go tests
> that call backtrace on signal handling).  This could be mitigate by
> increasing the BACKOFF to a higher value, but to not deviate from other
> ports with split-stack support this patch is using the default values
> (initial stack size of 16k and backoff of 4k).  This should be correctly
> handled with proper gold suppor (as for other ports).
> 
> --
> 
> This patch adds the split-stack support on aarch64 (PR #67877).  As for
> other ports this patch should be used along with glibc and gold support.
> 
> The support is done similar to other architectures: a split-stack field
> is allocated before TCB by glibc, a target-specific __morestack implementation
> and helper functions are added in libgcc and compiler supported in adjusted
> (split-stack prologue, va_start for argument handling).  I also plan to
> send the gold support to adjust stack allocation acrosss split-stack
> and default code calls.
> 
> Current approach is to set the final stack adjustments using a 2 instructions
> at most (mov/movk) which limits stack allocation to upper limit of 4GB.
> The morestack call is non standard with x10 hollding the requested stack
> pointer, x11 the argument pointer (if required), and x12 to return
> continuation address.  Unwinding is handled by a personality routine that
> knows how to find stack segments.
> 
> Split-stack prologue on function entry is as follow (this goes before the
> usual function prologue):
> 
> function:
>   mrsx9, tpidr_el0
>   movx10, 
>   movk   x10, #0x0, lsl #16
>   subx10, sp, x10
>   movx11, sp  # if function has stacked arguments
>   adrp   x12, main_fn_entry
>   addx12, x12, :lo12:.L2
>   cmpx9, x10
>   b.lt   
>   b  __morestack
> main_fn_entry:
>   [function prologue]
> 
> Notes:
> 
> 1. Even if a function does not allocate a stack frame, a split-stack prologue
>is created.  It is to avoid issues with tail call for external symbols
>which might require linker adjustment (libgo/runtime/go-varargs.c).
> 
> 2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldr
>to after the required stack calculation.
> 
> 3. Similar to powerpc, When the linker detects a call from split-stack to
>non-split-stack code, it adds 16k (or more) to the value found in 
> "allocate"
>instructions (so non-split-stack code gets a larger stack).  The amount is
>tunable by a linker option.  The edit means aarch64 does not need to
>implement __morestack_non_split, necessary on x86 because insufficient
>space is available there to edit the stack comparison code.  This feature
>is only implemented in the GNU gold linker.
> 
> 4. AArch64 does not handle >4G stack initially and although it is possible
>to implement it, limiting to 4G allows to materize the allocation with
>only 2 instructions (mov + movk) and thus simplifying the linker
>adjustments required.  Supporting multiple threads each requiring more
>than 4G of stack is probably not that important, and likely to OOM at
>run time.
> 
> 5. The TCB support on GLIBC is meant to be included in version 2.26.
> 
> 6. The continuation address materialized on x12 is done using 'adrp'
>plus add and a static relocation.  Current code uses the
>aarch64_expand_mov_immediate function and since a better alternative
>would be 'adp', it could be a future optimization (not implemented
>in this patch).
> 
> libgcc/ChangeLog:
> 
>   * libgcc/config.host: Use t-stack and t-statck-aarch64 for
>   aarch64*-*-linux.
>   * libgcc/config/aarch64/morestack-c.c: New file.
>   * libgcc/config/aarch64/morestack.S: Likewise.
>   * libgcc/config/aarch64/t-stack-aarch64: Likewise.
>   * libgcc/generic-morestack.c (__splitstack_find): Add aarch64-specific
>   code.
> 
> gcc/ChangeLog:
> 
>   * common/config/aarch64/aarch64-common.c
>   (aarch64_supports_split_stack): New function.
>   (TARGET_SUPPORTS_SPLIT_STACK): New macro.
>   * gcc/config/aarch64/aar

[PATCH v5] aarch64: Add split-stack initial support

2017-07-26 Thread Adhemerval Zanella
This is an update patch based on my previous submission [1].  The changes
from previous version are:

  - Update aarch64_supports_split_stack to return true an let the loader
to actually emit an error if it does not provide the required TCB
field.  The TCB field support is planed for GLIBC 2.27.

  - Some cleanup on morestack-c.c to avoid code duplication (ss_pointer
function).

I am still some sporadic failures, but it due missing linker support
for split calls due not enough stack size (on pprof and other go tests
that call backtrace on signal handling).  This could be mitigate by
increasing the BACKOFF to a higher value, but to not deviate from other
ports with split-stack support this patch is using the default values
(initial stack size of 16k and backoff of 4k).  This should be correctly
handled with proper gold suppor (as for other ports).

--

This patch adds the split-stack support on aarch64 (PR #67877).  As for
other ports this patch should be used along with glibc and gold support.

The support is done similar to other architectures: a split-stack field
is allocated before TCB by glibc, a target-specific __morestack implementation
and helper functions are added in libgcc and compiler supported in adjusted
(split-stack prologue, va_start for argument handling).  I also plan to
send the gold support to adjust stack allocation acrosss split-stack
and default code calls.

Current approach is to set the final stack adjustments using a 2 instructions
at most (mov/movk) which limits stack allocation to upper limit of 4GB.
The morestack call is non standard with x10 hollding the requested stack
pointer, x11 the argument pointer (if required), and x12 to return
continuation address.  Unwinding is handled by a personality routine that
knows how to find stack segments.

Split-stack prologue on function entry is as follow (this goes before the
usual function prologue):

function:
mrsx9, tpidr_el0
movx10, 
movk   x10, #0x0, lsl #16
subx10, sp, x10
movx11, sp  # if function has stacked arguments
adrp   x12, main_fn_entry
addx12, x12, :lo12:.L2
cmpx9, x10
b.lt   
b  __morestack
main_fn_entry:
[function prologue]

Notes:

1. Even if a function does not allocate a stack frame, a split-stack prologue
   is created.  It is to avoid issues with tail call for external symbols
   which might require linker adjustment (libgo/runtime/go-varargs.c).

2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldr
   to after the required stack calculation.

3. Similar to powerpc, When the linker detects a call from split-stack to
   non-split-stack code, it adds 16k (or more) to the value found in "allocate"
   instructions (so non-split-stack code gets a larger stack).  The amount is
   tunable by a linker option.  The edit means aarch64 does not need to
   implement __morestack_non_split, necessary on x86 because insufficient
   space is available there to edit the stack comparison code.  This feature
   is only implemented in the GNU gold linker.

4. AArch64 does not handle >4G stack initially and although it is possible
   to implement it, limiting to 4G allows to materize the allocation with
   only 2 instructions (mov + movk) and thus simplifying the linker
   adjustments required.  Supporting multiple threads each requiring more
   than 4G of stack is probably not that important, and likely to OOM at
   run time.

5. The TCB support on GLIBC is meant to be included in version 2.26.

6. The continuation address materialized on x12 is done using 'adrp'
   plus add and a static relocation.  Current code uses the
   aarch64_expand_mov_immediate function and since a better alternative
   would be 'adp', it could be a future optimization (not implemented
   in this patch).

libgcc/ChangeLog:

* libgcc/config.host: Use t-stack and t-statck-aarch64 for
aarch64*-*-linux.
* libgcc/config/aarch64/morestack-c.c: New file.
* libgcc/config/aarch64/morestack.S: Likewise.
* libgcc/config/aarch64/t-stack-aarch64: Likewise.
* libgcc/generic-morestack.c (__splitstack_find): Add aarch64-specific
code.

gcc/ChangeLog:

* common/config/aarch64/aarch64-common.c
(aarch64_supports_split_stack): New function.
(TARGET_SUPPORTS_SPLIT_STACK): New macro.
* gcc/config/aarch64/aarch64-linux.h (TARGET_ASM_FILE_END): Remove
macro.
* gcc/config/aarch64/aarch64-protos.h: Add
aarch64_expand_split_stack_prologue and
aarch64_split_stack_space_check.
* gcc/config/aarch64/aarch64.c (aarch64_gen_far_branch): Add suport
to emit 'b' instruction to rtx different than LABEL_REF.
(aarch64_expand_builtin_va_start): Use internal argument pointer
instead of virtual_incoming_args_rtx.
(morestack_ref): New symbol.
(aarch64_load_split_stack_value): New 

[PATCH] sparc: Set noexecstack on mulsi3, divsi3, and modsi3

2017-05-10 Thread Adhemerval Zanella
A recent GLIBC fix for sparc [1] made some configuration to show
an executable stack on ld.so (shown on elf/check-execstack testcase
failure).

It is because with generated sparc toolchain from build-many-glibcs.py
(a GLIBC script to produce cross-compiling toolchains) the minimum
supported sparc32 version is pre-v9 and it requires a software
implementation of '.udiv'.  Since now we are using libgcc.a one instead,
it must have the '.note.GNU-stack' so linker can properly set the stack non
executable.

>From a build using a toolchain from build-many-glibcs.py:

elf/librtld.os.map

[...]
.../sparc64-glibc-linux-gnu/6.2.1/32/libgcc.a(_divsi3.o)
  
.../sparc64-glibc-linux-gnu/6.2.1/32/libgcc.a(_udivdi3.o) (.udiv)
.../sparc64-glibc-linux-gnu/6.2.1/32/libgcc.a(_clz.o)
  
.../lib/gcc/sparc64-glibc-linux-gnu/6.2.1/32/libgcc.a(_udivdi3.o) (__clz_tab)
[...]

And dumping _udivdi3.o section headers:

  [Nr] Name  TypeAddr OffSize   ES Flg Lk Inf Al
  [ 0]   NULL 00 00 00  0   0  0
  [ 1] .text PROGBITS 34 0002b0 00  AX  0   0  4
  [ 2] .data PROGBITS 0002e4 00 00  WA  0   0  1
  [ 3] .bss  NOBITS   0002e4 00 00  WA  0   0  1
  [ 4] .debug_line   PROGBITS 0002e4 00010d 00  0   0  1
  [ 5] .rela.debug_line  RELA 0007c0 0c 0c   I 12   4  4
  [ 6] .debug_info   PROGBITS 0003f1 ab 00  0   0  1
  [ 7] .rela.debug_info  RELA 0007cc 30 0c   I 12   6  4
  [ 8] .debug_abbrev PROGBITS 00049c 14 00  0   0  1
  [ 9] .debug_arangesPROGBITS 0004b0 20 00  0   0  8
  [10] .rela.debug_arang RELA 0007fc 18 0c   I 12   9  4
  [11] .shstrtab STRTAB   000814 70 00  0   0  1
  [12] .symtab   SYMTAB   0004d0 000220 10 13  32  4
  [13] .strtab   STRTAB   0006f0 cf 00  0   0  1

I am not seeing this on a native gcc build which I configured with:

' --with-arch-directory=sparc64 --enable-multiarch --enable-targets=all
  --with-cpu-32=ultrasparc --with-long-double-128 --enable-multilib'

Both libgcc's __udivdi3 and __umoddi3 do not pull .udiv since for this libgcc 
build
both are using hardware instructions:

elf/librtld.os.map

/home/azanella/gcc/install/lib/gcc/sparc64-linux-gnu/6.3.1/32/libgcc.a(_udivdi3.o)
  
/home/azanella/glibc/glibc-git-build-sparcv9/elf/dl-allobjs.os (__udivdi3)
/home/azanella/gcc/install/lib/gcc/sparc64-linux-gnu/6.3.1/32/libgcc.a(_umoddi3.o)
  
/home/azanella/glibc/glibc-git-build-sparcv9/elf/dl-allobjs.os (__umoddi3)

This patch adds them missing noexectack on sparc assembly implementation.  I saw
no regression on gcc testsuite and it fixes the regression on GLIBC side.

libgcc/

* config/sparc/lb1spc.S [__ELF__ && __linux__]: Emit .note.GNU-stack
section for a non-executable stack.

[1] 
https://sourceware.org/git/?p=glibc.git;a=commit;h=bdc543e338281da051b3dc06eae96c330a485ce6
---
 libgcc/ChangeLog | 5 +
 libgcc/config/sparc/lb1spc.S | 6 ++
 2 files changed, 11 insertions(+)

diff --git a/libgcc/config/sparc/lb1spc.S b/libgcc/config/sparc/lb1spc.S
index b60bd57..e693864 100644
--- a/libgcc/config/sparc/lb1spc.S
+++ b/libgcc/config/sparc/lb1spc.S
@@ -5,6 +5,12 @@
slightly edited to match the desired calling convention, and also to
optimize them for our purposes.  */
 
+/* An executable stack is *not* required for these functions.  */
+#if defined(__ELF__) && defined(__linux__)
+.section .note.GNU-stack,"",%progbits
+.previous
+#endif
+
 #ifdef L_mulsi3
 .text
.align 4
-- 
2.7.4



Re: [PATCH v4] aarch64: Add split-stack initial support

2017-03-06 Thread Adhemerval Zanella
Ping.

On 15/02/2017 21:23, Adhemerval Zanella wrote:
> This is an update patch from my previous version (v3) based mainly on
> glibc comments on exported loader symbol [1].  The changes from previous
> version are:
> 
>- Remove __tcb_private_ss call on libgcc and emit a data usage on
>  glibc symbol when split-stack is used.  This removes the runtime
>  errors when running on older glibc and instead make the loader
>  fails with a missing symbol.
> 
>- Add glibc version check on split-stack support check.
> 
>- Some comments fixes on morestack.S.
> 
>- Remove some compile warnings on morestack-c.c.
> 
> --
> 
> This patch adds the split-stack support on aarch64 (PR #67877).  As for
> other ports this patch should be used along with glibc and gold support.
> 
> The support is done similar to other architectures: a split-stack field
> is allocated before TCB by glibc, a target-specific __morestack implementation
> and helper functions are added in libgcc and compiler supported in adjusted
> (split-stack prologue, va_start for argument handling).  I also plan to
> send the gold support to adjust stack allocation acrosss split-stack
> and default code calls.
> 
> Current approach is to set the final stack adjustments using a 2 instructions
> at most (mov/movk) which limits stack allocation to upper limit of 4GB.
> The morestack call is non standard with x10 hollding the requested stack
> pointer, x11 the argument pointer (if required), and x12 to return
> continuation address.  Unwinding is handled by a personality routine that
> knows how to find stack segments.
> 
> Split-stack prologue on function entry is as follow (this goes before the
> usual function prologue):
> 
> function:
>   mrsx9, tpidr_el0
>   movx10, 
>   movk   x10, #0x0, lsl #16
>   subx10, sp, x10
>   movx11, sp  # if function has stacked arguments
>   adrp   x12, main_fn_entry
>   addx12, x12, :lo12:.L2
>   cmpx9, x10
>   b.lt   
>   b  __morestack
> main_fn_entry:
>   [function prologue]
> 
> Notes:
> 
> 1. Even if a function does not allocate a stack frame, a split-stack prologue
>is created.  It is to avoid issues with tail call for external symbols
>which might require linker adjustment (libgo/runtime/go-varargs.c).
> 
> 2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldr
>to after the required stack calculation.
> 
> 3. Similar to powerpc, When the linker detects a call from split-stack to
>non-split-stack code, it adds 16k (or more) to the value found in 
> "allocate"
>instructions (so non-split-stack code gets a larger stack).  The amount is
>tunable by a linker option.  The edit means aarch64 does not need to
>implement __morestack_non_split, necessary on x86 because insufficient
>space is available there to edit the stack comparison code.  This feature
>is only implemented in the GNU gold linker.
> 
> 4. AArch64 does not handle >4G stack initially and although it is possible
>to implement it, limiting to 4G allows to materize the allocation with
>only 2 instructions (mov + movk) and thus simplifying the linker
>adjustments required.  Supporting multiple threads each requiring more
>than 4G of stack is probably not that important, and likely to OOM at
>run time.
> 
> 5. The TCB support on GLIBC is meant to be included in version 2.26.
> 
> 6. The continuation address materialized on x12 is done using 'adrp'
>plus add and a static relocation.  Current code uses the
>aarch64_expand_mov_immediate function and since a better alternative
>would be 'adp', it could be a future optimization (not implemented
>in this patch).
> 
> [1] https://sourceware.org/ml/libc-alpha/2017-02/msg00272.html
> 
> libgcc/ChangeLog:
> 
>   * libgcc/config.host: Use t-stack and t-statck-aarch64 for
>   aarch64*-*-linux.
>   * libgcc/config/aarch64/morestack-c.c: New file.
>   * libgcc/config/aarch64/morestack.S: Likewise.
>   * libgcc/config/aarch64/t-stack-aarch64: Likewise.
>   * libgcc/generic-morestack.c (__splitstack_find): Add aarch64-specific
>   code.
> 
> gcc/ChangeLog:
> 
>   * common/config/aarch64/aarch64-common.c
>   (aarch64_supports_split_stack): New function.
>   (TARGET_SUPPORTS_SPLIT_STACK): New macro.
>   * gcc/config/aarch64/aarch64-linux.h (TARGET_ASM_FILE_END): Remove
>   macro.
>   * gcc/config/aarch64/aarch64-protos.h: Add
>   aarch64_expand_split_stack_prologue and
>   aarch64_split_stack_space_check.
>   * gcc/con

[PATCH v4] aarch64: Add split-stack initial support

2017-02-15 Thread Adhemerval Zanella
This is an update patch from my previous version (v3) based mainly on
glibc comments on exported loader symbol [1].  The changes from previous
version are:

   - Remove __tcb_private_ss call on libgcc and emit a data usage on
 glibc symbol when split-stack is used.  This removes the runtime
 errors when running on older glibc and instead make the loader
 fails with a missing symbol.

   - Add glibc version check on split-stack support check.

   - Some comments fixes on morestack.S.

   - Remove some compile warnings on morestack-c.c.

--

This patch adds the split-stack support on aarch64 (PR #67877).  As for
other ports this patch should be used along with glibc and gold support.

The support is done similar to other architectures: a split-stack field
is allocated before TCB by glibc, a target-specific __morestack implementation
and helper functions are added in libgcc and compiler supported in adjusted
(split-stack prologue, va_start for argument handling).  I also plan to
send the gold support to adjust stack allocation acrosss split-stack
and default code calls.

Current approach is to set the final stack adjustments using a 2 instructions
at most (mov/movk) which limits stack allocation to upper limit of 4GB.
The morestack call is non standard with x10 hollding the requested stack
pointer, x11 the argument pointer (if required), and x12 to return
continuation address.  Unwinding is handled by a personality routine that
knows how to find stack segments.

Split-stack prologue on function entry is as follow (this goes before the
usual function prologue):

function:
mrsx9, tpidr_el0
movx10, 
movk   x10, #0x0, lsl #16
subx10, sp, x10
movx11, sp  # if function has stacked arguments
adrp   x12, main_fn_entry
addx12, x12, :lo12:.L2
cmpx9, x10
b.lt   
b  __morestack
main_fn_entry:
[function prologue]

Notes:

1. Even if a function does not allocate a stack frame, a split-stack prologue
   is created.  It is to avoid issues with tail call for external symbols
   which might require linker adjustment (libgo/runtime/go-varargs.c).

2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldr
   to after the required stack calculation.

3. Similar to powerpc, When the linker detects a call from split-stack to
   non-split-stack code, it adds 16k (or more) to the value found in "allocate"
   instructions (so non-split-stack code gets a larger stack).  The amount is
   tunable by a linker option.  The edit means aarch64 does not need to
   implement __morestack_non_split, necessary on x86 because insufficient
   space is available there to edit the stack comparison code.  This feature
   is only implemented in the GNU gold linker.

4. AArch64 does not handle >4G stack initially and although it is possible
   to implement it, limiting to 4G allows to materize the allocation with
   only 2 instructions (mov + movk) and thus simplifying the linker
   adjustments required.  Supporting multiple threads each requiring more
   than 4G of stack is probably not that important, and likely to OOM at
   run time.

5. The TCB support on GLIBC is meant to be included in version 2.26.

6. The continuation address materialized on x12 is done using 'adrp'
   plus add and a static relocation.  Current code uses the
   aarch64_expand_mov_immediate function and since a better alternative
   would be 'adp', it could be a future optimization (not implemented
   in this patch).

[1] https://sourceware.org/ml/libc-alpha/2017-02/msg00272.html

libgcc/ChangeLog:

* libgcc/config.host: Use t-stack and t-statck-aarch64 for
aarch64*-*-linux.
* libgcc/config/aarch64/morestack-c.c: New file.
* libgcc/config/aarch64/morestack.S: Likewise.
* libgcc/config/aarch64/t-stack-aarch64: Likewise.
* libgcc/generic-morestack.c (__splitstack_find): Add aarch64-specific
code.

gcc/ChangeLog:

* common/config/aarch64/aarch64-common.c
(aarch64_supports_split_stack): New function.
(TARGET_SUPPORTS_SPLIT_STACK): New macro.
* gcc/config/aarch64/aarch64-linux.h (TARGET_ASM_FILE_END): Remove
macro.
* gcc/config/aarch64/aarch64-protos.h: Add
aarch64_expand_split_stack_prologue and
aarch64_split_stack_space_check.
* gcc/config/aarch64/aarch64.c (aarch64_gen_far_branch): Add suport
to emit 'b' instruction to rtx different than LABEL_REF.
(aarch64_expand_builtin_va_start): Use internal argument pointer
instead of virtual_incoming_args_rtx.
(morestack_ref): New symbol.
(aarch64_load_split_stack_value): New function.
(aarch64_expand_split_stack_prologue): Likewise.
(aarch64_internal_arg_pointer): Likewise.
(aarch64_split_stack_space_check): Likewise.
(aarch64_file_end): Emit the split-stack note sections.

[PATCH v3] aarch64: Add split-stack support

2017-02-13 Thread Adhemerval Zanella
Changes from previous version:

  - The split-stack are is now allocated before the thread pointer
(instead of tcbhead_t which is positioned after it) as decribed
in glibc patch [1].  It has the advantage to not require linker
changes to take in consideration the new tcbhead_t size and it
also fixed the regression I was seeing on the previous version of
this patch at runtime/pprof: __splitstack_setcontext updates some
internal TLS variables and with split-stack trying to update
a larger tcbhead_t field it clobbered the tls variable when
using a default static linker (which expected a default tcbhead_t
structure as size of 16).

  - Function aarch64_split_stack_space_check is changed back to previous
implementation because a wrong comment in my previous version lead
to wrong assumptions on [2].  The function will generate a jump
to 'label' variable (which will call __morestack_allocate_stack_space)
not the arch-specific __morestack implementation.  I corrected the
comment as wellA

  - Change the stack adjustments to always use 2 instructions. [3]

I am not seeing any more issue when using split-stack neither check-go
regressions with this version.

[1] https://sourceware.org/ml/libc-alpha/2017-02/msg00247.html
[2] https://gcc.gnu.org/ml/gcc-patches/2017-01/msg00093.html
[3] https://gcc.gnu.org/ml/gcc-patches/2017-02/msg00055.html

--

This patch adds the split-stack support on aarch64 (PR #67877).  As for
other ports this patch should be used along with glibc and gold support.

The support is done similar to other architectures: a __private_ss field is
added on TCB in glibc, a target-specific __morestack implementation and
helper functions are added in libgcc and compiler supported in adjusted
(split-stack prologue, va_start for argument handling).  I also plan to
send the gold support to adjust stack allocation acrosss split-stack
and default code calls.

Current approach is similar to powerpc one: at most 2 GB of stack allocation
is support so stack adjustments can be done with 2 instructions (either just
a movn plus nop or a movn followed by movk).  The morestack call is non
standard with x10 hollding the requested stack pointer, x11 the argument
pointer, and x12 to return continuation address.  Unwinding is handled by a
personality routine that knows how to find stack segments.

Split-stack prologue on function entry is as follow (this goes before the
usual function prologue):

function:
mrsx9, tpidr_el0
movx10, -
movk   0x0
addx10, sp, x10
movx11, sp  # if function has stacked arguments
adrp   x12, main_fn_entry
addx12, x12, :lo12:.L2
cmpx9, x10
b.lt   
b  __morestack
main_fn_entry:
[function prologue]

Notes:

1. Even if a function does not allocate a stack frame, a split-stack prologue
   is created.  It is to avoid issues with tail call for external symbols
   which might require linker adjustment (libgo/runtime/go-varargs.c).

2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldr
   to after the required stack calculation.

3. Similar to powerpc, When the linker detects a call from split-stack to
   non-split-stack code, it adds 16k (or more) to the value found in "allocate"
   instructions (so non-split-stack code gets a larger stack).  The amount is
   tunable by a linker option.  The edit means aarch64 does not need to
   implement __morestack_non_split, necessary on x86 because insufficient
   space is available there to edit the stack comparison code.  This feature
   is only implemented in the GNU gold linker.

4. AArch64 does not handle >4G stack initially and although it is possible
   to implement it, limiting to 4G allows to materize the allocation with
   only 2 instructions (mov + movk) and thus simplifying the linker
   adjustments required.  Supporting multiple threads each requiring more
   than 4G of stack is probably not that important, and likely to OOM at
   run time.

5. The TCB support on GLIBC is meant to be included in version 2.25.

6. The continuation address materialized on x12 is done using 'adrp'
   plus add and a static relocation.  Current code uses the
   aarch64_expand_mov_immediate function and since a better alternative
   would be 'adp', it could be a future optimization (not implemented
   in this patch).

libgcc/ChangeLog:

* libgcc/config.host: Use t-stack and t-statck-aarch64 for
aarch64*-*-linux.
* libgcc/config/aarch64/morestack-c.c: New file.
* libgcc/config/aarch64/morestack.S: Likewise.
* libgcc/config/aarch64/t-stack-aarch64: Likewise.
* libgcc/generic-morestack.c (__splitstack_find): Add aarch64-specific
code.

gcc/ChangeLog:

PR 67877
* common/config/aarch64/aarch64-common.c
(aarch64_supports_split_stack): New function.
(TARGET_SUPPORTS_SPLIT_STACK): New macro.
  

Re: [PATCH v2] aarch64: Add split-stack initial support

2017-01-31 Thread Adhemerval Zanella


On 25/01/2017 10:10, Jiong Wang wrote:
> On 24/01/17 18:05, Adhemerval Zanella wrote:
>>
>> On 03/01/2017 13:13, Wilco Dijkstra wrote:
>>
>>> +  /* If function uses stacked arguments save the old stack value so 
>>> morestack
>>> + can return it.  */
>>> +  reg11 = gen_rtx_REG (Pmode, R11_REGNUM);
>>> +  if (cfun->machine->frame.saved_regs_size
>>> +  || cfun->machine->frame.saved_varargs_size)
>>> +emit_move_insn (reg11, stack_pointer_rtx);
>>>
>>> This doesn't look right - we could have many arguments even without varargs 
>>> or
>>> saved regs.  This would need to check varargs as well as ctrl->args.size (I 
>>> believe
>>> that is the size of the arguments on the stack). It's fine to omit this 
>>> optimization
>>> in the first version - we already emit 2-3 extra instructions for the check 
>>> anyway.
>> I will check for a better solution.
> 
> Hi Adhemerval
> 
>   My only concern on this this patch is the initialization of R11 (internal 
> arg
> pointer).  The current implementation looks to me is generating wrong code 
> for a
> testcase simply return the sum of ten int param, I see the function body is
> using R11 while there is no initialization of it in split prologue,  so if the
> execution flow is *not* through __morestack, then R11 is not initialized.
> As Wilco suggested, I feel using crtl->args.size instead of
> cfun->machine->frame.saved_regs_size might be the correct approach after
> checking assign_parms in function.c.
> 

Hi Jiong,

Indeed the previous version which used 'saved_regs_size' is wrong for stacked
parameters.  A simple 10 arguments function call shows when the reg11 is
evaluated:

cfun->machine->frame.saved_regs_size= 0
cfun->machine->frame.saved_varargs_size = 0
crtl->args.size = 16

So indeed 'ctrl->args.size' seems the correct argument to used in this case
(which will trigger the correct reg11 set for split-stack case).  In this 
version
I also removed some unused variables I left in previous patch.

>From 30cbedc303d364dd94f0d35abee1d237a3671cdb Mon Sep 17 00:00:00 2001
From: Adhemerval Zanella <adhemerval.zane...@linaro.org>
Date: Wed, 4 May 2016 21:13:39 +
Subject: [PATCH] aarch64: Add split-stack initial support

This patch adds the split-stack support on aarch64 (PR #67877).  As for
other ports this patch should be used along with glibc and gold support.

The support is done similar to other architectures: a __private_ss field is
added on TCB in glibc, a target-specific __morestack implementation and
helper functions are added in libgcc and compiler supported in adjusted
(split-stack prologue, va_start for argument handling).  I also plan to
send the gold support to adjust stack allocation acrosss split-stack
and default code calls.

Current approach is similar to powerpc one: at most 2 GB of stack allocation
is support so stack adjustments can be done with 2 instructions (either just
a movn plus nop or a movn followed by movk).  The morestack call is non
standard with x10 hollding the requested stack pointer, x11 the argument
pointer, and x12 to return continuation address.  Unwinding is handled by a
personality routine that knows how to find stack segments.

Split-stack prologue on function entry is as follow (this goes before the
usual function prologue):

function:
	mrsx9, tpidr_el0
	movx10, -
	movk   0x0
	addx10, sp, x10
	movx11, sp   	# if function has stacked arguments
	adrp   x12, main_fn_entry
	addx12, x12, :lo12:.L2
	cmpx9, x10
	b.lt   
	b  __morestack
main_fn_entry:
	[function prologue]

Notes:

1. Even if a function does not allocate a stack frame, a split-stack prologue
   is created.  It is to avoid issues with tail call for external symbols
   which might require linker adjustment (libgo/runtime/go-varargs.c).

2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldr
   to after the required stack calculation.

3. Similar to powerpc, When the linker detects a call from split-stack to
   non-split-stack code, it adds 16k (or more) to the value found in "allocate"
   instructions (so non-split-stack code gets a larger stack).  The amount is
   tunable by a linker option.  The edit means aarch64 does not need to
   implement __morestack_non_split, necessary on x86 because insufficient
   space is available there to edit the stack comparison code.  This feature
   is only implemented in the GNU gold linker.

4. AArch64 does not handle >4G stack initially and although it is possible
   to implement it, limiting to 4G allows to materize the allocation with
   only 2 instructions (mov + movk) and thus simplifying the linker
   adjustments required.  Supporting multiple threads each

Re: [PATCH v2] aarch64: Add split-stack initial support

2017-01-24 Thread Adhemerval Zanella


On 03/01/2017 13:13, Wilco Dijkstra wrote:
> Adhemerval Zanella wrote:
>   
> Sorry for the late reply - but I think it's getting there. A few more 
> comments:

No worries.

> 
> +  /* If function uses stacked arguments save the old stack value so morestack
> + can return it.  */
> +  reg11 = gen_rtx_REG (Pmode, R11_REGNUM);
> +  if (cfun->machine->frame.saved_regs_size
> +  || cfun->machine->frame.saved_varargs_size)
> +emit_move_insn (reg11, stack_pointer_rtx);
> 
> This doesn't look right - we could have many arguments even without varargs or
> saved regs.  This would need to check varargs as well as ctrl->args.size (I 
> believe
> that is the size of the arguments on the stack). It's fine to omit this 
> optimization
> in the first version - we already emit 2-3 extra instructions for the check 
> anyway.

I will check for a better solution.

> 
> 
> +void
> +aarch64_split_stack_space_check (rtx size, rtx label)
> {
> +  rtx mem, ssvalue, cc, cmp, jump, temp;
> +  rtx requested = gen_reg_rtx (Pmode);
> +  /* Offset from thread pointer to __private_ss.  */
> +  int psso = 0x10;
> +
> +  /* Load __private_ss from TCB.  */
> +  ssvalue = gen_rtx_REG (Pmode, R9_REGNUM);
> 
> ssvalue doesn't need to be a hardcoded register.

Indeed, and it seems that this was not being triggered. I have fixed it in
this version.

> 
> +  emit_insn (gen_aarch64_load_tp_hard (ssvalue));
> +  mem = gen_rtx_MEM (Pmode, plus_constant (Pmode, ssvalue, psso));
> +  emit_move_insn (ssvalue, mem);
> +
> +  temp = gen_rtx_REG (Pmode, R10_REGNUM);
> +
> +  /* And compare it with frame pointer plus required stack.  */
> +  size = force_reg (Pmode, size);
> +  emit_move_insn (requested, gen_rtx_MINUS (Pmode, stack_pointer_rtx, size));
> +
> +  /* Jump to __morestack call if current __private_ss is not suffice.  */
> +  cc = aarch64_gen_compare_reg (LT, temp, ssvalue);
> 
> This uses X10, but where is it set???

I fixed it on this version.

> 
> +  cmp = gen_rtx_fmt_ee (GEU, VOIDmode, cc, const0_rtx);
> +  jump = emit_jump_insn (gen_condjump (cmp, cc, label));
> +  JUMP_LABEL (jump) = label;
> +}
> 
> So neither X10 nor X12 are set before potentially calling __morestack, so I 
> don't
> think it will work. Could this be causing the crash you mentioned?

I do not think so, the issue in with the runtime/pprof libgo test that fails 
with

[Switching to LWP 18926]
0x0050c358 in runtime.sigtrampgo (sig=sig@entry=27, 
info=info@entry=0x7fb63d5da0, ctx=ctx@entry=0x7fb63d5e20)
at ../../../gcc-git/libgo/go/runtime/signal_unix.go:221
221 setg(g.m.gsignal)

Where g.m is null.  Trying to obtain a stackstrace I am not seeing:

(gdb) bt
#0  0x0050c358 in runtime.sigtrampgo (sig=sig@entry=27, 
info=info@entry=0x7fb63d5da0, ctx=ctx@entry=0x7fb63d5e20)
at ../../../gcc-git/libgo/go/runtime/signal_unix.go:221
#1  0x0056acb4 in runtime.sigtramp (sig=27, info=0x7fb63d5da0, 
context=0x7fb63d5e20) at ../../../gcc-git/libgo/runtime/go-signal.c:131
#2  
#3  pprof_test.cpuHog1 () at pprof_test.go:52
#4  0x0040c814 in pprof_test.cpuHogger (f=f@entry=0x57c560 
<runtime_pprof_test.cpuHog1$descriptor>, dur=) at 
pprof_test.go:37
#5  0x0040c9f8 in pprof_test.$nested1 (dur=) at 
pprof_test.go:75
#6  0x0040d038 in pprof_test.testCPUProfile (t=t@entry=0x420804e680, 
need=..., f=f@entry=0x57c600 <runtime_pprof_test.$nested1$descriptor>)
at pprof_test.go:144
#7  0x0040c9a8 in runtime_pprof_test.TestCPUProfile (t=0x420804e680) at 
pprof_test.go:74
#8  0x00543bec in testing.tRunner (param=, fn=) at ../../../gcc-git/libgo/go/testing/testing.go:656
#9  0x00543c84 in testing.$thunk24 (__go_thunk_parameter=) at ../../../gcc-git/libgo/go/testing/testing.go:693
#10 0x0041a7dc in kickoff () at 
../../../gcc-git/libgo/runtime/proc.c:258
/build/buildd/gdb-7.9/gdb/dwarf2-frame.c:1732: internal-error: add_cie: 
Assertion `n < 1 || cie_table->entries[n - 1]->cie_pointer < cie->cie_pointer' 
failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n)

Which maybe the case that morestack.S unwind info not really correct. It could 
be
a case for issue in gdb as well (I will check with a newer gdb).


> 
> Wilco
> 
>From 09b51fb706a8b15ecaea4ec4b8a80d0a7903053d Mon Sep 17 00:00:00 2001
From: Adhemerval Zanella <adhemerval.zane...@linaro.org>
Date: Wed, 4 May 2016 21:13:39 +
Subject: [PATCH] aarch64: Add split-stack initial support

This patch adds the split-stack support on aarch64 (PR #67877).  As for
other ports this patch should be used along with glibc and gold support.

The support is done similar to other architectures: a __private_ss field is
added on TCB in glibc,

Re: [PATCH v2] aarch64: Add split-stack initial support

2016-11-25 Thread Adhemerval Zanella


On 15/11/2016 16:38, Wilco Dijkstra wrote:
> 
> On 07/11/2016 16:59, Adhemerval Zanella wrote:
>> On 14/10/2016 15:59, Wilco Dijkstra wrote:
> 
>> There is no limit afaik on gold split stack allocation handling,
>> and I think one could be added for each backend (in the method
>> override require to implement it).
>>
>> In fact it is not really required to tie the nop generation with the
>> instruction generated by 'aarch64_internal_mov_immediate', it is
>> just a matter to simplify linker code.  
> 
> If there is no easy limit and you'll still require a nop, I think it is best 
> then
> to emit mov N+movk #0. Then the scheduler won't be able to reorder
> them with the add/sub.

Good call, I have changed the patch to emit a mov N+mov #0 instead
on relying the emit_nop.

> 
>>> Is there any need to detect underflow of x10 or is there a guarantee that 
>>> stacks are
>>> never allocated in the low 2GB (given the maximum adjustment is 2GB)? It's 
>>> safe
>>> to do a signed comparison.
>>
>> I do not think so, at least none of current backend that implements
>> split stack do so.
> 
> OK, well a signed comparison like in your new version works for underflow.
> 
> Now to the patch:
> 
> 
> @@ -3316,6 +3339,28 @@ aarch64_expand_prologue (void)
>aarch64_save_callee_saves (DFmode, callee_offset, V0_REGNUM, V31_REGNUM,
>callee_adjust != 0 || frame_pointer_needed);
>aarch64_sub_sp (IP1_REGNUM, final_adjust, !frame_pointer_needed);
> +
> +  if (split_stack_arg_pointer_used_p ())
> +{
> +  /* Setup the argument pointer (x10) for -fsplit-stack code.  If
> +  __morestack was called, it will left the arg pointer to the
> +  old stack in x28.  Otherwise, the argument pointer is the top
> +  of current frame.  */
> +  rtx x11 = gen_rtx_REG (Pmode, R11_REGNUM);
> +  rtx x28 = gen_rtx_REG (Pmode, R28_REGNUM);
> +  rtx cc_reg = gen_rtx_REG (CCmode, CC_REGNUM);
> +
> +  rtx not_more = gen_label_rtx ();
> +
> +  rtx cmp = gen_rtx_fmt_ee (LT, VOIDmode, cc_reg, const0_rtx);
> +  rtx jump = emit_jump_insn (gen_condjump (cmp, cc_reg, not_more));
> +  JUMP_LABEL (jump) = not_more;
> +  LABEL_NUSES (not_more) += 1;
> +
> +  emit_move_insn (x11, x28);
> +
> +  emit_label (not_more);
> +}
> 
> If you pass the old sp in x11 when called from __morestack you can remove
> the above thunk completely.

Indeed this snippet does not make sense anymore, I removed it.

> 
> +  /* It limits total maximum stack allocation on 2G so its value can be
> + materialized using two instructions at most (movn/movk).  It might be
> + used by the linker to add some extra space for split calling non split
> + stack functions.  */
> +  allocate = cfun->machine->frame.frame_size;
> +  if (allocate > ((HOST_WIDE_INT) 1 << 31))
> +{
> +  sorry ("Stack frame larger than 2G is not supported for 
> -fsplit-stack");
> +  return;
> +}
> 
> Note a 2-instruction mov/movk can generate any immediate up to 4GB and if
> we need even large sizes, we could round up to a multiple of 64KB so that 2
> instructions are enough for a 48-bit stack size...

I think we can set a limit of 4GB (powerpc64 backend limits to 2GB and
it seems fine).  I corrected the comment.

> 
> +  int ninsn = aarch64_internal_mov_immediate (reg10, GEN_INT (-allocate),
> +   true, Pmode);
> +  gcc_assert (ninsn == 1 || ninsn == 2);
> +  if (ninsn == 1)
> +emit_insn (gen_nop ());
> 
> To avoid any issues with the nop being scheduled, it's best to emit an 
> explicit movk
> here (0x if allocate > 0, or 0 if zero) using gen_insv_immdi.

Right, I changed to that.

> 
> +void
> +aarch64_split_stack_space_check (rtx size, rtx label)
> 
> Isn't very similar code used in aarch64_expand_split_stack_prologue? Any 
> possibility
> to share/reuse?

I though about it, but it would require split in two subparts (one
to load __private_ss from TCB and another to jump to __morestack)
and both are basically 4 lines.  In the end I think current approach
should be simpler.

> 
> +static void
> +aarch64_live_on_entry (bitmap regs)
> +{
> +  if (flag_split_stack)
> +bitmap_set_bit (regs, R11_REGNUM);
> +}
> 
> I'm wondering whether you need extra code in aarch64_can_eliminate to deal
> with the argument pointer? Also do we need to define a fixed register, or 
> will GCC
> automatically allocate it to a callee-save if necessary?

Now that you asked I think we can get rid of this live marking.
I used as base for initial patch the powerpc backend

Re: [PATCH v2] aarch64: Add split-stack initial support

2016-11-14 Thread Adhemerval Zanella
Ping.

On 07/11/2016 16:59, Adhemerval Zanella wrote:
> 
> 
> On 14/10/2016 15:59, Wilco Dijkstra wrote:
>> Hi,
>>
> 
> Thanks for the thoughtful review and sorry for late response. 
> 
>>> Split-stack prologue on function entry is as follow (this goes before the
>>> usual function prologue):
>>
>>> mrsx9, tpidr_el0
>>> movx10, -
>>
>> As Jiong already remarked, the nop won't work. Do we know the maximum 
>> adjustment
>> that the linker is allowed to make? If so, and we can limit the adjustment 
>> to 16MB in
>> most cases, emitting 2 subtracts is best. Larger offset need mov/movk/sub 
>> but that
>> should be extremely rare.
> 
> There is no limit afaik on gold split stack allocation handling,
> and I think one could be added for each backend (in the method
> override require to implement it).
> 
> In fact it is not really required to tie the nop generation with the
> instruction generated by 'aarch64_internal_mov_immediate', it is
> just a matter to simplify linker code.  
> 
> And although 16MB should be rare, nilptr2.go tests allocates 134217824
> so this test fails with this low stack limit.  I am not sure how well
> is the stack usage on 'go', but I think we should at least support
> current testcase scenario.  So for current iteration I kept my
> current approach, but I am open to suggestions.
> 
> 
>>
>>> nop/movk
>>
>>> addx10, sp, x10
>>> ldrx9, [x9, 16]
>>
>> Is there any need to detect underflow of x10 or is there a guarantee that 
>> stacks are
>> never allocated in the low 2GB (given the maximum adjustment is 2GB)? It's 
>> safe
>> to do a signed comparison.
> 
> I do not think so, at least none of current backend that implements
> split stack do so.
> 
>>
>>> cmpx10, x9
>>> b.csenough
>>
>> Why save/restore x30 and the call x30+8 trick when we could pass the
>> continuation address and use a tailcall? That also avoids emitting extra 
>> unwind info.
>>
>>> stpx30, [sp, -16]
>>> bl __morestack
>>> ldpx30, [sp], 16
>>> ret
>>
>> This part doesn't make any sense - both x28 and carry flag as an input, and 
>> spread
>> across the prolog - why???
>>
>>> enough:
>>> mov x10, sp
>>  [prolog]
>>> b.cscontinue
>>> mov x10, x28
>> continue:
>>  [rest of function]
>>
>> Why not do this?
>>
>> function:
>>  mrsx9, tpidr_el0
>>  subx10, sp, N & 0xfff000
>>  subx10, x10, N & 0xfff
>>  ldrx9, [x9, 16]
>>  adr x12, main_fn_entry
>>  movx11, sp   [if function has stacked arguments]
>>  cmpx10, x9
>>  b.gemain_fn_entry
>>  b __morestack
>> main_fn_entry: [x11 is argument pointer]
>>  [prolog]
>>  [rest of function]
>>
>> In __morestack you need to save x8 as well (another argument register!) and 
>> x12 (the 
>> continuation address). After returning from the call x8 doesn't need to be 
>> preserved.
> 
> Indeed this strategy is way better and I adjusted the code follow it.
> The only change is I am using a:
> 
>   [...]
>   cmp x9, x10
>   b.ltmain_fn_entr
>   b   __morestack.
>   [...]
> 
> So I can issue a 'cmp , 0' on __morestack to indicate
> the function was called.
> 
>>
>> There are several issues with unwinding in __morestack. x28 is not described 
>> as a callee-save
>> so will be corrupted if unwinding across a __morestack call. This won't 
>> unwind correctly after
>> the ldp as the unwinder will use the restored frame pointer to try to 
>> restore x29/x30:
>>
>> +ldp x29, x30, [x28, STACKFRAME_BASE]
>> +ldr x28, [x28, STACKFRAME_BASE + 80]
>> +
>> +.cfi_remember_state
>> +.cfi_restore 30
>> +.cfi_restore 29
>> +.cfi_def_cfa 31, 0
> 
> Indeed, it misses x28 save/restore. I think I have added the missing bits, 
> but I
> must confess that I am not well versed in CFI directives.  I will appreciate 
> if 
> you could help me on this new version.
> 
>>
>> This stores a random x30 value on the stack, what is the purpose of this? 
>> Nothing can unwind
>> to here:
>>
>> +# Start using new stack
>> +stp x29, x30, [x0, -16]!
>> +mov sp, x0
>>
>> Also we no longer need split_stack_arg_pointer_used_p () or any code that 
>> uses it (functions
>> that don't have any arguments passed on the stack could omit the mov x11, 
>> sp).
> 
> Right, we new strategy you proposed to do a branch this is indeed not
> really required.  I remove it from on this new patch.
> 
>>
>> Wilco
>>


Re: [PATCH v2] aarch64: Add split-stack initial support

2016-11-07 Thread Adhemerval Zanella


On 14/10/2016 15:59, Wilco Dijkstra wrote:
> Hi,
> 

Thanks for the thoughtful review and sorry for late response. 

>> Split-stack prologue on function entry is as follow (this goes before the
>> usual function prologue):
> 
>>  mrsx9, tpidr_el0
>>  movx10, -
> 
> As Jiong already remarked, the nop won't work. Do we know the maximum 
> adjustment
> that the linker is allowed to make? If so, and we can limit the adjustment to 
> 16MB in
> most cases, emitting 2 subtracts is best. Larger offset need mov/movk/sub but 
> that
> should be extremely rare.

There is no limit afaik on gold split stack allocation handling,
and I think one could be added for each backend (in the method
override require to implement it).

In fact it is not really required to tie the nop generation with the
instruction generated by 'aarch64_internal_mov_immediate', it is
just a matter to simplify linker code.  

And although 16MB should be rare, nilptr2.go tests allocates 134217824
so this test fails with this low stack limit.  I am not sure how well
is the stack usage on 'go', but I think we should at least support
current testcase scenario.  So for current iteration I kept my
current approach, but I am open to suggestions.


> 
>>  nop/movk
> 
>>  addx10, sp, x10
>>  ldrx9, [x9, 16]
> 
> Is there any need to detect underflow of x10 or is there a guarantee that 
> stacks are
> never allocated in the low 2GB (given the maximum adjustment is 2GB)? It's 
> safe
> to do a signed comparison.

I do not think so, at least none of current backend that implements
split stack do so.

> 
>>  cmpx10, x9
>>  b.csenough
> 
> Why save/restore x30 and the call x30+8 trick when we could pass the
> continuation address and use a tailcall? That also avoids emitting extra 
> unwind info.
> 
>>  stpx30, [sp, -16]
>>  bl __morestack
>>  ldpx30, [sp], 16
>>  ret
> 
> This part doesn't make any sense - both x28 and carry flag as an input, and 
> spread
> across the prolog - why???
> 
>> enough:
>>  mov x10, sp
>   [prolog]
>>  b.cscontinue
>>  mov x10, x28
> continue:
>   [rest of function]
> 
> Why not do this?
> 
> function:
>   mrsx9, tpidr_el0
>   subx10, sp, N & 0xfff000
>   subx10, x10, N & 0xfff
>   ldrx9, [x9, 16]
>   adr x12, main_fn_entry
>   movx11, sp   [if function has stacked arguments]
>   cmpx10, x9
>   b.gemain_fn_entry
>   b __morestack
> main_fn_entry: [x11 is argument pointer]
>   [prolog]
>   [rest of function]
> 
> In __morestack you need to save x8 as well (another argument register!) and 
> x12 (the 
> continuation address). After returning from the call x8 doesn't need to be 
> preserved.

Indeed this strategy is way better and I adjusted the code follow it.
The only change is I am using a:

[...]
cmp x9, x10
b.ltmain_fn_entr
b   __morestack.
[...]

So I can issue a 'cmp , 0' on __morestack to indicate
the function was called.

> 
> There are several issues with unwinding in __morestack. x28 is not described 
> as a callee-save
> so will be corrupted if unwinding across a __morestack call. This won't 
> unwind correctly after
> the ldp as the unwinder will use the restored frame pointer to try to restore 
> x29/x30:
> 
> + ldp x29, x30, [x28, STACKFRAME_BASE]
> + ldr x28, [x28, STACKFRAME_BASE + 80]
> +
> + .cfi_remember_state
> + .cfi_restore 30
> + .cfi_restore 29
> + .cfi_def_cfa 31, 0

Indeed, it misses x28 save/restore. I think I have added the missing bits, but I
must confess that I am not well versed in CFI directives.  I will appreciate if 
you could help me on this new version.

> 
> This stores a random x30 value on the stack, what is the purpose of this? 
> Nothing can unwind
> to here:
> 
> + # Start using new stack
> + stp x29, x30, [x0, -16]!
> + mov sp, x0
> 
> Also we no longer need split_stack_arg_pointer_used_p () or any code that 
> uses it (functions
> that don't have any arguments passed on the stack could omit the mov x11, sp).

Right, we new strategy you proposed to do a branch this is indeed not
really required.  I remove it from on this new patch.

> 
> Wilco
> 
From dd2927aa5deb8d609c748014f3b566962fb852c5 Mon Sep 17 00:00:00 2001
From: Adhemerval Zanella <adhemerval.zane...@linaro.org>
Date: Wed, 4 May 2016 21:13:39 +
Subject: [PATCH 2/2] aarch64: Add split-stack initial support

This patch adds the split-stack support on aarch64 (PR #67877).  A

[PATCH v3 2/2] aarch64: Add split-stack initial support

2016-10-17 Thread Adhemerval Zanella
From: Adhemerval Zanella <adhemerval.zane...@linaro.org>

Changes from previous version:

 * Add missing new function comments;
 * Using gen_rtx_fmt_ee instead of gen_rtx_IF_THEN_ELSE to create
   conditional jumps;
 * Some typos and withspace issues.

--

This patch adds the split-stack support on aarch64 (PR #67877).  As for
other ports this patch should be used along with glibc and gold support.

The support is done similar to other architectures: a __private_ss field is
added on TCB in glibc, a target-specific __morestack implementation and
helper functions are added in libgcc and compiler supported in adjusted
(split-stack prologue, va_start for argument handling).  I also plan to
send the gold support to adjust stack allocation acrosss split-stack
and default code calls.

Current approach is similar to powerpc one: at most 2 GB of stack allocation
is support so stack adjustments can be done with 2 instructions (either just
a movn plus nop or a movn followed by movk).  The morestack call is non
standard with x10 hollding the requested stack pointer and x11 the required
stack size to be copied.  Also function arguments on the old stack are
accessed based on a value relative to the stack pointer, so x10 is used to
hold theold stack value.  Unwinding is handled by a personality routine that
knows how to find stack segments.

Split-stack prologue on function entry is as follow (this goes before the
usual function prologue):

mrsx9, tpidr_el0
movx10, -
nop/movk
addx10, sp, x10
ldrx9, [x9, 16]
cmpx10, x9
b.csenough
stpx30, [sp, -16]movx11, 
movx11, 
bl __morestack
ldpx30, [sp], 16
ret
enough:
mov x10, sp
[...]
b.csfunction
mov x10, x28
function:

Notes:

1. Even if a function does not allocate a stack frame, a split-stack prologue
   is created.  It is to avoid issues with tail call for external symbols
   which might require linker adjustment (libgo/runtime/go-varargs.c).

2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldr
   to after the required stack calculation.

3. Similar to powerpc, When the linker detects a call from split-stack to
   non-split-stack code, it adds 16k (or more) to the value found in "allocate"
   instructions (so non-split-stack code gets a larger stack).  The amount is
   tunable by a linker option.  The edit means aarch64 does not need to
   implement __morestack_non_split, necessary on x86 because insufficient
   space is available there to edit the stack comparison code.  This feature
   is only implemented in the GNU gold linker.

4. AArch64 does not handle >2G stack initially and although it is possible
   to implement it, limiting to 2G allows to materize the allocation with
   only 2 instructions (mov + movk) and thus simplifying the linker
   adjustments required.  Supporting multiple threads each requiring more
   than 2G of stack is probably not that important, and likely to OOM at
   run time.

5. The TCB support on GLIBC is meant to be included in version 2.25.

libgcc/ChangeLog:

* libgcc/config.host: Use t-stack and t-statck-aarch64 for
aarch64*-*-linux.
* libgcc/config/aarch64/morestack-c.c: New file.
* libgcc/config/aarch64/morestack.S: Likewise.
* libgcc/config/aarch64/t-stack-aarch64: Likewise.
* libgcc/generic-morestack.c (__splitstack_find): Add aarch64-specific
code.

gcc/ChangeLog:

* common/config/aarch64/aarch64-common.c
(aarch64_supports_split_stack): New function.
(TARGET_SUPPORTS_SPLIT_STACK): New macro.
* gcc/config/aarch64/aarch64-linux.h (TARGET_ASM_FILE_END): Remove
macro.
* gcc/config/aarch64/aarch64-protos.h: Add
aarch64_expand_split_stack_prologue and
aarch64_split_stack_space_check.
* gcc/config/aarch64/aarch64.c (split_stack_arg_pointer_used_p): New
function.
(aarch64_expand_prologue): Setup the argument pointer (x10) for
split-stack.
(aarch64_expand_builtin_va_start): Use internal argument pointer
instead of virtual_incoming_args_rtx.
(morestack_ref): New symbol.
(aarch64_expand_split_stack_prologue): New function.
(aarch64_file_end): Emit the split-stack note sections.
(aarch64_internal_arg_pointer): Likewise.
(aarch64_live_on_entry): Set the argument pointer for split-stack.
(aarch64_split_stack_space_check): Likewise.
(TARGET_ASM_FILE_END): New macro.
(TARGET_EXTRA_LIVE_ON_ENTRY): Likewise.
(TARGET_INTERNAL_ARG_POINTER): Likewise.
* gcc/config/aarch64/aarch64.h (aarch64_frame): Add
split_stack_arg_pointer to setup the argument pointer when using
split-stack.
* gcc/config/aarch64/aarch64.md (UNSPEC_STACK_CHECK): New unspec.
(UNSPECV_SPLIT_STAC

[PATCH 1/2] Fix GCC split-stack variadic argument tests

2016-10-17 Thread Adhemerval Zanella
This test adds the expect va_end after va_start on split-stack tests.

gcc/ChangeLog:

* gcc/testsuite/gcc.dg/split-3.c (down): Call va_end after va_start.
* gcc/testsuite/gcc.dg/split-6.c (down): Likewise.
---
 gcc/ChangeLog  | 5 +
 gcc/testsuite/gcc.dg/split-3.c | 1 +
 gcc/testsuite/gcc.dg/split-6.c | 1 +
 3 files changed, 7 insertions(+)

diff --git a/gcc/testsuite/gcc.dg/split-3.c b/gcc/testsuite/gcc.dg/split-3.c
index 64bbb8c..5ba7616 100644
--- a/gcc/testsuite/gcc.dg/split-3.c
+++ b/gcc/testsuite/gcc.dg/split-3.c
@@ -40,6 +40,7 @@ down (int i, ...)
   || va_arg (ap, int) != 9
   || va_arg (ap, int) != 10)
 abort ();
+  va_end (ap);
 
   if (i > 0)
 {
diff --git a/gcc/testsuite/gcc.dg/split-6.c b/gcc/testsuite/gcc.dg/split-6.c
index b32cf8d..b3016ba 100644
--- a/gcc/testsuite/gcc.dg/split-6.c
+++ b/gcc/testsuite/gcc.dg/split-6.c
@@ -37,6 +37,7 @@ down (int i, ...)
   || va_arg (ap, int) != 9
   || va_arg (ap, int) != 10)
 abort ();
+  va_end (ap);
 
   if (i > 0)
 {
-- 
2.7.4



Re: [PATCH v2] aarch64: Add split-stack initial support

2016-10-11 Thread Adhemerval Zanella


On 07/10/2016 05:28, Kyrill Tkachov wrote:
> Hi Adhemerval,
> 
> CC'ing the aarch64 maintainers/reviewers.
> I have some comments inline, mostly about the GCC coding conventions.

Thanks for the review.

>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> index df6514d..c689440 100644
>> --- a/gcc/config/aarch64/aarch64.c
>> +++ b/gcc/config/aarch64/aarch64.c
>> @@ -3149,6 +3149,49 @@ aarch64_restore_callee_saves (machine_mode mode,
>>   }
>>   }
>>   +static bool
>> +split_stack_arg_pointer_used_p (void)
>> +{
> 
> New function needs a function comment.
> 

Ack, I will add one.

>> +  bool using_split_stack = (flag_split_stack
>> +&& (lookup_attribute ("no_split_stack",
>> +  DECL_ATTRIBUTES (cfun->decl))
>> +   == NULL));
>> +  if (using_split_stack == false)
>> +return false;
>> +
>> +  /* If the pseudo holding the arg pointer is no longer a pseudo,
>> + then the arg pointer is used.  */
>> +  if (cfun->machine->frame.split_stack_arg_pointer != NULL_RTX
>> +  && (!REG_P (cfun->machine->frame.split_stack_arg_pointer)
>> +  || (REGNO (cfun->machine->frame.split_stack_arg_pointer)
>> +  < FIRST_PSEUDO_REGISTER)))
>> +return true;
>> +
>> +  /* Unfortunately we also need to do some code scanning, since
>> + x10 may have been substituted for the pseudo.  */
>> +  rtx_insn *insn;
>> +  basic_block bb = ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb;
>> +  FOR_BB_INSNS (bb, insn)
>> +if (NONDEBUG_INSN_P (insn))
>> +  {
>> +df_ref use;
>> +FOR_EACH_INSN_USE (use, insn)
>> +  {
>> +rtx x = DF_REF_REG (use);
>> +if (REG_P (x) && REGNO (x) == R10_REGNUM)
>> +  return true;
>> +  }
>> +df_ref def;
>> +FOR_EACH_INSN_DEF (def, insn)
>> +  {
>> +rtx x = DF_REF_REG (def);
>> +if (REG_P (x) && REGNO (x) == R10_REGNUM)
>> +  return false;
>> +  }
>> +  }
>> +  return bitmap_bit_p (DF_LR_OUT (bb), R10_REGNUM);
>> +}
>> +
>>   /* AArch64 stack frames generated by this compiler look like:
>> +---+
>> @@ -3204,6 +3247,7 @@ aarch64_expand_prologue (void)
>> unsigned reg1 = cfun->machine->frame.wb_candidate1;
>> unsigned reg2 = cfun->machine->frame.wb_candidate2;
>> rtx_insn *insn;
>> +  bool split_stack_arg_pointer_used = split_stack_arg_pointer_used_p ();
>>   if (flag_stack_usage_info)
>>   current_function_static_stack_size = frame_size;
>> @@ -3220,6 +3264,10 @@ aarch64_expand_prologue (void)
>>   aarch64_emit_probe_stack_range (STACK_CHECK_PROTECT, frame_size);
>>   }
>>   +  /* Save split-stack argument pointer before stack adjustment.  */
>> +  if (split_stack_arg_pointer_used)
>> +emit_move_insn (gen_rtx_REG (Pmode, R10_REGNUM), stack_pointer_rtx);
>> +
>> aarch64_add_constant (Pmode, SP_REGNUM, IP0_REGNUM, -initial_adjust, 
>> true);
>>   if (callee_adjust != 0)
>> @@ -3243,6 +3291,30 @@ aarch64_expand_prologue (void)
>>callee_adjust != 0 || frame_pointer_needed);
>> aarch64_add_constant (Pmode, SP_REGNUM, IP1_REGNUM, -final_adjust,
>>   !frame_pointer_needed);
>> +
>> +  if (split_stack_arg_pointer_used_p ())
>> +{
>> +  /* Setup the argument pointer (x10) for -fsplit-stack code.  If
>> + __morestack was called, it will left the arg pointer to the
>> + old stack in x28.  Otherwise, the argument pointer is the top
>> + of current frame.  */
>> +  rtx x10 = gen_rtx_REG (Pmode, R10_REGNUM);
>> +  rtx x28 = gen_rtx_REG (Pmode, R28_REGNUM);
>> +  rtx not_more = gen_label_rtx ();
>> +  rtx cc_reg = gen_rtx_REG (CCmode, CC_REGNUM);
>> +  rtx jump;
>> +
>> +  jump = gen_rtx_IF_THEN_ELSE (VOIDmode,
>> +   gen_rtx_GEU (VOIDmode, cc_reg,
>> +const0_rtx),
>> +   gen_rtx_LABEL_REF (VOIDmode, not_more),
>> +   pc_rtx);
>> +  jump = emit_jump_insn (gen_rtx_SET (pc_rtx, jump));
> 
> Are you trying to emit a conditional jump here?
> If so it would be cleaner to use the pattern name you have in mind
> directly like so (barring typos ;):
> rtx cmp = gen_rtx_fmt_ee (GEU, VOIDmode, cc_reg, const0_rtx);
> jump = emit_jump_insn (gen_condjump (cmp, cc_reg, not_more));
> This will avoid constructing the SET and IF_THEN_ELSE rtxes manually.

Yes, the idea is to emit a conditional jump.  I have no preference here,
I used the IF_THEN_ELSE functions mainly because the other arches are
using it, but I think your suggestion should be simpler.

> 
>> +  JUMP_LABEL (jump) = not_more;
>> +  LABEL_NUSES (not_more) += 1;
>> +  emit_move_insn (x10, x28);
>> +  emit_label (not_more);
>> +}
>>   }
>> /* Return TRUE if we can use a simple_return insn.
>> @@ -9677,7 +9749,7 @@ aarch64_expand_builtin_va_start (tree valist, rtx 

[PATCH v2] aarch64: Add split-stack initial support

2016-10-06 Thread Adhemerval Zanella
From: Adhemerval Zanella <adhemerval.zane...@linaro.org>

Changes from previous version:

 - Rewrite how to setup variadic argument: instead of using the
   hard_fp_offset value to setup the x10, save a fp value before stack
   allocation instead.  This allows linker/gold to not require scan
   and change in case of split to non-split call (only the split prologue
   will require adjustments).

--

This patch adds the split-stack support on aarch64 (PR #67877).  As for
other ports this patch should be used along with glibc and gold support.

The support is done similar to other architectures: a __private_ss field is
added on TCB in glibc, a target-specific __morestack implementation and
helper functions are added in libgcc and compiler supported in adjusted
(split-stack prologue, va_start for argument handling).  I also plan to
send the gold support to adjust stack allocation acrosss split-stack
and default code calls.

Current approach is similar to powerpc one: at most 2 GB of stack allocation
is support so stack adjustments can be done with 2 instructions (either just
a movn plus nop or a movn followed by movk).  The morestack call is non
standard with x10 hollding the requested stack pointer and x11 the required
stack size to be copied.  Also function arguments on the old stack are
accessed based on a value relative to the stack pointer, so x10 is used to
hold theold stack value.  Unwinding is handled by a personality routine that
knows how to find stack segments.

Split-stack prologue on function entry is as follow (this goes before the
usual function prologue):

mrsx9, tpidr_el0
movx10, -
nop/movk
addx10, sp, x10
ldrx9, [x9, 16]
cmpx10, x9
b.csenough
stpx30, [sp, -16]
bl __morestack
ldpx30, [sp], 16
ret
enough:
mov x10, sp
[...]
b.csfunction
mov x10, x28
function:


I checked with a bootstrap build on aarch64 and saw no regression.  The go
tests pass with the exception of cplx2.go and runtime/pprof/check (both
unrelated to the patch afaiu).

Notes:

1. Even if a function does not allocate a stack frame, a split-stack prologue
   is created.  It is to avoid issues with tail call for external symbols
   which might require linker adjustment (libgo/runtime/go-varargs.c).

2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldr
   to after the required stack calculation.

3. Similar to powerpc, When the linker detects a call from split-stack to
   non-split-stack code, it adds 16k (or more) to the value found in "allocate"
   instructions (so non-split-stack code gets a larger stack).  The amount is
   tunable by a linker option.  The edit means aarch64 does not need to
   implement __morestack_non_split, necessary on x86 because insufficient
   space is available there to edit the stack comparison code.  This feature
   is only implemented in the GNU gold linker.

4. AArch64 does not handle >2G stack initially and although it is possible
   to implement it, limiting to 2G allows to materize the allocation with
   only 2 instructions (mov + movk) and thus simplifying the linker
   adjustments required.  Supporting multiple threads each requiring more
   than 2G of stack is probably not that important, and likely to OOM at
   run time.

5. The TCB support on GLIBC is meant to be included in version 2.25.

libgcc/ChangeLog:

* libgcc/config.host: Use t-stack and t-statck-aarch64 for
aarch64*-*-linux.
* libgcc/config/aarch64/morestack-c.c: New file.
* libgcc/config/aarch64/morestack.S: Likewise.
* libgcc/config/aarch64/t-stack-aarch64: Likewise.
* libgcc/generic-morestack.c (__splitstack_find): Add aarch64-specific
code.

gcc/ChangeLog:

* common/config/aarch64/aarch64-common.c
(aarch64_supports_split_stack): New function.
(TARGET_SUPPORTS_SPLIT_STACK): New macro.
* gcc/config/aarch64/aarch64-linux.h (TARGET_ASM_FILE_END): Remove
macro.
* gcc/config/aarch64/aarch64-protos.h: Add
aarch64_expand_split_stack_prologue and
aarch64_split_stack_space_check.
* gcc/config/aarch64/aarch64.c (split_stack_arg_pointer_used_p): New
function.
(aarch64_expand_prologue): Setup the argument pointer (x10) for
split-stack.
(aarch64_expand_builtin_va_start): Use internal argument pointer
instead of virtual_incoming_args_rtx.
(morestack_ref): New symbol.
(aarch64_expand_split_stack_prologue): New function.
(aarch64_file_end): Emit the split-stack note sections.
(aarch64_internal_arg_pointer): Likewise.
(aarch64_live_on_entry): Set the argument pointer for split-stack.
(aarch64_split_stack_space_check): Likewise.
(TARGET_ASM_FILE_END): New macro.
(TARGET_EXTRA_LIVE_

Re: [PATCH] aarch64: Add split-stack initial support

2016-08-18 Thread Adhemerval Zanella


On 08/08/2016 07:58, Jiong Wang wrote:
> 
> Adhemerval Zanella writes:
> 
>>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>>> index e56398a..2cf239f 100644
>>> --- a/gcc/config/aarch64/aarch64.c
>>> +++ b/gcc/config/aarch64/aarch64.c
>>> @@ -3227,6 +3227,34 @@ aarch64_expand_prologue (void)
>>> +  emit_move_insn (x11, GEN_INT (hard_fp_offset));
>>> +  emit_insn (gen_add3_insn (x10, x29, x11));
>>> +  jump = gen_rtx_IF_THEN_ELSE (VOIDmode,
>>> +  gen_rtx_GEU (VOIDmode, cc_reg,
>>> +   const0_rtx),
>>> +  gen_rtx_LABEL_REF (VOIDmode, not_more),
>>> +  pc_rtx);
>>> +  jump = emit_jump_insn (gen_rtx_SET (pc_rtx, jump));
>>> +  JUMP_LABEL (jump) = not_more;
>>> +  LABEL_NUSES (not_more) += 1;
>>> +  emit_move_insn (x10, x28);
>>> +  emit_label (not_more);
>>> +}
>>>  }
> 
> This part needs rebase, there are major changes in AArch64 prologue code
> recently.
> 

Right, I see that 'hard_fp_offset' is not defined locally anymore.

>>>  
>>>  /* Return TRUE if we can use a simple_return insn.
>>> @@ -3303,6 +3331,7 @@ aarch64_expand_epilogue (bool for_sibcall)
>>>offset = offset - fp_offset;
>>>  }
>>>  
>>> +
> 
> Unncessary new line.
>

Ack.

 
>>> +
>>> +  /* Load __private_ss from TCB.  */
>>> +  ssvalue = gen_rtx_REG (Pmode, R9_REGNUM);
>>> +  emit_insn (gen_aarch64_load_tp_hard (ssvalue));
>>> +  mem = gen_rtx_MEM (Pmode, plus_constant (Pmode, ssvalue, psso));
>>> +  emit_move_insn (ssvalue, mem);
>>> +
>>> +  temp = gen_rtx_REG (Pmode, R10_REGNUM);
>>> +
>>> +  /* Always emit two insns to calculate the requested stack, so the linker
>>> + can edit them when adjusting size for calling non-split-stack code.  
>>> */
>>> +  ninsn = aarch64_internal_mov_immediate (temp, GEN_INT (-frame_size), 
>>> true,
>>> + Pmode);
>>> +  gcc_assert (ninsn == 1 || ninsn == 2);
>>> +  if (ninsn == 1)
>>> +emit_insn (gen_nop ());
> 
> there will be trouble to linker if the following add is scheduled before
> the nop?

I theory yes, although I haven't see gcc splitting it.  Which would be the
correct way to tie the nop generation to be emitted after the mov immediate?
 
>>> +
>>> +   # Set up for a call to the target function.
>>> +   #ldpx29, x30, [x28, STACKFRAME_BASE]
>>> +   ldr x30, [x28, STACKFRAME_BASE + 8]
>>> +   ldp x0, x1, [x28, STACKFRAME_BASE + 16]
>>> +   ldp x2, x3, [x28, STACKFRAME_BASE + 32]
>>> +   ldp x4, x5, [x28, STACKFRAME_BASE + 48]
>>> +   ldp x6, x7, [x28, STACKFRAME_BASE + 64]
>>> +   add x9, x30, 8
>>> +   cmp x30, x9
> 
> Can you explain why do we need this "cmp" before jumping to target
> function?

This is due the function prologue addition for var args handling:

  mov x11, 
  sub sp, sp, 
  add x10, x29, x11
  b.csfunction:
  mov x10, x28

If __morestack is called it will use the the 'b.cs' to setup the
correct var arg pointer.


Below it the last iteration patch, however I now seeing some similar issue
s390 hit when building libgo:

../../../gcc-git/libgo/go/syscall/socket_linux.go:90:1: error: flow control 
insn inside a basic block
(jump_insn 90 89 91 14 (set (pc)
(if_then_else (geu (reg:CC 66 cc)
(const_int 0 [0]))
(label_ref 92)
(pc))) ../../../gcc-git/libgo/go/syscall/socket_linux.go:90 -1
 (nil)
 -> 92)
../../../gcc-git/libgo/go/syscall/socket_linux.go:90:1: internal compiler 
error: in rtl_verify_bb_insns, at cfgrtl.c:2658
0xac35af _fatal_insn(char const*, rtx_def const*, char const*, int, char const*)

It shows only with -O2, which I think it due how the block is reorganized
internally and regarding the pseudo-return instruction inserted by split-stack.
I am still debugging the issue and how to proper fix it, so if you have any
advice I open to suggestions.




diff --git a/gcc/common/config/aarch64/aarch64-common.c 
b/gcc/common/config/aarch64/aarch64-common.c
index 08e7959..01c3239 100644
--- a/gcc/common/config/aarch64/aarch64-common.c
+++ b/gcc/common/config/aarch64/aarch64-common.c
@@ -106,6 +106,21 @@ aarch64_handle_option (struct gcc_options *opts,
 }
 }
 
+/* -fsplit-stack uses a TCB field available on glibc-2.25.  GLIBC also
+   exports symbol, __tcb_private_ss, to signal

Re: [PATCH] aarch64: Add split-stack initial support

2016-08-05 Thread Adhemerval Zanella
Ping.

On 28/07/2016 17:36, Adhemerval Zanella wrote:
> From: Adhemerval Zanella <adhemerval.zane...@linaro.org>
> 
> This patch adds the split-stack support on aarch64 (PR #67877).  As for
> other ports this patch should be used along with glibc and gold support.
> 
> The support is done similar to other architectures: a __private_ss field is
> added on TCB in glibc, a target-specific __morestack implementation and
> helper functions are added in libgcc and compiler supported adjustments
> (split-stack prologue, va_start for argument handling).  I also plan to
> send the gold support to adjust stack allocation acrosss split-stack
> and default code calls.
> 
> Current approach is similar to powerpc one: at most 2 GB of stack allocation
> is support so stack adjustments can be done with 2 instructions (either just
> a movn plus nop or a movn followed by movk).  The morestack call is non
> standard with x10 hollding the requested stack pointer and x11 the required
> stack size to be copied.  Also function arguments on the old stack are
> accessed based on a value relative to the stack pointer, so x10 is used to
> hold theold stack value.  Unwinding is handled by a personality routine that
> knows how to find stack segments.
> 
> Split-stack prologue on function entry is as follow (this goes before the
> usual function prologue):
> 
>   mrsx9, tpidr_el0
>   movx10, -
>   nop/movk
>   addx10, sp, x10
>   ldrx9, [x9, 16]
>   cmpx10, x9
>   b.csenough
>   stpx30, [sp, -16]movx11, 
>   movx11, 
>   bl __morestack
>   ldpx30, [sp], 16
>   ret
> enough:
>   # usual function prologue, modified a little at the end to set up the
>   # arg_pointer in x10, starts here.  The arg_pointer is initialized,
>   # if it is used, with
>   mov x11, 
>   add x10, x29, x11
>   b.csfunction
>   mov x10, x28
> function:
> 
> Notes:
>  1. Even if a function does not allocate a stack frame, a split-stack prologue
> is created.  It is to avoid issues with tail call for external symbols
> which might require linker adjustment (libgo/runtime/go-varargs.c).
> 
>  2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldr
> to after the required stack calculation.
> 
>  3. Similar to powerpc, When the linker detects a call from split-stack to
> non-split-stack code, it adds 16k (or more) to the value found in 
> "allocate"
> instructions (so non-split-stack code gets a larger stack).  The amount is
> tunable by a linker option.  The edit means aarch64 does not need to
> implement __morestack_non_split, necessary on x86 because insufficient
> space is available there to edit the stack comparison code.  This feature
> is only implemented in the GNU gold linker.
> 
>  4. AArch64 does not handle >2G stack initially and although it is possible
> to implement it, limiting to 2G allows to materize the allocation with
> only 2 instructions (mov + movk) and thus simplifying the linker
> adjustments required.  Supporting multiple threads each requiring more
> than 2G of stack is probably not that important, and likely to OOM at
> run time.
> 
>  5. The TCB support on GLIBC is meant to be included in version 2.25 [1].
> 
> I tested bootstrapping on aarch64-linux-gnu and although still digesting
> the results I saw no regression.  All cgo tests are passing, although based
> on previous reports in other archs gold support should be present to avoid
> issues on split calling non-split code.
> 
> libgcc/ChangeLog:
> 
>   * libgcc/config.host: Use t-stack and t-statck-aarch64 for
>   aarch64*-*-linux.
>   * libgcc/config/aarch64/morestack-c.c: New file.
>   * libgcc/config/aarch64/morestack.S: Likewise.
>   * libgcc/config/aarch64/t-stack-aarch64: Likewise.
>   * libgcc/generic-morestack.c (__splitstack_find): Add aarch64-specific
>   code.
> 
> gcc/ChangeLog:
> 
>   * common/config/aarch64/aarch64-common.c
>   (aarch64_supports_split_stack): New function.
>   (TARGET_SUPPORTS_SPLIT_STACK): New macro.
>   * gcc/config/aarch64/aarch64-linux.h (TARGET_ASM_FILE_END): Remove
>   macro.
>   * gcc/config/aarch64/aarch64-protos.h: Add
>   aarch64_expand_split_stack_prologue and
>   aarch64_split_stack_space_check.
>   * gcc/config/aarch64/aarch64.c (aarch64_expand_prologue): Setup the
>   argument pointer (x10) for split-stack.
>   (aarch64_expand_builtin_va_start): Use internal argument pointer
>   instead of virtual_incoming_args_rtx.
>   (aarch64_e

[PATCH] aarch64: Add split-stack initial support

2016-07-28 Thread Adhemerval Zanella
From: Adhemerval Zanella <adhemerval.zane...@linaro.org>

This patch adds the split-stack support on aarch64 (PR #67877).  As for
other ports this patch should be used along with glibc and gold support.

The support is done similar to other architectures: a __private_ss field is
added on TCB in glibc, a target-specific __morestack implementation and
helper functions are added in libgcc and compiler supported adjustments
(split-stack prologue, va_start for argument handling).  I also plan to
send the gold support to adjust stack allocation acrosss split-stack
and default code calls.

Current approach is similar to powerpc one: at most 2 GB of stack allocation
is support so stack adjustments can be done with 2 instructions (either just
a movn plus nop or a movn followed by movk).  The morestack call is non
standard with x10 hollding the requested stack pointer and x11 the required
stack size to be copied.  Also function arguments on the old stack are
accessed based on a value relative to the stack pointer, so x10 is used to
hold theold stack value.  Unwinding is handled by a personality routine that
knows how to find stack segments.

Split-stack prologue on function entry is as follow (this goes before the
usual function prologue):

mrsx9, tpidr_el0
movx10, -
nop/movk
addx10, sp, x10
ldrx9, [x9, 16]
cmpx10, x9
b.csenough
stpx30, [sp, -16]movx11, 
movx11, 
bl __morestack
ldpx30, [sp], 16
ret
enough:
# usual function prologue, modified a little at the end to set up the
# arg_pointer in x10, starts here.  The arg_pointer is initialized,
# if it is used, with
mov x11, 
add x10, x29, x11
b.csfunction
mov x10, x28
function:

Notes:
 1. Even if a function does not allocate a stack frame, a split-stack prologue
is created.  It is to avoid issues with tail call for external symbols
which might require linker adjustment (libgo/runtime/go-varargs.c).

 2. Basic-block reordering (enabled with -O2) will move split-stack TCB ldr
to after the required stack calculation.

 3. Similar to powerpc, When the linker detects a call from split-stack to
non-split-stack code, it adds 16k (or more) to the value found in "allocate"
instructions (so non-split-stack code gets a larger stack).  The amount is
tunable by a linker option.  The edit means aarch64 does not need to
implement __morestack_non_split, necessary on x86 because insufficient
space is available there to edit the stack comparison code.  This feature
is only implemented in the GNU gold linker.

 4. AArch64 does not handle >2G stack initially and although it is possible
to implement it, limiting to 2G allows to materize the allocation with
only 2 instructions (mov + movk) and thus simplifying the linker
adjustments required.  Supporting multiple threads each requiring more
than 2G of stack is probably not that important, and likely to OOM at
run time.

 5. The TCB support on GLIBC is meant to be included in version 2.25 [1].

I tested bootstrapping on aarch64-linux-gnu and although still digesting
the results I saw no regression.  All cgo tests are passing, although based
on previous reports in other archs gold support should be present to avoid
issues on split calling non-split code.

libgcc/ChangeLog:

* libgcc/config.host: Use t-stack and t-statck-aarch64 for
aarch64*-*-linux.
* libgcc/config/aarch64/morestack-c.c: New file.
* libgcc/config/aarch64/morestack.S: Likewise.
* libgcc/config/aarch64/t-stack-aarch64: Likewise.
* libgcc/generic-morestack.c (__splitstack_find): Add aarch64-specific
code.

gcc/ChangeLog:

* common/config/aarch64/aarch64-common.c
(aarch64_supports_split_stack): New function.
(TARGET_SUPPORTS_SPLIT_STACK): New macro.
* gcc/config/aarch64/aarch64-linux.h (TARGET_ASM_FILE_END): Remove
macro.
* gcc/config/aarch64/aarch64-protos.h: Add
aarch64_expand_split_stack_prologue and
aarch64_split_stack_space_check.
* gcc/config/aarch64/aarch64.c (aarch64_expand_prologue): Setup the
argument pointer (x10) for split-stack.
(aarch64_expand_builtin_va_start): Use internal argument pointer
instead of virtual_incoming_args_rtx.
(aarch64_expand_split_stack_prologue): New function.
(aarch64_file_end): Emit the split-stack note sections.
(aarch64_internal_arg_pointer): Likewise.
(aarch64_live_on_entry): Set the argument pointer for split-stack.
(aarch64_split_stack_space_check): Likewise.
(TARGET_ASM_FILE_END): New macro.
(TARGET_INTERNAL_ARG_POINTER): Likewise.
* gcc/config/aarch64/aarch64.h (aarch64_frame): Add
split_stack_arg_pointer to setup the argument

Re: [PATCH 1/7] Libsanitizer merge from upstream r249633.

2015-10-14 Thread Adhemerval Zanella


On 14-10-2015 04:54, Jakub Jelinek wrote:
> On Tue, Oct 13, 2015 at 07:54:33PM +0300, Maxim Ostapenko wrote:
>> On 13/10/15 14:15, Maxim Ostapenko wrote:
>>> This is the raw merge itself. I'm bumping SONAME to libasan.so.3.
>>>
>>> -Maxim
>>
>> I have just noticed that I've misused autoconf stuff (used wrong version).
>> Here a fixed version of the same patch. Sorry for inconvenience.
> 
> Is libubsan, libtsan backwards compatible, or do we want to change SONAME
> there too?
> 
> The aarch64 changes are terrible, not just because it doesn't yet have
> runtime decision on what VA to use or that it doesn't support 48-bit VA,
> but also that for the 42-bit VA it uses a different shadow offset from
> 39-bit VA.  But on the compiler side we have just one...
> Though, only the 39-bit VA is enabled right now by default, so out of the
> box the state is as bad as we had in 5.x - users wanting 42-bit VA or 48-bit
> VA have to patch libsanitizer.

Yes we are aware with current deficiencies for aarch64 with a 39-bit and 
42-bit vma and the lack of support of 48-bit vma. On LLVM side current 
approach is to built compiler support for either 39 or 42 bit (and again
we are aware this is not the ideal approach). This approach was used
mainly for validation and buildbot enablement.

I am currently working on a way to make the sanitizer (asan and msan)
VMA independent on aarch64 (aiming to current support 39/42 and later 
48-bit vma) and the approach we decide to use is to change the 
instrumentation to use a parametrized value to compute the shadow memory
regions (based on VMA address) which will be initialized externally
by the  ibsanitizer. TSAN is somewhat easier since instrumentation 
does not  take in consideration the VMA (only the libsanitizer itself).

The idea is to avoid compiler switches a make is transparent to run
the binary regardless of the VMA in the system. The downside is 
instrumentation will require more steps (to lead the parametrized
value to compute shadow memory) and thus slower.



> 
> Have you verified libbacktrace sanitization still works properly (that is
> something upstream does not test)?
> 
> Do you plan to update the asan tests we have to reflect the changes in
> upstream?
> 
>   Jakub
> 


[PATCH] rs6000: Fix HTM tcheck assembly encoding

2015-02-20 Thread Adhemerval Zanella
gcc/ChangeLog:

* config/rs6000/htm.md (tcheck): Fix assembly encoding.

gcc/testsuite/ChangeLog

* gcc.target/powerpc/htm-builtin-1.c: Fix tcheck expect value.

---

diff --git a/gcc/config/rs6000/htm.md b/gcc/config/rs6000/htm.md
index 2c4689f..79fb740 100644
--- a/gcc/config/rs6000/htm.md
+++ b/gcc/config/rs6000/htm.md
@@ -252,7 +252,7 @@
(unspec_volatile:CC [(match_operand 0 u3bit_cint_operand n)]
UNSPECV_HTM_TCHECK))]
   TARGET_HTM
-  tcheck. %0
+  tcheck %0
   [(set_attr type htm)
(set_attr length 4)])
 
diff --git a/gcc/testsuite/gcc.target/powerpc/htm-builtin-1.c 
b/gcc/testsuite/gcc.target/powerpc/htm-builtin-1.c
index e58816a..62d64e6 100644
--- a/gcc/testsuite/gcc.target/powerpc/htm-builtin-1.c
+++ b/gcc/testsuite/gcc.target/powerpc/htm-builtin-1.c
@@ -10,7 +10,7 @@
 /* { dg-final { scan-assembler-times tabortdci\\. 1 } } */
 /* { dg-final { scan-assembler-times tabortwc\\. 1 } } */
 /* { dg-final { scan-assembler-times tabortwci\\. 2 } } */
-/* { dg-final { scan-assembler-times tcheck\\. 1 } } */
+/* { dg-final { scan-assembler-times tcheck 1 } } */
 /* { dg-final { scan-assembler-times trechkpt\\. 1 } } */
 /* { dg-final { scan-assembler-times treclaim\\. 1 } } */
 /* { dg-final { scan-assembler-times tsr\\. 3 } } */



Re: [PATCH] PowerPC: Implement TARGET_ATOMIC_ASSIGN_EXPAND_FENV

2014-09-04 Thread Adhemerval Zanella
On 03-09-2014 11:01, Maciej W. Rozycki wrote:
 On Tue, 2 Sep 2014, Adhemerval Zanella wrote:

 Ping.

 On 19-08-2014 13:54, Adhemerval Zanella wrote:
 Ping.

 On 06-08-2014 17:21, Adhemerval Zanella wrote:
 On 01-08-2014 12:31, Joseph S. Myers wrote:
 On Thu, 31 Jul 2014, David Edelsohn wrote:

 Thanks for implementing the FENV support.  The patch generally looks 
 good to me.

 My one concern is a detail in the implementation of update. I do not
 have enough experience with GENERIC to verify the details and it seems
 like it is missing building an outer COMPOUND_EXPR containing
 update_mffs and the CALL_EXPR for update mtfsf.
 I suppose what's actually odd there is that you have

 +  tree update_mffs = build2 (MODIFY_EXPR, void_type_node, old_fenv, 
 call_mffs);
 +
 +  tree old_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, 
 update_mffs);

 so you build a MODIFY_EXPR in void_type_node but then convert it with a 
 VIEW_CONVERT_EXPR.  If you'd built the MODIFY_EXPR in double_type_node 
 then the VIEW_CONVERT_EXPR would be meaningful (the value of an 
 assignment 
 a = b being the new value of a), but reinterpreting a void value doesn't 
 make sense.  Or you could probably just use call_mffs directly in the 
 VIEW_CONVERT_EXPR without explicitly creating the old_fenv variable.

 Thanks for the review Josephm.  I have changed to avoid the void 
 reinterpretation
 and use call_mffs directly.  I have also removed the the mask generation 
 in 'clear'
 from your previous message, it is now reusing the mas used in 
 feholdexcept.  The 
 testcase patch is the same as before.

 Checked on both linux-powerpc64/powerpc64le and no regressions found.

 --

 2014-08-06  Adhemerval Zanella  azane...@linux.vnet.ibm.com

 gcc:
* config/rs6000/rs6000.c (rs6000_atomic_assign_expand_fenv): New
function.

 gcc/testsuite:
* gcc.dg/atomic/c11-atomic-exec-5.c
(test_main_long_double_add_overflow): Define and run only for
LDBL_MANT_DIG != 106.
(test_main_complex_long_double_add_overflow): Likewise.
(test_main_long_double_sub_overflow): Likewise.
(test_main_complex_long_double_sub_overflow): Likewise.
  FWIW I pushed it through regression testing across my usual set of 
 powerpc-linux-gnu multilibs with the results (for c11-atomic-exec-5.c) as 
 follows:

 -mcpu=603ePASS
 -mcpu=603e -msoft-float   UNSUPPORTED
 -mcpu=8540 -mfloat-gprs=single -mspe=yes -mabi=speUNSUPPORTED
 -mcpu=8548 -mfloat-gprs=double -mspe=yes -mabi=speUNSUPPORTED
 -mcpu=7400 -maltivec -mabi=altivecPASS
 -mcpu=e6500 -maltivec -mabi=altivec   PASS
 -mcpu=e5500 -m64  PASS
 -mcpu=e6500 -m64 -maltivec -mabi=altivec  PASS

Thanks for testing it, I'll to add these permutations on my own testbench.


 (floating-point environment is of course unsupported for soft-float 
 targets and for the SPE FPU another change is required to implement 
 floating-point environment handling to complement one proposed here).  
 No regressions otherwise.

  While at it, may I propose another change on top of this?

  I've noticed the test case is rather slow, it certainly takes much more 
 time than the average one, I've seen elapsed times of well over a minute 
 on reasonably fast hardware and occasionally a timeout midway through even 
 though the test case was otherwise progressing just fine.  I think lock 
 contention or unrelated system activity such as hardware interrupts (think 
 a busy network!) may contribute to it for systems using LL/SC loops for 
 atomicity.

  So I think the default timeout that's used for really quick tests should 
 be extended a bit.  I propose a factor of 2, just not to make it too 
 excessive, at least for the beginning (maybe it'll have to be higher 
 eventually).

Do you mind if I incorporate this change on my patchset?


  OK?

 2014-09-03  Maciej W. Rozycki  ma...@codesourcery.com

   gcc/testsuite/
   * gcc.dg/atomic/c11-atomic-exec-5.c (dg-timeout-factor): New 
   setting.

   Maciej

 gcc-test-c11-atomic-exec-5-timeout-factor.diff
 Index: gcc-fsf-trunk-quilt/gcc/testsuite/gcc.dg/atomic/c11-atomic-exec-5.c
 ===
 --- gcc-fsf-trunk-quilt.orig/gcc/testsuite/gcc.dg/atomic/c11-atomic-exec-5.c  
 2014-09-02 17:34:06.718927043 +0100
 +++ gcc-fsf-trunk-quilt/gcc/testsuite/gcc.dg/atomic/c11-atomic-exec-5.c   
 2014-09-03 14:51:12.958927233 +0100
 @@ -9,6 +9,7 @@
  /* { dg-additional-options -D_XOPEN_SOURCE=600 { target 
 *-*-solaris2.1[0-9]* } } */
  /* { dg-require-effective-target fenv_exceptions } */
  /* { dg-require-effective-target pthread } */
 +/* { dg-timeout-factor 2 } */

  #include fenv.h
  #include float.h




Re: [PATCH] PowerPC: Implement TARGET_ATOMIC_ASSIGN_EXPAND_FENV

2014-09-02 Thread Adhemerval Zanella
Ping.

On 19-08-2014 13:54, Adhemerval Zanella wrote:
 Ping.

 On 06-08-2014 17:21, Adhemerval Zanella wrote:
 On 01-08-2014 12:31, Joseph S. Myers wrote:
 On Thu, 31 Jul 2014, David Edelsohn wrote:

 Thanks for implementing the FENV support.  The patch generally looks 
 good to me.

 My one concern is a detail in the implementation of update. I do not
 have enough experience with GENERIC to verify the details and it seems
 like it is missing building an outer COMPOUND_EXPR containing
 update_mffs and the CALL_EXPR for update mtfsf.
 I suppose what's actually odd there is that you have

 +  tree update_mffs = build2 (MODIFY_EXPR, void_type_node, old_fenv, 
 call_mffs);
 +
 +  tree old_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, update_mffs);

 so you build a MODIFY_EXPR in void_type_node but then convert it with a 
 VIEW_CONVERT_EXPR.  If you'd built the MODIFY_EXPR in double_type_node 
 then the VIEW_CONVERT_EXPR would be meaningful (the value of an assignment 
 a = b being the new value of a), but reinterpreting a void value doesn't 
 make sense.  Or you could probably just use call_mffs directly in the 
 VIEW_CONVERT_EXPR without explicitly creating the old_fenv variable.

 Thanks for the review Josephm.  I have changed to avoid the void 
 reinterpretation
 and use call_mffs directly.  I have also removed the the mask generation in 
 'clear'
 from your previous message, it is now reusing the mas used in feholdexcept.  
 The 
 testcase patch is the same as before.

 Checked on both linux-powerpc64/powerpc64le and no regressions found.

 --

 2014-08-06  Adhemerval Zanella  azane...@linux.vnet.ibm.com

 gcc:
  * config/rs6000/rs6000.c (rs6000_atomic_assign_expand_fenv): New
  function.

 gcc/testsuite:
  * gcc.dg/atomic/c11-atomic-exec-5.c
  (test_main_long_double_add_overflow): Define and run only for
  LDBL_MANT_DIG != 106.
  (test_main_complex_long_double_add_overflow): Likewise.
  (test_main_long_double_sub_overflow): Likewise.
  (test_main_complex_long_double_sub_overflow): Likewise.

 ---

 diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
 index d088ff6..7d66eb1 100644
 --- a/gcc/config/rs6000/rs6000.c
 +++ b/gcc/config/rs6000/rs6000.c
 @@ -1631,6 +1631,9 @@ static const struct attribute_spec 
 rs6000_attribute_table[] =

  #undef TARGET_CAN_USE_DOLOOP_P
  #define TARGET_CAN_USE_DOLOOP_P can_use_doloop_if_innermost
 +
 +#undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
 +#define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv
  

  /* Processor table.  */
 @@ -7,6 +33340,80 @@ emit_fusion_gpr_load (rtx *operands)
return ;
  }

 +/* Implement TARGET_ATOMIC_ASSIGN_EXPAND_FENV hook.  */
 +
 +static void
 +rs6000_atomic_assign_expand_fenv (tree *hold, tree *clear, tree *update)
 +{
 +  if (!TARGET_HARD_FLOAT || !TARGET_FPRS)
 +return;
 +
 +  tree mffs = rs6000_builtin_decls[RS6000_BUILTIN_MFFS];
 +  tree mtfsf = rs6000_builtin_decls[RS6000_BUILTIN_MTFSF];
 +  tree call_mffs = build_call_expr (mffs, 0);
 +
 +  /* Generates the equivalent of feholdexcept (fenv_var)
 +
 + *fenv_var = __builtin_mffs ();
 + double fenv_hold;
 + *(uint64_t*)fenv_hold = *(uint64_t*)fenv_var  0x0007LL;
 + __builtin_mtfsf (0xff, fenv_hold);  */
 +
 +  /* Mask to clear everything except for the rounding modes and non-IEEE
 + arithmetic flag.  */
 +  const unsigned HOST_WIDE_INT hold_exception_mask =
 +HOST_WIDE_INT_UC (0x0007);
 +
 +  tree fenv_var = create_tmp_var (double_type_node, NULL);
 +
 +  tree hold_mffs = build2 (MODIFY_EXPR, void_type_node, fenv_var, 
 call_mffs);
 +
 +  tree fenv_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, fenv_var);
 +  tree fenv_llu_and = build2 (BIT_AND_EXPR, uint64_type_node, fenv_llu,
 +build_int_cst (uint64_type_node, hold_exception_mask));
 +
 +  tree fenv_mtfsf = build1 (VIEW_CONVERT_EXPR, double_type_node, 
 fenv_llu_and);
 +
 +  tree hold_mtfsf = build_call_expr (mtfsf, 2,
 +build_int_cst (unsigned_type_node, 0xff), fenv_mtfsf);
 +
 +  *hold = build2 (COMPOUND_EXPR, void_type_node, hold_mffs, hold_mtfsf);
 +
 +  /* Reload the value of fenv_hold to clear the exceptions.  */
 +
 +  *clear = build_call_expr (mtfsf, 2,
 +build_int_cst (unsigned_type_node, 0xff), fenv_mtfsf);
 +
 +  /* Generates the equivalent of feupdateenv (fenv_var)
 +
 + double old_fenv = __builtin_mffs ();
 + double fenv_update;
 + *(uint64_t*)fenv_update = (*(uint64_t*)old  0x1f00LL) |
 +(*(uint64_t*)fenv_var 0x1ff80fff);
 + __builtin_mtfsf (0xff, fenv_update);  */
 +
 +  const unsigned HOST_WIDE_INT update_exception_mask =
 +HOST_WIDE_INT_UC (0x1f00);
 +  const unsigned HOST_WIDE_INT new_exception_mask =
 +HOST_WIDE_INT_UC (0x1ff80fff);
 +
 +  tree old_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, call_mffs);
 +  tree old_llu_and = build2 (BIT_AND_EXPR, uint64_type_node,
 +old_llu

Re: [PATCH] PowerPC: Implement TARGET_ATOMIC_ASSIGN_EXPAND_FENV

2014-08-19 Thread Adhemerval Zanella
Ping.

On 06-08-2014 17:21, Adhemerval Zanella wrote:
 On 01-08-2014 12:31, Joseph S. Myers wrote:
 On Thu, 31 Jul 2014, David Edelsohn wrote:

 Thanks for implementing the FENV support.  The patch generally looks 
 good to me.

 My one concern is a detail in the implementation of update. I do not
 have enough experience with GENERIC to verify the details and it seems
 like it is missing building an outer COMPOUND_EXPR containing
 update_mffs and the CALL_EXPR for update mtfsf.
 I suppose what's actually odd there is that you have

 +  tree update_mffs = build2 (MODIFY_EXPR, void_type_node, old_fenv, 
 call_mffs);
 +
 +  tree old_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, update_mffs);

 so you build a MODIFY_EXPR in void_type_node but then convert it with a 
 VIEW_CONVERT_EXPR.  If you'd built the MODIFY_EXPR in double_type_node 
 then the VIEW_CONVERT_EXPR would be meaningful (the value of an assignment 
 a = b being the new value of a), but reinterpreting a void value doesn't 
 make sense.  Or you could probably just use call_mffs directly in the 
 VIEW_CONVERT_EXPR without explicitly creating the old_fenv variable.

 Thanks for the review Josephm.  I have changed to avoid the void 
 reinterpretation
 and use call_mffs directly.  I have also removed the the mask generation in 
 'clear'
 from your previous message, it is now reusing the mas used in feholdexcept.  
 The 
 testcase patch is the same as before.

 Checked on both linux-powerpc64/powerpc64le and no regressions found.

 --

 2014-08-06  Adhemerval Zanella  azane...@linux.vnet.ibm.com

 gcc:
   * config/rs6000/rs6000.c (rs6000_atomic_assign_expand_fenv): New
   function.

 gcc/testsuite:
   * gcc.dg/atomic/c11-atomic-exec-5.c
   (test_main_long_double_add_overflow): Define and run only for
   LDBL_MANT_DIG != 106.
   (test_main_complex_long_double_add_overflow): Likewise.
   (test_main_long_double_sub_overflow): Likewise.
   (test_main_complex_long_double_sub_overflow): Likewise.

 ---

 diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
 index d088ff6..7d66eb1 100644
 --- a/gcc/config/rs6000/rs6000.c
 +++ b/gcc/config/rs6000/rs6000.c
 @@ -1631,6 +1631,9 @@ static const struct attribute_spec 
 rs6000_attribute_table[] =

  #undef TARGET_CAN_USE_DOLOOP_P
  #define TARGET_CAN_USE_DOLOOP_P can_use_doloop_if_innermost
 +
 +#undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
 +#define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv
  

  /* Processor table.  */
 @@ -7,6 +33340,80 @@ emit_fusion_gpr_load (rtx *operands)
return ;
  }

 +/* Implement TARGET_ATOMIC_ASSIGN_EXPAND_FENV hook.  */
 +
 +static void
 +rs6000_atomic_assign_expand_fenv (tree *hold, tree *clear, tree *update)
 +{
 +  if (!TARGET_HARD_FLOAT || !TARGET_FPRS)
 +return;
 +
 +  tree mffs = rs6000_builtin_decls[RS6000_BUILTIN_MFFS];
 +  tree mtfsf = rs6000_builtin_decls[RS6000_BUILTIN_MTFSF];
 +  tree call_mffs = build_call_expr (mffs, 0);
 +
 +  /* Generates the equivalent of feholdexcept (fenv_var)
 +
 + *fenv_var = __builtin_mffs ();
 + double fenv_hold;
 + *(uint64_t*)fenv_hold = *(uint64_t*)fenv_var  0x0007LL;
 + __builtin_mtfsf (0xff, fenv_hold);  */
 +
 +  /* Mask to clear everything except for the rounding modes and non-IEEE
 + arithmetic flag.  */
 +  const unsigned HOST_WIDE_INT hold_exception_mask =
 +HOST_WIDE_INT_UC (0x0007);
 +
 +  tree fenv_var = create_tmp_var (double_type_node, NULL);
 +
 +  tree hold_mffs = build2 (MODIFY_EXPR, void_type_node, fenv_var, call_mffs);
 +
 +  tree fenv_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, fenv_var);
 +  tree fenv_llu_and = build2 (BIT_AND_EXPR, uint64_type_node, fenv_llu,
 +build_int_cst (uint64_type_node, hold_exception_mask));
 +
 +  tree fenv_mtfsf = build1 (VIEW_CONVERT_EXPR, double_type_node, 
 fenv_llu_and);
 +
 +  tree hold_mtfsf = build_call_expr (mtfsf, 2,
 +build_int_cst (unsigned_type_node, 0xff), fenv_mtfsf);
 +
 +  *hold = build2 (COMPOUND_EXPR, void_type_node, hold_mffs, hold_mtfsf);
 +
 +  /* Reload the value of fenv_hold to clear the exceptions.  */
 +
 +  *clear = build_call_expr (mtfsf, 2,
 +build_int_cst (unsigned_type_node, 0xff), fenv_mtfsf);
 +
 +  /* Generates the equivalent of feupdateenv (fenv_var)
 +
 + double old_fenv = __builtin_mffs ();
 + double fenv_update;
 + *(uint64_t*)fenv_update = (*(uint64_t*)old  0x1f00LL) |
 +(*(uint64_t*)fenv_var 0x1ff80fff);
 + __builtin_mtfsf (0xff, fenv_update);  */
 +
 +  const unsigned HOST_WIDE_INT update_exception_mask =
 +HOST_WIDE_INT_UC (0x1f00);
 +  const unsigned HOST_WIDE_INT new_exception_mask =
 +HOST_WIDE_INT_UC (0x1ff80fff);
 +
 +  tree old_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, call_mffs);
 +  tree old_llu_and = build2 (BIT_AND_EXPR, uint64_type_node,
 +old_llu, build_int_cst (uint64_type_node, update_exception_mask

Re: [PATCH] PowerPC: Implement TARGET_ATOMIC_ASSIGN_EXPAND_FENV

2014-08-06 Thread Adhemerval Zanella
On 01-08-2014 12:31, Joseph S. Myers wrote:
 On Thu, 31 Jul 2014, David Edelsohn wrote:

 Thanks for implementing the FENV support.  The patch generally looks 
 good to me.

 My one concern is a detail in the implementation of update. I do not
 have enough experience with GENERIC to verify the details and it seems
 like it is missing building an outer COMPOUND_EXPR containing
 update_mffs and the CALL_EXPR for update mtfsf.
 I suppose what's actually odd there is that you have

 +  tree update_mffs = build2 (MODIFY_EXPR, void_type_node, old_fenv, 
 call_mffs);
 +
 +  tree old_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, update_mffs);

 so you build a MODIFY_EXPR in void_type_node but then convert it with a 
 VIEW_CONVERT_EXPR.  If you'd built the MODIFY_EXPR in double_type_node 
 then the VIEW_CONVERT_EXPR would be meaningful (the value of an assignment 
 a = b being the new value of a), but reinterpreting a void value doesn't 
 make sense.  Or you could probably just use call_mffs directly in the 
 VIEW_CONVERT_EXPR without explicitly creating the old_fenv variable.

Thanks for the review Josephm.  I have changed to avoid the void 
reinterpretation
and use call_mffs directly.  I have also removed the the mask generation in 
'clear'
from your previous message, it is now reusing the mas used in feholdexcept.  
The 
testcase patch is the same as before.

Checked on both linux-powerpc64/powerpc64le and no regressions found.

--

2014-08-06  Adhemerval Zanella  azane...@linux.vnet.ibm.com

gcc:
* config/rs6000/rs6000.c (rs6000_atomic_assign_expand_fenv): New
function.

gcc/testsuite:
* gcc.dg/atomic/c11-atomic-exec-5.c
(test_main_long_double_add_overflow): Define and run only for
LDBL_MANT_DIG != 106.
(test_main_complex_long_double_add_overflow): Likewise.
(test_main_long_double_sub_overflow): Likewise.
(test_main_complex_long_double_sub_overflow): Likewise.

---

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index d088ff6..7d66eb1 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1631,6 +1631,9 @@ static const struct attribute_spec 
rs6000_attribute_table[] =
 
 #undef TARGET_CAN_USE_DOLOOP_P
 #define TARGET_CAN_USE_DOLOOP_P can_use_doloop_if_innermost
+
+#undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
+#define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv
 
 
 /* Processor table.  */
@@ -7,6 +33340,80 @@ emit_fusion_gpr_load (rtx *operands)
   return ;
 }
 
+/* Implement TARGET_ATOMIC_ASSIGN_EXPAND_FENV hook.  */
+
+static void
+rs6000_atomic_assign_expand_fenv (tree *hold, tree *clear, tree *update)
+{
+  if (!TARGET_HARD_FLOAT || !TARGET_FPRS)
+return;
+
+  tree mffs = rs6000_builtin_decls[RS6000_BUILTIN_MFFS];
+  tree mtfsf = rs6000_builtin_decls[RS6000_BUILTIN_MTFSF];
+  tree call_mffs = build_call_expr (mffs, 0);
+
+  /* Generates the equivalent of feholdexcept (fenv_var)
+
+ *fenv_var = __builtin_mffs ();
+ double fenv_hold;
+ *(uint64_t*)fenv_hold = *(uint64_t*)fenv_var  0x0007LL;
+ __builtin_mtfsf (0xff, fenv_hold);  */
+
+  /* Mask to clear everything except for the rounding modes and non-IEEE
+ arithmetic flag.  */
+  const unsigned HOST_WIDE_INT hold_exception_mask =
+HOST_WIDE_INT_UC (0x0007);
+
+  tree fenv_var = create_tmp_var (double_type_node, NULL);
+
+  tree hold_mffs = build2 (MODIFY_EXPR, void_type_node, fenv_var, call_mffs);
+
+  tree fenv_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, fenv_var);
+  tree fenv_llu_and = build2 (BIT_AND_EXPR, uint64_type_node, fenv_llu,
+build_int_cst (uint64_type_node, hold_exception_mask));
+
+  tree fenv_mtfsf = build1 (VIEW_CONVERT_EXPR, double_type_node, fenv_llu_and);
+
+  tree hold_mtfsf = build_call_expr (mtfsf, 2,
+build_int_cst (unsigned_type_node, 0xff), fenv_mtfsf);
+
+  *hold = build2 (COMPOUND_EXPR, void_type_node, hold_mffs, hold_mtfsf);
+
+  /* Reload the value of fenv_hold to clear the exceptions.  */
+
+  *clear = build_call_expr (mtfsf, 2,
+build_int_cst (unsigned_type_node, 0xff), fenv_mtfsf);
+
+  /* Generates the equivalent of feupdateenv (fenv_var)
+
+ double old_fenv = __builtin_mffs ();
+ double fenv_update;
+ *(uint64_t*)fenv_update = (*(uint64_t*)old  0x1f00LL) |
+(*(uint64_t*)fenv_var 0x1ff80fff);
+ __builtin_mtfsf (0xff, fenv_update);  */
+
+  const unsigned HOST_WIDE_INT update_exception_mask =
+HOST_WIDE_INT_UC (0x1f00);
+  const unsigned HOST_WIDE_INT new_exception_mask =
+HOST_WIDE_INT_UC (0x1ff80fff);
+
+  tree old_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, call_mffs);
+  tree old_llu_and = build2 (BIT_AND_EXPR, uint64_type_node,
+old_llu, build_int_cst (uint64_type_node, update_exception_mask));
+
+  tree new_llu_and = build2 (BIT_AND_EXPR, uint64_type_node, fenv_llu,
+build_int_cst (uint64_type_node, new_exception_mask

[COMMITTED] Add myself to MAINTAINERS file

2014-07-16 Thread Adhemerval Zanella
Hi,

This patch adds myself to the MAINTAINERS file.  Commmitted as 212652.

--

2014-07-16  Adhemerval Zanella  azane...@linux.vnet.ibm.com 
mailto:azanella%40linux.vnet.ibm.com

* MAINTAINERS (Write After Approval): Add myself.

---

--- trunk/MAINTAINERS   2014/07/16 14:21:34 212651
+++ trunk/MAINTAINERS   2014/07/16 14:23:03 212652
@@ -583,6 +583,7 @@
 David Yustedavid.yu...@gmail.com
 Kirill Yukhin  kirill.yuk...@gmail.com
 Kenneth Zadeck zad...@naturalbridge.com
+Adhemerval Zanella azane...@linux.vnet.ibm.com
 Yufeng Zhang   yufeng.zh...@arm.com
 Shujing Zhao   pearly.z...@oracle.com
 Jon Zieglerj...@apple.com





Re: [PATCH] PowerPC: Implement TARGET_ATOMIC_ASSIGN_EXPAND_FENV

2014-07-16 Thread Adhemerval Zanella
Ping.

On 03-07-2014 18:08, Adhemerval Zanella wrote:
 This patch implements the TARGET_ATOMIC_ASSIGN_EXPAND_FENV for
 powerpc-fpu. I have to adjust current c11-atomic-exec-5 testcase
 because for IBM long double 0 += LDBL_MAX might generate 
 overflow/underflow in internal __gcc_qadd calculations.

 The c11-atomic-exec-5 now passes for linux/powerpc, checked on
 powerpc32-linux-fpu, powerpc64-linux, and powerpc64le-linux.

 --

 2014-07-03  Adhemerval Zanella  azane...@linux.vnet.ibm.com

 gcc:
   * config/rs6000/rs6000.c (rs6000_atomic_assign_expand_fenv): New
   function.

 gcc/testsuite:
   * gcc.dg/atomic/c11-atomic-exec-5.c
   (test_main_long_double_add_overflow): Define and run only for
   LDBL_MANT_DIG != 106.
   (test_main_complex_long_double_add_overflow): Likewise.
   (test_main_long_double_sub_overflow): Likewise.
   (test_main_complex_long_double_sub_overflow): Likewise.

 ---

 diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
 index bf67e72..75a2a45 100644
 --- a/gcc/config/rs6000/rs6000.c
 +++ b/gcc/config/rs6000/rs6000.c
 @@ -1621,6 +1621,9 @@ static const struct attribute_spec 
 rs6000_attribute_table[] =

  #undef TARGET_CAN_USE_DOLOOP_P
  #define TARGET_CAN_USE_DOLOOP_P can_use_doloop_if_innermost
 +
 +#undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
 +#define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv
  

  /* Processor table.  */
 @@ -32991,6 +32994,105 @@ emit_fusion_gpr_load (rtx *operands)
return ;
  }

 +/* Implement TARGET_ATOMIC_ASSIGN_EXPAND_FENV hook.  */
 +
 +static void
 +rs6000_atomic_assign_expand_fenv (tree *hold, tree *clear, tree *update)
 +{
 +  if (!TARGET_HARD_FLOAT || !TARGET_FPRS)
 +return;
 +
 +  tree mffs = rs6000_builtin_decls[RS6000_BUILTIN_MFFS];
 +  tree mtfsf = rs6000_builtin_decls[RS6000_BUILTIN_MTFSF];
 +  tree call_mffs = build_call_expr (mffs, 0);
 +
 +  /* Generates the equivalent of feholdexcept (fenv_var)
 +
 + *fenv_var = __builtin_mffs ();
 + double fenv_hold;
 + *(uint64_t*)fenv_hold = *(uint64_t*)fenv_var  0x0007LL;
 + __builtin_mtfsf (0xff, fenv_hold);  */
 +
 +  /* Mask to clear everything except for the rounding modes and non-IEEE
 + arithmetic flag.  */
 +  const unsigned HOST_WIDE_INT hold_exception_mask =
 +HOST_WIDE_INT_C (0x0007);
 +
 +  tree fenv_var = create_tmp_var (double_type_node, NULL);
 +
 +  tree hold_mffs = build2 (MODIFY_EXPR, void_type_node, fenv_var, call_mffs);
 +
 +  tree fenv_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, fenv_var);
 +  tree fenv_llu_and = build2 (BIT_AND_EXPR, uint64_type_node, fenv_llu,
 +build_int_cst (uint64_type_node, hold_exception_mask));
 +
 +  tree fenv_mtfsf = build1 (VIEW_CONVERT_EXPR, double_type_node, 
 fenv_llu_and);
 +
 +  tree hold_mtfsf = build_call_expr (mtfsf, 2,
 +build_int_cst (unsigned_type_node, 0xff), fenv_mtfsf);
 +
 +  *hold = build2 (COMPOUND_EXPR, void_type_node, hold_mffs, hold_mtfsf);
 +
 +  /* Generates the equivalent of feclearexcept (FE_ALL_EXCEPT):
 +
 + double fenv_clear = __builtin_mffs ();
 + *(uint64_t)fenv_clear = 0xLL;
 + __builtin_mtfsf (0xff, fenv_clear);  */
 +
 +  /* Mask to clear everything except for the rounding modes and non-IEEE
 + arithmetic flag.  */
 +  const unsigned HOST_WIDE_INT clear_exception_mask =
 +HOST_WIDE_INT_UC (0x);
 +
 +  tree fenv_clear = create_tmp_var (double_type_node, NULL);
 +
 +  tree clear_mffs = build2 (MODIFY_EXPR, void_type_node, fenv_clear, 
 call_mffs);
 +
 +  tree fenv_clean_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, 
 fenv_var);
 +  tree fenv_clear_llu_and = build2 (BIT_AND_EXPR, uint64_type_node,
 +fenv_clean_llu, build_int_cst (uint64_type_node, clear_exception_mask));
 +
 +  tree fenv_clear_mtfsf = build1 (VIEW_CONVERT_EXPR, double_type_node,
 +fenv_clear_llu_and);
 +
 +  tree clear_mtfsf = build_call_expr (mtfsf, 2,
 +build_int_cst (unsigned_type_node, 0xff), fenv_clear_mtfsf);
 +
 +  *clear = build2 (COMPOUND_EXPR, void_type_node, clear_mffs, clear_mtfsf);
 +
 +  /* Generates the equivalent of feupdateenv (fenv_var)
 +
 + double old_fenv = __builtin_mffs ();
 + double fenv_update;
 + *(uint64_t*)fenv_update = (*(uint64_t*)old  0x1f00LL) |
 +(*(uint64_t*)fenv_var 0x1ff80fff);
 + __builtin_mtfsf (0xff, fenv_update);  */
 +
 +  const unsigned HOST_WIDE_INT update_exception_mask =
 +HOST_WIDE_INT_UC (0x1f00);
 +  const unsigned HOST_WIDE_INT new_exception_mask =
 +HOST_WIDE_INT_UC (0x1ff80fff);
 +
 +  tree old_fenv = create_tmp_var (double_type_node, NULL);
 +  tree update_mffs = build2 (MODIFY_EXPR, void_type_node, old_fenv, 
 call_mffs);
 +
 +  tree old_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, update_mffs);
 +  tree old_llu_and = build2 (BIT_AND_EXPR, uint64_type_node,
 +old_llu, build_int_cst (uint64_type_node

[PATCH] PowerPC: Implement TARGET_ATOMIC_ASSIGN_EXPAND_FENV

2014-07-03 Thread Adhemerval Zanella
This patch implements the TARGET_ATOMIC_ASSIGN_EXPAND_FENV for
powerpc-fpu. I have to adjust current c11-atomic-exec-5 testcase
because for IBM long double 0 += LDBL_MAX might generate 
overflow/underflow in internal __gcc_qadd calculations.

The c11-atomic-exec-5 now passes for linux/powerpc, checked on
powerpc32-linux-fpu, powerpc64-linux, and powerpc64le-linux.

--

2014-07-03  Adhemerval Zanella  azane...@linux.vnet.ibm.com

gcc:
* config/rs6000/rs6000.c (rs6000_atomic_assign_expand_fenv): New
function.

gcc/testsuite:
* gcc.dg/atomic/c11-atomic-exec-5.c
(test_main_long_double_add_overflow): Define and run only for
LDBL_MANT_DIG != 106.
(test_main_complex_long_double_add_overflow): Likewise.
(test_main_long_double_sub_overflow): Likewise.
(test_main_complex_long_double_sub_overflow): Likewise.

---

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index bf67e72..75a2a45 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1621,6 +1621,9 @@ static const struct attribute_spec 
rs6000_attribute_table[] =
 
 #undef TARGET_CAN_USE_DOLOOP_P
 #define TARGET_CAN_USE_DOLOOP_P can_use_doloop_if_innermost
+
+#undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
+#define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv
 
 
 /* Processor table.  */
@@ -32991,6 +32994,105 @@ emit_fusion_gpr_load (rtx *operands)
   return ;
 }
 
+/* Implement TARGET_ATOMIC_ASSIGN_EXPAND_FENV hook.  */
+
+static void
+rs6000_atomic_assign_expand_fenv (tree *hold, tree *clear, tree *update)
+{
+  if (!TARGET_HARD_FLOAT || !TARGET_FPRS)
+return;
+
+  tree mffs = rs6000_builtin_decls[RS6000_BUILTIN_MFFS];
+  tree mtfsf = rs6000_builtin_decls[RS6000_BUILTIN_MTFSF];
+  tree call_mffs = build_call_expr (mffs, 0);
+
+  /* Generates the equivalent of feholdexcept (fenv_var)
+
+ *fenv_var = __builtin_mffs ();
+ double fenv_hold;
+ *(uint64_t*)fenv_hold = *(uint64_t*)fenv_var  0x0007LL;
+ __builtin_mtfsf (0xff, fenv_hold);  */
+
+  /* Mask to clear everything except for the rounding modes and non-IEEE
+ arithmetic flag.  */
+  const unsigned HOST_WIDE_INT hold_exception_mask =
+HOST_WIDE_INT_C (0x0007);
+
+  tree fenv_var = create_tmp_var (double_type_node, NULL);
+
+  tree hold_mffs = build2 (MODIFY_EXPR, void_type_node, fenv_var, call_mffs);
+
+  tree fenv_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, fenv_var);
+  tree fenv_llu_and = build2 (BIT_AND_EXPR, uint64_type_node, fenv_llu,
+build_int_cst (uint64_type_node, hold_exception_mask));
+
+  tree fenv_mtfsf = build1 (VIEW_CONVERT_EXPR, double_type_node, fenv_llu_and);
+
+  tree hold_mtfsf = build_call_expr (mtfsf, 2,
+build_int_cst (unsigned_type_node, 0xff), fenv_mtfsf);
+
+  *hold = build2 (COMPOUND_EXPR, void_type_node, hold_mffs, hold_mtfsf);
+
+  /* Generates the equivalent of feclearexcept (FE_ALL_EXCEPT):
+
+ double fenv_clear = __builtin_mffs ();
+ *(uint64_t)fenv_clear = 0xLL;
+ __builtin_mtfsf (0xff, fenv_clear);  */
+
+  /* Mask to clear everything except for the rounding modes and non-IEEE
+ arithmetic flag.  */
+  const unsigned HOST_WIDE_INT clear_exception_mask =
+HOST_WIDE_INT_UC (0x);
+
+  tree fenv_clear = create_tmp_var (double_type_node, NULL);
+
+  tree clear_mffs = build2 (MODIFY_EXPR, void_type_node, fenv_clear, 
call_mffs);
+
+  tree fenv_clean_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, fenv_var);
+  tree fenv_clear_llu_and = build2 (BIT_AND_EXPR, uint64_type_node,
+fenv_clean_llu, build_int_cst (uint64_type_node, clear_exception_mask));
+
+  tree fenv_clear_mtfsf = build1 (VIEW_CONVERT_EXPR, double_type_node,
+fenv_clear_llu_and);
+
+  tree clear_mtfsf = build_call_expr (mtfsf, 2,
+build_int_cst (unsigned_type_node, 0xff), fenv_clear_mtfsf);
+
+  *clear = build2 (COMPOUND_EXPR, void_type_node, clear_mffs, clear_mtfsf);
+
+  /* Generates the equivalent of feupdateenv (fenv_var)
+
+ double old_fenv = __builtin_mffs ();
+ double fenv_update;
+ *(uint64_t*)fenv_update = (*(uint64_t*)old  0x1f00LL) |
+(*(uint64_t*)fenv_var 0x1ff80fff);
+ __builtin_mtfsf (0xff, fenv_update);  */
+
+  const unsigned HOST_WIDE_INT update_exception_mask =
+HOST_WIDE_INT_UC (0x1f00);
+  const unsigned HOST_WIDE_INT new_exception_mask =
+HOST_WIDE_INT_UC (0x1ff80fff);
+
+  tree old_fenv = create_tmp_var (double_type_node, NULL);
+  tree update_mffs = build2 (MODIFY_EXPR, void_type_node, old_fenv, call_mffs);
+
+  tree old_llu = build1 (VIEW_CONVERT_EXPR, uint64_type_node, update_mffs);
+  tree old_llu_and = build2 (BIT_AND_EXPR, uint64_type_node,
+old_llu, build_int_cst (uint64_type_node, update_exception_mask));
+
+  tree new_llu_and = build2 (BIT_AND_EXPR, uint64_type_node, fenv_llu,
+build_int_cst (uint64_type_node, new_exception_mask));
+
+  tree

Re: [PATCH, PR 57363] IBM long double: adding qNaN and number raises inexact exception

2013-11-20 Thread Adhemerval Zanella
Hi,

This is a reformatted patch after Ulrich review:

---

libgcc/ChangeLog:

2013-11-20  Adhemerval Zanella  azane...@linux.vnet.ibm.com

* config/rs6000/ibm-ldouble.c (__gcc_qadd): Fix add
of normal number and qNaN to not raise an inexact exception.

gcc/testsuite/ChangeLog:

2013-11-20  Adhemerval Zanella  azane...@linux.vnet.ibm.com

* gcc.target/powerpc/pr57363.c: New test.

--

diff --git a/gcc/testsuite/gcc.target/powerpc/pr57363.c 
b/gcc/testsuite/gcc.target/powerpc/pr57363.c
new file mode 100644
index 000..45ea3f3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr57363.c
@@ -0,0 +1,19 @@
+/* { dg-do run { target { powerpc*-*-linux* } } } */
+/* { dg-options -mlong-double-128 } */
+
+/* Check if adding a qNAN and a normal long double does not generate a
+   inexact exception.  */
+
+#define _GNU_SOURCE
+#include fenv.h
+
+int main(void)
+{
+  double x = __builtin_nan ();
+  long double y = 1.1L;
+
+  feenableexcept (FE_INEXACT);
+  feclearexcept (FE_ALL_EXCEPT);
+  x = x + y;
+  return fetestexcept(FE_INEXACT);
+}
diff --git a/libgcc/config/rs6000/ibm-ldouble.c 
b/libgcc/config/rs6000/ibm-ldouble.c
index 28e02e9..7ca900c 100644
--- a/libgcc/config/rs6000/ibm-ldouble.c
+++ b/libgcc/config/rs6000/ibm-ldouble.c
@@ -104,6 +104,8 @@ __gcc_qadd (double a, double aa, double c, double cc)
 
   if (nonfinite (z))
 {
+  if (fabs (z) != inf())
+   return z;
   z = cc + aa + c + a;
   if (nonfinite (z))
return z;



Re: [PATCH, PR 57363] IBM long double: adding qNaN and number raises inexact exception

2013-11-20 Thread Adhemerval Zanella
On 20-11-2013 14:23, David Edelsohn wrote:
 On Wed, Nov 20, 2013 at 9:11 AM, Adhemerval Zanella
 azane...@linux.vnet.ibm.com wrote:

 libgcc/ChangeLog:

 2013-11-20  Adhemerval Zanella  azane...@linux.vnet.ibm.com

 * config/rs6000/ibm-ldouble.c (__gcc_qadd): Fix add
 of normal number and qNaN to not raise an inexact exception.

 gcc/testsuite/ChangeLog:

 2013-11-20  Adhemerval Zanella  azane...@linux.vnet.ibm.com

 * gcc.target/powerpc/pr57363.c: New test.
 This is okay.  And the patch is small enough that it does not need a
 copyright assignment for GCC.

 Do you need someone to commit the patch for you?

 Thanks, David

Yes I do need, thanks.



Re: [PATCH, PR 57363] IBM long double: adding qNaN and number raises inexact exception

2013-11-18 Thread Adhemerval Zanella
On 15-11-2013 19:54, Ulrich Weigand wrote:
 Should this be qNaN instead of sNaN here?

Yes, indeed.


 Also, since you already have a test case, I think it would be good to add it
 to the GCC test suite ...

 Otherwise, this looks reasonable to me (but I cannot approve the patch):

 Bye,
 Ulrich

Here it is with updated ChangeLog. Tested on PPC32 and PPC64. The testcase
fails at runtime without the fix with a FLoating-Point Exception.

---

2013-11-15  Adhemerval Zanella  azane...@linux.vnet.ibm.com

* libgcc/config/rs6000/ibm-ldouble.c (__gcc_qadd): Fix add
of normal number and qNaN to not raise an inexact exception.
* gcc/testsuite/gcc.target/powerpc/pr57363.c: New file: check for
PR#57363.

--

diff --git a/gcc/testsuite/gcc.target/powerpc/pr57363.c 
b/gcc/testsuite/gcc.target/powerpc/pr57363.c
new file mode 100644
index 000..cc9b1d7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr57363.c
@@ -0,0 +1,19 @@
+/* { dg-do run { target { powerpc*-*-linux* } } } */
+/* { dg-options -mlong-double-128 } */
+
+/* Check if adding a sNAN and a normal long double does not generate a
+   inexact exception.  */
+
+#define _GNU_SOURCE
+#include fenv.h
+
+int main(void)
+{
+  double x = __builtin_nan ();
+  long double y = 1.1L;
+
+  feenableexcept (FE_INEXACT);
+  feclearexcept (FE_ALL_EXCEPT);
+  x = x + y;
+  return fetestexcept(FE_INEXACT);
+}
diff --git a/libgcc/config/rs6000/ibm-ldouble.c 
b/libgcc/config/rs6000/ibm-ldouble.c
index 28e02e9..7ca900c 100644
--- a/libgcc/config/rs6000/ibm-ldouble.c
+++ b/libgcc/config/rs6000/ibm-ldouble.c
@@ -104,6 +104,8 @@ __gcc_qadd (double a, double aa, double c, double cc)
 
   if (nonfinite (z))
 {
+  if (fabs (z) != inf())
+   return z;
   z = cc + aa + c + a;
   if (nonfinite (z))
return z;




Re: [PATCH, PR 57363] IBM long double: adding qNaN and number raises inexact exception

2013-11-18 Thread Adhemerval Zanella
On 18-11-2013 18:10, Ulrich Weigand wrote:
 Adhemerval Zanella wrote:

 2013-11-15 Adhemerval Zanella azane...@linux.vnet.ibm.com * 
 libgcc/config/rs6000/ibm-ldouble.c (__gcc_qadd): Fix add of normal number 
 and qNaN to not raise an inexact exception. * 
 gcc/testsuite/gcc.target/powerpc/pr57363.c: New file: check for PR#57363.

 This should go into two separate ChangeLog files, one in libgcc/ChangeLog:

   * config/rs6000/ibm-ldouble.c (__gcc_qadd): Fix add of normal number
   and qNaN to not raise an inexact exception.

 and one in gcc/testsuite/ChangeLog:

   * gcc.target/powerpc/pr57363.c: New test.

Ok, thanks for spot it.


 +/* Check if adding a sNAN and a normal long double does not generate a
 +   inexact exception.  */
 qNaN again :-)

Yeah... :(


 Thanks,
 Ulrich




[PATCH, PR 57363] IBM long double: adding qNaN and number raises inexact exception

2013-11-15 Thread Adhemerval Zanella
Hi all,

For IBM long double, adding a normal number to a qNaN raises an inexact 
exception. Adding any number to qNaN should not raise any exception. The 
following testcase triggers the issue (the testcase is meant to run a gnu 
compatible libc):

--
$ cat gcc_testcase.c
#include math.h
#include fenv.h
#include stdio.h

double
sum (double x, long double y)
{
  return x + y;
}

int main ()
{
  feenableexcept (FE_INEXACT);

  double x = __builtin_nan ();
  long double y = 1.1L;

  printf (%e\n, sum (x, y));
  
  return 0;
}
$ gcc -O3 -m64 -fno-inline gcc_testcase.c -o gcc_testcase -lm
$ ./gcc_testcase 
Floating point exception (core dumped)
--

The issue is in __gcc_qadd implementation at 
libgcc/config/rs6000/ibm-ldouble.c, 
if the number if non finite, there is not check if it a NaN before actually 
summing all the components. A possible solution would be to add an extra test 
and return the first sum if the number if not infinity.

This also fixes two GLIBC math issues I'm seeing:

TEST_ff_f (nexttoward, qnan_value, 1.1L, qnan_value, NO_INEXACT_EXCEPTION),
TEST_ff_f (nexttoward, 1.1L, qnan_value, qnan_value, NO_INEXACT_EXCEPTION),

Fix tested on PPC32 and PPC64.

---

2013-11-15  Adhemerval Zanella  azane...@linux.vnet.ibm.com

* libgcc/config/rs6000/ibm-ldouble.c (__gcc_qadd): Fix add
of normal number and sNaN to not raise an inexact exception.

---

diff --git a/libgcc/config/rs6000/ibm-ldouble.c 
b/libgcc/config/rs6000/ibm-ldouble.c
index 28e02e9..7ca900c 100644
--- a/libgcc/config/rs6000/ibm-ldouble.c
+++ b/libgcc/config/rs6000/ibm-ldouble.c
@@ -104,6 +104,8 @@ __gcc_qadd (double a, double aa, double c, double cc)
 
   if (nonfinite (z))
 {
+  if (fabs (z) != inf())
+   return z;
   z = cc + aa + c + a;
   if (nonfinite (z))
return z;