[Bug target/94154] New: AArch64: Add parameters to tune the precision of reciprocal div

2020-03-12 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94154

Bug ID: 94154
   Summary: AArch64: Add parameters to tune the precision of
reciprocal div
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Keywords: patch
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bule1 at huawei dot com
CC: richard.sandiford at arm dot com
  Target Milestone: ---
Target: AARCH64

This report suggest to use parameters to control the number of newton
iterations when using the reciprocal division on aarch64 platform, which is
currently hard coded in aarch64.c. 

This can benefit some test cases in spec2017 fpspeed in peak mode that do not
have a high demand on precision. And also fix the downside that users are
forced to use  reciprocal approximation at low precision.  

A proposed patch is attached.

[Bug target/94154] AArch64: Add parameters to tune the precision of reciprocal div

2020-03-13 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94154

Bu Le  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #1 from Bu Le  ---
The patch has been reviewed and merged to master by Richard. Fixed and close.

[Bug tree-optimization/94434] New: [AArch64][SVE] ICE caused by incompatibility of SRA and svst3 builtin-function

2020-04-01 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94434

Bug ID: 94434
   Summary: [AArch64][SVE] ICE caused by incompatibility of SRA
and svst3 builtin-function
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bule1 at huawei dot com
CC: mjambor at suse dot cz
  Target Milestone: ---
Target: aarch64

Created attachment 48154
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48154=edit
patch for the problem

test case: gcc/testsuite/gcc.target/aarch64/sve/acle/asm/st2_bf16.c

Command line:gcc st2_bf16.c -march=armv8.2-a+sve -msve-vector-bits=256 -O2
-fno-diagnostics-show-caret -fno-diagnostics-show-line-numbers
-fdiagnostics-color=never -fdiagnostics-urls=never  -DTEST_OVERLOADS
-fno-ipa-icf -c -o st2_bf16.o
during IPA pass: sra
st2_bf16.c: In function ‘st2_vnum_bf16_x1’:
st2_bf16.c:198:1: internal compiler error: Segmentation fault
0xc995b3 crash_signal
../.././gcc/toplev.c:328
0xa34f68 hash_map, isra_call_summary*,
simple_hashmap_traits >,
isra_call_summary*> >::get_or_insert(int const&, bool*)
../.././gcc/hash-map.h:194
0xa34f68 call_summary::get_create(cgraph_edge*)
../.././gcc/symbol-summary.h:642
0xa34f68 record_nonregister_call_use
../.././gcc/ipa-sra.c:1613
0xa34f68 scan_expr_access
../.././gcc/ipa-sra.c:1781
0xa37627 scan_function
../.././gcc/ipa-sra.c:1880
0xa37627 ipa_sra_summarize_function
../.././gcc/ipa-sra.c:2505
0xa38437 ipa_sra_generate_summary
../.././gcc/ipa-sra.c:2555
0xbb58bb execute_ipa_summary_passes(ipa_opt_pass_d*)
../.././gcc/passes.c:2191
0x7f672f ipa_passes
../.././gcc/cgraphunit.c:2627
0x7f672f symbol_table::compile()
../.././gcc/cgraphunit.c:2737
0x7f89ab symbol_table::compile()
../.././gcc/cgraphunit.c:2717
0x7f89ab symbol_table::finalize_compilation_unit()
../.././gcc/cgraphunit.c:2984
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

Similar problems can be found in svst2、svst4 and other functions of this kind.

This problem is cause by "record_nonregister_call_use" function trying to
access the call graph edge of an internal call, .MASK_STORE_LANE, which is a
NULL pointer.

The reason of stepping into "record_nonregister_call_use" function is that the
upper level function "scan_expr_access" considered the "svbfloat16x3_t z1"
argument as a valid candidate for further optimization.

A simple solution here is to disqualify the candidate at "scan_expr_access"
level when the call graph edge is null, which indicates the call is either an
internal call or a call with no references. For both case, the further
optimization process should stop before it reference a NULL pointer.

A proposed patch is attached.

[Bug target/95285] New: AArch64:aarch64 medium code model proposal

2020-05-23 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

Bug ID: 95285
   Summary: AArch64:aarch64 medium code model proposal
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bule1 at huawei dot com
  Target Milestone: ---

Created attachment 48584
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48584=edit
proposed patch

I would like to propose an implementation of the medium code model in aarch64.
A prototype is attached, passed bootstrap and the regression test.

Mcmodel = medium is a missing code model in aarch64 architecture, which is
supported in x86. This code model describes a situation that some small data is
relocated by small code model while large data is relocated by large code
model. The official statement about medium code model in x86 ABI file page 34
URL : https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf

The key difference between x86 and aarch64 is that x86 can use lea+movabs
instruction to implement a dynamic relocatable large code model. Currently,
large code model in AArch64 relocate the symbol using ldr instruction, which
can only be static linked. However, the small code mode use adrp + ldr
instruction, which can be dynamic linked. Therefore, the medium code model
cannot be implemented directly by simply setting a threshold. As a result a
dynamic reloadable large code model is needed first for a functional medium
code model.

I met this problem when compiling CESM, which is a climate forecast software
that widely used in hpc field. In some configure case, when the manipulating
large arrays, the large code model with dynamic relocation is needed. The
following case is abstract from CESM for this scenario.

program main
 common/baz/a,b,c
 real a,b,c
 b = 1.0
 call foo()
 print*, b
 end

 subroutine foo()
 common/baz/a,b,c
 real a,b,c

 integer, parameter :: nx = 1024
 integer, parameter :: ny = 1024
 integer, parameter :: nz = 1024
 integer, parameter :: nf = 1
 real :: bar(nf,nx*ny*nz)
 real :: bar1(nf,nx*ny*nz)
 bar = 0.0
 bar1 =0.0
 b = bar(1,1024*1024*100)
 b = bar1(1,1)

 return
 end

compile with -mcmodel=small -fPIC will give following error due to the access
of bar1 array
test.f90:(.text+0x28): relocation truncated to fit:
R_AARCH64_ADR_PREL_PG_HI21 against `.bss'
test.f90:(.text+0x6c): relocation truncated to fit:
R_AARCH64_ADR_PREL_PG_HI21 against `.bss'

compile with -mcmodel=large -fPIC will give unsupported error:
f951: sorry, unimplemented: code model ‘large’ with ‘-fPIC’

As discussed in the beginning, to tackle this problem we have to solve the
static large code model problem. My solution here is to use
R_AARCH64_MOVW_PREL_Gx group relocation with instructions to calculate the
current PC value.

Before change (mcmodel=small) :
adrpx0, bar1.2782
add x0, x0, :lo12:bar1.2782

After change:(mcmodel = medium proposed):
movzx0, :prel_g3:bar1.2782
movkx0, :prel_g2_nc:bar1.2782
movkx0, :prel_g1_nc:bar1.2782
movkx0, :prel_g0_nc:bar1.2782
adr x1, .
sub x1, x1, 0x4
add x0, x0, x1

The first 4 movk instruction will calculate the offset between bar1 and the
last movk instruction in 64-bits, which fulfil the requirement of large code
model(64-bit relocation).
The adr+sub instruction will calculate the pc-address of the last movk
instruction. By adding the offset with the PC address, bar1 can be dynamically
located.

Because this relocation is time consuming, a threshold is set to classify the
size of the data to be relocated, like x86. The default value of the threshold
is set to 65536, which is max relocation capability of small code model.
This implementation will also need to amend the linker in binutils so that the4
movk can calculated the same pc-offset of the last movk instruction.

The good side of this implementation is that it can use existed relocation type
to prototype a medium code model.

The drawback of this implementation also exists. 
For start, these 4movk instructions and the adr instruction must be combined in
this order. No other instruction should insert in between the sequence, which
will leads to mistake symbol address. This might impede the insn schedule
optimizations. 
Secondly, the linker need to make the change correspondingly so that every mov
instruction calculate the same pc-offset. For example, in my implementation,
the fisrt movz instruction will need to add 12 to the result of
":prel_g3:bar1.2782" to make up the pc-offset.   

I haven't figure out a suitable solution for these problems yet. You are most
welcomed to leave your suggestions regarding these issues.

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-23 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #1 from Bu Le  ---
Created attachment 48585
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48585=edit
patch for binutils

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-27 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #14 from Bu Le  ---
> > Anyway, my point is that the size of single data does't affact the fact that
> > medium code model is missing in aarch64 and aarch64 is lack of PIC large
> > code model.
> 
> What is missing is efficient support for >4GB of data, right? How that is
> implemented is a different question - my point is that it does not require a
> new code model. It would be much better if it just worked without users even
> needing to think about code models.
> 
> Also, what is the purpose of a large fpic model? Are there any applications
> that use shared libraries larger than 4GB?

Yes, I understand, and I am grateful for you suggestion. I have to say it is
not a critical problem. After all, most applications works fine with curreent
code modes. 

But there are some cases, like CESM with certain configuration, or my test
case, which cannot be compiled with current gcc compiler on aarch64.
Unfortunately, applications that large than 4GB is quiet normal in HPC feild.
In the meantime, x86 and llvm-aarch64 can compile it, with medium or large-pic
code model. That is the purpose I am proposing it. By adding this feature, we
can make a step forward for aarch64 gcc compiler, making it more powerful and
robust.

Clear enough for your concern? 

And for the implementation you suggested, I believe it is a promissing plan. I
would like to try to implement it first. Might take weeks of development. I
will see what I can get. I will give you update with progress.

Thanks for the suggestion again.

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-27 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #10 from Bu Le  ---

> Fortran already has -fstack-arrays to decide between allocating arrays on
> the heap or on the stack.

I tried the flag with my example. The fstack-array seems cannot move the array
in the bss to the heap. The problem is still there. 

Anyway, my point is that the size of single data does't affact the fact that
medium code model is missing in aarch64 and aarch64 is lack of PIC large code
model.

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-27 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #11 from Bu Le  ---

> You're right, we need an extra add, so it's like this:
> 
> adrpx0, bar1.2782
> movk  x1, :high32_47:bar1.2782
> add x0, x0, x1
> add x0, x0, :lo12:bar1.2782
> 
> > (By the way, the high32_47 relocation you suggested is the prel_g2 in the
> > officail aarch64 ABI released)
> 
> It needs a new relocation because of the ADRP. ADR could be used so the
> existing R__MOVW_PREL_G0-3 work, but then you need 5 instructions.

So you suggest a new relocation type "high32_47" to calculate the offset
between ADRP and bar1. Am I right?

> > And in terms of engineering, you idea can save the trouble to modify the
> > linker for calculating the offset for 3 movks. But we still need to make a
> > new relocation type for ADRP, because it currently checking the overflow of
> > address and gives the "relocation truncated to fit" error. Therefore, both
> > idea need to do works in binutils, which make it also equivalent.
> 
> There is relocation 276 (R__ADR_PREL_PG_HI21_NC).

Yes, through, we still need to make a change to compiler so when it comes to
medium code model, ADRP can use R__ADR_PREL_PG_HI21_NC relocation.

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-26 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #3 from Bu Le  ---
(In reply to Wilco from comment #2)

> Is the main usage scenario huge arrays? If so, these could easily be
> allocated via malloc at startup rather than using bss. It means an extra
> indirection in some cases (to load the pointer), but it should be much more
> efficient than using a large code model with all the overheads.

Thanks for the reply. 

The large array is just used to construct the test case. It is not a neccessary
condition for this scenario. The common scenario is that the symbol is too far
away for small code model to reach it, which cloud also result from large
amount of small arrays, structures, etc. Meanwhile, the large code model is
able to reach the symbol but can not be position independent, which cause the
problem. 

Besides, the code in CESM is quiet complicated to reconstruct with malloc,
which is also not an acceptable option for my customer.

Clear enough for your concern?

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-27 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #6 from Bu Le  ---
(In reply to Wilco from comment #4)
> (In reply to Bu Le from comment #3)
> > (In reply to Wilco from comment #2)

> Well the question is whether we're talking about more than 4GB of code or
> more than 4GB of data. With >4GB code you're indeed stuck with the large
> model. With data it is feasible to automatically use malloc for arrays when
> larger than a certain size, so there is no need to change the application at
> all. Something like that could be the default in the small model so that you
> don't have any extra overhead unless you have huge arrays. Making the
> threshold configurable means you can tune it for a specific application.


Is this automatic malloc already avaiable on some target? I haven't found an
example that works in that way. Would you mind provide an example?

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-27 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #7 from Bu Le  ---
(In reply to Wilco from comment #5)
> (In reply to Bu Le from comment #0)
> 
> Also it would be much more efficient to have a relocation like this if you
> wanted a 48-bit PC-relative offset:
> 
> adrpx0, bar1.2782
> add x0, x0, :lo12:bar1.2782
> movkx0, :high32_47:bar1.2782

I am afraid that put the PC-relative offset into x0 is not correct, because x0
issuppose to be the final address of bar1 rather than an PC offset. Therefore
an extra register is needed to hold the offest temporarily. Later, we need to
add the PC address of the movk with the offset to calsulate 32:48 bits of the
final address of bar1. Finally, add this part of address with x0 to compute the
entire 48 bits final address. So the code sould be following sequence:

adrpx0, bar1.2782
add x0, x0, :lo12:bar1.2782  //x0 here hold the 0:31 bits of the final addr
movkx4, :prel_g2:bar1.2782
adr x1, .
sub x1, x1, 0x4
add x4, x4, x1   // x4 here hold the 32:47 bits of the final addr
add x0, x4, x0

(By the way, the high32_47 relocation you suggested is the prel_g2 in the
officail aarch64 ABI released)

So acctually, if we just want a 48-bit PC-relevent relocation, your idea and
mine both need 6-7 instructions to get the symbol. In terms of efficiency, it
would be similar. 

And in terms of engineering, you idea can save the trouble to modify the linker
for calculating the offset for 3 movks. But we still need to make a new
relocation type for ADRP, because it currently checking the overflow of address
and gives the "relocation truncated to fit" error. Therefore, both idea need to
do works in binutils, which make it also equivalent.

[Bug target/96366] [AArch64] ICE due to lack of support for VNx2SI sub instruction

2020-08-03 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96366

--- Comment #5 from Bu Le  ---
(In reply to rsand...@gcc.gnu.org from comment #3)
> (In reply to Bu Le from comment #2)
> > (In reply to rsand...@gcc.gnu.org from comment #1)
> > > (In reply to Bu Le from comment #0)

> Generating a subtraction out of an addition seemed odd since
> canonicalisations usually go the other way.  But if the target
> says it supports negation too then that changes things.  It doesn't
> make much sense to support addition and negation but not subtraction.

If some mode or target do not have a subtraction pattern, should we let the
compiler try to use the addition and negation before it fall into an ICE? If
so, the changes for optabs seems reasonable as well.


(In reply to rsand...@gcc.gnu.org from comment #4)
> Fixed by g:9623f61b142174b87760c81f78928dd14af7cbc6.
> 
> As far as I know, only GCC 11 needs the fix, but we can backport
> to GCC 10 as well if we find a testcase that needs it.

Sure, I will have a try to see whether the problem also exists in gcc10.

[Bug fortran/96030] New: AArch64: Add an option to control 64bits simdclone of math functions for fortran

2020-07-02 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96030

Bug ID: 96030
   Summary: AArch64: Add an option to control 64bits simdclone of
math functions for fortran
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: fortran
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bule1 at huawei dot com
  Target Milestone: ---
Target: AARCH64

Created attachment 48824
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48824=edit
patch for the problem

Hi,

I found that currently fortran can only enable simdclone of math functions by
declare "!GCC$ builtin (exp) attributes simd" derivative. This kind of
declaration seems cannot indicate whether the simdlen is 64bits or 128bits. By
reading the source code and some tests, I believe the simdclone of both mode
will be generated with a single derivative declaration. At present, vector math
lib for aarch64  (mathlib and sleef) donot support the 64bits mode functions.
So when I want to enable the simd math on some application, if the application
has an oppotunity for 64bits mode simdclone, there is no matching math library
call, which leads to a link time error. 

For now, to solve this problem, I added a new backend option -msimdmath-64 to
control the generation of the 64bits mode simdclone, which is default to
disable. The patch is attached.

I think it is reasonable to set a switch to control the generation of the
64bits mode simdclone. Do we have something alike already? 

If not, is this the right way to go?

Thanks.

[Bug fortran/96030] AArch64: Add an option to control 64bits simdclone of math functions for fortran

2020-07-07 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96030

--- Comment #3 from Bu Le  ---
(In reply to Jakub Jelinek from comment #1)
> The directive should be doing what
> #pragma omp declare simd
> does on the target and it is an ABI decision what exactly it does.

Hi,I am still confused about your comment. Would you mind explain more?

[Bug fortran/96030] AArch64: Add an option to control 64bits simdclone of math functions for fortran

2020-07-02 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96030

--- Comment #2 from Bu Le  ---
(In reply to Jakub Jelinek from comment #1)
> The directive should be doing what
> #pragma omp declare simd
> does on the target and it is an ABI decision what exactly it does.

I tried this test case. But I haven't found a way to prevent V2SF 64 bits
simdclone (e.g. _ZGVnN2v_cosf )from being generated.

!GCC$ builtin (cosf) attributes simd (notinbranch)

subroutine test_cos(a_cos, b_cos, is_cos, ie_cos)
  integer, intent(in) :: is_cos, ie_cos
  REAL(4), dimension(is_cos:ie_cos), intent(inout) :: a_cos, b_cos
  do i = 1, 3
  b_cos(i) = cos(a_cos(i))
  enddo

end subroutine test_cos


Are you suggesting we already have a way to prevent the generation of
_ZGVnN2v_cosf with !GCC$ builtin derivative? Or We haven' solve the problem
yet, but the correct way to solve this problem is to implement more feature for
the !GCC$ builtin derivative?

[Bug libquadmath/96016] New: AArch64: enable libquadmath

2020-07-01 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96016

Bug ID: 96016
   Summary: AArch64: enable libquadmath
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libquadmath
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bule1 at huawei dot com
  Target Milestone: ---
Target: AARCH64

Created attachment 48815
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48815=edit
patch for enable libquadmath in aarch64

Hi

I would like to propose a way to enable libquadmath on aarch64.

Currently aarch64 support quad precision float point with  type "long double".
However the libquadmath won't build even if we specify --enable-quadmath in the
configure because it will check whether the target support type __float128
during the build configureation. The build process of libquadmath exit if the
answer is no.

According to the arm abi(https://c9x.me/compile/bib/abi-arm64.pdf) and some
test cases I tried, I found that in aarch64, long double is equivalent to
__float128 in x86.

I happened need to use a quad-precision math library. So I cancled the hard
limitation on detecting __float 128 type. After the change when it found the
target is aarch64, a Macro is introduced to redefine long double as __float128.
It turns out that the libquadmath can be build and works successfully on
aarch64. Test have been conducted with random inputs on aarch64 and x86. The
output on aarch64 is agree with the output on x86.

One minor question of my solution is that aarch64 don't have
__builtin_huge_valq built-in functions to define the HUGE_VALQ. I used the
value in the original comment that clearly stated this might cause warning,
which did happened during the build. I haven't found where and how does aarch64
define all these built-in const values. Any comment on this issue?  

The patch is attached. You are welcome to check and comment on the patch. Is it
ok for trunk?

[Bug target/96366] New: [AArch64] ICE due to lack of support for VNx2SI sub instruction

2020-07-29 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96366

Bug ID: 96366
   Summary: [AArch64] ICE due to lack of support for VNx2SI sub
instruction
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bule1 at huawei dot com
CC: richard.sandiford at arm dot com
  Target Milestone: ---

Created attachment 48950
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48950=edit
preprocessed source code for recurent the problem

Hi, 

The test case bb-slp-20.c in the gcc testsuit will cause an ICE in the expand
pass because the gcc lack of a pattern for subtraction of the VNx2SI mode.

The preprocessed file is attached and the problem will be triggered when
compiled with -march=armv8.5-a+sve -msve-vector-bits=256 -O3 -fno-tree-forwprop
options.

By tracing the debug infomation, it is found that the error is due to a
vectorized subtraction gimple with VNx2SI mode cannot find its pattern during
the expand pass.

I tried to extend the mode of this pattern from SVE_FULL_I to SVE_I as
following, after which the problem is solved. 

diff -Nurp a/gcc/config/aarch64/aarch64-sve.md
b/gcc/config/aarch64/aarch64-sve.md
--- a/gcc/config/aarch64/aarch64-sve.md 2020-07-29 15:54:39.36000 +0800
+++ b/gcc/config/aarch64/aarch64-sve.md 2020-07-29 14:37:21.93200 +0800
@@ -3644,10 +3644,10 @@
 ;; -

 (define_insn "sub3"
-  [(set (match_operand:SVE_FULL_I 0 "register_operand" "=w, w, ?")
-   (minus:SVE_FULL_I
- (match_operand:SVE_FULL_I 1 "aarch64_sve_arith_operand" "w, vsa,
vsa")
- (match_operand:SVE_FULL_I 2 "register_operand" "w, 0, w")))]
+  [(set (match_operand:SVE_I 0 "register_operand" "=w, w, ?")
+   (minus:SVE_I
+ (match_operand:SVE_I 1 "aarch64_sve_arith_operand" "w, vsa, vsa")
+ (match_operand:SVE_I 2 "register_operand" "w, 0, w")))]
   "TARGET_SVE"
   "@
sub\t%0., %1., %2.

I noticed that this mode iterator was changed from SVE_I to SVE_FULL_I in Nov
2019 by richard to support partial SVE vectors. However, in the following patch
the addition pattern is supported by changing SVE_FULL_I to SVE_I but not the
subtraction pattern. Is there any specific reason why this pattern is not
supported?

Thanks.

[Bug target/96366] [AArch64] ICE due to lack of support for VNx2SI sub instruction

2020-07-30 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96366

--- Comment #2 from Bu Le  ---
(In reply to rsand...@gcc.gnu.org from comment #1)
> (In reply to Bu Le from comment #0)
> Hmm.  In general, the lack of a vector pattern shouldn't case ICEs,
> but I suppose the add/sub pairing is somewhat special because of
> the canonicalisation rules.  It would be worth looking at exactly
> why we generate the subtract though, just to confirm that this is
> an “expected” ICE rather than a symptom of a deeper problem.

Sure. The logic is that the subtraction will be expanded in expr.c:8989, before
which I believe it still works fine. The gimple to be expand is 

vect__5.16_77 = { 4294967273, 4294967154, 4294967294, 4294967265 } -
vect__1.14_73

When the logic goes on, it went into the binop routine at expr:9948 because op1
(vect__1.14_73) is not a const op and missed the oppotunity to turn into an add
negative pair. Then,the routine will call expand_binop to finalize the
subtraction. The expand_binop function also has an oppotunity to turn this
subtraction to add a negative number, but also missed because op1 is not a
constant.

It occurs to me that we can brought the check for the availbility of this
pattern to the decision condition for whether turning the subtraction to a
addition of negative equivalent. This can be an insurance measurement for the
similar case that the pattern is missed, preventing the ICE. So I tried
following change, which turns out could also solve the problem by turning the
subtraction into addition as expected.

diff -Nurp gcc-20200728-org/gcc/optabs.c gcc-20200728/gcc/optabs.c
--- gcc-20200728-org/gcc/optabs.c   2020-07-29 15:53:52.76000 +0800
+++ gcc-20200728/gcc/optabs.c   2020-07-30 11:00:00.96400 +0800
@@ -1171,10 +1171,12 @@ expand_binop (machine_mode mode, optab b

   mclass = GET_MODE_CLASS (mode);

-  /* If subtracting an integer constant, convert this into an addition of
- the negated constant.  */
+  /* If subtracting an integer constant, or if no subtraction pattern
available
+ for this mode, convert this into an addition of the negated constant.  */

-  if (binoptab == sub_optab && CONST_INT_P (op1))
+  if (binoptab == sub_optab
+  && (CONST_INT_P (op1)
+  || optab_handler (binoptab, mode) == CODE_FOR_nothing))
 {
   op1 = negate_rtx (mode, op1);
   binoptab = add_optab;


> The idea was for that patch to add the bare minimum needed
> to support the “unpacked vector” infrastructure.  Then, once the
> infrastructure was in place, we could add support for other
> unpacked vector operations too.
> 
> However, the infrastructure went in late during the GCC 10
> cycle, so the idea was to postpone any new unpacked vector
> support to GCC 11.  So far the only additional operations
> has been Joe Ramsay's patches for logical operations
> (g:bb3ab62a8b4a108f01ea2eddfe31e9f733bd9cb6 and
> g:6802b5ba8234427598abfd9f0163eb5e7c0d6aa8).
> 
> The reason for not changing many operations at once is that,
> in each case, a decision needs to be made whether the
> operation should use the container mode (as for INDEX),
> the element mode (as for right shifts, once they're
> implemented) or whether it doesn't matter (as for addition).
> Each operation also needs tests.  So from that point of view,
> it's more convenient to have a separate patch for each
> operation (or at least closely-related groups of operations).

Oh, I see. From the performance's point of view, I beleive that add the
subtraction pattern is necessary eventually. I compiled and ran the test case
attached with the subtraction pattern sulotion, which works fine. Logically,
the subtraction should be the same as the addition, which is not sensetive to
the operation mode.

My idea of solving this problem is that we upstream the patch for mode
extension independently, after which upstream the sub-to-add patch for
insurance that other cases might step into the same routine.

Any suggestions?

[Bug libquadmath/96016] AArch64: enable libquadmath

2020-07-01 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96016

--- Comment #2 from Bu Le  ---
(In reply to Andrew Pinski from comment #1)
> If long double is 128bit fp already, then glibc has full support of it.  So
> you dont need libquadmath at all.  It is only there if long double is not
> 128bit long double and glibc does not have support for the __float128 type.

Can you elabrate more? 

I tried this test case with libm which gives me an incorrected answer without
enough precision -f000--3ffd. What library does glibc
provides for quad math? Or maybe I configure the libm wrong?

#include 
#include 
int main(void)
{
long double ld = 0;
long double res;
long double pi = acos(-1);
int* i = (int*) 
i[0] = i[1] = i[2] = i[3] = 0xdeadbeef;

ld = pi/6;
res = sin(ld);
printf("sinq-1: %08x-%08x-%08x-%08x\n", i[0], i[1], i[2], i[3]);
}
/* { dg-output "sinq-1: af2139b8-fae7b900--3ffd\n" } */

[Bug libquadmath/96016] AArch64: enable libquadmath

2020-07-01 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96016

--- Comment #4 from Bu Le  ---
(In reply to Andreas Schwab from comment #3)
> You are computing the sine of (double)ld.  If you want the sine of a long
> double value, you need to use the sinl function, also use acosl(-1) to
> compute pi in long double precision.

Oh, I see. Thanks for help.