subject:"\[PATCH PR95254\] aarch64\: gcc generate inefficient code with fixed sve vector length"

RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-06-08 Thread Tamar Christina

Sending to list as well.. not sure why it was stripped from the reply all.

> -Original Message-
> From: Tamar Christina
> Sent: Monday, June 8, 2020 3:25 PM
> To: Jakub Jelinek ; Tobias Burnus
> 
> Cc: Uros Bizjak ; nd 
> Subject: RE: [PATCH PR95254] aarch64: gcc generate inefficient code with
> fixed sve vector length
> 
> This is a bug with where the C++11 feature test is located as discussed in
> https://gcc.gnu.org/pipermail/gcc-patches/2020-June/547395.html
> A patch was approved but not yet applied to trunk
> https://gcc.gnu.org/pipermail/gcc-patches/2020-June/547463.html
> 
> > -Original Message-
> > From: Gcc-patches  On Behalf Of
> Jakub
> > Jelinek via Gcc-patches
> > Sent: Monday, June 8, 2020 2:57 PM
> > To: Tobias Burnus 
> > Cc: Uros Bizjak ; gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code
> > with fixed sve vector length
> >
> > On Mon, Jun 08, 2020 at 03:49:17PM +0200, Tobias Burnus wrote:
> > > Hi,
> > >
> > > I just observe that this patch causes our *nonbootstrap* builds to
> > > fail; I have not yet looked into this nor tried a bootstrap build.
> > >
> > > gcc-mainline/gcc/expr.c: In function 'rtx_insn* emit_move_insn(rtx, rtx)':
> > > gcc-mainline/gcc/expr.c:3830:3: warning: 'auto' changes meaning in
> > > C++11;
> > please remove it [-Wc++0x-compat]
> > >auto candidate_subreg_p = [&](rtx subreg) {
> > >^
> >
> > That means you must be compiling with -std=gnu++98 rather than -
> > std=gnu++11.
> > Current trunk requires C++11.
> > Make sure this is done in a clean tree, or ./config.status -- recheck;
> > ./config.status has been done?
> >
> > Jakub

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-06-08 Thread Jakub Jelinek via Gcc-patches

On Mon, Jun 08, 2020 at 03:49:17PM +0200, Tobias Burnus wrote:
> Hi,
> 
> I just observe that this patch causes our *nonbootstrap* builds
> to fail; I have not yet looked into this nor tried a bootstrap
> build.
> 
> gcc-mainline/gcc/expr.c: In function 'rtx_insn* emit_move_insn(rtx, rtx)':
> gcc-mainline/gcc/expr.c:3830:3: warning: 'auto' changes meaning in C++11; 
> please remove it [-Wc++0x-compat]
>auto candidate_subreg_p = [&](rtx subreg) {
>^

That means you must be compiling with -std=gnu++98 rather than -std=gnu++11.
Current trunk requires C++11.
Make sure this is done in a clean tree, or ./config.status --recheck; 
./config.status
has been done?

Jakub

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-06-08 Thread Tobias Burnus


Forgot to mention that this was an accelerator build:

-DHOST_MACHINE=\"x86_64-pc-linux-gnu\" -DTARGET_MACHINE=\"nvptx-unknown-none\"

build which failed.

Tobias

On 6/8/20 3:49 PM, Tobias Burnus wrote:


Hi,

I just observe that this patch causes our *nonbootstrap* builds
to fail; I have not yet looked into this nor tried a bootstrap
build.

gcc-mainline/gcc/expr.c: In function 'rtx_insn* emit_move_insn(rtx,
rtx)':
gcc-mainline/gcc/expr.c:3830:3: warning: 'auto' changes meaning in
C++11; please remove it [-Wc++0x-compat]
   auto candidate_subreg_p = [&](rtx subreg) {
   ^
gcc-mainline/gcc/expr.c:3830:8: error: 'candidate_subreg_p' does not
name a type
   auto candidate_subreg_p = [&](rtx subreg) {
^
gcc-mainline/gcc/expr.c:3838:3: warning: 'auto' changes meaning in
C++11; please remove it [-Wc++0x-compat]
   auto candidate_mem_p = [&](machine_mode innermode, rtx mem) {
   ^
gcc-mainline/gcc/expr.c:3838:8: error: 'candidate_mem_p' does not name
a type
   auto candidate_mem_p = [&](machine_mode innermode, rtx mem) {
^
gcc-mainline/gcc/expr.c:3849:44: error: 'candidate_subreg_p' was not
declared in this scope
   if (SUBREG_P (x) && candidate_subreg_p (x))
^
gcc-mainline/gcc/expr.c:3852:44: error: 'candidate_subreg_p' was not
declared in this scope
   if (SUBREG_P (y) && candidate_subreg_p (y))
^
gcc-mainline/gcc/expr.c:3866:46: error: 'candidate_mem_p' was not
declared in this scope
 && candidate_mem_p (GET_MODE (x_inner), y))

Tobias

On 6/2/20 4:44 AM, Yangfei (Felix) wrote:


Hi,


-Original Message-
From: Richard Sandiford [mailto:richard.sandif...@arm.com]
Sent: Monday, June 1, 2020 4:47 PM
To: Yangfei (Felix) 
Cc: gcc-patches@gcc.gnu.org; Uros Bizjak ; Jakub
Jelinek ; Hongtao Liu ; H.J. Lu

Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code
with
fixed sve vector length

Snip...

Sounds good.  Maybe at this point the x_inner and y_inner code is
getting
complicated enough to put into a lambda too:

   x_inner = ... (x);
   y_inner = ... (y);

Just a suggestion though.

Yes, that's a good suggestion.  I see the code becomes more cleaner
with another lambda.

Yeah, looks good.

Formatting nit though: multi-line conditions should be wrapped in
(...),
i.e.:

 return (...
 && ...
 && ...);


Done.  v6 patch is based on trunk 20200601.
Bootstrapped and tested on aarch64-linux-gnu.
Also bootstrapped on x86-64-linux-gnu with --enable-multilib (for
building -m32 x86 libgcc).
Regresssion test on x86-64-linux-gnu looks good except for the
following failures which has been confirmed by x86 devs:


FAIL: gcc.target/i386/avx512f-vcvtps2ph-2.c (test for excess errors)
UNRESOLVED: gcc.target/i386/avx512f-vcvtps2ph-2.c compilation failed
to produce executable

154803c154803

Thanks,
Felix


-
Mentor Graphics (Deutschland) GmbH, Arnulfstraße 201, 80634 München / Germany
Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Alexander 
Walter

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-06-08 Thread Tobias Burnus


Hi,

I just observe that this patch causes our *nonbootstrap* builds
to fail; I have not yet looked into this nor tried a bootstrap
build.

gcc-mainline/gcc/expr.c: In function 'rtx_insn* emit_move_insn(rtx, rtx)':
gcc-mainline/gcc/expr.c:3830:3: warning: 'auto' changes meaning in C++11; 
please remove it [-Wc++0x-compat]
   auto candidate_subreg_p = [&](rtx subreg) {
   ^
gcc-mainline/gcc/expr.c:3830:8: error: 'candidate_subreg_p' does not name a type
   auto candidate_subreg_p = [&](rtx subreg) {
^
gcc-mainline/gcc/expr.c:3838:3: warning: 'auto' changes meaning in C++11; 
please remove it [-Wc++0x-compat]
   auto candidate_mem_p = [&](machine_mode innermode, rtx mem) {
   ^
gcc-mainline/gcc/expr.c:3838:8: error: 'candidate_mem_p' does not name a type
   auto candidate_mem_p = [&](machine_mode innermode, rtx mem) {
^
gcc-mainline/gcc/expr.c:3849:44: error: 'candidate_subreg_p' was not declared 
in this scope
   if (SUBREG_P (x) && candidate_subreg_p (x))
^
gcc-mainline/gcc/expr.c:3852:44: error: 'candidate_subreg_p' was not declared 
in this scope
   if (SUBREG_P (y) && candidate_subreg_p (y))
^
gcc-mainline/gcc/expr.c:3866:46: error: 'candidate_mem_p' was not declared in 
this scope
 && candidate_mem_p (GET_MODE (x_inner), y))

Tobias

On 6/2/20 4:44 AM, Yangfei (Felix) wrote:


Hi,


-Original Message-
From: Richard Sandiford [mailto:richard.sandif...@arm.com]
Sent: Monday, June 1, 2020 4:47 PM
To: Yangfei (Felix) 
Cc: gcc-patches@gcc.gnu.org; Uros Bizjak ; Jakub
Jelinek ; Hongtao Liu ; H.J. Lu

Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with
fixed sve vector length

Snip...


Sounds good.  Maybe at this point the x_inner and y_inner code is getting
complicated enough to put into a lambda too:

   x_inner = ... (x);
   y_inner = ... (y);

Just a suggestion though.

Yes, that's a good suggestion.  I see the code becomes more cleaner with 
another lambda.


Yeah, looks good.

Formatting nit though: multi-line conditions should be wrapped in (...),
i.e.:

 return (...
 && ...
 && ...);


Done.  v6 patch is based on trunk 20200601.
Bootstrapped and tested on aarch64-linux-gnu.
Also bootstrapped on x86-64-linux-gnu with --enable-multilib (for building -m32 
x86 libgcc).
Regresssion test on x86-64-linux-gnu looks good except for the following 
failures which has been confirmed by x86 devs:


FAIL: gcc.target/i386/avx512f-vcvtps2ph-2.c (test for excess errors)
UNRESOLVED: gcc.target/i386/avx512f-vcvtps2ph-2.c compilation failed to produce 
executable

154803c154803

Thanks,
Felix


-
Mentor Graphics (Deutschland) GmbH, Arnulfstraße 201, 80634 München / Germany
Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Alexander 
Walter

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-06-05 Thread Richard Sandiford

Richard Sandiford  writes:
> "Yangfei (Felix)"  writes:
>> Hi,
>>
>>> -Original Message-
>>> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
>>> Sent: Monday, June 1, 2020 4:47 PM
>>> To: Yangfei (Felix) 
>>> Cc: gcc-patches@gcc.gnu.org; Uros Bizjak ; Jakub
>>> Jelinek ; Hongtao Liu ; H.J. Lu
>>> 
>>> Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with
>>> fixed sve vector length
>>
>> Snip...
>>  
>>> Sounds good.  Maybe at this point the x_inner and y_inner code is getting
>>> complicated enough to put into a lambda too:
>>> 
>>>   x_inner = ... (x);
>>>   y_inner = ... (y);
>>> 
>>> Just a suggestion though.
>>
>> Yes, that's a good suggestion.  I see the code becomes more cleaner with 
>> another lambda.
>>  
>>> Yeah, looks good.
>>> 
>>> Formatting nit though: multi-line conditions should be wrapped in (...),
>>> i.e.:
>>> 
>>> return (...
>>> && ...
>>> && ...);
>>> 
>>
>> Done.  v6 patch is based on trunk 20200601.
>> Bootstrapped and tested on aarch64-linux-gnu. 
>> Also bootstrapped on x86-64-linux-gnu with --enable-multilib (for building 
>> -m32 x86 libgcc).
>> Regresssion test on x86-64-linux-gnu looks good except for the following 
>> failures which has been confirmed by x86 devs: 
>>
>>> FAIL: gcc.target/i386/avx512f-vcvtps2ph-2.c (test for excess errors)
>>> UNRESOLVED: gcc.target/i386/avx512f-vcvtps2ph-2.c compilation failed to 
>>> produce executable
>> 154803c154803
>
> Looks good.  (I know I said that last time too :-))  I've also tested
> it on arm-linux-gnueabihf and powerpc64le-linux-gnu without problems.
>
> As before, I'll hold off applying until the AVX512 problem is fixed.

Now pushed.  In summary, the patch has now been tested on:

  aarch64-linux-gnu (with and without SVE)
  arm-linus-gnueabihf
  armeb-eabi
  powerpc64le-linux-gnu
  x86_64-linux-gnu (including -m32)

Hopefully Jeff and Christophe's testers will pick up any other
lurking problems.

Thanks,
Richard

RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-06-04 Thread Yangfei (Felix)

Hi,

> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Tuesday, June 2, 2020 7:17 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org; Uros Bizjak ; Jakub
> Jelinek ; Hongtao Liu ; H.J. Lu
> 
> Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with
> fixed sve vector length
>

Snip...
 
> >
> >> FAIL: gcc.target/i386/avx512f-vcvtps2ph-2.c (test for excess errors)
> >> UNRESOLVED: gcc.target/i386/avx512f-vcvtps2ph-2.c compilation failed
> >> to produce executable
> > 154803c154803
> 
> Looks good.  (I know I said that last time too :-))  I've also tested it on 
> arm-
> linux-gnueabihf and powerpc64le-linux-gnu without problems.

Thanks for reviewing and testing the patch  :-)

> As before, I'll hold off applying until the AVX512 problem is fixed.

Looks like the AVX512 problem is fixed with:

https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=43088bb4dadd3d14b6b594c5f9363fe879f3d7f7
 

I'm using: $ runtest --tool gcc i386.exp=avx512f-vcvtps2ph-2.c

Thanks,
Felix

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-06-02 Thread Richard Sandiford

"Yangfei (Felix)"  writes:
> Hi,
>
>> -Original Message-
>> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
>> Sent: Monday, June 1, 2020 4:47 PM
>> To: Yangfei (Felix) 
>> Cc: gcc-patches@gcc.gnu.org; Uros Bizjak ; Jakub
>> Jelinek ; Hongtao Liu ; H.J. Lu
>> 
>> Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with
>> fixed sve vector length
>
> Snip...
>  
>> Sounds good.  Maybe at this point the x_inner and y_inner code is getting
>> complicated enough to put into a lambda too:
>> 
>>   x_inner = ... (x);
>>   y_inner = ... (y);
>> 
>> Just a suggestion though.
>
> Yes, that's a good suggestion.  I see the code becomes more cleaner with 
> another lambda.
>  
>> Yeah, looks good.
>> 
>> Formatting nit though: multi-line conditions should be wrapped in (...),
>> i.e.:
>> 
>> return (...
>> && ...
>> && ...);
>> 
>
> Done.  v6 patch is based on trunk 20200601.
> Bootstrapped and tested on aarch64-linux-gnu. 
> Also bootstrapped on x86-64-linux-gnu with --enable-multilib (for building 
> -m32 x86 libgcc).
> Regresssion test on x86-64-linux-gnu looks good except for the following 
> failures which has been confirmed by x86 devs: 
>
>> FAIL: gcc.target/i386/avx512f-vcvtps2ph-2.c (test for excess errors)
>> UNRESOLVED: gcc.target/i386/avx512f-vcvtps2ph-2.c compilation failed to 
>> produce executable
> 154803c154803

Looks good.  (I know I said that last time too :-))  I've also tested
it on arm-linux-gnueabihf and powerpc64le-linux-gnu without problems.

As before, I'll hold off applying until the AVX512 problem is fixed.

Thanks,
Richard

RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-06-01 Thread Yangfei (Felix)

Hi,

> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Monday, June 1, 2020 4:47 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org; Uros Bizjak ; Jakub
> Jelinek ; Hongtao Liu ; H.J. Lu
> 
> Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with
> fixed sve vector length

Snip...
 
> Sounds good.  Maybe at this point the x_inner and y_inner code is getting
> complicated enough to put into a lambda too:
> 
>   x_inner = ... (x);
>   y_inner = ... (y);
> 
> Just a suggestion though.

Yes, that's a good suggestion.  I see the code becomes more cleaner with 
another lambda.
 
> Yeah, looks good.
> 
> Formatting nit though: multi-line conditions should be wrapped in (...),
> i.e.:
> 
> return (...
> && ...
> && ...);
> 

Done.  v6 patch is based on trunk 20200601.
Bootstrapped and tested on aarch64-linux-gnu. 
Also bootstrapped on x86-64-linux-gnu with --enable-multilib (for building -m32 
x86 libgcc).
Regresssion test on x86-64-linux-gnu looks good except for the following 
failures which has been confirmed by x86 devs: 

> FAIL: gcc.target/i386/avx512f-vcvtps2ph-2.c (test for excess errors)
> UNRESOLVED: gcc.target/i386/avx512f-vcvtps2ph-2.c compilation failed to 
> produce executable
154803c154803

Thanks,
Felix



pr95254-v6.diff
Description: pr95254-v6.diff

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-06-01 Thread Richard Sandiford

"Yangfei (Felix)"  writes:
> Hi,
>
>> -Original Message-
>> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
>> Sent: Sunday, May 31, 2020 12:01 AM
>> To: Yangfei (Felix) 
>> Cc: gcc-patches@gcc.gnu.org; Uros Bizjak ; Jakub
>> Jelinek ; Hongtao Liu ; H.J. Lu
>> 
>> Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with
>> fixed sve vector length
>> 
>
> Snip...
>
>> >
>> > The v5 patch attached addressed this issue.
>> >
>> > There two added changes compared with the v4 patch:
>> > 1. In candidate_mem_p, mov_optab for innermode should be available.
>> >  In this case, mov_optab for SDmode is not there and subreg are added
>> back by emit_move_insn_1.  So we won't get the benefit with the patch.
>> 
>> I agree we should have this check.  I think the rule applies to all of the
>> transforms though, not just the mem one, so we should add the check to the
>> register and constant cases too.
>
> OK.  I changed to make this an extra condition for calculating x_inner & y 
> _inner.

Sounds good.  Maybe at this point the x_inner and y_inner code is
getting complicated enough to put into a lambda too:

  x_inner = ... (x);
  y_inner = ... (y);

Just a suggestion though.

>> > 2. Instead of using adjust_address, I changed to use adjust_address_nv to
>> avoid the emit of invalid insn 13.
>> > The latter call to validize_mem() in emit_move_insn will take care of 
>> > the
>> address for us.
>> 
>> The validation performed by validize_mem is the same as that performed by
>> adjust_address, so the only case this should make a difference is for
>> push_operands:
>
> True.
>
>>   /* If X or Y are memory references, verify that their addresses are valid
>>  for the machine.  */
>>   if (MEM_P (x)
>>   && (! memory_address_addr_space_p (GET_MODE (x), XEXP (x, 0),
>>   MEM_ADDR_SPACE (x))
>>&& ! push_operand (x, GET_MODE (x
>> x = validize_mem (x);
>> 
>>   if (MEM_P (y)
>>   && ! memory_address_addr_space_p (GET_MODE (y), XEXP (y, 0),
>>  MEM_ADDR_SPACE (y)))
>> y = validize_mem (y);
>> 
>> So I think the fix is to punt on push_operands instead (and continue to use
>> adjust_address rather than adjust_address_nv).
>
> Not sure if I understand it correctly.
> Do you mean excluding push_operand in candidate_mem_p? Like:
>
>  3830   auto candidate_mem_p = [&](machine_mode innermode, rtx mem) {
>  3831 return !targetm.can_change_mode_class (innermode, GET_MODE (mem), 
> ALL_REGS)
>  3832&& !push_operand (mem, GET_MODE (mem))
>  3833/* Not a candiate if innermode requires too much alignment.  
> */
>  3834&& (MEM_ALIGN (mem) >= GET_MODE_ALIGNMENT (innermode)
>  3835|| targetm.slow_unaligned_access (GET_MODE (mem),
>  3836  MEM_ALIGN (mem))
>  3837|| !targetm.slow_unaligned_access (innermode, MEM_ALIGN 
> (mem)));
>  3838   };

Yeah, looks good.

Formatting nit though: multi-line conditions should be wrapped in (...),
i.e.:

return (...
&& ...
&& ...);

Thanks,
Richard

RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-05-31 Thread Yangfei (Felix)

Hi,

> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Sunday, May 31, 2020 12:01 AM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org; Uros Bizjak ; Jakub
> Jelinek ; Hongtao Liu ; H.J. Lu
> 
> Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with
> fixed sve vector length
> 

Snip...

> >
> > The v5 patch attached addressed this issue.
> >
> > There two added changes compared with the v4 patch:
> > 1. In candidate_mem_p, mov_optab for innermode should be available.
> >  In this case, mov_optab for SDmode is not there and subreg are added
> back by emit_move_insn_1.  So we won't get the benefit with the patch.
> 
> I agree we should have this check.  I think the rule applies to all of the
> transforms though, not just the mem one, so we should add the check to the
> register and constant cases too.

OK.  I changed to make this an extra condition for calculating x_inner & y 
_inner.

> > 2. Instead of using adjust_address, I changed to use adjust_address_nv to
> avoid the emit of invalid insn 13.
> > The latter call to validize_mem() in emit_move_insn will take care of 
> > the
> address for us.
> 
> The validation performed by validize_mem is the same as that performed by
> adjust_address, so the only case this should make a difference is for
> push_operands:

True.

>   /* If X or Y are memory references, verify that their addresses are valid
>  for the machine.  */
>   if (MEM_P (x)
>   && (! memory_address_addr_space_p (GET_MODE (x), XEXP (x, 0),
>MEM_ADDR_SPACE (x))
> && ! push_operand (x, GET_MODE (x
> x = validize_mem (x);
> 
>   if (MEM_P (y)
>   && ! memory_address_addr_space_p (GET_MODE (y), XEXP (y, 0),
>   MEM_ADDR_SPACE (y)))
> y = validize_mem (y);
> 
> So I think the fix is to punt on push_operands instead (and continue to use
> adjust_address rather than adjust_address_nv).

Not sure if I understand it correctly.
Do you mean excluding push_operand in candidate_mem_p? Like:

 3830   auto candidate_mem_p = [&](machine_mode innermode, rtx mem) {
 3831 return !targetm.can_change_mode_class (innermode, GET_MODE (mem), 
ALL_REGS)
 3832&& !push_operand (mem, GET_MODE (mem))
 3833/* Not a candiate if innermode requires too much alignment.  */
 3834&& (MEM_ALIGN (mem) >= GET_MODE_ALIGNMENT (innermode)
 3835|| targetm.slow_unaligned_access (GET_MODE (mem),
 3836  MEM_ALIGN (mem))
 3837|| !targetm.slow_unaligned_access (innermode, MEM_ALIGN 
(mem)));
 3838   };

Thanks,
Felix

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-05-30 Thread Richard Sandiford

"Yangfei (Felix)"  writes:
> Turns out that this ICE triggers under x86 -m32.
>
> Command to reproduce:
> ~/build-gcc/x86_64-pc-linux-gnu/32/libgcc$ gcc  -g -O2 -m32 -O2  -g -O2 
> -DIN_GCC-W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual 
> -Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition  -isystem 
> ./include   -fpic -mlong-double-80 -DUSE_ELF_SYMVER -fcf-protection -mshstk 
> -g -DIN_LIBGCC2 -fbuilding-libgcc -fno-stack-protector   -fpic 
> -mlong-double-80 -DUSE_ELF_SYMVER -fcf-protection -mshstk -I. -I. 
> -I../../.././gcc -I../../../../gcc-git/libgcc -I../../../../gcc-git/libgcc/. 
> -I../../../../gcc-git/libgcc/../gcc -I../../../../gcc-git/libgcc/../include 
> -I../../../../gcc-git/libgcc/config/libbid -DENABLE_DECIMAL_BID_FORMAT 
> -DHAVE_CC_TLS  -DUSE_TLS -o _isinfd32.o -MT _isinfd32.o -MD -MP -MF 
> _isinfd32.dep -c ../../../../gcc-git/libgcc/config/libbid/_isinfd32.c
>
> Source:
>  28 int
>  29 isinfd32 (_Decimal32 x) {
>  30   int res;
>  31   UINT64 x64;
>  32   union decimal32 ux;
>  33
>  34   ux.d = x;
>  35   x64 = __bid32_to_bid64 (ux.i);
>  36   res = __bid64_isInf (x64);
>  37   return (res);
>  38 }
>
> On the crash site (emit_single_push_insn), we have three insns:
> (gdb) p debug_rtx (prev)
> (insn 12 0 13 (parallel [
> (set (reg/f:SI 7 sp)
> (plus:SI (reg/f:SI 7 sp)
> (const_int -12 [0xfff4])))
> (clobber (reg:CC 17 flags))
> ]) "../../../../gcc-git/libgcc/config/libbid/_isinfd32.c":35:9 -1
>  (expr_list:REG_ARGS_SIZE (const_int 12 [0xc])
> (nil)))
> $1 = void
> (gdb) p debug_rtx (last)
> (insn 14 13 0 (set (mem:SI (reg/f:SI 87) [0  S4 A32])
> (subreg:SI (reg/v:SD 86 [ x ]) 0)) 
> "../../../../gcc-git/libgcc/config/libbid/_isinfd32.c":35:9 -1
>  (nil))
> $2 = void
> (gdb) p debug_rtx (PREV_INSN (last))
> (insn 13 12 14 (set (reg/f:SI 87)
> (pre_dec:SI (reg/f:SI 7 sp))) 
> "../../../../gcc-git/libgcc/config/libbid/_isinfd32.c":35:9 -1
>  (nil))
> $3 = void
>
> Here, insn 13 is invalid. It is emitted by: x = adjust_address (x, GET_MODE 
> (y_inner), 0);
>
> The v5 patch attached addressed this issue.
>
> There two added changes compared with the v4 patch:
> 1. In candidate_mem_p, mov_optab for innermode should be available. 
>  In this case, mov_optab for SDmode is not there and subreg are added 
> back by emit_move_insn_1.  So we won't get the benefit with the patch. 

I agree we should have this check.  I think the rule applies to all
of the transforms though, not just the mem one, so we should add
the check to the register and constant cases too.

> 2. Instead of using adjust_address, I changed to use adjust_address_nv to 
> avoid the emit of invalid insn 13. 
> The latter call to validize_mem() in emit_move_insn will take care of the 
> address for us. 

The validation performed by validize_mem is the same as that performed
by adjust_address, so the only case this should make a difference is
for push_operands:

  /* If X or Y are memory references, verify that their addresses are valid
 for the machine.  */
  if (MEM_P (x)
  && (! memory_address_addr_space_p (GET_MODE (x), XEXP (x, 0),
 MEM_ADDR_SPACE (x))
  && ! push_operand (x, GET_MODE (x
x = validize_mem (x);

  if (MEM_P (y)
  && ! memory_address_addr_space_p (GET_MODE (y), XEXP (y, 0),
MEM_ADDR_SPACE (y)))
y = validize_mem (y);

So I think the fix is to punt on push_operands instead (and continue
to use adjust_address rather than adjust_address_nv).

Thanks,
Richard

RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-05-30 Thread Yangfei (Felix)

Hi,

> -Original Message-
> From: Yangfei (Felix)
> Sent: Friday, May 29, 2020 2:56 PM
> To: 'Hongtao Liu' ; H.J. Lu 
> Cc: gcc-patches@gcc.gnu.org; Uros Bizjak ; Jakub
> Jelinek ; Richard Sandiford
> 
> Subject: RE: [PATCH PR95254] aarch64: gcc generate inefficient code with
> fixed sve vector length

Snip...

> 
> Yes, I tried your configure and reproduced the error.  Thanks for pointing 
> this
> out.
> The patch can pass bootstrap on x86_64 with the following configure options.
> Surprised to see that it failed to build with your configure.

Turns out that this ICE triggers under x86 -m32.

Command to reproduce:
~/build-gcc/x86_64-pc-linux-gnu/32/libgcc$ gcc  -g -O2 -m32 -O2  -g -O2 
-DIN_GCC-W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual 
-Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition  -isystem 
./include   -fpic -mlong-double-80 -DUSE_ELF_SYMVER -fcf-protection -mshstk -g 
-DIN_LIBGCC2 -fbuilding-libgcc -fno-stack-protector   -fpic -mlong-double-80 
-DUSE_ELF_SYMVER -fcf-protection -mshstk -I. -I. -I../../.././gcc 
-I../../../../gcc-git/libgcc -I../../../../gcc-git/libgcc/. 
-I../../../../gcc-git/libgcc/../gcc -I../../../../gcc-git/libgcc/../include 
-I../../../../gcc-git/libgcc/config/libbid -DENABLE_DECIMAL_BID_FORMAT 
-DHAVE_CC_TLS  -DUSE_TLS -o _isinfd32.o -MT _isinfd32.o -MD -MP -MF 
_isinfd32.dep -c ../../../../gcc-git/libgcc/config/libbid/_isinfd32.c

Source:
 28 int
 29 isinfd32 (_Decimal32 x) {
 30   int res;
 31   UINT64 x64;
 32   union decimal32 ux;
 33
 34   ux.d = x;
 35   x64 = __bid32_to_bid64 (ux.i);
 36   res = __bid64_isInf (x64);
 37   return (res);
 38 }

On the crash site (emit_single_push_insn), we have three insns:
(gdb) p debug_rtx (prev)
(insn 12 0 13 (parallel [
(set (reg/f:SI 7 sp)
(plus:SI (reg/f:SI 7 sp)
(const_int -12 [0xfff4])))
(clobber (reg:CC 17 flags))
]) "../../../../gcc-git/libgcc/config/libbid/_isinfd32.c":35:9 -1
 (expr_list:REG_ARGS_SIZE (const_int 12 [0xc])
(nil)))
$1 = void
(gdb) p debug_rtx (last)
(insn 14 13 0 (set (mem:SI (reg/f:SI 87) [0  S4 A32])
(subreg:SI (reg/v:SD 86 [ x ]) 0)) 
"../../../../gcc-git/libgcc/config/libbid/_isinfd32.c":35:9 -1
 (nil))
$2 = void
(gdb) p debug_rtx (PREV_INSN (last))
(insn 13 12 14 (set (reg/f:SI 87)
(pre_dec:SI (reg/f:SI 7 sp))) 
"../../../../gcc-git/libgcc/config/libbid/_isinfd32.c":35:9 -1
 (nil))
$3 = void

Here, insn 13 is invalid. It is emitted by: x = adjust_address (x, GET_MODE 
(y_inner), 0);

The v5 patch attached addressed this issue.

There two added changes compared with the v4 patch:
1. In candidate_mem_p, mov_optab for innermode should be available. 
 In this case, mov_optab for SDmode is not there and subreg are added back 
by emit_move_insn_1.  So we won't get the benefit with the patch. 

2. Instead of using adjust_address, I changed to use adjust_address_nv to avoid 
the emit of invalid insn 13. 
The latter call to validize_mem() in emit_move_insn will take care of the 
address for us. 

Bootstrapped and tested on aarch64-linux-gnu.  I am running another test on 
x86.  
Richard, could you please take a further look? 

Thanks,
Felix


pr95254-v5.diff
Description: pr95254-v5.diff

RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-05-29 Thread Yangfei (Felix)

Hi,

> -Original Message-
> From: Hongtao Liu [mailto:crazy...@gmail.com]
> Sent: Friday, May 29, 2020 11:24 AM
> To: H.J. Lu 
> Cc: Yangfei (Felix) ; gcc-patches@gcc.gnu.org;
> Uros Bizjak ; Jakub Jelinek ;
> Richard Sandiford 
> Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with
> fixed sve vector length
> 

Snip...

> > >
> > > This is due to define_subst magic.  The generators automatically
> > > create a vec_merge form of the instruction based on the information
> > > in the  attributes.
> > >
> > > AFAICT the rtl above is for the line-125 instruction, which looks ok.
> > > The problem is the line-126 instruction, since vcvtps2ph doesn't
> > > AIUI allow zero masking.
> > >
> 
> zero masking is not allowed for mem_operand here, but available for
> register_operand.
> there's something wrong in the pattern, we need to fix it.
> (define_insn "avx512f_vcvtps2ph512"

Thanks for confirming that :-)

> 
> > > The "mask" define_subst allows both zeroing and merging, so I guess
> > > this means that the pattern should either be using a different
> > > define_subst, or should be enforcing merging in some other way.
> > > Please could one of the x86 devs take a look?
> > >
> >
> > Hongtao, can you take a look?
> >
> > Thanks.
> >
> >
> > --
> > H.J.
> 
> BTW, i failed to build gcc when apply pr95254-v4.txt.
> 
> gcc configure:
> 
> Using built-in specs.
> COLLECT_GCC=./gcc/xgcc
> Target: x86_64-pc-linux-gnu
> Configured with: ../../gcc/gnu-toolchain/gcc/configure
> --enable-languages=c,c++,fortran --disable-bootstrap Thread model: posix
> Supported LTO compression algorithms: zlib gcc version 11.0.0 20200528
> (experimental) (GCC)
> 
> host on x86_64 rel8.

Yes, I tried your configure and reproduced the error.  Thanks for pointing this 
out.
The patch can pass bootstrap on x86_64 with the following configure options.
Surprised to see that it failed to build with your configure.

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/home/yangfei/gcc-hacking/install-gcc/libexec/gcc/x86_64-pc-linux-gnu/11/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-git/configure 
--prefix=/home/yangfei/gcc-hacking/install-gcc 
--enable-languages=c,c++,objc,obj-c++,fortran,lto --enable-shared 
--enable-threads=posix --enable-checking=yes --disable-multilib 
--with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions 
--enable-gnu-unique-object --enable-linker-build-id 
--with-gcc-major-version-only --enable-plugin --enable-initfini-array 
--without-isl --disable-libmpx --enable-gnu-indirect-function
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 11.0.0 20200526 (experimental) (GCC)

Felix

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-05-28 Thread Hongtao Liu via Gcc-patches

On Thu, May 28, 2020 at 11:37 PM H.J. Lu  wrote:
>
> On Thu, May 28, 2020 at 8:00 AM Richard Sandiford
>  wrote:
> >
> > "Yangfei (Felix)"  writes:
> > > Thanks for reviewing this.
> > > Attached please find the v5 patch.
> > > Note: we also need to modify local variable "mode" once we catch one 
> > > case.  I see test failure without this change.
> >
> > Looks good.  Patch is OK assuming the x86 folks don't want to rewrite
> > gcc.target/i386/pr67609.c to avoid the new optimisation.  I'll hold off
> > applying until the AVX512 thing is sorted.
> >
> > > Bootstrapped and tested on aarch64-linux-gnu.
> > > Also bootstrapped on x86_64-linux-gnu.  Regression test show two failed 
> > > tests on this platform:
> >
> > Thanks for the extra testing.
> >
> > > 1> FAIL: gcc.target/i386/avx512f-vcvtps2ph-2.c (test for excess errors)
> > > 2> FAIL: gcc.target/i386/pr67609.c scan-assembler movdqa
> > >
> > > I have adjust the second one in the v4 patch.
> >
> > So this is:
> >
> > movdqa  reg(%rip), %xmm1
> > movaps  %xmm1, -24(%rsp)
> > movsd   %xmm0, -24(%rsp)
> > movapd  -24(%rsp), %xmm2
> > movaps  %xmm2, reg(%rip)
> > ret
> >
> > to:
> >
> > movq%xmm0, reg(%rip)
> > ret
> >
> > Nice.  I think it's safe to say that's an improvement :-)
> >
> > I don't know whether this means we're no longer testing what the test
> > was intended to test.  Maybe one of the x86 folks has an opinion about
> > whether we should instead rewrite the test somehow.
> >
> > > But The first one looks strange to me.
> > > I see gcc emits invalid x86 vcvtps2ph instrunctions which looks like:
> > >
> > > 125 vcvtps2ph   $0, %zmm0, -112(%rbp){%k1}
> > > 126 vcvtps2ph   $0, %zmm0, -80(%rbp){%k1}{z}
> > >
> > > This happens in the combine phase, where an combined insn looks like:
> > >
> > > 1989 Trying 31 -> 33:
> > > 199031: r106:V16HI=vec_merge(unspec[r103:V16SF,0] 
> > > 133,[frame:DI-0x60],r109:HI)
> > > 199133: [frame:DI-0x60]=r106:V16HI
> > > 1992   REG_DEAD r106:V16HI
> > > 1993 Successfully matched this instruction:
> > > 1994 (set (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame)
> > > 1995 (const_int -96 [0xffa0])) [2 res2.x+0 S32 
> > > A256])
> > > 1996 (vec_merge:V16HI (unspec:V16HI [
> > > 1997 (reg:V16SF 103)
> > > 1998 (const_int 0 [0])
> > > 1999 ] UNSPEC_VCVTPS2PH)
> > > 2000 (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame)
> > > 2001 (const_int -96 [0xffa0])) [2 res2.x+0 
> > > S32 A256])
> > > 2002 (reg:HI 109)))
> > > 2003 allowing combination of insns 31 and 33
> > > 2004 original costs 16 + 4 = 20
> > > 2005 replacement cost 16
> > > 2006 deferring deletion of insn with uid = 31.
> > > 2007 modifying insn i333: 
> > > [frame:DI-0x60]=vec_merge(unspec[r103:V16SF,0] 
> > > 133,[frame:DI-0x60],r109:HI)
> > > 2008 deferring rescan insn with uid = 33.
> > >
> > > And this can be matched with pattern: avx512f_vcvtps2ph512_mask
> > > 2282 (insn 33 31 37 4 (set (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame)
> > > 2283 (const_int -96 [0xffa0])) [2 res2.x+0 
> > > S32 A256])
> > > 2284 (vec_merge:V16HI (unspec:V16HI [
> > > 2285 (reg:V16SF 103)
> > > 2286 (const_int 0 [0])
> > > 2287 ] UNSPEC_VCVTPS2PH)
> > > 2288 (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame)
> > > 2289 (const_int -96 [0xffa0])) [2 
> > > res2.x+0 S32 A256])
> > > 2290 (reg:HI 109))) "avx512f-vcvtps2ph-2.c":80:10 5324 
> > > {avx512f_vcvtps2ph512_mask}
> > > 2291  (nil))
> > >
> > > gcc/config/i386/sse.md:
> > > 21663 (define_insn "avx512f_vcvtps2ph512"
> > > 21664   [(set (match_operand:V16HI 0 "nonimmediate_operand" "=vm")
> > > 21665 (unspec:V16HI
> > > 21666   [(match_operand:V16SF 1 "register_operand" "v")
> > > 21667(match_operand:SI 2 "const_0_to_255_operand" "N")]
> > > 21668   UNSPEC_VCVTPS2PH))]
> > > 21669   "TARGET_AVX512F"
> > > 21670   "vcvtps2ph\t{%2, %1, %0|%0, %1, %2}"
> > > 21671   [(set_attr "type" "ssecvt")
> > > 21672(set_attr "prefix" "evex")
> > > 21673(set_attr "mode" "V16SF")])
> > >
> > > How can that happen?
> >
> > This is due to define_subst magic.  The generators automatically
> > create a vec_merge form of the instruction based on the information
> > in the  attributes.
> >
> > AFAICT the rtl above is for the line-125 instruction, which looks ok.
> > The problem is the line-126 instruction, since vcvtps2ph doesn't
> > AIUI allow zero masking.
> >

zero masking is not allowed for mem_operand here, but available for
register_operand.
there's something wrong in the pattern, we need to fix it.
(define_insn "avx512f_vcvtps2ph512"


> > The "mask" define_subst allows both zeroing and merging,
> > so I guess this means that the pattern should

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-05-28 Thread H.J. Lu via Gcc-patches

On Thu, May 28, 2020 at 8:00 AM Richard Sandiford
 wrote:
>
> "Yangfei (Felix)"  writes:
> > Thanks for reviewing this.
> > Attached please find the v5 patch.
> > Note: we also need to modify local variable "mode" once we catch one case.  
> > I see test failure without this change.
>
> Looks good.  Patch is OK assuming the x86 folks don't want to rewrite
> gcc.target/i386/pr67609.c to avoid the new optimisation.  I'll hold off
> applying until the AVX512 thing is sorted.
>
> > Bootstrapped and tested on aarch64-linux-gnu.
> > Also bootstrapped on x86_64-linux-gnu.  Regression test show two failed 
> > tests on this platform:
>
> Thanks for the extra testing.
>
> > 1> FAIL: gcc.target/i386/avx512f-vcvtps2ph-2.c (test for excess errors)
> > 2> FAIL: gcc.target/i386/pr67609.c scan-assembler movdqa
> >
> > I have adjust the second one in the v4 patch.
>
> So this is:
>
> movdqa  reg(%rip), %xmm1
> movaps  %xmm1, -24(%rsp)
> movsd   %xmm0, -24(%rsp)
> movapd  -24(%rsp), %xmm2
> movaps  %xmm2, reg(%rip)
> ret
>
> to:
>
> movq%xmm0, reg(%rip)
> ret
>
> Nice.  I think it's safe to say that's an improvement :-)
>
> I don't know whether this means we're no longer testing what the test
> was intended to test.  Maybe one of the x86 folks has an opinion about
> whether we should instead rewrite the test somehow.
>
> > But The first one looks strange to me.
> > I see gcc emits invalid x86 vcvtps2ph instrunctions which looks like:
> >
> > 125 vcvtps2ph   $0, %zmm0, -112(%rbp){%k1}
> > 126 vcvtps2ph   $0, %zmm0, -80(%rbp){%k1}{z}
> >
> > This happens in the combine phase, where an combined insn looks like:
> >
> > 1989 Trying 31 -> 33:
> > 199031: r106:V16HI=vec_merge(unspec[r103:V16SF,0] 
> > 133,[frame:DI-0x60],r109:HI)
> > 199133: [frame:DI-0x60]=r106:V16HI
> > 1992   REG_DEAD r106:V16HI
> > 1993 Successfully matched this instruction:
> > 1994 (set (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame)
> > 1995 (const_int -96 [0xffa0])) [2 res2.x+0 S32 
> > A256])
> > 1996 (vec_merge:V16HI (unspec:V16HI [
> > 1997 (reg:V16SF 103)
> > 1998 (const_int 0 [0])
> > 1999 ] UNSPEC_VCVTPS2PH)
> > 2000 (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame)
> > 2001 (const_int -96 [0xffa0])) [2 res2.x+0 S32 
> > A256])
> > 2002 (reg:HI 109)))
> > 2003 allowing combination of insns 31 and 33
> > 2004 original costs 16 + 4 = 20
> > 2005 replacement cost 16
> > 2006 deferring deletion of insn with uid = 31.
> > 2007 modifying insn i333: 
> > [frame:DI-0x60]=vec_merge(unspec[r103:V16SF,0] 133,[frame:DI-0x60],r109:HI)
> > 2008 deferring rescan insn with uid = 33.
> >
> > And this can be matched with pattern: avx512f_vcvtps2ph512_mask
> > 2282 (insn 33 31 37 4 (set (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame)
> > 2283 (const_int -96 [0xffa0])) [2 res2.x+0 S32 
> > A256])
> > 2284 (vec_merge:V16HI (unspec:V16HI [
> > 2285 (reg:V16SF 103)
> > 2286 (const_int 0 [0])
> > 2287 ] UNSPEC_VCVTPS2PH)
> > 2288 (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame)
> > 2289 (const_int -96 [0xffa0])) [2 res2.x+0 
> > S32 A256])
> > 2290 (reg:HI 109))) "avx512f-vcvtps2ph-2.c":80:10 5324 
> > {avx512f_vcvtps2ph512_mask}
> > 2291  (nil))
> >
> > gcc/config/i386/sse.md:
> > 21663 (define_insn "avx512f_vcvtps2ph512"
> > 21664   [(set (match_operand:V16HI 0 "nonimmediate_operand" "=vm")
> > 21665 (unspec:V16HI
> > 21666   [(match_operand:V16SF 1 "register_operand" "v")
> > 21667(match_operand:SI 2 "const_0_to_255_operand" "N")]
> > 21668   UNSPEC_VCVTPS2PH))]
> > 21669   "TARGET_AVX512F"
> > 21670   "vcvtps2ph\t{%2, %1, %0|%0, %1, %2}"
> > 21671   [(set_attr "type" "ssecvt")
> > 21672(set_attr "prefix" "evex")
> > 21673(set_attr "mode" "V16SF")])
> >
> > How can that happen?
>
> This is due to define_subst magic.  The generators automatically
> create a vec_merge form of the instruction based on the information
> in the  attributes.
>
> AFAICT the rtl above is for the line-125 instruction, which looks ok.
> The problem is the line-126 instruction, since vcvtps2ph doesn't
> AIUI allow zero masking.
>
> The "mask" define_subst allows both zeroing and merging,
> so I guess this means that the pattern should either be using
> a different define_subst, or should be enforcing merging in
> some other way.  Please could one of the x86 devs take a look?
>

Hongtao, can you take a look?

Thanks.


-- 
H.J.

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-05-28 Thread Richard Sandiford

"Yangfei (Felix)"  writes:
> Thanks for reviewing this.
> Attached please find the v5 patch.
> Note: we also need to modify local variable "mode" once we catch one case.  I 
> see test failure without this change.

Looks good.  Patch is OK assuming the x86 folks don't want to rewrite
gcc.target/i386/pr67609.c to avoid the new optimisation.  I'll hold off
applying until the AVX512 thing is sorted.

> Bootstrapped and tested on aarch64-linux-gnu.
> Also bootstrapped on x86_64-linux-gnu.  Regression test show two failed tests 
> on this platform:

Thanks for the extra testing.

> 1> FAIL: gcc.target/i386/avx512f-vcvtps2ph-2.c (test for excess errors)
> 2> FAIL: gcc.target/i386/pr67609.c scan-assembler movdqa
>
> I have adjust the second one in the v4 patch.

So this is:

movdqa  reg(%rip), %xmm1
movaps  %xmm1, -24(%rsp)
movsd   %xmm0, -24(%rsp)
movapd  -24(%rsp), %xmm2
movaps  %xmm2, reg(%rip)
ret

to:

movq%xmm0, reg(%rip)
ret

Nice.  I think it's safe to say that's an improvement :-)

I don't know whether this means we're no longer testing what the test
was intended to test.  Maybe one of the x86 folks has an opinion about
whether we should instead rewrite the test somehow.

> But The first one looks strange to me.
> I see gcc emits invalid x86 vcvtps2ph instrunctions which looks like:
>
> 125 vcvtps2ph   $0, %zmm0, -112(%rbp){%k1}
> 126 vcvtps2ph   $0, %zmm0, -80(%rbp){%k1}{z}
>
> This happens in the combine phase, where an combined insn looks like:
>
> 1989 Trying 31 -> 33:
> 199031: r106:V16HI=vec_merge(unspec[r103:V16SF,0] 
> 133,[frame:DI-0x60],r109:HI)
> 199133: [frame:DI-0x60]=r106:V16HI
> 1992   REG_DEAD r106:V16HI
> 1993 Successfully matched this instruction:
> 1994 (set (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame)
> 1995 (const_int -96 [0xffa0])) [2 res2.x+0 S32 A256])
> 1996 (vec_merge:V16HI (unspec:V16HI [
> 1997 (reg:V16SF 103)
> 1998 (const_int 0 [0])
> 1999 ] UNSPEC_VCVTPS2PH)
> 2000 (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame)
> 2001 (const_int -96 [0xffa0])) [2 res2.x+0 S32 
> A256])
> 2002 (reg:HI 109)))
> 2003 allowing combination of insns 31 and 33
> 2004 original costs 16 + 4 = 20
> 2005 replacement cost 16
> 2006 deferring deletion of insn with uid = 31.
> 2007 modifying insn i333: [frame:DI-0x60]=vec_merge(unspec[r103:V16SF,0] 
> 133,[frame:DI-0x60],r109:HI)
> 2008 deferring rescan insn with uid = 33.
>
> And this can be matched with pattern: avx512f_vcvtps2ph512_mask
> 2282 (insn 33 31 37 4 (set (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame)
> 2283 (const_int -96 [0xffa0])) [2 res2.x+0 S32 
> A256])
> 2284 (vec_merge:V16HI (unspec:V16HI [
> 2285 (reg:V16SF 103)
> 2286 (const_int 0 [0])
> 2287 ] UNSPEC_VCVTPS2PH)
> 2288 (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame)
> 2289 (const_int -96 [0xffa0])) [2 res2.x+0 
> S32 A256])
> 2290 (reg:HI 109))) "avx512f-vcvtps2ph-2.c":80:10 5324 
> {avx512f_vcvtps2ph512_mask}
> 2291  (nil))
>
> gcc/config/i386/sse.md:
> 21663 (define_insn "avx512f_vcvtps2ph512"
> 21664   [(set (match_operand:V16HI 0 "nonimmediate_operand" "=vm")
> 21665 (unspec:V16HI
> 21666   [(match_operand:V16SF 1 "register_operand" "v")
> 21667(match_operand:SI 2 "const_0_to_255_operand" "N")]
> 21668   UNSPEC_VCVTPS2PH))]
> 21669   "TARGET_AVX512F"
> 21670   "vcvtps2ph\t{%2, %1, %0|%0, %1, %2}"
> 21671   [(set_attr "type" "ssecvt")
> 21672(set_attr "prefix" "evex")
> 21673(set_attr "mode" "V16SF")])
>
> How can that happen? 

This is due to define_subst magic.  The generators automatically
create a vec_merge form of the instruction based on the information
in the  attributes.

AFAICT the rtl above is for the line-125 instruction, which looks ok.
The problem is the line-126 instruction, since vcvtps2ph doesn't
AIUI allow zero masking.

The "mask" define_subst allows both zeroing and merging,
so I guess this means that the pattern should either be using
a different define_subst, or should be enforcing merging in
some other way.  Please could one of the x86 devs take a look?

Thanks,
Richard

RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-05-28 Thread Yangfei (Felix)

Hi,

> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Thursday, May 28, 2020 12:07 AM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with
> fixed sve vector length
> 

Snip...

> 
> Ah, OK.  But in that case, shouldn't we allow the change if the original
> unaligned MEM was also “slow”?
> 
> I guess there might be cases in which both modes are slow enough for the
> hook to return true for them, but one is worse than the other.
> But I don't think there's much we can do about that as things stand:
> changing the mode might move from a slow mode to a slower mode, but it
> might move in the other direction too.

Good point.

> > +2020-05-27  Felix Yang  
> > +   Richard Sandiford  
> 
> I appreciate the gesture, but I don't think it's appropriate to list me as an
> author.  I haven't written any of the code, I've just reviewed it. :-)

OK.

> > diff --git a/gcc/expr.c b/gcc/expr.c
> > index dfbeae71518..3035791c764 100644
> > --- a/gcc/expr.c
> > +++ b/gcc/expr.c
> > @@ -3814,6 +3814,69 @@ emit_move_insn (rtx x, rtx y)
> >gcc_assert (mode != BLKmode
> >   && (GET_MODE (y) == mode || GET_MODE (y) == VOIDmode));
> >
> > +  /* If we have a copy which looks like one of the following patterns:
> 
> s/which/that/ (I think)

OK.

> > +   (set (subreg:M1 (reg:M2 ...)) (subreg:M1 (reg:M2 ...)))
> > +   (set (subreg:M1 (reg:M2 ...)) (mem:M1 ADDR))
> > +   (set (mem:M1 ADDR) (subreg:M1 (reg:M2 ...)))
> > +   (set (subreg:M1 (reg:M2 ...)) (constant C))
> > + where mode M1 is equal in size to M2 and target hook
> can_change_mode_class
> > + (M1, M2, ALL_REGS) returns false, try to remove the subreg.  This
> avoids
> > + an implicit round trip through memory.  */
> 
> How about:
> 
>  where mode M1 is equal in size to M2, try to detect whether the
>  mode change involves an implicit round trip through memory.
>  If so, see if we can avoid that by removing the subregs and
>  doing the move in mode M2 instead.  */
> 
> > +  else if (x_inner != NULL_RTX
> > +  && MEM_P (y)
> > +  && ! targetm.can_change_mode_class (GET_MODE (x_inner),
> > +  mode, ALL_REGS)
> > +  /* Stop if the inner mode requires too much alignment.  */
> > +  && (! targetm.slow_unaligned_access (GET_MODE (x_inner),
> > +   MEM_ALIGN (y))
> > +  || MEM_ALIGN (y) >= GET_MODE_ALIGNMENT (GET_MODE
> (x_inner
> 
> It's better to check the alignment first, since it's cheaper.
> So taking the comment above into account, I think this ends up as:
> 
>  && (MEM_ALIGN (y) >= GET_MODE_ALIGNMENT (GET_MODE
> (x_inner))
>  || targetm.slow_unaligned_access (mode, MEM_ALIGN (y)
>  || !targetm.slow_unaligned_access (GET_MODE (x_inner),
> MEM_ALIGN (y))
> 
> (Note: no space after "!", although the sources aren't as consistent about
> that as they could be.)

OK.

> TBH I think it would be good to avoid duplicating such a complicated condition
> in both directions, so at the risk of getting flamed, how about using a 
> lambda?
> 
>   auto candidate_mem_p = [&](machine_mode inner_mode, rtx mem) {
> return ...;
>   };
> 
> with ... containing everything after the MEM_P check?

Yes, this avoids duplicating code.

> Looks good otherwise, thanks,

Thanks for reviewing this.
Attached please find the v5 patch.
Note: we also need to modify local variable "mode" once we catch one case.  I 
see test failure without this change.

Bootstrapped and tested on aarch64-linux-gnu.
Also bootstrapped on x86_64-linux-gnu.  Regression test show two failed tests 
on this platform:

1> FAIL: gcc.target/i386/avx512f-vcvtps2ph-2.c (test for excess errors)
2> FAIL: gcc.target/i386/pr67609.c scan-assembler movdqa

I have adjust the second one in the v4 patch. But The first one looks strange 
to me.
I see gcc emits invalid x86 vcvtps2ph instrunctions which looks like:

125 vcvtps2ph   $0, %zmm0, -112(%rbp){%k1}
126 vcvtps2ph   $0, %zmm0, -80(%rbp){%k1}{z}

This happens in the combine phase, where an combined insn looks like:

1989 Trying 31 -> 33:
199031: r106:V16HI=vec_merge(unspec[r103:V16SF,0] 
133,[frame:DI-0x60],r109:HI)
199133: [frame:DI-0x60]=r106:V16HI
1992   REG_DEAD r106:V16HI
1993 Successfully matched this instruction:
1994 (set (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame)
1995

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-05-27 Thread Richard Sandiford

"Yangfei (Felix)"  writes:
>> > +
>> > +{
>> > +  x = x_inner;
>> > +}
>> > +  else if (x_inner != NULL_RTX && MEM_P (y)
>> > + && known_eq (GET_MODE_SIZE (x_inner_mode),
>> GET_MODE_SIZE (mode))
>> > + && ! targetm.can_change_mode_class (x_inner_mode, mode,
>> ALL_REGS)
>> > + && (! targetm.slow_unaligned_access (x_inner_mode,
>> MEM_ALIGN (y))
>> > + || MEM_ALIGN (y) >= GET_MODE_ALIGNMENT
>> (x_inner_mode)))
>> 
>> What is the last condition protecting against?  Seems worth a comment.
>
> Comment added.  Here I am intended to avoid generating a slow unaligned 
> memory access.
> Machine modes like VNx2HImode may have an small alignment than modes like 
> V4HI.
> For the given test case, SLP forces the alignment of memory access of mode 
> VNx2HImode to be 32 bytes.
> In theory, we may have other cases where alignment of innermode is bigger 
> than that of the outermode.

Ah, OK.  But in that case, shouldn't we allow the change if the
original unaligned MEM was also “slow”?

I guess there might be cases in which both modes are slow enough
for the hook to return true for them, but one is worse than the other.
But I don't think there's much we can do about that as things stand:
changing the mode might move from a slow mode to a slower mode,
but it might move in the other direction too.

> +2020-05-27  Felix Yang  
> +   Richard Sandiford  

I appreciate the gesture, but I don't think it's appropriate
to list me as an author.  I haven't written any of the code,
I've just reviewed it. :-)

> diff --git a/gcc/expr.c b/gcc/expr.c
> index dfbeae71518..3035791c764 100644
> --- a/gcc/expr.c
> +++ b/gcc/expr.c
> @@ -3814,6 +3814,69 @@ emit_move_insn (rtx x, rtx y)
>gcc_assert (mode != BLKmode
> && (GET_MODE (y) == mode || GET_MODE (y) == VOIDmode));
>  
> +  /* If we have a copy which looks like one of the following patterns:

s/which/that/ (I think)

> +   (set (subreg:M1 (reg:M2 ...)) (subreg:M1 (reg:M2 ...)))
> +   (set (subreg:M1 (reg:M2 ...)) (mem:M1 ADDR))
> +   (set (mem:M1 ADDR) (subreg:M1 (reg:M2 ...)))
> +   (set (subreg:M1 (reg:M2 ...)) (constant C))
> + where mode M1 is equal in size to M2 and target hook 
> can_change_mode_class
> + (M1, M2, ALL_REGS) returns false, try to remove the subreg.  This avoids
> + an implicit round trip through memory.  */

How about:

 where mode M1 is equal in size to M2, try to detect whether the
 mode change involves an implicit round trip through memory.
 If so, see if we can avoid that by removing the subregs and
 doing the move in mode M2 instead.  */

> +  else if (x_inner != NULL_RTX
> +&& MEM_P (y)
> +&& ! targetm.can_change_mode_class (GET_MODE (x_inner),
> +mode, ALL_REGS)
> +/* Stop if the inner mode requires too much alignment.  */
> +&& (! targetm.slow_unaligned_access (GET_MODE (x_inner),
> + MEM_ALIGN (y))
> +|| MEM_ALIGN (y) >= GET_MODE_ALIGNMENT (GET_MODE (x_inner

It's better to check the alignment first, since it's cheaper.
So taking the comment above into account, I think this ends up as:

   && (MEM_ALIGN (y) >= GET_MODE_ALIGNMENT (GET_MODE (x_inner))
   || targetm.slow_unaligned_access (mode, MEM_ALIGN (y)
   || !targetm.slow_unaligned_access (GET_MODE (x_inner),
  MEM_ALIGN (y))

(Note: no space after "!", although the sources aren't as consistent
about that as they could be.)

TBH I think it would be good to avoid duplicating such a complicated
condition in both directions, so at the risk of getting flamed, how
about using a lambda?

  auto candidate_mem_p = [&](machine_mode inner_mode, rtx mem) {
return ...;
  };

with ... containing everything after the MEM_P check?

Looks good otherwise, thanks,

Richard

RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-05-27 Thread Yangfei (Felix)

Hi,

> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Tuesday, May 26, 2020 11:58 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with
> fixed sve vector length
> 
> Sorry for the slow reply, was off for a few days.
> 
> I think the new code ought to happen earlier in emit_move_insn, before:
> 
>   if (CONSTANT_P (y))
> {
> 
> That way, all the canonicalisation happens on the mode we actually want the
> move to have.

OK. That’s a good point.

> "Yangfei (Felix)"  writes:
> > diff --git a/gcc/expr.c b/gcc/expr.c
> > index dfbeae71518..4442fb83367 100644
> > --- a/gcc/expr.c
> > +++ b/gcc/expr.c
> > @@ -3852,6 +3852,62 @@ emit_move_insn (rtx x, rtx y)
> >
> >gcc_assert (mode != BLKmode);
> >
> > +  rtx x_inner = NULL_RTX;
> > +  rtx y_inner = NULL_RTX;
> > +  machine_mode x_inner_mode, y_inner_mode;
> > +
> > +  if (SUBREG_P (x)
> > +  && REG_P (SUBREG_REG (x))
> > +  && known_eq (SUBREG_BYTE (x), 0))
> > +{
> > +  x_inner = SUBREG_REG (x);
> > +  x_inner_mode = GET_MODE (x_inner);
> > +}
> > +  if (SUBREG_P (y)
> > +  && REG_P (SUBREG_REG (y))
> > +  && known_eq (SUBREG_BYTE (y), 0))
> > +{
> > +  y_inner = SUBREG_REG (y);
> > +  y_inner_mode = GET_MODE (y_inner);
> > +}
> 
> The later code is only interested in SUBREG_REGs that are the same size as
> "mode", so I think it would make sense to check that in the "if"s above
> instead of checking SUBREG_BYTE.  (SUBREG_BYTE is always zero when the
> modes are the same size, but the reverse is not true.)
> 
> It might also be better to avoid [xy]_inner_mode and just use GET_MODE
> where necessary.
> 
> It would be good to have a block comment above the code to explain what
> we're doing.

Good suggestion. Done.

> > +  if (x_inner != NULL_RTX
> > +  && y_inner != NULL_RTX
> > +  && x_inner_mode == y_inner_mode
> > +  && known_eq (GET_MODE_SIZE (x_inner_mode), GET_MODE_SIZE
> (mode))
> > +  && ! targetm.can_change_mode_class (x_inner_mode, mode,
> ALL_REGS))
> > +{
> > +  x = x_inner;
> > +  y = y_inner;
> > +}
> > +  else if (x_inner != NULL_RTX && CONSTANT_P (y)
> 
> Formatting nit: one subcondition per line when the condition spans multiple
> lines.

OK.

> > +  && known_eq (GET_MODE_SIZE (x_inner_mode),
> GET_MODE_SIZE (mode))
> > +  && ! targetm.can_change_mode_class (x_inner_mode, mode,
> ALL_REGS)
> > +  && targetm.legitimate_constant_p (x_inner_mode, y))
> 
> This call isn't valid, since the mode has to match the rtx.  ("y" still has 
> mode
> "mode" at this point.)  I think instead we should just do:
> 
>  && (y_inner = simplify_subreg (GET_MODE (x_inner), y, mode, 0))
> 
> to convert the constant, and use it if the result is nonnull.
> The existing CONSTANT_P emit_move_insn code will handle cases in which
> the new constant isn't legitimate.

Good catch. Done.

> > +
> > +{
> > +  x = x_inner;
> > +}
> > +  else if (x_inner != NULL_RTX && MEM_P (y)
> > +  && known_eq (GET_MODE_SIZE (x_inner_mode),
> GET_MODE_SIZE (mode))
> > +  && ! targetm.can_change_mode_class (x_inner_mode, mode,
> ALL_REGS)
> > +  && (! targetm.slow_unaligned_access (x_inner_mode,
> MEM_ALIGN (y))
> > +  || MEM_ALIGN (y) >= GET_MODE_ALIGNMENT
> (x_inner_mode)))
> 
> What is the last condition protecting against?  Seems worth a comment.

Comment added.  Here I am intended to avoid generating a slow unaligned memory 
access.
Machine modes like VNx2HImode may have an small alignment than modes like V4HI.
For the given test case, SLP forces the alignment of memory access of mode 
VNx2HImode to be 32 bytes.
In theory, we may have other cases where alignment of innermode is bigger than 
that of the outermode.

Attached please find the v3 patch.  Bootstrapped and tested on 
aarch64-linux-gnu.
Does it look better?

gcc/ChangeLog:
+2020-05-27  Felix Yang  
+   Richard Sandiford  
+
+   PR target/95254
+   * expr.c (emit_move_insn): If we have a copy of which src and/or dest
+   is a subreg, try to remove the subreg when innermode and outermode are
+   equal in size and targetm.can_change_mode_class (outermode, innermode,
+   ALL_REGS) returns false.

testsuite/ChangeLog:
+2020-05-27  Felix Yang  
+   Richard Sandiford  
+
+   PR target/95254
+   * gcc.target/aarch64/pr95254.c: New test.

Thanks,
Felix




pr95254-v3.diff
Description: pr95254-v3.diff

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-05-26 Thread Richard Sandiford

Sorry for the slow reply, was off for a few days.

I think the new code ought to happen earlier in emit_move_insn, before:

  if (CONSTANT_P (y))
{

That way, all the canonicalisation happens on the mode we actually
want the move to have.

"Yangfei (Felix)"  writes:
> diff --git a/gcc/expr.c b/gcc/expr.c
> index dfbeae71518..4442fb83367 100644
> --- a/gcc/expr.c
> +++ b/gcc/expr.c
> @@ -3852,6 +3852,62 @@ emit_move_insn (rtx x, rtx y)
>  
>gcc_assert (mode != BLKmode);
>  
> +  rtx x_inner = NULL_RTX;
> +  rtx y_inner = NULL_RTX;
> +  machine_mode x_inner_mode, y_inner_mode;
> +
> +  if (SUBREG_P (x)
> +  && REG_P (SUBREG_REG (x))
> +  && known_eq (SUBREG_BYTE (x), 0))
> +{
> +  x_inner = SUBREG_REG (x);
> +  x_inner_mode = GET_MODE (x_inner);
> +}
> +  if (SUBREG_P (y)
> +  && REG_P (SUBREG_REG (y))
> +  && known_eq (SUBREG_BYTE (y), 0))
> +{
> +  y_inner = SUBREG_REG (y);
> +  y_inner_mode = GET_MODE (y_inner);
> +}

The later code is only interested in SUBREG_REGs that are the same size
as "mode", so I think it would make sense to check that in the "if"s
above instead of checking SUBREG_BYTE.  (SUBREG_BYTE is always zero
when the modes are the same size, but the reverse is not true.)

It might also be better to avoid [xy]_inner_mode and just use GET_MODE
where necessary.

It would be good to have a block comment above the code to explain
what we're doing.

> +  if (x_inner != NULL_RTX
> +  && y_inner != NULL_RTX
> +  && x_inner_mode == y_inner_mode
> +  && known_eq (GET_MODE_SIZE (x_inner_mode), GET_MODE_SIZE (mode))
> +  && ! targetm.can_change_mode_class (x_inner_mode, mode, ALL_REGS))
> +{
> +  x = x_inner;
> +  y = y_inner;
> +}
> +  else if (x_inner != NULL_RTX && CONSTANT_P (y)

Formatting nit: one subcondition per line when the condition spans
multiple lines.

> +&& known_eq (GET_MODE_SIZE (x_inner_mode), GET_MODE_SIZE (mode))
> +&& ! targetm.can_change_mode_class (x_inner_mode, mode, ALL_REGS)
> +&& targetm.legitimate_constant_p (x_inner_mode, y))

This call isn't valid, since the mode has to match the rtx.  ("y" still
has mode "mode" at this point.)  I think instead we should just do:

   && (y_inner = simplify_subreg (GET_MODE (x_inner), y, mode, 0))

to convert the constant, and use it if the result is nonnull.
The existing CONSTANT_P emit_move_insn code will handle cases
in which the new constant isn't legitimate.

> +
> +{
> +  x = x_inner;
> +}
> +  else if (x_inner != NULL_RTX && MEM_P (y)
> +&& known_eq (GET_MODE_SIZE (x_inner_mode), GET_MODE_SIZE (mode))
> +&& ! targetm.can_change_mode_class (x_inner_mode, mode, ALL_REGS)
> +&& (! targetm.slow_unaligned_access (x_inner_mode, MEM_ALIGN (y))
> +|| MEM_ALIGN (y) >= GET_MODE_ALIGNMENT (x_inner_mode)))

What is the last condition protecting against?  Seems worth a comment.

Thanks,
Richard

RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-05-22 Thread Yangfei (Felix)

Hi Richard,

Thanks for the suggestions.

> -Original Message-
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Thursday, May 21, 2020 5:22 PM
> To: Yangfei (Felix) 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with
> fixed sve vector length
> 
> "Yangfei (Felix)"  writes:
> > Hi,
> >
> >   Notice a tiny SVE-related performance issue:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95254
> >
> >   For the given test case, SLP succeeds with VNx8HImode with or without
> option -msve-vector-bits=256.
> >   The root cause for the difference is that we choose a different mode in
> aarch64_vectorize_related_mode under -msve-vector-bits=256:
> VNx2HImode instead of V4HImode.
> >   Then in the final tree ssa forwprop pass, we need to do a VIEW_CONVERT
> from V4HImode to VNx2HImode.
> >
> >   PATCH catch and simplify the pattern in
> aarch64_expand_sve_mem_move, emitting a mov pattern of V4HImode
> instead.
> >   I am assuming endianness does not make a difference here considering
> this simplification.
> >   Bootstrap and tested on aarch64-linux-gnu.  Comments?
> 
> I think we should try to catch this at the gimple level, possibly during SLP 
> itself.

Agreed.  It's better if this can be handled in SLP. 
For the given test case, the difference reflect itself after the final ssa 
forwprop. 
So I guess it might be hard for SLP to catch and evaluate in the vect cost 
model.

> Although the patch handles the case in which the V4HI is stored directly to
> memory, I assume it won't help if the code instead does:
> 
> for (i = 0; i < 4; i++)
>   b[i] = u.a[i] + 1;

SLP failed if you modify like that.  

> targetm.can_change_mode_class (..., ALL_REGS) would be a good indicator
> of whether the needed VIEW_CONVERT_EXPR is cheap or expensive.
> 
> I agree it might still be worth handling this in the move patterns too.
> It feels like a target-independent optimisation though, and for example
> should also apply to V4HI moves involving subregs of VNx2HIs.
> 
> So I think it would be worth trying to do this in emit_move_insn.
> In principle it would be useful for:
> 
>   // M1 and M2 equal size, !targetm.can_change_mode_class (M1, M2,
> ALL_REGS)
> 
>   (set (subreg:M1 (reg:M2 ...)) (subreg:M1 (reg:M2 ...)))
>   (set (subreg:M1 (reg:M2 ...)) (mem:M1 ADDR))
>   (set (mem:M1 ADDR) (subreg:M1 (reg:M2 ...)))
>   (set (subreg:M1 (reg:M2 ...)) (constant C))
> 
> It would be nice if we could do this even without the
> can_change_mode_class condition, provided that it doesn't turn valid M1
> constants or MEMs into invalid M2 ones (or at least, M2 ones that need to be
> validated).
> Unfortunately, going that far is likely to interfere with target-specific 
> choices,
> so it's probably too dangerous.
> 
> With the can_change_mode_class condition it should be fine though, since
> it's avoiding an implicit round trip through memory.  The change should be a
> win even if the MEMs or constants need legitimising for M2.

Good suggestion.  I worked out a v2 patch with the logic moved to in 
emit_move_insn.
For (set (subreg:M1 (reg:M2 ...)) (constant C)) case, I am simply requiring C 
should also be a legitimate constant for M2.
It works for the test case.  I will do a full test for the v2 patch if you like 
it.

> > [...]
> > +  if (inner != NULL_RTX
> > +  && aarch64_classify_vector_mode (inner_mode) == VEC_ADVSIMD
> > +  && GET_MODE_INNER (mode) == GET_MODE_INNER (inner_mode)
> > +  && known_eq (GET_MODE_SIZE (mode), GET_MODE_SIZE
> (inner_mode))
> > +  && GET_MODE_BITSIZE (inner_mode).to_constant () <= alignment)
> > +{
> > +  rtx addr, mem;
> > +  if (MEM_P (src))
> > +   {
> > + addr = XEXP (src, 0);
> > + mem = gen_rtx_MEM (inner_mode, addr);
> > + emit_move_insn (inner, mem);
> > +   }
> > +  else
> > +   {
> > + addr = XEXP (dest, 0);
> > + mem = gen_rtx_MEM (inner_mode, addr);
> > + emit_move_insn (mem, inner);
> 
> gen_rtx_MEM shouldn't be used to create new versions of existing MEMs,
> since it drops all the attributes.  It's better to use something like
> adjust_address instead.  That will also take care of making sure that the
> address is valid for the new mode.

It was my fault.  I should have realized that difference.  :-)

Felix


pr95254-v2.diff
Description: pr95254-v2.diff

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-05-21 Thread Richard Sandiford

"Yangfei (Felix)"  writes:
> Hi,
>
>   Notice a tiny SVE-related performance issue:  
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95254 
>
>   For the given test case, SLP succeeds with VNx8HImode with or without 
> option -msve-vector-bits=256.
>   The root cause for the difference is that we choose a different mode in 
> aarch64_vectorize_related_mode under -msve-vector-bits=256: VNx2HImode 
> instead of V4HImode.
>   Then in the final tree ssa forwprop pass, we need to do a VIEW_CONVERT from 
> V4HImode to VNx2HImode.
>
>   PATCH catch and simplify the pattern in aarch64_expand_sve_mem_move, 
> emitting a mov pattern of V4HImode instead.
>   I am assuming endianness does not make a difference here considering this 
> simplification.
>   Bootstrap and tested on aarch64-linux-gnu.  Comments?

I think we should try to catch this at the gimple level, possibly
during SLP itself.  Although the patch handles the case in which
the V4HI is stored directly to memory, I assume it won't help
if the code instead does:

for (i = 0; i < 4; i++)
  b[i] = u.a[i] + 1;

targetm.can_change_mode_class (..., ALL_REGS) would be a good
indicator of whether the needed VIEW_CONVERT_EXPR is cheap or expensive.

I agree it might still be worth handling this in the move patterns too.
It feels like a target-independent optimisation though, and for example
should also apply to V4HI moves involving subregs of VNx2HIs.

So I think it would be worth trying to do this in emit_move_insn.
In principle it would be useful for:

  // M1 and M2 equal size, !targetm.can_change_mode_class (M1, M2, ALL_REGS)

  (set (subreg:M1 (reg:M2 ...)) (subreg:M1 (reg:M2 ...)))
  (set (subreg:M1 (reg:M2 ...)) (mem:M1 ADDR))
  (set (mem:M1 ADDR) (subreg:M1 (reg:M2 ...)))
  (set (subreg:M1 (reg:M2 ...)) (constant C))

It would be nice if we could do this even without the can_change_mode_class
condition, provided that it doesn't turn valid M1 constants or MEMs into
invalid M2 ones (or at least, M2 ones that need to be validated).
Unfortunately, going that far is likely to interfere with target-specific
choices, so it's probably too dangerous.

With the can_change_mode_class condition it should be fine though,
since it's avoiding an implicit round trip through memory.  The change
should be a win even if the MEMs or constants need legitimising for M2.

> [...]
> +  if (inner != NULL_RTX
> +  && aarch64_classify_vector_mode (inner_mode) == VEC_ADVSIMD
> +  && GET_MODE_INNER (mode) == GET_MODE_INNER (inner_mode)
> +  && known_eq (GET_MODE_SIZE (mode), GET_MODE_SIZE (inner_mode))
> +  && GET_MODE_BITSIZE (inner_mode).to_constant () <= alignment)
> +{
> +  rtx addr, mem;
> +  if (MEM_P (src))
> + {
> +   addr = XEXP (src, 0);
> +   mem = gen_rtx_MEM (inner_mode, addr);
> +   emit_move_insn (inner, mem);
> + }
> +  else
> + {
> +   addr = XEXP (dest, 0);
> +   mem = gen_rtx_MEM (inner_mode, addr);
> +   emit_move_insn (mem, inner);

gen_rtx_MEM shouldn't be used to create new versions of existing MEMs,
since it drops all the attributes.  It's better to use something like
adjust_address instead.  That will also take care of making sure that
the address is valid for the new mode.

Thanks,
Richard

[PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

2020-05-21 Thread Yangfei (Felix)

Hi,

  Notice a tiny SVE-related performance issue:  
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95254 

  For the given test case, SLP succeeds with VNx8HImode with or without option 
-msve-vector-bits=256.
  The root cause for the difference is that we choose a different mode in 
aarch64_vectorize_related_mode under -msve-vector-bits=256: VNx2HImode instead 
of V4HImode.
  Then in the final tree ssa forwprop pass, we need to do a VIEW_CONVERT from 
V4HImode to VNx2HImode.

  PATCH catch and simplify the pattern in aarch64_expand_sve_mem_move, emitting 
a mov pattern of V4HImode instead.
  I am assuming endianness does not make a difference here considering this 
simplification.
  Bootstrap and tested on aarch64-linux-gnu.  Comments?

Thanks,
Felix




pr95254-v1.diff
Description: pr95254-v1.diff

RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

[PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

23 matches

Site Navigation

Mail list logo

Footer information