Re: [PATCH v2] dse: Remove partial load after full store for high part access[PR71309]

2020-07-22 Thread Richard Sandiford
luoxhu  writes:
> Hi,
>
> On 2020/7/21 23:30, Richard Sandiford wrote:
>> Xiong Hu Luo  writes:>> @@ -1872,9 +1872,27 @@ 
>> get_stored_val (store_info *store_info, machine_mode read_mode,
>>>   {
>>> poly_int64 shift = gap * BITS_PER_UNIT;
>>> poly_int64 access_size = GET_MODE_SIZE (read_mode) + gap;
>>> -  read_reg = find_shift_sequence (access_size, store_info, read_mode,
>>> - shift, optimize_bb_for_speed_p (bb),
>>> - require_cst);
>>> +  rtx rhs_subreg = NULL;
>>> +
>>> +  if (known_eq (GET_MODE_BITSIZE (store_mode), shift * 2))
>>> +   {
>>> + scalar_int_mode inner_mode = smallest_int_mode_for_size (shift);
>>> + poly_uint64 sub_off
>>> +   = ((!BYTES_BIG_ENDIAN)
>>> +? GET_MODE_SIZE (store_mode) - GET_MODE_SIZE (inner_mode)
>>> +: 0);
>>> +
>>> + rhs_subreg = simplify_gen_subreg (inner_mode, store_info->rhs,
>>> +   store_mode, sub_off);
>>> + if (rhs_subreg)
>>> +   read_reg
>>> + = extract_low_bits (read_mode, inner_mode, copy_rtx (rhs_subreg));
>>> +   }
>>> +
>>> +  if (read_reg == NULL)
>>> +   read_reg
>>> + = find_shift_sequence (access_size, store_info, read_mode, shift,
>>> +optimize_bb_for_speed_p (bb), require_cst);
>> 
>> Did you consider doing this in find_shift_sequence instead?
>> ISTM that this is really using subregs to optimise:
>> 
>>/* In theory we could also check for an ashr.  Ian Taylor knows
>>   of one dsp where the cost of these two was not the same.  But
>>   this really is a rare case anyway.  */
>>target = expand_binop (new_mode, lshr_optab, new_reg,
>>   gen_int_shift_amount (new_mode, shift),
>>   new_reg, 1, OPTAB_DIRECT);
>> 
>> I think everything up to:
>> 
>>/* Also try a wider mode if the necessary punning is either not
>>   desirable or not possible.  */
>>if (!CONSTANT_P (store_info->rhs)
>>&& !targetm.modes_tieable_p (new_mode, store_mode))
>>  continue;
>> 
>> is either neutral or helpful for the subreg case too, so maybe
>> we could just add the optimisation after that.  (It probably isn't
>> worth reusing any of the later loop body code though, since the
>> subreg case is much simpler.)
>> 
>> I don't think we need to restrict this case to modes of size
>> shift * 2.  We can just check whether the shift is a multiple of
>> the new_mode calculated by find_shift_sequence (using multiple_p).
>> 
>> An easier way of converting the shift to a subreg byte offset
>> is to use subreg_offset_from_lsb, which also handles
>> BYTES_BIG_ENDIAN != WORDS_BIG_ENDIAN.
>> 
>
> Thanks, I've updated the patch by moving it into find_shift_sequence.
> Not sure whether meets your comments precisely though it still works:)
> There is a comment mentioned that 
>
>   /* Some machines like the x86 have shift insns for each size of
>  operand.  Other machines like the ppc or the ia-64 may only have
>  shift insns that shift values within 32 or 64 bit registers.
>  This loop tries to find the smallest shift insn that will right
>  justify the value we want to read but is available in one insn on
>  the machine.  */
>
> So it will early break without some additional check as the new_mode is
> TImode here:
>
>   if (GET_MODE_BITSIZE (new_mode) > BITS_PER_WORD)
>   break;
>
>
>
> [PATCH v2] dse: Remove partial load after full store for high part 
> access[PR71309]
>
>
> This patch could optimize (works for char/short/int/void*):
>
> 6: r119:TI=[r118:DI+0x10]
> 7: [r118:DI]=r119:TI
> 8: r121:DI=[r118:DI+0x8]
>
> =>
>
> 6: r119:TI=[r118:DI+0x10]
> 18: r122:TI=r119:TI
> 16: r123:TI#0=r122:TI#8 0>>0
> 17: r123:TI#8=0
> 19: r124:DI=r123:TI#0
> 7: [r118:DI]=r119:TI
> 8: r121:DI=r124:DI
>
> Final ASM will be as below without partial load after full store(stxv+ld):
>   mr 9,3
>   ld 3,24(3)
>   ld 10,16(3)
>   std 3,8(9)
>   std 10,0(9)
>   blr
>
> It could achieve ~25% performance improvement for typical cases on
> Power9.  Bootstrap and regression testing on Power9-LE.
>
> For AArch64, one ldr is replaced by mov:
>
> ldp x2, x3, [x0, 16]
> stp x2, x3, [x0]
> ldr x0, [x0, 8]
>
> =>

Re: [PATCH v2] dse: Remove partial load after full store for high part access[PR71309]

2020-07-22 Thread luoxhu via Gcc-patches
Hi,

On 2020/7/21 23:30, Richard Sandiford wrote:
> Xiong Hu Luo  writes:>> @@ -1872,9 +1872,27 @@ 
> get_stored_val (store_info *store_info, machine_mode read_mode,
>>   {
>> poly_int64 shift = gap * BITS_PER_UNIT;
>> poly_int64 access_size = GET_MODE_SIZE (read_mode) + gap;
>> -  read_reg = find_shift_sequence (access_size, store_info, read_mode,
>> -  shift, optimize_bb_for_speed_p (bb),
>> -  require_cst);
>> +  rtx rhs_subreg = NULL;
>> +
>> +  if (known_eq (GET_MODE_BITSIZE (store_mode), shift * 2))
>> +{
>> +  scalar_int_mode inner_mode = smallest_int_mode_for_size (shift);
>> +  poly_uint64 sub_off
>> += ((!BYTES_BIG_ENDIAN)
>> + ? GET_MODE_SIZE (store_mode) - GET_MODE_SIZE (inner_mode)
>> + : 0);
>> +
>> +  rhs_subreg = simplify_gen_subreg (inner_mode, store_info->rhs,
>> +store_mode, sub_off);
>> +  if (rhs_subreg)
>> +read_reg
>> +  = extract_low_bits (read_mode, inner_mode, copy_rtx (rhs_subreg));
>> +}
>> +
>> +  if (read_reg == NULL)
>> +read_reg
>> +  = find_shift_sequence (access_size, store_info, read_mode, shift,
>> + optimize_bb_for_speed_p (bb), require_cst);
> 
> Did you consider doing this in find_shift_sequence instead?
> ISTM that this is really using subregs to optimise:
> 
>/* In theory we could also check for an ashr.  Ian Taylor knows
>of one dsp where the cost of these two was not the same.  But
>this really is a rare case anyway.  */
>target = expand_binop (new_mode, lshr_optab, new_reg,
>gen_int_shift_amount (new_mode, shift),
>new_reg, 1, OPTAB_DIRECT);
> 
> I think everything up to:
> 
>/* Also try a wider mode if the necessary punning is either not
>desirable or not possible.  */
>if (!CONSTANT_P (store_info->rhs)
> && !targetm.modes_tieable_p (new_mode, store_mode))
>   continue;
> 
> is either neutral or helpful for the subreg case too, so maybe
> we could just add the optimisation after that.  (It probably isn't
> worth reusing any of the later loop body code though, since the
> subreg case is much simpler.)
> 
> I don't think we need to restrict this case to modes of size
> shift * 2.  We can just check whether the shift is a multiple of
> the new_mode calculated by find_shift_sequence (using multiple_p).
> 
> An easier way of converting the shift to a subreg byte offset
> is to use subreg_offset_from_lsb, which also handles
> BYTES_BIG_ENDIAN != WORDS_BIG_ENDIAN.
> 

Thanks, I've updated the patch by moving it into find_shift_sequence.
Not sure whether meets your comments precisely though it still works:)
There is a comment mentioned that 

  /* Some machines like the x86 have shift insns for each size of
 operand.  Other machines like the ppc or the ia-64 may only have
 shift insns that shift values within 32 or 64 bit registers.
 This loop tries to find the smallest shift insn that will right
 justify the value we want to read but is available in one insn on
 the machine.  */

So it will early break without some additional check as the new_mode is
TImode here:

  if (GET_MODE_BITSIZE (new_mode) > BITS_PER_WORD)
break;



[PATCH v2] dse: Remove partial load after full store for high part 
access[PR71309]


This patch could optimize (works for char/short/int/void*):

6: r119:TI=[r118:DI+0x10]
7: [r118:DI]=r119:TI
8: r121:DI=[r118:DI+0x8]

=>

6: r119:TI=[r118:DI+0x10]
18: r122:TI=r119:TI
16: r123:TI#0=r122:TI#8 0>>0
17: r123:TI#8=0
19: r124:DI=r123:TI#0
7: [r118:DI]=r119:TI
8: r121:DI=r124:DI

Final ASM will be as below without partial load after full store(stxv+ld):
  mr 9,3
  ld 3,24(3)
  ld 10,16(3)
  std 3,8(9)
  std 10,0(9)
  blr

It could achieve ~25% performance improvement for typical cases on
Power9.  Bootstrap and regression testing on Power9-LE.

For AArch64, one ldr is replaced by mov:

ldp x2, x3, [x0, 16]
stp x2, x3, [x0]
ldr x0, [x0, 8]

=>

mov x1, x0
ldp x2, x0, [x0, 16]
stp x2, x0, [x1]

gcc/ChangeLog:

2020-07-22  Xionghu Luo  

PR rtl-optimization/71309
* dse.c (find_shift_sequence): Use subreg of shifted from high part
register to avoid loading from address.

gcc/testsuite/ChangeLog:

2020-07-22  Xionghu Luo  

PR rtl-optimization/71309
* gcc.target/powerpc/pr71309.c: New test.
---
 gcc/dse.c  | 15 +-
 gcc/testsuite/gcc.target/powerpc/pr71309.c |