Re: divide 64-bit by constant for 32-bit target machines

2012-06-15 Thread Richard Earnshaw
On 14/06/12 19:46, Dinar Temirbulatov wrote:
 Hi,
 OK for trunk?
 thanks, Dinar.
 

I'm still not comfortable about the code bloat that this is likely to
incurr at -O2.

R.

 On Tue, Jun 12, 2012 at 11:00 AM, Paolo Bonzini bonz...@gnu.org wrote:
 Il 12/06/2012 08:52, Dinar Temirbulatov ha scritto:
 is safe?  That is, that the underflows cannot produce a wrong result?

 [snip]

 Thanks very much!

 Paolo=

 ChangeLog.txt


 2012-06-14  Dinar Temirbulatov  dtemirbula...@gmail.com
  Alexey Kravets  mr.kayr...@gmail.com
  Paolo Bonzini  bonz...@gnu.org

  * config/arm/arm.c (arm_rtx_costs_1): Add cost estimate for the integer
  double-word division operation.
  * config/mips/mips.c (mips_rtx_costs): Extend cost estimate for the 
 integer
  double-word division operation for 32-bit targets.
  * gcc/expmed.c (expand_mult_highpart_optab): Allow to generate the 
 higher multipilcation
  product for unsigned double-word integers using 32-bit wide registers.

 30.patch


 N¬n‡r¥ªíÂ)emçhÂyhiם¢w^™©Ý




Re: divide 64-bit by constant for 32-bit target machines

2012-06-15 Thread Dinar Temirbulatov
Hi, Richard,
How about if I add and utilize umul_highpart_di to the libgcc
instead of expanding
multiplication for the high part directly, or add my own function with
with pre-shift, post-shift, and
64-bit constant and 64-bit operand as function parameters for division
for less than -O3?
 thanks, Dinar.

On Fri, Jun 15, 2012 at 12:12 PM, Richard Earnshaw rearn...@arm.com wrote:
 On 14/06/12 19:46, Dinar Temirbulatov wrote:
 Hi,
 OK for trunk?
                 thanks, Dinar.


 I'm still not comfortable about the code bloat that this is likely to
 incurr at -O2.

 R.

 On Tue, Jun 12, 2012 at 11:00 AM, Paolo Bonzini bonz...@gnu.org wrote:
 Il 12/06/2012 08:52, Dinar Temirbulatov ha scritto:
 is safe?  That is, that the underflows cannot produce a wrong result?

 [snip]

 Thanks very much!

 Paolo=

 ChangeLog.txt


 2012-06-14  Dinar Temirbulatov  dtemirbula...@gmail.com
          Alexey Kravets  mr.kayr...@gmail.com
          Paolo Bonzini  bonz...@gnu.org

      * config/arm/arm.c (arm_rtx_costs_1): Add cost estimate for the integer
      double-word division operation.
      * config/mips/mips.c (mips_rtx_costs): Extend cost estimate for the 
 integer
      double-word division operation for 32-bit targets.
      * gcc/expmed.c (expand_mult_highpart_optab): Allow to generate the 
 higher multipilcation
      product for unsigned double-word integers using 32-bit wide registers.

 30.patch


 N ¬n‡r¥ªíÂ)emçhÂyhi× ¢w^™©Ý




Re: divide 64-bit by constant for 32-bit target machines

2012-06-14 Thread Dinar Temirbulatov
Hi,
OK for trunk?
thanks, Dinar.

On Tue, Jun 12, 2012 at 11:00 AM, Paolo Bonzini bonz...@gnu.org wrote:
 Il 12/06/2012 08:52, Dinar Temirbulatov ha scritto:
 is safe?  That is, that the underflows cannot produce a wrong result?

 [snip]

 Thanks very much!

 Paolo
2012-06-14  Dinar Temirbulatov  dtemirbula...@gmail.com
Alexey Kravets  mr.kayr...@gmail.com
Paolo Bonzini  bonz...@gnu.org
  
* config/arm/arm.c (arm_rtx_costs_1): Add cost estimate for the integer
double-word division operation.
* config/mips/mips.c (mips_rtx_costs): Extend cost estimate for the 
integer
double-word division operation for 32-bit targets.
* gcc/expmed.c (expand_mult_highpart_optab): Allow to generate the 
higher multipilcation
product for unsigned double-word integers using 32-bit wide registers.


30.patch
Description: Binary data


Re: divide 64-bit by constant for 32-bit target machines

2012-06-08 Thread Dinar Temirbulatov
Hi, Paolo.
Here is the new version of patch. I have tested this version with gcc
testsuite only on i686 without new regressions, for now. Mips and arm
tests are in progress.

One strange thing I noticed:

 No need for this gen_reg_rtx, either, by passing a NULL_RTX target below.

 +      carry_result = expand_shift (LSHIFT_EXPR, mode, carry, BITS_PER_WORD, 
 carry_result, 1);
 +
 +      /* Adding 0x1 as carry here if required.  */

 Oops, a remnant of 32-bit specific code.


that I have to add convert_to_mode () to DImode after
emit_store_flag_force (), since emit_store_flag_force () returns
carry in SImode and without convert_to_mode () call compiler fails
with this error:

Breakpoint 2, simplify_subreg (outermode=SImode, op=0x756cdf20,
innermode=DImode, byte=0) at
../../gcc-20120418-1/gcc/simplify-rtx.c:5423
5423  gcc_assert (GET_MODE (op) == innermode
(gdb) bt
#0  simplify_subreg (outermode=SImode, op=0x756cdf20,
innermode=DImode, byte=0) at
../../gcc-20120418-1/gcc/simplify-rtx.c:5423
#1  0x00aea223 in simplify_gen_subreg (outermode=SImode,
op=0x756cdf20, innermode=DImode, byte=0) at
../../gcc-20120418-1/gcc/simplify-rtx.c:5763
#2  0x00733c99 in operand_subword (op=0x756cdf20,
offset=0, validate_address=1, mode=DImode) at
../../gcc-20120418-1/gcc/emit-rtl.c:1427
#3  0x00733cc6 in operand_subword_force (op=0x756cdf20,
offset=0, mode=DImode) at ../../gcc-20120418-1/gcc/emit-rtl.c:1440
#4  0x00a016b3 in expand_binop (mode=DImode,
binoptab=0x195f580, op0=0x756cdf20, op1=0x7583d670,
target=0x756cdfa0, unsignedp=1, methods=OPTAB_DIRECT)
at ../../gcc-20120418-1/gcc/optabs.c:1779
#5  0x007525af in expand_shift_1 (code=LSHIFT_EXPR,
mode=DImode, shifted=0x756cdf20, amount=0x7583d670,
target=0x0, unsignedp=1)
at ../../gcc-20120418-1/gcc/expmed.c:2273
#6  0x007526b6 in expand_shift (code=LSHIFT_EXPR, mode=DImode,
shifted=0x756cdf20, amount=32, target=0x0, unsignedp=1) at
../../gcc-20120418-1/gcc/expmed.c:2318
#7  0x007563e6 in expand_mult_highpart_optab (mode=DImode,
op0=0x756cdcc0, op1=0x756b1e00, target=0x0, unsignedp=1,
max_cost=188)
at ../../gcc-20120418-1/gcc/expmed.c:3581
#8  0x00756747 in expand_mult_highpart (mode=DImode,
op0=0x756cdcc0, op1=0x756b1e00, target=0x0, unsignedp=1,
max_cost=188)
at ../../gcc-20120418-1/gcc/expmed.c:3654

  thanks, Dinar.


30.patch
Description: Binary data


Re: divide 64-bit by constant for 32-bit target machines

2012-06-07 Thread Dinar Temirbulatov
Hi,
Here is new version of patch based up on Paolo review, again tested on
arm-7l, mips-32r2 (74k), i686 without new regressions.
  thanks, Dinar.

On Sat, May 26, 2012 at 4:45 PM, Paolo Bonzini bonz...@gnu.org wrote:
 Il 26/05/2012 14:35, Paolo Bonzini ha scritto:
        /* We have to return

             z2 + ((u0 + u1)  GET_MODE_BITSIZE (word_mode)).

           u0 + u1 are the upper two words of the three-word
           intermediate result and they could have up to
           2 * GET_MODE_BITSIZE (word_mode) + 1 bits of precision.
           We compute the extra bit by checking for carry, and add
           1  GET_MODE_BITSIZE (word_mode) to z2 if there is carry.  */

 Oops, GET_MODE_BITSIZE (word_mode) is more concisely BITS_PER_WORD.

  +      tmp = expand_binop (mode, add_optab, u0, u1, tmp, 1, 
  OPTAB_LIB_WIDEN);
  +      if (!tmp)
  +             return 0;
          /* We have to return z2 + (tmp  32).  We need
  +      /* Checking for overflow.  */
 This is not overflow, it's carry (see above).

  +      c = gen_reg_rtx (mode);
  +      c1 = gen_reg_rtx (mode);
  +      cres = gen_reg_rtx (mode);
  +
  +      emit_store_flag_force (c, GT, u0, tmp, mode, 1, 1);
  +      emit_store_flag_force (c1, GT, u1, tmp, mode, 1, 1);
  +      result = expand_binop (mode, ior_optab, c, c1, cres, 1, 
  OPTAB_LIB_WIDEN);
  +      if (!result)
  +           return 0;
  +
  +      ccst = gen_reg_rtx (mode);
  +      ccst = expand_shift (LSHIFT_EXPR, mode, cres, 32, ccst, 1);
 This 32 should be GET_MODE_BITSIZE (word_mode).

 Here, too.

 Paolo

2012-06-07  Dinar Temirbulatov  dtemirbula...@gmail.com
Alexey Kravets  mr.kayr...@gmail.com
  
* config/arm/arm.c (arm_rtx_costs_1): Add cost estimate for the integer
double-word division operation.
* config/mips/mips.c (mips_rtx_costs): Extend cost estimate for the 
integer
double-word division operation for 32-bit targets.
* gcc/expmed.c (expand_mult_highpart_optab): Allow to generate the 
higher multipilcation
product for unsigned double-word integers using 32-bit wide registers.


28.patch
Description: Binary data


Re: divide 64-bit by constant for 32-bit target machines

2012-06-07 Thread Paolo Bonzini
Il 07/06/2012 12:21, Dinar Temirbulatov ha scritto:
 oh, I found typo in comment in the end of patch. fixed.

Great improvement, thanks!

Unfortunately we're not there yet, but much closer!  I could understand
the new code much better so I suggest some more improvements below, both
to the comments and to the code generation.

 diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
 index 8a86227..0f8120f 100644
 --- a/gcc/config/arm/arm.c
 +++ b/gcc/config/arm/arm.c
 @@ -7130,6 +7130,8 @@ arm_rtx_costs_1 (rtx x, enum rtx_code outer, int* 
 total, bool speed)
   *total = COSTS_N_INSNS (2);
else if (TARGET_HARD_FLOAT  mode == DFmode  !TARGET_VFP_SINGLE)
   *total = COSTS_N_INSNS (4);
 +  else if (mode == DImode) 
 +*total = COSTS_N_INSNS (50);
else
   *total = COSTS_N_INSNS (20);
return false;
 diff --git a/gcc/config/mips/mips.c b/gcc/config/mips/mips.c
 index 5bcb7a8..57bb4cc 100644
 --- a/gcc/config/mips/mips.c
 +++ b/gcc/config/mips/mips.c
 @@ -3879,8 +3879,13 @@ mips_rtx_costs (rtx x, int code, int outer_code, int 
 opno ATTRIBUTE_UNUSED,
   }
 *total = COSTS_N_INSNS (mips_idiv_insns ());
   }
 -  else if (mode == DImode)
 +  else if (mode == DImode) {
 + if (!TARGET_64BIT)
 +/* divide double integer libcall is expensive.  */
 +*total = COSTS_N_INSNS (200);
 +   else
  *total = mips_cost-int_div_di;
 + }
else
   *total = mips_cost-int_div_si;
return false;
 diff --git a/gcc/expmed.c b/gcc/expmed.c
 index 98f7c09..bb4d7cd 100644
 --- a/gcc/expmed.c
 +++ b/gcc/expmed.c
 @@ -3539,6 +3539,84 @@ expand_mult_highpart_optab (enum machine_mode mode, 
 rtx op0, rtx op1,
   }
  }
  
 +  if (unsignedp
 +   size - 1  BITS_PER_WORD 
 +   (!optimize_size  (optimize1))

Coding style: (!optimize_size  optimize  1).

 +   (4 * mul_cost[speed][mode] + 4 * add_cost[speed][mode]
 +  + shift_cost[speed][mode][31]  max_cost))

Thanks, this is now much cleaner and I could see other improvements.

This should be

  3 * mul_widen_cost[speed][mode] + mul_highpart_cost[speed][mode] +
  4 * add_cost[speed][mode] + add_cost[speed][word_mode]

That is because there is no shift really: a shift by 32 is simply moving
the operand to the higher word, and an add of that value will ignore the
lower word.  Hence, summing carry_result is cheaper: that is
add_cost[speed][word_mode].

On the other hand you also have to consider the comparison emitted by
emit_store_flag_force, which will usually cost the same as an addition.
 That is the fourth add_cost[speed][mode].

 +{
 +  rtx x1, x0, y1, y0, z2, z0, tmp, u0, u0tmp, u1, carry, carry_result, 
 result;
 +  /* Extracting the higher part of the 64-bit multiplier.  */
 +  x1 = gen_highpart (word_mode, op0);
 +  x1 = force_reg (word_mode, x1);
 +
 +  /* Extracting the lower part of the 64-bit multiplier.  */
 +  x0 = gen_lowpart (word_mode, op0);
 +  x0 = force_reg (word_mode, x0);
 +
 +  /* Splitting the 64-bit constant for the higher and the lower parts.  
 */
 +  y0 = gen_lowpart (word_mode, op1);
 +  y0 = force_reg (word_mode, y0);
 +  y1 = gen_highpart_mode (word_mode, mode, op1);
 +
 +  z2 = gen_reg_rtx (mode);
 +  u0 = gen_reg_rtx (mode);

You do not need the gen_reg_rtx; just pass a NULL_RTX target to
expand_widening_mult.

 +  z2 = expand_widening_mult (mode, x1, y1, z2, 1, umul_widen_optab);
 +

Remove the empty line.  Also, let's rename the values to make clear
where is the multiplication of what:

z2 - u11
u0 - u01
z0 - u00
u1 - u10


 +  u0 = expand_widening_mult (mode, x0, y1, u0, 1, umul_widen_optab);
 +
 +  z0 = gen_reg_rtx (mode);
 +  u1 = gen_reg_rtx (mode);

gen_reg_rtx is not needed here too.

 +  z0 = expand_widening_mult (mode, x0, y0, z0, 1, umul_widen_optab);
 +

And neither is this blank line.

 +  u1 = expand_widening_mult (mode, x1, y0, u1, 1, umul_widen_optab);
 +  /* Compute the middle word of the three-word intermediate result.  */
^^

Oops, this is the low word, not the middle.  But let's improve the
comment to explain the algorithm.

   /* u00, u01, u10, u11 form a three-word value with the result
  in the top word, so we want to return this:

((u11  2*BITS_PER_WORD) +
 (u01 + u10  BITS_PER_WORD) +
u00)  3 * BITS_PER_WORD

  We then rewrite it this way:

(u11 + ((u01 + u10 + (u00  BITS_PER_WORD))
BITS_PER_WORD)  BITS_PER_WORD

  where the shifts are realized with gen_highpart and a
  conversion back to the wider mode.  */

 +  u0tmp = gen_highpart (word_mode, z0);
 +  u0tmp = force_reg (word_mode, u0tmp);
 +  u0tmp = convert_to_mode (mode, u0tmp, 1);

u0tmp - u00h

Put the expand_inc (u01, u00h); before the comment.  The formula is
now above so we can say more 

Re: divide 64-bit by constant for 32-bit target machines

2012-05-26 Thread Paolo Bonzini

 diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
 index 2cecf45..9d6983b 100644
 --- a/gcc/config/arm/arm.c
 +++ b/gcc/config/arm/arm.c
 @@ -7131,6 +7131,8 @@ arm_rtx_costs_1 (rtx x, enum rtx_code outer, int* 
 total, bool speed)
   *total = COSTS_N_INSNS (2);
else if (TARGET_HARD_FLOAT  mode == DFmode  !TARGET_VFP_SINGLE)
   *total = COSTS_N_INSNS (4);
 +  else if (mode == DImode)
 +*total = COSTS_N_INSNS (50);
else
   *total = COSTS_N_INSNS (20);
return false;
 diff --git a/gcc/config/mips/mips.c b/gcc/config/mips/mips.c
 index d48a465..b5627c2 100644
 --- a/gcc/config/mips/mips.c
 +++ b/gcc/config/mips/mips.c
 @@ -3846,7 +3846,13 @@ mips_rtx_costs (rtx x, int code, int outer_code, int 
 opno ATTRIBUTE_UNUSED,
 *total = COSTS_N_INSNS (mips_idiv_insns ());
   }
else if (mode == DImode)
 -*total = mips_cost-int_div_di;
 +{
 +  if (!TARGET_64BIT)
 +/* Divide double integer library call is expensive.  */
 +*total = COSTS_N_INSNS (200);
 +  else
 +*total = mips_cost-int_div_di;
 +}
else
   *total = mips_cost-int_div_si;
return false;
 diff --git a/gcc/expmed.c b/gcc/expmed.c
 index aa24fbf..5f4c921 100644
 --- a/gcc/expmed.c
 +++ b/gcc/expmed.c
 @@ -3523,6 +3523,105 @@ expand_mult_highpart_optab (enum machine_mode mode, 
 rtx op0, rtx op1,
   }
  }
  
 +  if (unsignedp  (!optimize_size  (optimize1))
 +   (size - 1  BITS_PER_WORD
 +   BITS_PER_WORD == 32  GET_MODE_BITSIZE (mode) == 
 2*BITS_PER_WORD)

These references to 32-bits are still wrong (and unnecessary, just
remove them).

 +   (4 * mul_cost[speed][mode] + 4 * add_cost[speed][mode]
 +  + shift_cost[speed][mode][31]  max_cost))
 +{
 +  unsigned HOST_WIDE_INT d;
 +  rtx x1, x0, y1, y0, z2, z0, tmp, u0, u0tmp, u1, c, c1, ccst, cres, 
 result;
 +
 +  d = (INTVAL (op1)  GET_MODE_MASK (mode));

This could be a CONST_DOUBLE.  But you don't need d, because you can...

 +  /* Extracting the higher part of the 64-bit multiplier.  */
 +  x1 = gen_highpart (word_mode, op0);
 +  x1 = force_reg (word_mode, x1);
 +
 +  /* Extracting the lower part of the 64-bit multiplier.  */
 +  x0 = gen_lowpart (word_mode, op0);
 +  x0 = force_reg (word_mode, x0);
 +
 +  /* Splitting the 64-bit constant for the higher and the lower parts.  
 */
 +  y0 = gen_int_mode(d  UINT32_MAX, word_mode);
 +  y1 = gen_int_mode(d  32, word_mode);

... use gen_lowpart and gen_highpart directly on op1.

 +
 +  z2 = gen_reg_rtx (mode);
 +  u0 = gen_reg_rtx (mode);
 +
 +  /* Unsigned multiplication of the higher multiplier part
 +  and the higher constant part.  */
 +  z2 = expand_widening_mult (mode, x1, y1, z2, 1, umul_widen_optab);
 +  /* Unsigned multiplication of the lower multiplier part
 +  and the higher constant part.  */
 +  u0 = expand_widening_mult (mode, x0, y1, u0, 1, umul_widen_optab);
 +
 +  z0 = gen_reg_rtx (mode);
 +  u1 = gen_reg_rtx (mode);
 +
 +  /* Unsigned multiplication of the lower multiplier part
 +  and the lower constant part.  */
 +  z0 = expand_widening_mult (mode, x0, y0, z0, 1, umul_widen_optab);
 +
 +  /* Unsigned multiplication of the higher multiplier part
 +  the lower constant part.  */
 +  u1 = expand_widening_mult (mode, x1, y0, u1, 1, umul_widen_optab);

Up to here the comments are not necessary.

 +  /* Getting the higher part of multiplication between the lower 
 multiplier
 +  part and the lower constant part, the lower part is not interesting
 +  for the final result.  */
 +  u0tmp = gen_highpart (word_mode, z0);
 +  u0tmp = force_reg (word_mode, u0tmp);
 +  u0tmp = convert_to_mode (mode, u0tmp, 1);
 +
 +  /* Adding the higher part of multiplication between the lower 
 multiplier
 +  part and the lower constant part to the result of multiplication 
 between
 +  the lower multiplier part and the higher constant part. Please note,
 +  that we couldn't get overflow here since in the worst case
 +  (0x*0x)+0x we get 0xL.  */

The command can simply be compute the middle word of the three-word
intermediate result.  Also it's not overflow, it's carry.

 +  expand_inc (u0, u0tmp);
 +  tmp = gen_reg_rtx (mode);
 +
 +  /* Adding multiplication between the lower multiplier part and the 
 higher
 +  constant part with the higher part of multiplication between the lower
 +  multiplier part and the lower constant part to the result of 
 multiplication
 +  between the higher multiplier part and the lower constant part.  */

Here you have to explain:

   /* We have to return

z2 + ((u0 + u1)  GET_MODE_BITSIZE (word_mode)).

  u0 + u1 are the upper two words of the three-word
  intermediate result and they could 

Re: divide 64-bit by constant for 32-bit target machines

2012-05-26 Thread Paolo Bonzini
Il 25/05/2012 12:20, Dinar Temirbulatov ha scritto:
 +  emit_store_flag_force (c, GT, u0, tmp, mode, 1, 1);
 +  emit_store_flag_force (c1, GT, u1, tmp, mode, 1, 1);
 +  result = expand_binop (mode, ior_optab, c, c1, cres, 1, 
 OPTAB_LIB_WIDEN);
 +  if (!result)
 +   return 0;

Ah, you don't need the or.  u0  tmp is already giving the overflow.

Paolo


Re: divide 64-bit by constant for 32-bit target machines

2012-05-26 Thread Paolo Bonzini
Il 26/05/2012 14:35, Paolo Bonzini ha scritto:
/* We have to return
 
 z2 + ((u0 + u1)  GET_MODE_BITSIZE (word_mode)).
 
   u0 + u1 are the upper two words of the three-word
   intermediate result and they could have up to
   2 * GET_MODE_BITSIZE (word_mode) + 1 bits of precision.
   We compute the extra bit by checking for carry, and add
   1  GET_MODE_BITSIZE (word_mode) to z2 if there is carry.  */

Oops, GET_MODE_BITSIZE (word_mode) is more concisely BITS_PER_WORD.

  +  tmp = expand_binop (mode, add_optab, u0, u1, tmp, 1, 
  OPTAB_LIB_WIDEN);
  +  if (!tmp)
  + return 0;
  /* We have to return z2 + (tmp  32).  We need
  +  /* Checking for overflow.  */
 This is not overflow, it's carry (see above).
 
  +  c = gen_reg_rtx (mode);
  +  c1 = gen_reg_rtx (mode);
  +  cres = gen_reg_rtx (mode);
  +
  +  emit_store_flag_force (c, GT, u0, tmp, mode, 1, 1);
  +  emit_store_flag_force (c1, GT, u1, tmp, mode, 1, 1);
  +  result = expand_binop (mode, ior_optab, c, c1, cres, 1, 
  OPTAB_LIB_WIDEN);
  +  if (!result)
  +   return 0;
  +
  +  ccst = gen_reg_rtx (mode);
  +  ccst = expand_shift (LSHIFT_EXPR, mode, cres, 32, ccst, 1);
 This 32 should be GET_MODE_BITSIZE (word_mode).

Here, too.

Paolo



Re: divide 64-bit by constant for 32-bit target machines

2012-05-25 Thread Dinar Temirbulatov
Hi,
I have replaced expand_mult to expand_widening_mult and removed
all direct references to DImode, SImode modes in the
expand_mult_highpart_optab funtion. The attached patch was tested on
arm-7l, mips-32r2 (74k), i686 without new regressions. Richard, do you
think it is ready now?
thanks, Dinar.

On Tue, May 22, 2012 at 7:45 PM, Richard Henderson r...@redhat.com wrote:
 On 05/22/12 07:05, Dinar Temirbulatov wrote:
 +  if ((size - 1  BITS_PER_WORD
 +        BITS_PER_WORD == 32  mode == DImode)

 Do note that this patch will not go in with hard-coded
 SImode and DImode references.

 Which, really, you do not even need.

   GET_MODE_BITSIZE (mode) == 2*BITS_PER_WORD

 is what you wanted for testing for double-word-ness,
 and word_mode is a good substitute for SImode here.

 +      /* Splitting the 64-bit constant for the higher and the lower parts.  
 */
 +      y0 = gen_rtx_CONST_INT (DImode, dUINT32_MAX);
 +      y1 = gen_rtx_CONST_INT (DImode, d32);

 Use gen_int_mode.

 +      x1 = convert_to_mode (DImode, x1, 1);
 +      x0 = convert_to_mode (DImode, x0, 1);
 +
 +      /* Splitting the 64-bit constant for the higher and the lower parts.  
 */
 +      y0 = gen_rtx_CONST_INT (DImode, dUINT32_MAX);
 +      y1 = gen_rtx_CONST_INT (DImode, d32);
 +
 +      z2 = gen_reg_rtx (DImode);
 +      u0 = gen_reg_rtx (DImode);
 +
 +      /* Unsigned multiplication of the higher multiplier part
 +      and the higher constant part.  */
 +      z2 = expand_mult(DImode, x1, y1, z2, 1);
 +      /* Unsigned multiplication of the lower multiplier part
 +         and the higher constant part.  */
 +      u0 = expand_mult(DImode, x0, y1, u0, 1);

 I'm fairly sure you really want to be using expand_widening_mult
 here, rather than using convert_to_mode first.  While combine may
 be able to re-generate a widening multiply out of this sequence,
 there's no sense making it work too hard.



 r~
2012-05-25  Dinar Temirbulatov  dtemirbula...@gmail.com
Alexey Kravets  mr.kayr...@gmail.com
  
* config/arm/arm.c (arm_rtx_costs_1): Add cost estimate for the integer
double-word division operation.
* config/mips/mips.c (mips_rtx_costs): Extend cost estimate for the 
integer
double-word division operation for 32-bit targets.
* gcc/expmed.c (expand_mult_highpart_optab): Allow to generate the 
higher multipilcation
product for unsigned double-word integers using 32-bit wide registers.


22.patch
Description: Binary data


Re: divide 64-bit by constant for 32-bit target machines

2012-05-22 Thread Dinar Temirbulatov
Hi,
Here is the new version of the patch
I have fixed two errors in the previous version,
   on mips32 the compiler could not expand division and terminated
with ICE, this change fixed the issue:
  /* Extrating the higher part of the sum */
  tmp = gen_highpart (SImode, tmp);
  tmp = force_reg (SImode, tmp);
+  tmp = force_reg (SImode, tmp);
+  tmp = convert_to_mode (DImode, tmp, 1);

   and another error on i686 and mips32r2: I found that overflow
check in multiplication was not working. For example
0xfe34b4190a392b23/257 produced wrong result. Following change
resolved the issue:
   -emit_store_flag_force (c, GT, u0, tmp, DImode, 1, -1);
   +emit_store_flag_force (c, GT, u0, tmp, DImode, 1, 1);
Tested this new version of the patch on -none-linux-gnu with arm-7l,
mips-32r2 (74k), i686 without new regressions.

On Thu, May 3, 2012 at 5:40 PM, Richard Earnshaw rearn...@arm.com wrote:
 On 03/05/12 11:27, Dinar Temirbulatov wrote:


 This clearly needs more work.

 Comments:  Need to end with a period and two spaces.
 Binary Operators: Need to be surrounded with white space.
sorry for this, hope I resolved such issues with the new version.

 As Andrew Haley has already said, some documentation of the algorithm is
 needed.
General documentation for the issue could be found here
gmplib.org/~tege/divcnst-pldi94.pdf.
Multiplication to get higher 128-bit was developed by me and Alexey
Kravets, I attached C version of the algorithm.

 Why is this restricted to BITS_PER_WORD == 32?
I am checking here that we are generating code for 32-bit target with
32-bit wide general propose registers, and with 64-bit wide registers
usually there is an instruction to get the higher 64-bit of 64x64-bit
multiplication cheaply.


 Costs: This code clearly needs to understand the relative cost of doing
 division this way; there is a limit to the amount of inline expansion
 that we should tolerate, particularly at -O2 and it's not clear that
 this will be much better than a library call if we don't have a widening
 multiply operation (as is true for older ARM chips, and presumably some
 other CPUs).  In essence, you need to work out the cost of a divide
 instruction (just as rtx costs for this) and the approximate cost of the
 expanded algorithm.
Added cost calculation.

 Another issue that worries me is the number of live DImode values; on
 machines with few registers this could cause excessive spilling to start
 happening.
The cost calculation suppose to take this into account.

 I also wonder whether it would be beneficial to generate custom
 functions for division by specific constants (using this algorithm) and
 then call those functions rather than the standard lib-call.  On ELF
 systems the functions can marked as hidden and put into common sections
 so that we only end up with one instance of each function in a program.
yes, I think it is a good approach, I could redo my patch with such
intrinsic function implementation with pre-shift, post-shift, and
64-bit constant as function parameters.

 Finally, do you have a copyright assignment with the FSF?  We can't use
 this code without one.
yes, I do have a copyright assignment with the FSF.
Also I am in process of implementing signed integer 64-bit division by constant.

thanks, Dinar.


18.patch
Description: Binary data
unsigned long long mul(unsigned long long a, unsigned long long b)
{
unsigned long long x1=a32;
unsigned long long x0=a0x;
unsigned long long y1=b32;
unsigned long long y0=b0x;
unsigned long long z2, z0, tmp, u0, u1;
unsigned char c=0;
z2=x1*y1;
z0=x0*y0;
u0=x0*y1+(z032);
u1=x1*y0;
tmp = (u0+u1);
c = (tmp  u0) || (tmp  u1);
return z2+(tmp32)+(((unsigned long long)c)32);
}



Re: divide 64-bit by constant for 32-bit target machines

2012-05-22 Thread Richard Henderson
On 05/22/12 07:05, Dinar Temirbulatov wrote:
 +  if ((size - 1  BITS_PER_WORD
 +BITS_PER_WORD == 32  mode == DImode)

Do note that this patch will not go in with hard-coded
SImode and DImode references.

Which, really, you do not even need.

   GET_MODE_BITSIZE (mode) == 2*BITS_PER_WORD

is what you wanted for testing for double-word-ness,
and word_mode is a good substitute for SImode here.

+  /* Splitting the 64-bit constant for the higher and the lower parts.  */
+  y0 = gen_rtx_CONST_INT (DImode, dUINT32_MAX);
+  y1 = gen_rtx_CONST_INT (DImode, d32);

Use gen_int_mode.

 +  x1 = convert_to_mode (DImode, x1, 1);
 +  x0 = convert_to_mode (DImode, x0, 1);
 +
 +  /* Splitting the 64-bit constant for the higher and the lower parts.  
 */
 +  y0 = gen_rtx_CONST_INT (DImode, dUINT32_MAX);
 +  y1 = gen_rtx_CONST_INT (DImode, d32);
 +
 +  z2 = gen_reg_rtx (DImode);
 +  u0 = gen_reg_rtx (DImode);
 +
 +  /* Unsigned multiplication of the higher multiplier part
 +  and the higher constant part.  */
 +  z2 = expand_mult(DImode, x1, y1, z2, 1);
 +  /* Unsigned multiplication of the lower multiplier part
 + and the higher constant part.  */
 +  u0 = expand_mult(DImode, x0, y1, u0, 1);

I'm fairly sure you really want to be using expand_widening_mult
here, rather than using convert_to_mode first.  While combine may
be able to re-generate a widening multiply out of this sequence,
there's no sense making it work too hard.



r~


Re: divide 64-bit by constant for 32-bit target machines

2012-05-03 Thread Dinar Temirbulatov
Hi,
Here is updated version of patch. I added comments describing the algorithm.

 Hi Dinar.  I'm afraid it gives the wrong results for some dividends
  * 82625484914982912 / 2023346444509052928: gives 4096, should be zero
  * 18317463604061229328 / 2023346444509052928: gives 4109, should be 9
  * 12097415865295708879 / 4545815675034402816: gives 130, should be 2
  * 18195490362097456014 / 6999635335417036800: gives 10, should be 2
Oh, I have used signed multiplication instead of unsigned and that was
the reason of errors above, fixed that typo.
Tested on arm-7l with no new regressions.
Ok for trunk?

thanks, Dinar.


14.patch
Description: Binary data


Re: divide 64-bit by constant for 32-bit target machines

2012-05-03 Thread Richard Earnshaw
On 03/05/12 11:27, Dinar Temirbulatov wrote:
 Hi,
 Here is updated version of patch. I added comments describing the algorithm.
 
 Hi Dinar.  I'm afraid it gives the wrong results for some dividends
  * 82625484914982912 / 2023346444509052928: gives 4096, should be zero
  * 18317463604061229328 / 2023346444509052928: gives 4109, should be 9
  * 12097415865295708879 / 4545815675034402816: gives 130, should be 2
  * 18195490362097456014 / 6999635335417036800: gives 10, should be 2
 Oh, I have used signed multiplication instead of unsigned and that was
 the reason of errors above, fixed that typo.
 Tested on arm-7l with no new regressions.
 Ok for trunk?
 


This clearly needs more work.

Comments:  Need to end with a period and two spaces.
Binary Operators: Need to be surrounded with white space.

As Andrew Haley has already said, some documentation of the algorithm is
needed.

Why is this restricted to BITS_PER_WORD == 32?

Costs: This code clearly needs to understand the relative cost of doing
division this way; there is a limit to the amount of inline expansion
that we should tolerate, particularly at -O2 and it's not clear that
this will be much better than a library call if we don't have a widening
multiply operation (as is true for older ARM chips, and presumably some
other CPUs).  In essence, you need to work out the cost of a divide
instruction (just as rtx costs for this) and the approximate cost of the
expanded algorithm.

Another issue that worries me is the number of live DImode values; on
machines with few registers this could cause excessive spilling to start
happening.

I also wonder whether it would be beneficial to generate custom
functions for division by specific constants (using this algorithm) and
then call those functions rather than the standard lib-call.  On ELF
systems the functions can marked as hidden and put into common sections
so that we only end up with one instance of each function in a program.

+  /* Checking for owerflow, please not that we couldn't user carry-flag
+ here before the reload pass .
+cres = (tmp  u0) || (tmp  u1); */

Generic code can't assume the presence of a carry flag either before or
after reload, so the latter part of the comment is somewhat meaningless.
 Also spelling error in comment.

Finally, do you have a copyright assignment with the FSF?  We can't use
this code without one.

R.



Re: divide 64-bit by constant for 32-bit target machines

2012-04-23 Thread Andrew Haley
On 04/20/2012 01:57 PM, Dinar Temirbulatov wrote:
 Here is the patch that adds support for divide 64-bit by constant for
 32-bit target machines, this patch was tested on arm-7a with no new
 regressions, also I am not sure on how to avoid for example i686
 targets since div operation there is fast compared to over targets and
 it showed better performance with  libc/sysdeps/wordsize-32/divdi3.c
 __udivdi3 vs my implementation on the compiler side, it is not clear
 for me by looking at the udiv_cost[speed][SImode] value.

I can't approve this patch.  However, I do think that a comment
which explains the algorithm is needed.

Andrew.



Re: divide 64-bit by constant for 32-bit target machines

2012-04-23 Thread Michael Hope
On 21 April 2012 00:57, Dinar Temirbulatov dtemirbula...@gmail.com wrote:
 Hi,
 Here is the patch that adds support for divide 64-bit by constant for
 32-bit target machines, this patch was tested on arm-7a with no new
 regressions, also I am not sure on how to avoid for example i686
 targets since div operation there is fast compared to over targets and
 it showed better performance with  libc/sysdeps/wordsize-32/divdi3.c
 __udivdi3 vs my implementation on the compiler side, it is not clear
 for me by looking at the udiv_cost[speed][SImode] value.

Hi Dinar.  I'm afraid it gives the wrong results for some dividends:
 * 82625484914982912 / 2023346444509052928: gives 4096, should be zero
 * 18317463604061229328 / 2023346444509052928: gives 4109, should be 9
 * 12097415865295708879 / 4545815675034402816: gives 130, should be 2
 * 18195490362097456014 / 6999635335417036800: gives 10, should be 2

The expanded version is very large.  Perhaps it should only turn on at
-O2 and always turn off at -Os?

The speed increase is quite impressive - I'm seeing between 2.7 and
20x faster on a Cortex-A9 for things like x / 3.

-- Michael


divide 64-bit by constant for 32-bit target machines

2012-04-20 Thread Dinar Temirbulatov
Hi,
Here is the patch that adds support for divide 64-bit by constant for
32-bit target machines, this patch was tested on arm-7a with no new
regressions, also I am not sure on how to avoid for example i686
targets since div operation there is fast compared to over targets and
it showed better performance with  libc/sysdeps/wordsize-32/divdi3.c
__udivdi3 vs my implementation on the compiler side, it is not clear
for me by looking at the udiv_cost[speed][SImode] value.
  thanks, Dinar.


11.patch
Description: Binary data