Re: divide 64-bit by constant for 32-bit target machines
On 14/06/12 19:46, Dinar Temirbulatov wrote: Hi, OK for trunk? thanks, Dinar. I'm still not comfortable about the code bloat that this is likely to incurr at -O2. R. On Tue, Jun 12, 2012 at 11:00 AM, Paolo Bonzini bonz...@gnu.org wrote: Il 12/06/2012 08:52, Dinar Temirbulatov ha scritto: is safe? That is, that the underflows cannot produce a wrong result? [snip] Thanks very much! Paolo= ChangeLog.txt 2012-06-14 Dinar Temirbulatov dtemirbula...@gmail.com Alexey Kravets mr.kayr...@gmail.com Paolo Bonzini bonz...@gnu.org * config/arm/arm.c (arm_rtx_costs_1): Add cost estimate for the integer double-word division operation. * config/mips/mips.c (mips_rtx_costs): Extend cost estimate for the integer double-word division operation for 32-bit targets. * gcc/expmed.c (expand_mult_highpart_optab): Allow to generate the higher multipilcation product for unsigned double-word integers using 32-bit wide registers. 30.patch N¬n‡r¥ªíÂ)emçhÂyhi×¢w^™©Ý
Re: divide 64-bit by constant for 32-bit target machines
Hi, Richard, How about if I add and utilize umul_highpart_di to the libgcc instead of expanding multiplication for the high part directly, or add my own function with with pre-shift, post-shift, and 64-bit constant and 64-bit operand as function parameters for division for less than -O3? thanks, Dinar. On Fri, Jun 15, 2012 at 12:12 PM, Richard Earnshaw rearn...@arm.com wrote: On 14/06/12 19:46, Dinar Temirbulatov wrote: Hi, OK for trunk? thanks, Dinar. I'm still not comfortable about the code bloat that this is likely to incurr at -O2. R. On Tue, Jun 12, 2012 at 11:00 AM, Paolo Bonzini bonz...@gnu.org wrote: Il 12/06/2012 08:52, Dinar Temirbulatov ha scritto: is safe? That is, that the underflows cannot produce a wrong result? [snip] Thanks very much! Paolo= ChangeLog.txt 2012-06-14 Dinar Temirbulatov dtemirbula...@gmail.com Alexey Kravets mr.kayr...@gmail.com Paolo Bonzini bonz...@gnu.org * config/arm/arm.c (arm_rtx_costs_1): Add cost estimate for the integer double-word division operation. * config/mips/mips.c (mips_rtx_costs): Extend cost estimate for the integer double-word division operation for 32-bit targets. * gcc/expmed.c (expand_mult_highpart_optab): Allow to generate the higher multipilcation product for unsigned double-word integers using 32-bit wide registers. 30.patch N ¬n‡r¥ªíÂ)emçhÂyhi× ¢w^™©Ý
Re: divide 64-bit by constant for 32-bit target machines
Hi, OK for trunk? thanks, Dinar. On Tue, Jun 12, 2012 at 11:00 AM, Paolo Bonzini bonz...@gnu.org wrote: Il 12/06/2012 08:52, Dinar Temirbulatov ha scritto: is safe? That is, that the underflows cannot produce a wrong result? [snip] Thanks very much! Paolo 2012-06-14 Dinar Temirbulatov dtemirbula...@gmail.com Alexey Kravets mr.kayr...@gmail.com Paolo Bonzini bonz...@gnu.org * config/arm/arm.c (arm_rtx_costs_1): Add cost estimate for the integer double-word division operation. * config/mips/mips.c (mips_rtx_costs): Extend cost estimate for the integer double-word division operation for 32-bit targets. * gcc/expmed.c (expand_mult_highpart_optab): Allow to generate the higher multipilcation product for unsigned double-word integers using 32-bit wide registers. 30.patch Description: Binary data
Re: divide 64-bit by constant for 32-bit target machines
Hi, Paolo. Here is the new version of patch. I have tested this version with gcc testsuite only on i686 without new regressions, for now. Mips and arm tests are in progress. One strange thing I noticed: No need for this gen_reg_rtx, either, by passing a NULL_RTX target below. + carry_result = expand_shift (LSHIFT_EXPR, mode, carry, BITS_PER_WORD, carry_result, 1); + + /* Adding 0x1 as carry here if required. */ Oops, a remnant of 32-bit specific code. that I have to add convert_to_mode () to DImode after emit_store_flag_force (), since emit_store_flag_force () returns carry in SImode and without convert_to_mode () call compiler fails with this error: Breakpoint 2, simplify_subreg (outermode=SImode, op=0x756cdf20, innermode=DImode, byte=0) at ../../gcc-20120418-1/gcc/simplify-rtx.c:5423 5423 gcc_assert (GET_MODE (op) == innermode (gdb) bt #0 simplify_subreg (outermode=SImode, op=0x756cdf20, innermode=DImode, byte=0) at ../../gcc-20120418-1/gcc/simplify-rtx.c:5423 #1 0x00aea223 in simplify_gen_subreg (outermode=SImode, op=0x756cdf20, innermode=DImode, byte=0) at ../../gcc-20120418-1/gcc/simplify-rtx.c:5763 #2 0x00733c99 in operand_subword (op=0x756cdf20, offset=0, validate_address=1, mode=DImode) at ../../gcc-20120418-1/gcc/emit-rtl.c:1427 #3 0x00733cc6 in operand_subword_force (op=0x756cdf20, offset=0, mode=DImode) at ../../gcc-20120418-1/gcc/emit-rtl.c:1440 #4 0x00a016b3 in expand_binop (mode=DImode, binoptab=0x195f580, op0=0x756cdf20, op1=0x7583d670, target=0x756cdfa0, unsignedp=1, methods=OPTAB_DIRECT) at ../../gcc-20120418-1/gcc/optabs.c:1779 #5 0x007525af in expand_shift_1 (code=LSHIFT_EXPR, mode=DImode, shifted=0x756cdf20, amount=0x7583d670, target=0x0, unsignedp=1) at ../../gcc-20120418-1/gcc/expmed.c:2273 #6 0x007526b6 in expand_shift (code=LSHIFT_EXPR, mode=DImode, shifted=0x756cdf20, amount=32, target=0x0, unsignedp=1) at ../../gcc-20120418-1/gcc/expmed.c:2318 #7 0x007563e6 in expand_mult_highpart_optab (mode=DImode, op0=0x756cdcc0, op1=0x756b1e00, target=0x0, unsignedp=1, max_cost=188) at ../../gcc-20120418-1/gcc/expmed.c:3581 #8 0x00756747 in expand_mult_highpart (mode=DImode, op0=0x756cdcc0, op1=0x756b1e00, target=0x0, unsignedp=1, max_cost=188) at ../../gcc-20120418-1/gcc/expmed.c:3654 thanks, Dinar. 30.patch Description: Binary data
Re: divide 64-bit by constant for 32-bit target machines
Hi, Here is new version of patch based up on Paolo review, again tested on arm-7l, mips-32r2 (74k), i686 without new regressions. thanks, Dinar. On Sat, May 26, 2012 at 4:45 PM, Paolo Bonzini bonz...@gnu.org wrote: Il 26/05/2012 14:35, Paolo Bonzini ha scritto: /* We have to return z2 + ((u0 + u1) GET_MODE_BITSIZE (word_mode)). u0 + u1 are the upper two words of the three-word intermediate result and they could have up to 2 * GET_MODE_BITSIZE (word_mode) + 1 bits of precision. We compute the extra bit by checking for carry, and add 1 GET_MODE_BITSIZE (word_mode) to z2 if there is carry. */ Oops, GET_MODE_BITSIZE (word_mode) is more concisely BITS_PER_WORD. + tmp = expand_binop (mode, add_optab, u0, u1, tmp, 1, OPTAB_LIB_WIDEN); + if (!tmp) + return 0; /* We have to return z2 + (tmp 32). We need + /* Checking for overflow. */ This is not overflow, it's carry (see above). + c = gen_reg_rtx (mode); + c1 = gen_reg_rtx (mode); + cres = gen_reg_rtx (mode); + + emit_store_flag_force (c, GT, u0, tmp, mode, 1, 1); + emit_store_flag_force (c1, GT, u1, tmp, mode, 1, 1); + result = expand_binop (mode, ior_optab, c, c1, cres, 1, OPTAB_LIB_WIDEN); + if (!result) + return 0; + + ccst = gen_reg_rtx (mode); + ccst = expand_shift (LSHIFT_EXPR, mode, cres, 32, ccst, 1); This 32 should be GET_MODE_BITSIZE (word_mode). Here, too. Paolo 2012-06-07 Dinar Temirbulatov dtemirbula...@gmail.com Alexey Kravets mr.kayr...@gmail.com * config/arm/arm.c (arm_rtx_costs_1): Add cost estimate for the integer double-word division operation. * config/mips/mips.c (mips_rtx_costs): Extend cost estimate for the integer double-word division operation for 32-bit targets. * gcc/expmed.c (expand_mult_highpart_optab): Allow to generate the higher multipilcation product for unsigned double-word integers using 32-bit wide registers. 28.patch Description: Binary data
Re: divide 64-bit by constant for 32-bit target machines
Il 07/06/2012 12:21, Dinar Temirbulatov ha scritto: oh, I found typo in comment in the end of patch. fixed. Great improvement, thanks! Unfortunately we're not there yet, but much closer! I could understand the new code much better so I suggest some more improvements below, both to the comments and to the code generation. diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index 8a86227..0f8120f 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -7130,6 +7130,8 @@ arm_rtx_costs_1 (rtx x, enum rtx_code outer, int* total, bool speed) *total = COSTS_N_INSNS (2); else if (TARGET_HARD_FLOAT mode == DFmode !TARGET_VFP_SINGLE) *total = COSTS_N_INSNS (4); + else if (mode == DImode) +*total = COSTS_N_INSNS (50); else *total = COSTS_N_INSNS (20); return false; diff --git a/gcc/config/mips/mips.c b/gcc/config/mips/mips.c index 5bcb7a8..57bb4cc 100644 --- a/gcc/config/mips/mips.c +++ b/gcc/config/mips/mips.c @@ -3879,8 +3879,13 @@ mips_rtx_costs (rtx x, int code, int outer_code, int opno ATTRIBUTE_UNUSED, } *total = COSTS_N_INSNS (mips_idiv_insns ()); } - else if (mode == DImode) + else if (mode == DImode) { + if (!TARGET_64BIT) +/* divide double integer libcall is expensive. */ +*total = COSTS_N_INSNS (200); + else *total = mips_cost-int_div_di; + } else *total = mips_cost-int_div_si; return false; diff --git a/gcc/expmed.c b/gcc/expmed.c index 98f7c09..bb4d7cd 100644 --- a/gcc/expmed.c +++ b/gcc/expmed.c @@ -3539,6 +3539,84 @@ expand_mult_highpart_optab (enum machine_mode mode, rtx op0, rtx op1, } } + if (unsignedp + size - 1 BITS_PER_WORD + (!optimize_size (optimize1)) Coding style: (!optimize_size optimize 1). + (4 * mul_cost[speed][mode] + 4 * add_cost[speed][mode] + + shift_cost[speed][mode][31] max_cost)) Thanks, this is now much cleaner and I could see other improvements. This should be 3 * mul_widen_cost[speed][mode] + mul_highpart_cost[speed][mode] + 4 * add_cost[speed][mode] + add_cost[speed][word_mode] That is because there is no shift really: a shift by 32 is simply moving the operand to the higher word, and an add of that value will ignore the lower word. Hence, summing carry_result is cheaper: that is add_cost[speed][word_mode]. On the other hand you also have to consider the comparison emitted by emit_store_flag_force, which will usually cost the same as an addition. That is the fourth add_cost[speed][mode]. +{ + rtx x1, x0, y1, y0, z2, z0, tmp, u0, u0tmp, u1, carry, carry_result, result; + /* Extracting the higher part of the 64-bit multiplier. */ + x1 = gen_highpart (word_mode, op0); + x1 = force_reg (word_mode, x1); + + /* Extracting the lower part of the 64-bit multiplier. */ + x0 = gen_lowpart (word_mode, op0); + x0 = force_reg (word_mode, x0); + + /* Splitting the 64-bit constant for the higher and the lower parts. */ + y0 = gen_lowpart (word_mode, op1); + y0 = force_reg (word_mode, y0); + y1 = gen_highpart_mode (word_mode, mode, op1); + + z2 = gen_reg_rtx (mode); + u0 = gen_reg_rtx (mode); You do not need the gen_reg_rtx; just pass a NULL_RTX target to expand_widening_mult. + z2 = expand_widening_mult (mode, x1, y1, z2, 1, umul_widen_optab); + Remove the empty line. Also, let's rename the values to make clear where is the multiplication of what: z2 - u11 u0 - u01 z0 - u00 u1 - u10 + u0 = expand_widening_mult (mode, x0, y1, u0, 1, umul_widen_optab); + + z0 = gen_reg_rtx (mode); + u1 = gen_reg_rtx (mode); gen_reg_rtx is not needed here too. + z0 = expand_widening_mult (mode, x0, y0, z0, 1, umul_widen_optab); + And neither is this blank line. + u1 = expand_widening_mult (mode, x1, y0, u1, 1, umul_widen_optab); + /* Compute the middle word of the three-word intermediate result. */ ^^ Oops, this is the low word, not the middle. But let's improve the comment to explain the algorithm. /* u00, u01, u10, u11 form a three-word value with the result in the top word, so we want to return this: ((u11 2*BITS_PER_WORD) + (u01 + u10 BITS_PER_WORD) + u00) 3 * BITS_PER_WORD We then rewrite it this way: (u11 + ((u01 + u10 + (u00 BITS_PER_WORD)) BITS_PER_WORD) BITS_PER_WORD where the shifts are realized with gen_highpart and a conversion back to the wider mode. */ + u0tmp = gen_highpart (word_mode, z0); + u0tmp = force_reg (word_mode, u0tmp); + u0tmp = convert_to_mode (mode, u0tmp, 1); u0tmp - u00h Put the expand_inc (u01, u00h); before the comment. The formula is now above so we can say more
Re: divide 64-bit by constant for 32-bit target machines
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index 2cecf45..9d6983b 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -7131,6 +7131,8 @@ arm_rtx_costs_1 (rtx x, enum rtx_code outer, int* total, bool speed) *total = COSTS_N_INSNS (2); else if (TARGET_HARD_FLOAT mode == DFmode !TARGET_VFP_SINGLE) *total = COSTS_N_INSNS (4); + else if (mode == DImode) +*total = COSTS_N_INSNS (50); else *total = COSTS_N_INSNS (20); return false; diff --git a/gcc/config/mips/mips.c b/gcc/config/mips/mips.c index d48a465..b5627c2 100644 --- a/gcc/config/mips/mips.c +++ b/gcc/config/mips/mips.c @@ -3846,7 +3846,13 @@ mips_rtx_costs (rtx x, int code, int outer_code, int opno ATTRIBUTE_UNUSED, *total = COSTS_N_INSNS (mips_idiv_insns ()); } else if (mode == DImode) -*total = mips_cost-int_div_di; +{ + if (!TARGET_64BIT) +/* Divide double integer library call is expensive. */ +*total = COSTS_N_INSNS (200); + else +*total = mips_cost-int_div_di; +} else *total = mips_cost-int_div_si; return false; diff --git a/gcc/expmed.c b/gcc/expmed.c index aa24fbf..5f4c921 100644 --- a/gcc/expmed.c +++ b/gcc/expmed.c @@ -3523,6 +3523,105 @@ expand_mult_highpart_optab (enum machine_mode mode, rtx op0, rtx op1, } } + if (unsignedp (!optimize_size (optimize1)) + (size - 1 BITS_PER_WORD + BITS_PER_WORD == 32 GET_MODE_BITSIZE (mode) == 2*BITS_PER_WORD) These references to 32-bits are still wrong (and unnecessary, just remove them). + (4 * mul_cost[speed][mode] + 4 * add_cost[speed][mode] + + shift_cost[speed][mode][31] max_cost)) +{ + unsigned HOST_WIDE_INT d; + rtx x1, x0, y1, y0, z2, z0, tmp, u0, u0tmp, u1, c, c1, ccst, cres, result; + + d = (INTVAL (op1) GET_MODE_MASK (mode)); This could be a CONST_DOUBLE. But you don't need d, because you can... + /* Extracting the higher part of the 64-bit multiplier. */ + x1 = gen_highpart (word_mode, op0); + x1 = force_reg (word_mode, x1); + + /* Extracting the lower part of the 64-bit multiplier. */ + x0 = gen_lowpart (word_mode, op0); + x0 = force_reg (word_mode, x0); + + /* Splitting the 64-bit constant for the higher and the lower parts. */ + y0 = gen_int_mode(d UINT32_MAX, word_mode); + y1 = gen_int_mode(d 32, word_mode); ... use gen_lowpart and gen_highpart directly on op1. + + z2 = gen_reg_rtx (mode); + u0 = gen_reg_rtx (mode); + + /* Unsigned multiplication of the higher multiplier part + and the higher constant part. */ + z2 = expand_widening_mult (mode, x1, y1, z2, 1, umul_widen_optab); + /* Unsigned multiplication of the lower multiplier part + and the higher constant part. */ + u0 = expand_widening_mult (mode, x0, y1, u0, 1, umul_widen_optab); + + z0 = gen_reg_rtx (mode); + u1 = gen_reg_rtx (mode); + + /* Unsigned multiplication of the lower multiplier part + and the lower constant part. */ + z0 = expand_widening_mult (mode, x0, y0, z0, 1, umul_widen_optab); + + /* Unsigned multiplication of the higher multiplier part + the lower constant part. */ + u1 = expand_widening_mult (mode, x1, y0, u1, 1, umul_widen_optab); Up to here the comments are not necessary. + /* Getting the higher part of multiplication between the lower multiplier + part and the lower constant part, the lower part is not interesting + for the final result. */ + u0tmp = gen_highpart (word_mode, z0); + u0tmp = force_reg (word_mode, u0tmp); + u0tmp = convert_to_mode (mode, u0tmp, 1); + + /* Adding the higher part of multiplication between the lower multiplier + part and the lower constant part to the result of multiplication between + the lower multiplier part and the higher constant part. Please note, + that we couldn't get overflow here since in the worst case + (0x*0x)+0x we get 0xL. */ The command can simply be compute the middle word of the three-word intermediate result. Also it's not overflow, it's carry. + expand_inc (u0, u0tmp); + tmp = gen_reg_rtx (mode); + + /* Adding multiplication between the lower multiplier part and the higher + constant part with the higher part of multiplication between the lower + multiplier part and the lower constant part to the result of multiplication + between the higher multiplier part and the lower constant part. */ Here you have to explain: /* We have to return z2 + ((u0 + u1) GET_MODE_BITSIZE (word_mode)). u0 + u1 are the upper two words of the three-word intermediate result and they could
Re: divide 64-bit by constant for 32-bit target machines
Il 25/05/2012 12:20, Dinar Temirbulatov ha scritto: + emit_store_flag_force (c, GT, u0, tmp, mode, 1, 1); + emit_store_flag_force (c1, GT, u1, tmp, mode, 1, 1); + result = expand_binop (mode, ior_optab, c, c1, cres, 1, OPTAB_LIB_WIDEN); + if (!result) + return 0; Ah, you don't need the or. u0 tmp is already giving the overflow. Paolo
Re: divide 64-bit by constant for 32-bit target machines
Il 26/05/2012 14:35, Paolo Bonzini ha scritto: /* We have to return z2 + ((u0 + u1) GET_MODE_BITSIZE (word_mode)). u0 + u1 are the upper two words of the three-word intermediate result and they could have up to 2 * GET_MODE_BITSIZE (word_mode) + 1 bits of precision. We compute the extra bit by checking for carry, and add 1 GET_MODE_BITSIZE (word_mode) to z2 if there is carry. */ Oops, GET_MODE_BITSIZE (word_mode) is more concisely BITS_PER_WORD. + tmp = expand_binop (mode, add_optab, u0, u1, tmp, 1, OPTAB_LIB_WIDEN); + if (!tmp) + return 0; /* We have to return z2 + (tmp 32). We need + /* Checking for overflow. */ This is not overflow, it's carry (see above). + c = gen_reg_rtx (mode); + c1 = gen_reg_rtx (mode); + cres = gen_reg_rtx (mode); + + emit_store_flag_force (c, GT, u0, tmp, mode, 1, 1); + emit_store_flag_force (c1, GT, u1, tmp, mode, 1, 1); + result = expand_binop (mode, ior_optab, c, c1, cres, 1, OPTAB_LIB_WIDEN); + if (!result) + return 0; + + ccst = gen_reg_rtx (mode); + ccst = expand_shift (LSHIFT_EXPR, mode, cres, 32, ccst, 1); This 32 should be GET_MODE_BITSIZE (word_mode). Here, too. Paolo
Re: divide 64-bit by constant for 32-bit target machines
Hi, I have replaced expand_mult to expand_widening_mult and removed all direct references to DImode, SImode modes in the expand_mult_highpart_optab funtion. The attached patch was tested on arm-7l, mips-32r2 (74k), i686 without new regressions. Richard, do you think it is ready now? thanks, Dinar. On Tue, May 22, 2012 at 7:45 PM, Richard Henderson r...@redhat.com wrote: On 05/22/12 07:05, Dinar Temirbulatov wrote: + if ((size - 1 BITS_PER_WORD + BITS_PER_WORD == 32 mode == DImode) Do note that this patch will not go in with hard-coded SImode and DImode references. Which, really, you do not even need. GET_MODE_BITSIZE (mode) == 2*BITS_PER_WORD is what you wanted for testing for double-word-ness, and word_mode is a good substitute for SImode here. + /* Splitting the 64-bit constant for the higher and the lower parts. */ + y0 = gen_rtx_CONST_INT (DImode, dUINT32_MAX); + y1 = gen_rtx_CONST_INT (DImode, d32); Use gen_int_mode. + x1 = convert_to_mode (DImode, x1, 1); + x0 = convert_to_mode (DImode, x0, 1); + + /* Splitting the 64-bit constant for the higher and the lower parts. */ + y0 = gen_rtx_CONST_INT (DImode, dUINT32_MAX); + y1 = gen_rtx_CONST_INT (DImode, d32); + + z2 = gen_reg_rtx (DImode); + u0 = gen_reg_rtx (DImode); + + /* Unsigned multiplication of the higher multiplier part + and the higher constant part. */ + z2 = expand_mult(DImode, x1, y1, z2, 1); + /* Unsigned multiplication of the lower multiplier part + and the higher constant part. */ + u0 = expand_mult(DImode, x0, y1, u0, 1); I'm fairly sure you really want to be using expand_widening_mult here, rather than using convert_to_mode first. While combine may be able to re-generate a widening multiply out of this sequence, there's no sense making it work too hard. r~ 2012-05-25 Dinar Temirbulatov dtemirbula...@gmail.com Alexey Kravets mr.kayr...@gmail.com * config/arm/arm.c (arm_rtx_costs_1): Add cost estimate for the integer double-word division operation. * config/mips/mips.c (mips_rtx_costs): Extend cost estimate for the integer double-word division operation for 32-bit targets. * gcc/expmed.c (expand_mult_highpart_optab): Allow to generate the higher multipilcation product for unsigned double-word integers using 32-bit wide registers. 22.patch Description: Binary data
Re: divide 64-bit by constant for 32-bit target machines
Hi, Here is the new version of the patch I have fixed two errors in the previous version, on mips32 the compiler could not expand division and terminated with ICE, this change fixed the issue: /* Extrating the higher part of the sum */ tmp = gen_highpart (SImode, tmp); tmp = force_reg (SImode, tmp); + tmp = force_reg (SImode, tmp); + tmp = convert_to_mode (DImode, tmp, 1); and another error on i686 and mips32r2: I found that overflow check in multiplication was not working. For example 0xfe34b4190a392b23/257 produced wrong result. Following change resolved the issue: -emit_store_flag_force (c, GT, u0, tmp, DImode, 1, -1); +emit_store_flag_force (c, GT, u0, tmp, DImode, 1, 1); Tested this new version of the patch on -none-linux-gnu with arm-7l, mips-32r2 (74k), i686 without new regressions. On Thu, May 3, 2012 at 5:40 PM, Richard Earnshaw rearn...@arm.com wrote: On 03/05/12 11:27, Dinar Temirbulatov wrote: This clearly needs more work. Comments: Need to end with a period and two spaces. Binary Operators: Need to be surrounded with white space. sorry for this, hope I resolved such issues with the new version. As Andrew Haley has already said, some documentation of the algorithm is needed. General documentation for the issue could be found here gmplib.org/~tege/divcnst-pldi94.pdf. Multiplication to get higher 128-bit was developed by me and Alexey Kravets, I attached C version of the algorithm. Why is this restricted to BITS_PER_WORD == 32? I am checking here that we are generating code for 32-bit target with 32-bit wide general propose registers, and with 64-bit wide registers usually there is an instruction to get the higher 64-bit of 64x64-bit multiplication cheaply. Costs: This code clearly needs to understand the relative cost of doing division this way; there is a limit to the amount of inline expansion that we should tolerate, particularly at -O2 and it's not clear that this will be much better than a library call if we don't have a widening multiply operation (as is true for older ARM chips, and presumably some other CPUs). In essence, you need to work out the cost of a divide instruction (just as rtx costs for this) and the approximate cost of the expanded algorithm. Added cost calculation. Another issue that worries me is the number of live DImode values; on machines with few registers this could cause excessive spilling to start happening. The cost calculation suppose to take this into account. I also wonder whether it would be beneficial to generate custom functions for division by specific constants (using this algorithm) and then call those functions rather than the standard lib-call. On ELF systems the functions can marked as hidden and put into common sections so that we only end up with one instance of each function in a program. yes, I think it is a good approach, I could redo my patch with such intrinsic function implementation with pre-shift, post-shift, and 64-bit constant as function parameters. Finally, do you have a copyright assignment with the FSF? We can't use this code without one. yes, I do have a copyright assignment with the FSF. Also I am in process of implementing signed integer 64-bit division by constant. thanks, Dinar. 18.patch Description: Binary data unsigned long long mul(unsigned long long a, unsigned long long b) { unsigned long long x1=a32; unsigned long long x0=a0x; unsigned long long y1=b32; unsigned long long y0=b0x; unsigned long long z2, z0, tmp, u0, u1; unsigned char c=0; z2=x1*y1; z0=x0*y0; u0=x0*y1+(z032); u1=x1*y0; tmp = (u0+u1); c = (tmp u0) || (tmp u1); return z2+(tmp32)+(((unsigned long long)c)32); }
Re: divide 64-bit by constant for 32-bit target machines
On 05/22/12 07:05, Dinar Temirbulatov wrote: + if ((size - 1 BITS_PER_WORD +BITS_PER_WORD == 32 mode == DImode) Do note that this patch will not go in with hard-coded SImode and DImode references. Which, really, you do not even need. GET_MODE_BITSIZE (mode) == 2*BITS_PER_WORD is what you wanted for testing for double-word-ness, and word_mode is a good substitute for SImode here. + /* Splitting the 64-bit constant for the higher and the lower parts. */ + y0 = gen_rtx_CONST_INT (DImode, dUINT32_MAX); + y1 = gen_rtx_CONST_INT (DImode, d32); Use gen_int_mode. + x1 = convert_to_mode (DImode, x1, 1); + x0 = convert_to_mode (DImode, x0, 1); + + /* Splitting the 64-bit constant for the higher and the lower parts. */ + y0 = gen_rtx_CONST_INT (DImode, dUINT32_MAX); + y1 = gen_rtx_CONST_INT (DImode, d32); + + z2 = gen_reg_rtx (DImode); + u0 = gen_reg_rtx (DImode); + + /* Unsigned multiplication of the higher multiplier part + and the higher constant part. */ + z2 = expand_mult(DImode, x1, y1, z2, 1); + /* Unsigned multiplication of the lower multiplier part + and the higher constant part. */ + u0 = expand_mult(DImode, x0, y1, u0, 1); I'm fairly sure you really want to be using expand_widening_mult here, rather than using convert_to_mode first. While combine may be able to re-generate a widening multiply out of this sequence, there's no sense making it work too hard. r~
Re: divide 64-bit by constant for 32-bit target machines
Hi, Here is updated version of patch. I added comments describing the algorithm. Hi Dinar. I'm afraid it gives the wrong results for some dividends * 82625484914982912 / 2023346444509052928: gives 4096, should be zero * 18317463604061229328 / 2023346444509052928: gives 4109, should be 9 * 12097415865295708879 / 4545815675034402816: gives 130, should be 2 * 18195490362097456014 / 6999635335417036800: gives 10, should be 2 Oh, I have used signed multiplication instead of unsigned and that was the reason of errors above, fixed that typo. Tested on arm-7l with no new regressions. Ok for trunk? thanks, Dinar. 14.patch Description: Binary data
Re: divide 64-bit by constant for 32-bit target machines
On 03/05/12 11:27, Dinar Temirbulatov wrote: Hi, Here is updated version of patch. I added comments describing the algorithm. Hi Dinar. I'm afraid it gives the wrong results for some dividends * 82625484914982912 / 2023346444509052928: gives 4096, should be zero * 18317463604061229328 / 2023346444509052928: gives 4109, should be 9 * 12097415865295708879 / 4545815675034402816: gives 130, should be 2 * 18195490362097456014 / 6999635335417036800: gives 10, should be 2 Oh, I have used signed multiplication instead of unsigned and that was the reason of errors above, fixed that typo. Tested on arm-7l with no new regressions. Ok for trunk? This clearly needs more work. Comments: Need to end with a period and two spaces. Binary Operators: Need to be surrounded with white space. As Andrew Haley has already said, some documentation of the algorithm is needed. Why is this restricted to BITS_PER_WORD == 32? Costs: This code clearly needs to understand the relative cost of doing division this way; there is a limit to the amount of inline expansion that we should tolerate, particularly at -O2 and it's not clear that this will be much better than a library call if we don't have a widening multiply operation (as is true for older ARM chips, and presumably some other CPUs). In essence, you need to work out the cost of a divide instruction (just as rtx costs for this) and the approximate cost of the expanded algorithm. Another issue that worries me is the number of live DImode values; on machines with few registers this could cause excessive spilling to start happening. I also wonder whether it would be beneficial to generate custom functions for division by specific constants (using this algorithm) and then call those functions rather than the standard lib-call. On ELF systems the functions can marked as hidden and put into common sections so that we only end up with one instance of each function in a program. + /* Checking for owerflow, please not that we couldn't user carry-flag + here before the reload pass . +cres = (tmp u0) || (tmp u1); */ Generic code can't assume the presence of a carry flag either before or after reload, so the latter part of the comment is somewhat meaningless. Also spelling error in comment. Finally, do you have a copyright assignment with the FSF? We can't use this code without one. R.
Re: divide 64-bit by constant for 32-bit target machines
On 04/20/2012 01:57 PM, Dinar Temirbulatov wrote: Here is the patch that adds support for divide 64-bit by constant for 32-bit target machines, this patch was tested on arm-7a with no new regressions, also I am not sure on how to avoid for example i686 targets since div operation there is fast compared to over targets and it showed better performance with libc/sysdeps/wordsize-32/divdi3.c __udivdi3 vs my implementation on the compiler side, it is not clear for me by looking at the udiv_cost[speed][SImode] value. I can't approve this patch. However, I do think that a comment which explains the algorithm is needed. Andrew.
Re: divide 64-bit by constant for 32-bit target machines
On 21 April 2012 00:57, Dinar Temirbulatov dtemirbula...@gmail.com wrote: Hi, Here is the patch that adds support for divide 64-bit by constant for 32-bit target machines, this patch was tested on arm-7a with no new regressions, also I am not sure on how to avoid for example i686 targets since div operation there is fast compared to over targets and it showed better performance with libc/sysdeps/wordsize-32/divdi3.c __udivdi3 vs my implementation on the compiler side, it is not clear for me by looking at the udiv_cost[speed][SImode] value. Hi Dinar. I'm afraid it gives the wrong results for some dividends: * 82625484914982912 / 2023346444509052928: gives 4096, should be zero * 18317463604061229328 / 2023346444509052928: gives 4109, should be 9 * 12097415865295708879 / 4545815675034402816: gives 130, should be 2 * 18195490362097456014 / 6999635335417036800: gives 10, should be 2 The expanded version is very large. Perhaps it should only turn on at -O2 and always turn off at -Os? The speed increase is quite impressive - I'm seeing between 2.7 and 20x faster on a Cortex-A9 for things like x / 3. -- Michael
divide 64-bit by constant for 32-bit target machines
Hi, Here is the patch that adds support for divide 64-bit by constant for 32-bit target machines, this patch was tested on arm-7a with no new regressions, also I am not sure on how to avoid for example i686 targets since div operation there is fast compared to over targets and it showed better performance with libc/sysdeps/wordsize-32/divdi3.c __udivdi3 vs my implementation on the compiler side, it is not clear for me by looking at the udiv_cost[speed][SImode] value. thanks, Dinar. 11.patch Description: Binary data