Re: [Qemu-devel] Possible ppc comparision optimisation
Il 08/05/2013 00:56, Torbjorn Granlund ha scritto: The current ppc gen_op_cmp generates a long sequence of instructions, using a plain series of three disjoint compares. It is possible to compute the 3 result bits more cleverly. Below is a possible replacement gen_op_cmp. (It is tested by booting GNU/Linux ppx64, but not much more than that.) Surely this should be faster than the old code? OK, it is less readable, but cmp is pretty critical and should be made fast. Should one truncate things using tcg_gen_trunc_tl_i32 and do the add, xori, addi as i32 variants? (Why?) I think that would be faster on 32-bit hosts, truncs are cheap. There could be a disadvantage of this compared to the old code, since this has a chained algebraic dependency, while the old code's many instructions might have been more independent. What about these alternatives: setcond LT, t0, arg0, arg1 setcond EQ, t1, arg0, arg1 trunc s0, t0 trunc s1, t1 shli s0, s0, 1; s0 = (arg0 arg1) ? 2 : 0 subi s1, s1, 2; s1 = (arg0 != arg1) ? -2 : -1 subs0, s0, s1 ; 4 == 1 2 shli s0, s0, 1; 8 == 2 4 === setcond LT, t0, arg0, arg1 setcond NE, t1, arg0, arg1 trunc s0, t0 trunc s1, t1 add s0, s0, s1 ; 2 == 0 1 movis1, 1 add s0, s0, s1 ; 3 == 1 2 shl s1, s1, s0 ; 8 == 2 4 Paolo static inline void gen_op_cmp(TCGv arg0, TCGv arg1, int s, int crf) { TCGv t0 = tcg_temp_new(); TCGv t1 = tcg_temp_new(); TCGv_i32 s0 = tcg_temp_new_i32(); tcg_gen_trunc_tl_i32(cpu_crf[crf], cpu_so); tcg_gen_setcond_tl((s ? TCG_COND_LE: TCG_COND_LEU), t0, arg0, arg1); tcg_gen_setcond_tl((s ? TCG_COND_LT: TCG_COND_LTU), t1, arg0, arg1); tcg_gen_add_tl(t0, t0, t1); tcg_gen_xori_tl(t0, t0, 1); tcg_gen_addi_tl(t0, t0, 1); tcg_gen_trunc_tl_i32(s0, t0); tcg_gen_shli_i32(s0, s0, 1); tcg_gen_or_i32(cpu_crf[crf], cpu_crf[crf], s0); tcg_temp_free(t0); tcg_temp_free(t1); tcg_temp_free_i32(s0); }
Re: [Qemu-devel] Possible ppc comparision optimisation
Paolo Bonzini pbonz...@redhat.com writes: I think that would be faster on 32-bit hosts, truncs are cheap. And slower perhaps on 64-bit hosts, at least for operations where additional explicit trunctation will be needed (such as before comparisions and after right shifts). There could be a disadvantage of this compared to the old code, since this has a chained algebraic dependency, while the old code's many instructions might have been more independent. What about these alternatives: setcond LT, t0, arg0, arg1 setcond EQ, t1, arg0, arg1 trunc s0, t0 trunc s1, t1 shli s0, s0, 1; s0 = (arg0 arg1) ? 2 : 0 subi s1, s1, 2; s1 = (arg0 != arg1) ? -2 : -1 subs0, s0, s1 ; 4 == 1 2 shli s0, s0, 1; 8 == 2 4 === setcond LT, t0, arg0, arg1 setcond NE, t1, arg0, arg1 trunc s0, t0 trunc s1, t1 add s0, s0, s1 ; 2 == 0 1 movis1, 1 add s0, s0, s1 ; 3 == 1 2 shl s1, s1, s0 ; 8 == 2 4 Surely there are many alternative forms. Is your aim to add micro-parallelism? (Your sequences look a bit curious. Did you use a super-optimiser?) -- Torbjörn
Re: [Qemu-devel] Possible ppc comparision optimisation
Il 08/05/2013 17:44, Torbjorn Granlund ha scritto: Paolo Bonzini pbonz...@redhat.com writes: I think that would be faster on 32-bit hosts, truncs are cheap. And slower perhaps on 64-bit hosts, at least for operations where additional explicit trunctation will be needed (such as before comparisions and after right shifts). There could be a disadvantage of this compared to the old code, since this has a chained algebraic dependency, while the old code's many instructions might have been more independent. What about these alternatives: setcond LT, t0, arg0, arg1 setcond EQ, t1, arg0, arg1 trunc s0, t0 trunc s1, t1 shli s0, s0, 1; s0 = (arg0 arg1) ? 2 : 0 subi s1, s1, 2; s1 = (arg0 != arg1) ? -2 : -1 subs0, s0, s1 ; 4 == 1 2 shli s0, s0, 1; 8 == 2 4 === setcond LT, t0, arg0, arg1 setcond NE, t1, arg0, arg1 trunc s0, t0 trunc s1, t1 add s0, s0, s1 ; 2 == 0 1 movis1, 1 add s0, s0, s1 ; 3 == 1 2 shl s1, s1, s0 ; 8 == 2 4 Surely there are many alternative forms. Is your aim to add micro-parallelism? Yes, I think in this respect I think the first one is better. The second could be three instructions on machines that have a set-nth-bit instruction _and_ a zero register, but I'm not sure they exist... (Your sequences look a bit curious. Did you use a super-optimiser?) No, but I am attracted to these curious sequences from my previous life working on compilers. :) I know your superoptimizer and, in fact, we both worked on some parts of GCC (optimization of conditional branches/stores), just 20 years apart. The second is actually not too curious after you look at it for a while, it is a variant of the usual (x y) + (x = y) trick used to generate a 0/1/2 result. The first I found by trial and error based on yours; it is basically (x y) * 2 - (x == y) + 2, with some reordering to get parallelism and avoid the need for subfi-like instructions. Paolo
[Qemu-devel] Possible ppc comparision optimisation
The current ppc gen_op_cmp generates a long sequence of instructions, using a plain series of three disjoint compares. It is possible to compute the 3 result bits more cleverly. Below is a possible replacement gen_op_cmp. (It is tested by booting GNU/Linux ppx64, but not much more than that.) Surely this should be faster than the old code? OK, it is less readable, but cmp is pretty critical and should be made fast. Should one truncate things using tcg_gen_trunc_tl_i32 and do the add, xori, addi as i32 variants? (Why?) There could be a disadvantage of this compared to the old code, since this has a chained algebraic dependency, while the old code's many instructions might have been more independent. static inline void gen_op_cmp(TCGv arg0, TCGv arg1, int s, int crf) { TCGv t0 = tcg_temp_new(); TCGv t1 = tcg_temp_new(); TCGv_i32 s0 = tcg_temp_new_i32(); tcg_gen_trunc_tl_i32(cpu_crf[crf], cpu_so); tcg_gen_setcond_tl((s ? TCG_COND_LE: TCG_COND_LEU), t0, arg0, arg1); tcg_gen_setcond_tl((s ? TCG_COND_LT: TCG_COND_LTU), t1, arg0, arg1); tcg_gen_add_tl(t0, t0, t1); tcg_gen_xori_tl(t0, t0, 1); tcg_gen_addi_tl(t0, t0, 1); tcg_gen_trunc_tl_i32(s0, t0); tcg_gen_shli_i32(s0, s0, 1); tcg_gen_or_i32(cpu_crf[crf], cpu_crf[crf], s0); tcg_temp_free(t0); tcg_temp_free(t1); tcg_temp_free_i32(s0); } -- Torbjörn