Re: [Qemu-devel] Possible ppc comparision optimisation

2013-05-08 Thread Paolo Bonzini
Il 08/05/2013 00:56, Torbjorn Granlund ha scritto:
 The current ppc gen_op_cmp generates a long sequence of instructions,
 using a plain series of three disjoint compares.
 
 It is possible to compute the 3 result bits more cleverly.  Below is a
 possible replacement gen_op_cmp.  (It is tested by booting GNU/Linux
 ppx64, but not much more than that.)
 
 Surely this should be faster than the old code?  OK, it is less
 readable, but cmp is pretty critical and should be made fast.
 
 Should one truncate things using tcg_gen_trunc_tl_i32 and do the add,
 xori, addi as i32 variants?  (Why?)

I think that would be faster on 32-bit hosts, truncs are cheap.

 There could be a disadvantage of this compared to the old code, since
 this has a chained algebraic dependency, while the old code's many
 instructions might have been more independent.

What about these alternatives:

setcond LT, t0, arg0, arg1
setcond EQ, t1, arg0, arg1
trunc  s0, t0
trunc  s1, t1
shli   s0, s0, 1; s0 = (arg0  arg1) ? 2 : 0
subi   s1, s1, 2; s1 = (arg0 != arg1) ? -2 : -1
subs0, s0, s1   ;  4   == 1   2
shli   s0, s0, 1;  8   == 2   4

===

setcond LT, t0, arg0, arg1
setcond NE, t1, arg0, arg1
trunc   s0, t0
trunc   s1, t1
add s0, s0, s1  ;  2   == 0   1
movis1, 1
add s0, s0, s1  ;  3   == 1   2
shl s1, s1, s0  ;  8   == 2   4

Paolo

 static inline void gen_op_cmp(TCGv arg0, TCGv arg1, int s, int crf)
 {
 TCGv t0 = tcg_temp_new();
 TCGv t1 = tcg_temp_new();
 TCGv_i32 s0 = tcg_temp_new_i32();
 
 tcg_gen_trunc_tl_i32(cpu_crf[crf], cpu_so);
 
 tcg_gen_setcond_tl((s ? TCG_COND_LE: TCG_COND_LEU), t0, arg0, arg1);
 tcg_gen_setcond_tl((s ? TCG_COND_LT: TCG_COND_LTU), t1, arg0, arg1);
 tcg_gen_add_tl(t0, t0, t1);
 tcg_gen_xori_tl(t0, t0, 1);
 tcg_gen_addi_tl(t0, t0, 1);
 tcg_gen_trunc_tl_i32(s0, t0);
 tcg_gen_shli_i32(s0, s0, 1);
 tcg_gen_or_i32(cpu_crf[crf], cpu_crf[crf], s0);
 
 tcg_temp_free(t0);
 tcg_temp_free(t1);
 tcg_temp_free_i32(s0);
 }
 




Re: [Qemu-devel] Possible ppc comparision optimisation

2013-05-08 Thread Torbjorn Granlund
Paolo Bonzini pbonz...@redhat.com writes:

  I think that would be faster on 32-bit hosts, truncs are cheap.
  
And slower perhaps on 64-bit hosts, at least for operations where
additional explicit trunctation will be needed (such as before
comparisions and after right shifts).

   There could be a disadvantage of this compared to the old code, since
   this has a chained algebraic dependency, while the old code's many
   instructions might have been more independent.
  
  What about these alternatives:
  
  setcond LT, t0, arg0, arg1
  setcond EQ, t1, arg0, arg1
  trunc  s0, t0
  trunc  s1, t1
  shli   s0, s0, 1; s0 = (arg0  arg1) ? 2 : 0
  subi   s1, s1, 2; s1 = (arg0 != arg1) ? -2 : -1
  subs0, s0, s1   ;  4   == 1   2
  shli   s0, s0, 1;  8   == 2   4
  
  ===
  
  setcond LT, t0, arg0, arg1
  setcond NE, t1, arg0, arg1
  trunc   s0, t0
  trunc   s1, t1
  add s0, s0, s1  ;  2   == 0   1
  movis1, 1
  add s0, s0, s1  ;  3   == 1   2
  shl s1, s1, s0  ;  8   == 2   4
  
Surely there are many alternative forms.
Is your aim to add micro-parallelism?

(Your sequences look a bit curious.  Did you use a super-optimiser?)

-- 
Torbjörn



Re: [Qemu-devel] Possible ppc comparision optimisation

2013-05-08 Thread Paolo Bonzini
Il 08/05/2013 17:44, Torbjorn Granlund ha scritto:
 Paolo Bonzini pbonz...@redhat.com writes:
 
   I think that would be faster on 32-bit hosts, truncs are cheap.
   
 And slower perhaps on 64-bit hosts, at least for operations where
 additional explicit trunctation will be needed (such as before
 comparisions and after right shifts).
 
There could be a disadvantage of this compared to the old code, since
this has a chained algebraic dependency, while the old code's many
instructions might have been more independent.
   
   What about these alternatives:
   
   setcond LT, t0, arg0, arg1
   setcond EQ, t1, arg0, arg1
   trunc  s0, t0
   trunc  s1, t1
   shli   s0, s0, 1; s0 = (arg0  arg1) ? 2 : 0
   subi   s1, s1, 2; s1 = (arg0 != arg1) ? -2 : -1
   subs0, s0, s1   ;  4   == 1   2
   shli   s0, s0, 1;  8   == 2   4
   
   ===
   
   setcond LT, t0, arg0, arg1
   setcond NE, t1, arg0, arg1
   trunc   s0, t0
   trunc   s1, t1
   add s0, s0, s1  ;  2   == 0   1
   movis1, 1
   add s0, s0, s1  ;  3   == 1   2
   shl s1, s1, s0  ;  8   == 2   4
   
 Surely there are many alternative forms.
 Is your aim to add micro-parallelism?

Yes, I think in this respect I think the first one is better.  The
second could be three instructions on machines that have a set-nth-bit
instruction _and_ a zero register, but I'm not sure they exist...

 (Your sequences look a bit curious.  Did you use a super-optimiser?)

No, but I am attracted to these curious sequences from my previous life
working on compilers. :)  I know your superoptimizer and, in fact, we
both worked on some parts of GCC (optimization of conditional
branches/stores), just 20 years apart.

The second is actually not too curious after you look at it for a while,
it is a variant of the usual (x  y) + (x = y) trick used to generate a
0/1/2 result.  The first I found by trial and error based on yours; it
is basically (x  y) * 2 - (x == y) + 2, with some reordering to get
parallelism and avoid the need for subfi-like instructions.

Paolo



[Qemu-devel] Possible ppc comparision optimisation

2013-05-07 Thread Torbjorn Granlund
The current ppc gen_op_cmp generates a long sequence of instructions,
using a plain series of three disjoint compares.

It is possible to compute the 3 result bits more cleverly.  Below is a
possible replacement gen_op_cmp.  (It is tested by booting GNU/Linux
ppx64, but not much more than that.)

Surely this should be faster than the old code?  OK, it is less
readable, but cmp is pretty critical and should be made fast.

Should one truncate things using tcg_gen_trunc_tl_i32 and do the add,
xori, addi as i32 variants?  (Why?)

There could be a disadvantage of this compared to the old code, since
this has a chained algebraic dependency, while the old code's many
instructions might have been more independent.

static inline void gen_op_cmp(TCGv arg0, TCGv arg1, int s, int crf)
{
TCGv t0 = tcg_temp_new();
TCGv t1 = tcg_temp_new();
TCGv_i32 s0 = tcg_temp_new_i32();

tcg_gen_trunc_tl_i32(cpu_crf[crf], cpu_so);

tcg_gen_setcond_tl((s ? TCG_COND_LE: TCG_COND_LEU), t0, arg0, arg1);
tcg_gen_setcond_tl((s ? TCG_COND_LT: TCG_COND_LTU), t1, arg0, arg1);
tcg_gen_add_tl(t0, t0, t1);
tcg_gen_xori_tl(t0, t0, 1);
tcg_gen_addi_tl(t0, t0, 1);
tcg_gen_trunc_tl_i32(s0, t0);
tcg_gen_shli_i32(s0, s0, 1);
tcg_gen_or_i32(cpu_crf[crf], cpu_crf[crf], s0);

tcg_temp_free(t0);
tcg_temp_free(t1);
tcg_temp_free_i32(s0);
}

-- 
Torbjörn