Re: [Qemu-devel] [PATCH] tcg: increase MAX_OP_PER_INSTR to 395

2016-09-23 Thread Joseph Myers
On Fri, 23 Sep 2016, Richard Henderson wrote:

> While increasing the max per insn is indeed one way to approach this, aarch64
> is being remarkably inefficient in this case.  With the following, I see a
> reduction from 387 ops to 261 ops; for a 64-bit host, the reduction is from
> 258 ops to 195 ops.

261 ops plus ops generated in gen_intermediate_code_a64 after the loop 
plus ops from optimization may still require an increase from 266, of 
course (I don't know how to bound the number of ops space must still be 
available for after translating an instruction has resulted in 
tcg_op_buf_full() being true, but my testing had cases where it was at 
least 8).

-- 
Joseph S. Myers
jos...@codesourcery.com



Re: [Qemu-devel] [PATCH] tcg: increase MAX_OP_PER_INSTR to 395

2016-09-23 Thread Richard Henderson

On 09/22/2016 04:53 PM, Joseph Myers wrote:

MAX_OP_PER_INSTR is currently 266, reported in commit
14dcdac82f398cbac874c8579b9583fab31c67bf to be the worst case for the
ARM A64 decoder.

Whether or not it was in fact the worst case at that time in 2014, I'm
observing the instruction 0x4c006020 (st1 {v0.16b-v2.16b}, [x1])
generate 386 ops from disas_ldst_multiple_struct with current sources,


For the record, I reproduce your results on a 32-bit host with v0-v3.  I assume 
the v2 here is a typo.


While increasing the max per insn is indeed one way to approach this, aarch64 
is being remarkably inefficient in this case.  With the following, I see a 
reduction from 387 ops to 261 ops; for a 64-bit host, the reduction is from 258 
ops to 195 ops.


I should also note that the implementation of this insn should be even simpler. 
 I see this insn as performing 8 64-bit, little-endian, unaligned loads.  We 
should be able to implement this insn for a 64-bit host in about 25 ops, which 
implies that the current code is nearly 8 times too large.


The same should be true for other combinations of sizes for ldst.  I recognize 
that it gets more complicated for big-endian guest and element sizes larger 
than 1, but for element sizes larger than 1 we automatically have <= half of 
the number of ops seen here.



r~
diff --git a/target-arm/translate-a64.c b/target-arm/translate-a64.c
index ddf52f5..e44bf96 100644
--- a/target-arm/translate-a64.c
+++ b/target-arm/translate-a64.c
@@ -2536,7 +2536,7 @@ static void disas_ldst_multiple_struct(DisasContext *s, 
uint32_t insn)
 bool is_store = !extract32(insn, 22, 1);
 bool is_postidx = extract32(insn, 23, 1);
 bool is_q = extract32(insn, 30, 1);
-TCGv_i64 tcg_addr, tcg_rn;
+TCGv_i64 tcg_addr, tcg_rn, tcg_ebytes;
 
 int ebytes = 1 << size;
 int elements = (is_q ? 128 : 64) / (8 << size);
@@ -2601,6 +2601,7 @@ static void disas_ldst_multiple_struct(DisasContext *s, 
uint32_t insn)
 tcg_rn = cpu_reg_sp(s, rn);
 tcg_addr = tcg_temp_new_i64();
 tcg_gen_mov_i64(tcg_addr, tcg_rn);
+tcg_ebytes = tcg_const_i64(ebytes);
 
 for (r = 0; r < rpt; r++) {
 int e;
@@ -2624,7 +2625,7 @@ static void disas_ldst_multiple_struct(DisasContext *s, 
uint32_t insn)
 clear_vec_high(s, tt);
 }
 }
-tcg_gen_addi_i64(tcg_addr, tcg_addr, ebytes);
+tcg_gen_add_i64(tcg_addr, tcg_addr, tcg_ebytes);
 tt = (tt + 1) % 32;
 }
 }
@@ -2638,6 +2639,7 @@ static void disas_ldst_multiple_struct(DisasContext *s, 
uint32_t insn)
 tcg_gen_add_i64(tcg_rn, tcg_rn, cpu_reg(s, rm));
 }
 }
+tcg_temp_free_i64(tcg_ebytes);
 tcg_temp_free_i64(tcg_addr);
 }
 
@@ -2680,7 +2682,7 @@ static void disas_ldst_single_struct(DisasContext *s, 
uint32_t insn)
 bool replicate = false;
 int index = is_q << 3 | S << 2 | size;
 int ebytes, xs;
-TCGv_i64 tcg_addr, tcg_rn;
+TCGv_i64 tcg_addr, tcg_rn, tcg_ebytes;
 
 switch (scale) {
 case 3:
@@ -2733,6 +2735,7 @@ static void disas_ldst_single_struct(DisasContext *s, 
uint32_t insn)
 tcg_rn = cpu_reg_sp(s, rn);
 tcg_addr = tcg_temp_new_i64();
 tcg_gen_mov_i64(tcg_addr, tcg_rn);
+tcg_ebytes = tcg_const_i64(ebytes);
 
 for (xs = 0; xs < selem; xs++) {
 if (replicate) {
@@ -2776,7 +2779,7 @@ static void disas_ldst_single_struct(DisasContext *s, 
uint32_t insn)
 do_vec_st(s, rt, index, tcg_addr, s->be_data + scale);
 }
 }
-tcg_gen_addi_i64(tcg_addr, tcg_addr, ebytes);
+tcg_gen_add_i64(tcg_addr, tcg_addr, tcg_ebytes);
 rt = (rt + 1) % 32;
 }
 
@@ -2788,6 +2791,7 @@ static void disas_ldst_single_struct(DisasContext *s, 
uint32_t insn)
 tcg_gen_add_i64(tcg_rn, tcg_rn, cpu_reg(s, rm));
 }
 }
+tcg_temp_free_i64(tcg_ebytes);
 tcg_temp_free_i64(tcg_addr);
 }
 


Re: [Qemu-devel] [PATCH] tcg: increase MAX_OP_PER_INSTR to 395

2016-09-23 Thread Joseph Myers
On Fri, 23 Sep 2016, Laurent Desnogues wrote:

> Hello,
> 
> On Fri, Sep 23, 2016 at 1:53 AM, Joseph Myers  wrote:
> > MAX_OP_PER_INSTR is currently 266, reported in commit
> > 14dcdac82f398cbac874c8579b9583fab31c67bf to be the worst case for the
> > ARM A64 decoder.
> >
> > Whether or not it was in fact the worst case at that time in 2014, I'm
> > observing the instruction 0x4c006020 (st1 {v0.16b-v2.16b}, [x1])
> > generate 386 ops from disas_ldst_multiple_struct with current sources,
> 
> Something's odd, I get exactly half of that with 193.

Does the number of ops depend on the system for which TCG is generating 
code?  (I'm building QEMU for 32-bit x86.)  If 32-bit systems require 
twice as many ops as 64-bit systems, maybe the existing value should be 
doubled, so using 532 (plus whatever is needed to allow for extra ops from 
optimization etc.) - or made conditional on the system for which code is 
generated.

> That being said st1 {v0.16b-v3.16b}, [x1], #64 generates even more ops 
> with 258.

My empirical observations are only from examining cases where QEMU gives 
errors running GCC tests; it's quite possible some instructions aren't 
covered, or that the relevant tests got lucky and avoided buffer overruns 
despite generating too many ops.

-- 
Joseph S. Myers
jos...@codesourcery.com



Re: [Qemu-devel] [PATCH] tcg: increase MAX_OP_PER_INSTR to 395

2016-09-22 Thread Laurent Desnogues
Hello,

On Fri, Sep 23, 2016 at 1:53 AM, Joseph Myers  wrote:
> MAX_OP_PER_INSTR is currently 266, reported in commit
> 14dcdac82f398cbac874c8579b9583fab31c67bf to be the worst case for the
> ARM A64 decoder.
>
> Whether or not it was in fact the worst case at that time in 2014, I'm
> observing the instruction 0x4c006020 (st1 {v0.16b-v2.16b}, [x1])
> generate 386 ops from disas_ldst_multiple_struct with current sources,

Something's odd, I get exactly half of that with 193.

That being said st1 {v0.16b-v3.16b}, [x1], #64 generates even more ops with 258.

Thanks,

Laurent

> plus one op from the call to tcg_gen_insn_start in the loop in
> gen_intermediate_code_a64.  Furthermore, I see six ops generated after
> the loop in gen_intermediate_code_a64, and at least two added
> subsequently in optimization, so MAX_OP_PER_INSTR needs to be at least
> 395.  I do not know whether other instructions, or code during or
> after the loop in gen_intermediate_code_a64, might actually require
> the value to be bigger than 395 (possibly depending on the
> instructions translated before the one generating 386 ops), just that
> 395 is definitely needed for a GCC testcase that generates that
> particular instruction.  So if there is a reliable methodology to
> determine the maximum number of ops that might be generated in (one
> pass through that loop, plus the code after that loop, plus
> optimization), it should be used instead, and might result in a higher
> figure (or maybe a higher figure would be safer anyway).
>
> Signed-off-by: Joseph Myers 
>
> ---
>
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index c9949aa..a7fa452 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -32,7 +32,7 @@
>  #include "tcg-target.h"
>
>  /* XXX: make safe guess about sizes */
> -#define MAX_OP_PER_INSTR 266
> +#define MAX_OP_PER_INSTR 395
>
>  #if HOST_LONG_BITS == 32
>  #define MAX_OPC_PARAM_PER_ARG 2
>
> --
> Joseph S. Myers
> jos...@codesourcery.com
>