Re: [Qemu-devel] [PATCH 2/7] nios2: Add architecture emulation support

2016-10-19 Thread Marek Vasut
On 10/19/2016 06:18 PM, Richard Henderson wrote:
> On 10/18/2016 07:31 PM, Marek Vasut wrote:
>>> Processing a little more data can be preferable to fewer branch
>>> prediction failures.  And the best way to avoid those is to not have the
>>> branch at all. Especially when it's unlikely that the data will be
>>> created in the first place.
>>
>> OK, but I need to make sure stores to ZERO register are ignored, right ?
> 
> Yes.
> 
> Just remember that since nios2 doesn't have condition codes, where side
> effects of an instruction can be useful alone, the *only* likely case of
> ZERO as a destination register is in the (designated) NOP psuedo-op.

Got it, thanks.

-- 
Best regards,
Marek Vasut



Re: [Qemu-devel] [PATCH 2/7] nios2: Add architecture emulation support

2016-10-19 Thread Richard Henderson

On 10/18/2016 07:31 PM, Marek Vasut wrote:

Processing a little more data can be preferable to fewer branch
prediction failures.  And the best way to avoid those is to not have the
branch at all. Especially when it's unlikely that the data will be
created in the first place.


OK, but I need to make sure stores to ZERO register are ignored, right ?


Yes.

Just remember that since nios2 doesn't have condition codes, where side effects 
of an instruction can be useful alone, the *only* likely case of ZERO as a 
destination register is in the (designated) NOP psuedo-op.



r~



Re: [Qemu-devel] [PATCH 2/7] nios2: Add architecture emulation support

2016-10-18 Thread Marek Vasut
On 10/19/2016 03:24 AM, Richard Henderson wrote:
> On 10/18/2016 03:05 PM, Marek Vasut wrote:
 Thanks, I hope this is fixed now, although I mostly special-case the
 R_ZERO handling throughout the code. Any writes to R_ZERO are now
 ignored and any usage is converted to mov/movi instructions where
 applicable.
>>>
>>> We've done that in the past, but in the end it is much cleaner to
>>> minimize the number of places in which you have to check for R_ZERO.
>>
>> Isn't it a bit more performant if you generate as little TCG
>> instructions as possible ?
> 
> Well, yes and no.
> 
> We're always going to run the tcg optimizers, so the resulting code
> should be the same either way.

I see, I'd have to dig deeper into the TCG.

> Processing a little more data can be preferable to fewer branch
> prediction failures.  And the best way to avoid those is to not have the
> branch at all. Especially when it's unlikely that the data will be
> created in the first place.

OK, but I need to make sure stores to ZERO register are ignored, right ?

-- 
Best regards,
Marek Vasut



Re: [Qemu-devel] [PATCH 2/7] nios2: Add architecture emulation support

2016-10-18 Thread Richard Henderson

On 10/18/2016 03:05 PM, Marek Vasut wrote:

Thanks, I hope this is fixed now, although I mostly special-case the
R_ZERO handling throughout the code. Any writes to R_ZERO are now
ignored and any usage is converted to mov/movi instructions where
applicable.


We've done that in the past, but in the end it is much cleaner to
minimize the number of places in which you have to check for R_ZERO.


Isn't it a bit more performant if you generate as little TCG
instructions as possible ?


Well, yes and no.

We're always going to run the tcg optimizers, so the resulting code should be 
the same either way.


Processing a little more data can be preferable to fewer branch prediction 
failures.  And the best way to avoid those is to not have the branch at all. 
Especially when it's unlikely that the data will be created in the first place.



r~



Re: [Qemu-devel] [PATCH 2/7] nios2: Add architecture emulation support

2016-10-18 Thread Marek Vasut
On 10/18/2016 10:44 PM, Richard Henderson wrote:
> On 10/18/2016 11:32 AM, Marek Vasut wrote:
>> But the instruction encoding does, so I can use the field from the
>> instruction to directly index the register array.
> 
> Well, no, you can't.
> 
> In fact, it would be cleaner to have multiple arrays -- one for general
> registers and one for control registers -- that you *could* actually
> directly index.  There are only two instructions that touch the control
> registers, rdctl and wrctl.  In those instructions you're having to
> explicitly add 32.

Well yes, that's true.

>> Thanks, I hope this is fixed now, although I mostly special-case the
>> R_ZERO handling throughout the code. Any writes to R_ZERO are now
>> ignored and any usage is converted to mov/movi instructions where
>> applicable.
> 
> We've done that in the past, but in the end it is much cleaner to
> minimize the number of places in which you have to check for R_ZERO.

Isn't it a bit more performant if you generate as little TCG
instructions as possible ?

> r~


-- 
Best regards,
Marek Vasut



Re: [Qemu-devel] [PATCH 2/7] nios2: Add architecture emulation support

2016-10-18 Thread Richard Henderson

On 10/18/2016 11:32 AM, Marek Vasut wrote:

But the instruction encoding does, so I can use the field from the
instruction to directly index the register array.


Well, no, you can't.

In fact, it would be cleaner to have multiple arrays -- one for general 
registers and one for control registers -- that you *could* actually directly 
index.  There are only two instructions that touch the control registers, rdctl 
and wrctl.  In those instructions you're having to explicitly add 32.



Thanks, I hope this is fixed now, although I mostly special-case the
R_ZERO handling throughout the code. Any writes to R_ZERO are now
ignored and any usage is converted to mov/movi instructions where
applicable.


We've done that in the past, but in the end it is much cleaner to minimize the 
number of places in which you have to check for R_ZERO.



r~



Re: [Qemu-devel] [PATCH 2/7] nios2: Add architecture emulation support

2016-10-18 Thread Marek Vasut
On 10/18/2016 05:31 PM, Richard Henderson wrote:
> On 10/17/2016 08:58 PM, Marek Vasut wrote:
>>> There's no particular reason why R_PC needs to be 64; if you change it
>>> to 32, you can simplify this.
>>
>> I believe this is in fact needed, see [1] page 18 (section 2, Register
>> file), quote:
>>
>> "
>> The Nios II architecture supports a flat register file, consisting of
>> thirty-two 32-bit general-purpose integer
>> registers, and up to thirty-two 32-bit control registers. The
>> architecture supports supervisor and user
>> modes that allow system code to protect the control registers from
>> errant applications.
>> "
>>
>> So the CPU supports 32 general purpose regs (R_ZERO..R_RA), then up-to
>> 32 Control registers (CR_STATUS..CR_MPUACC) and then the PC .
> 
> The architecture manual doesn't imply anything about the numbering of
> these registers.

But the instruction encoding does, so I can use the field from the
instruction to directly index the register array.

>>> You're not doing anything to make sure that r0 reads as 0, and ignores
>>> writes. You need to look at target-alpha, target-mips, or target-sparc
>>> to see various ways in which a zero register may be handled.
>>
>> Any hint on this one ?
> 
> Well, there's three hints right there.  But look at target-alpha, and
> the functions load_gpr and dest_gpr (note that for alpha, 31 is the zero
> register).

Thanks, I hope this is fixed now, although I mostly special-case the
R_ZERO handling throughout the code. Any writes to R_ZERO are now
ignored and any usage is converted to mov/movi instructions where
applicable.

>>> Since divide-by-zero is undefined, this can be done with
>>>
>>>   TCGv t0 = tcg_const_tl(0);
>>>   tcg_gen_setcond_tl(TCG_COND_EQ, t0, dc->cpu_R[instr->b], t0);
>>>   tcg_gen_or_tl(t0, t0, dc->cpu_R[instr->b]);
>>>   tcg_gen_divu_tl(dc->cpu_R[instr->c], dc->cpu_R[instr->a], t0);
>>>   tcg_temp_free_tl(t0);
>>
>> Just so I get it completely, isn't this handled somehow by the
>> tcg_gen_divu_tl() itself ?
> 
> No, tcg_gen_divu_tl is a plain host divide instruction.
> 
> For guests that require exceptions to be raised, we handle those
> manually beforehand.  At which point extra checks in tcg_gen_divu_tl are
> wasted.

OK, got it, thanks!

-- 
Best regards,
Marek Vasut



Re: [Qemu-devel] [PATCH 2/7] nios2: Add architecture emulation support

2016-10-18 Thread Richard Henderson

On 10/17/2016 08:58 PM, Marek Vasut wrote:

There's no particular reason why R_PC needs to be 64; if you change it
to 32, you can simplify this.


I believe this is in fact needed, see [1] page 18 (section 2, Register
file), quote:

"
The Nios II architecture supports a flat register file, consisting of
thirty-two 32-bit general-purpose integer
registers, and up to thirty-two 32-bit control registers. The
architecture supports supervisor and user
modes that allow system code to protect the control registers from
errant applications.
"

So the CPU supports 32 general purpose regs (R_ZERO..R_RA), then up-to
32 Control registers (CR_STATUS..CR_MPUACC) and then the PC .


The architecture manual doesn't imply anything about the numbering of these 
registers.



You're not doing anything to make sure that r0 reads as 0, and ignores
writes. You need to look at target-alpha, target-mips, or target-sparc
to see various ways in which a zero register may be handled.


Any hint on this one ?


Well, there's three hints right there.  But look at target-alpha, and the 
functions load_gpr and dest_gpr (note that for alpha, 31 is the zero register).



Since divide-by-zero is undefined, this can be done with

  TCGv t0 = tcg_const_tl(0);
  tcg_gen_setcond_tl(TCG_COND_EQ, t0, dc->cpu_R[instr->b], t0);
  tcg_gen_or_tl(t0, t0, dc->cpu_R[instr->b]);
  tcg_gen_divu_tl(dc->cpu_R[instr->c], dc->cpu_R[instr->a], t0);
  tcg_temp_free_tl(t0);


Just so I get it completely, isn't this handled somehow by the
tcg_gen_divu_tl() itself ?


No, tcg_gen_divu_tl is a plain host divide instruction.

For guests that require exceptions to be raised, we handle those manually 
beforehand.  At which point extra checks in tcg_gen_divu_tl are wasted.




r~



Re: [Qemu-devel] [PATCH 2/7] nios2: Add architecture emulation support

2016-10-17 Thread Marek Vasut
On 09/29/2016 03:05 AM, Richard Henderson wrote:
>>  hw/nios2/cpu_pic.c |   70 +++
> 
> Why is this in this patch?

Ah, good catch, moved to 10m50 support patch.

>>  target-nios2/instruction.c | 1427
>> 
>>  target-nios2/instruction.h |  279 +
>>  target-nios2/translate.c   |  242 
> 
> Why are these files separate?

Remnant of the old code, will merge them.

>> +if (n < 32)/* GP regs */
>> +return gdb_get_reg32(mem_buf, env->regs[n]);
>> +else if (n == 32)/* PC */
>> +return gdb_get_reg32(mem_buf, env->regs[R_PC]);
>> +else if (n < 49)/* Status regs */
>> +return gdb_get_reg32(mem_buf, env->regs[n - 1]);
> 
> Use checkpatch.pl; the formatting is wrong.

Fixed across the entire series, thanks.

> There's no particular reason why R_PC needs to be 64; if you change it
> to 32, you can simplify this.

I believe this is in fact needed, see [1] page 18 (section 2, Register
file), quote:

"
The Nios II architecture supports a flat register file, consisting of
thirty-two 32-bit general-purpose integer
registers, and up to thirty-two 32-bit control registers. The
architecture supports supervisor and user
modes that allow system code to protect the control registers from
errant applications.
"

So the CPU supports 32 general purpose regs (R_ZERO..R_RA), then up-to
32 Control registers (CR_STATUS..CR_MPUACC) and then the PC .


[1]
https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/hb/nios2/n2cpu_nii5v1.pdf

>> +struct CPUNios2State {
>> +uint32_t regs[NUM_CORE_REGS];
>> +
>> +/* Addresses that are hard-coded in the FPGA build settings */
>> +uint32_t reset_addr;
>> +uint32_t exception_addr;
>> +uint32_t fast_tlb_miss_addr;
> 
> These three should go after ...
> 
>> +
>> +#if !defined(CONFIG_USER_ONLY)
>> +Nios2MMU mmu;
>> +
>> +uint32_t irq_pending;
>> +#endif
>> +
>> +CPU_COMMON
> 
> ... here, or even into ...
> 
>> +};
>> +
>> +/**
>> + * Nios2CPU:
>> + * @env: #CPUNios2State
>> + *
>> + * A Nios2 CPU.
>> + */
>> +typedef struct Nios2CPU {
>> +/*< private >*/
>> +CPUState parent_obj;
>> +/*< public >*/
>> +
>> +CPUNios2State env;
>> +bool mmu_present;
> 
> ... here.

Moved here, it looks more sensible as it's not something one can change
at runtime.

>> +static inline void t_gen_helper_raise_exception(DisasContext *dc,
>> +uint32_t index)
> 
> Remove all of the inline markers and let the compiler decide.

Done globally, thanks.

>> +static void gen_check_supervisor(DisasContext *dc, TCGLabel *label)
>> +{
>> +TCGLabel *l1 = gen_new_label();
>> +
>> +TCGv_i32 tmp = tcg_temp_new();
>> +tcg_gen_andi_tl(tmp, dc->cpu_R[CR_STATUS], CR_STATUS_U);
>> +tcg_gen_brcond_tl(TCG_COND_EQ, dc->cpu_R[R_ZERO], tmp, l1);
>> +t_gen_helper_raise_exception(dc, EXCP_SUPERI);
>> +tcg_gen_br(label);
> 
> The supervisor check should be done at translate time by checking
> dc->tb->flags & CR_STATUS_U.

Fixed.

>> +#ifdef CALL_TRACING
>> +TCGv_i32 tmp = tcg_const_i32(dc->pc);
>> +TCGv_i32 tmp2 = tcg_const_i32((dc->pc & 0xF000) |
>> (instr->imm26 * 4));
>> +gen_helper_call_status(tmp, tmp2);
>> +tcg_temp_free_i32(tmp);
>> +tcg_temp_free_i32(tmp2);
>> +#endif
> 
> What's the point of this?

Seems like some stale debug helper, removed, thanks.

>> +/*
>> + * I-Type instructions
>> + */
>> +
>> +/* rB <- 0x00 : Mem8[rA + @(IMM16)] */
>> +static void ldbu(DisasContext *dc, uint32_t code)
>> +{
>> +I_TYPE(instr, code);
>> +
>> +TCGv addr = tcg_temp_new();
>> +tcg_gen_addi_tl(addr, dc->cpu_R[instr->a],
>> +(int32_t)((int16_t)instr->imm16));
>> +
>> +tcg_gen_qemu_ld8u(dc->cpu_R[instr->b], addr, dc->mem_idx);
>> +
>> +tcg_temp_free(addr);
>> +}
> 
> You should have helper functions so that you don't have to replicate 12
> line functions.  Use the tcg_gen_qemu_ld_tl function, and the TCGMemOp
> flags to help merge all load instructions, and all store instructions.

Oh nice, thanks.

>> +/* rB <- rA + IMM16 */
>> +static void addi(DisasContext *dc, uint32_t code)
>> +{
>> +I_TYPE(instr, code);
>> +
>> +TCGv imm = tcg_temp_new();
>> +tcg_gen_addi_tl(dc->cpu_R[instr->b], dc->cpu_R[instr->a],
>> +(int32_t)((int16_t)instr->imm16));
> 
> The double cast is pointless, as are the extra parenthesis.

The int16_t cast is needed as the imm16 is signed value. The int32()
cast is indeed pointless, so removed. Thanks.

> You're not doing anything to make sure that r0 reads as 0, and ignores
> writes. You need to look at target-alpha, target-mips, or target-sparc
> to see various ways in which a zero register may be handled.

Any hint on this one ?

> You should handle the special case of movi, documented as addi rb, r0,
> imm, so that the common method of loading a small value does not require
> extra 

Re: [Qemu-devel] [PATCH 2/7] nios2: Add architecture emulation support

2016-09-28 Thread Richard Henderson

 hw/nios2/cpu_pic.c |   70 +++


Why is this in this patch?


 target-nios2/instruction.c | 1427 
 target-nios2/instruction.h |  279 +
 target-nios2/translate.c   |  242 


Why are these files separate?


+if (n < 32) /* GP regs */
+return gdb_get_reg32(mem_buf, env->regs[n]);
+else if (n == 32)  /* PC */
+return gdb_get_reg32(mem_buf, env->regs[R_PC]);
+else if (n < 49)/* Status regs */
+return gdb_get_reg32(mem_buf, env->regs[n - 1]);


Use checkpatch.pl; the formatting is wrong.

There's no particular reason why R_PC needs to be 64; if you change it to 32, 
you can simplify this.



+struct CPUNios2State {
+uint32_t regs[NUM_CORE_REGS];
+
+/* Addresses that are hard-coded in the FPGA build settings */
+uint32_t reset_addr;
+uint32_t exception_addr;
+uint32_t fast_tlb_miss_addr;


These three should go after ...


+
+#if !defined(CONFIG_USER_ONLY)
+Nios2MMU mmu;
+
+uint32_t irq_pending;
+#endif
+
+CPU_COMMON


... here, or even into ...


+};
+
+/**
+ * Nios2CPU:
+ * @env: #CPUNios2State
+ *
+ * A Nios2 CPU.
+ */
+typedef struct Nios2CPU {
+/*< private >*/
+CPUState parent_obj;
+/*< public >*/
+
+CPUNios2State env;
+bool mmu_present;


... here.


+static inline void t_gen_helper_raise_exception(DisasContext *dc,
+uint32_t index)


Remove all of the inline markers and let the compiler decide.


+static void gen_check_supervisor(DisasContext *dc, TCGLabel *label)
+{
+TCGLabel *l1 = gen_new_label();
+
+TCGv_i32 tmp = tcg_temp_new();
+tcg_gen_andi_tl(tmp, dc->cpu_R[CR_STATUS], CR_STATUS_U);
+tcg_gen_brcond_tl(TCG_COND_EQ, dc->cpu_R[R_ZERO], tmp, l1);
+t_gen_helper_raise_exception(dc, EXCP_SUPERI);
+tcg_gen_br(label);


The supervisor check should be done at translate time by checking dc->tb->flags 
& CR_STATUS_U.



+#ifdef CALL_TRACING
+TCGv_i32 tmp = tcg_const_i32(dc->pc);
+TCGv_i32 tmp2 = tcg_const_i32((dc->pc & 0xF000) | (instr->imm26 * 4));
+gen_helper_call_status(tmp, tmp2);
+tcg_temp_free_i32(tmp);
+tcg_temp_free_i32(tmp2);
+#endif


What's the point of this?


+/*
+ * I-Type instructions
+ */
+
+/* rB <- 0x00 : Mem8[rA + @(IMM16)] */
+static void ldbu(DisasContext *dc, uint32_t code)
+{
+I_TYPE(instr, code);
+
+TCGv addr = tcg_temp_new();
+tcg_gen_addi_tl(addr, dc->cpu_R[instr->a],
+(int32_t)((int16_t)instr->imm16));
+
+tcg_gen_qemu_ld8u(dc->cpu_R[instr->b], addr, dc->mem_idx);
+
+tcg_temp_free(addr);
+}


You should have helper functions so that you don't have to replicate 12 line 
functions.  Use the tcg_gen_qemu_ld_tl function, and the TCGMemOp flags to help 
merge all load instructions, and all store instructions.



+/* rB <- rA + IMM16 */
+static void addi(DisasContext *dc, uint32_t code)
+{
+I_TYPE(instr, code);
+
+TCGv imm = tcg_temp_new();
+tcg_gen_addi_tl(dc->cpu_R[instr->b], dc->cpu_R[instr->a],
+(int32_t)((int16_t)instr->imm16));


The double cast is pointless, as are the extra parenthesis.

You're not doing anything to make sure that r0 reads as 0, and ignores writes. 
You need to look at target-alpha, target-mips, or target-sparc to see various 
ways in which a zero register may be handled.


You should handle the special case of movi, documented as addi rb, r0, imm, so 
that the common method of loading a small value does not require extra tcg opcodes.


Similarly for the other (common) aliases, like mov and nop, which are both 
aliased to add.



+/* Initializes the data cache line currently caching address rA + @(IMM16) */
+static void initda(DisasContext *dc __attribute__((unused)),
+   uint32_t code __attribute__((unused)))
+{
+/* TODO */
+}


This will always be a nop for qemu.


+static void blt(DisasContext *dc, uint32_t code)
+{
+I_TYPE(instr, code);
+
+TCGLabel *l1 = gen_new_label();
+
+tcg_gen_brcond_tl(TCG_COND_LT, dc->cpu_R[instr->a],
+  dc->cpu_R[instr->b], l1);
+
+gen_goto_tb(dc, 0, dc->pc + 4);
+
+gen_set_label(l1);
+
+gen_goto_tb(dc, 1, dc->pc + 4 + (int16_t)(instr->imm16 & 0xFFFC));
+
+dc->is_jmp = DISAS_TB_JUMP;
+}


These really require a helper function.


+/* rB <- 0x00 : Mem8[rA + @(IMM16)] (bypassing cache) */
+static void ldbuio(DisasContext *dc, uint32_t code)


Reuse ldbu; qemu will never implement caches.


+/* rB <- (rA * @(IMM16))(31..0) */
+static void muli(DisasContext *dc, uint32_t code)
+{
+I_TYPE(instr, code);
+
+TCGv imm = tcg_temp_new();
+tcg_gen_muli_i32(dc->cpu_R[instr->b], dc->cpu_R[instr->a],
+ (int32_t)((int16_t)instr->imm16));
+tcg_temp_free(imm);


imm is unused.


+static const Nios2Instruction i_type_instructions[I_TYPE_COUNT] = {
+[CALL]= INSTRUCTION(call),
+[JMPI]=