Re: qemu questions about x86

2022-04-26 Thread Wei Li
Dear 项晨东

On Sat, Apr 23, 2022 at 3:57 PM 项晨东  wrote:

> Dear qemu developers:
> hello~ I'm Xiang Chen dong, a student from Tsinghua University. recently I
> am trying to  accomplish new X86 feature named user-interrupts which can
> view here
> <https://www.intel.com/content/dam/develop/external/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf>
> .
> I worked for a couple of time, reaching status that new msrs added and
> access of msrs is work well, also add new CPUID infos to qemu64, also I
> could catch new instructions by modify `translate.c` file. my code could
> find here <https://github.com/Xiang-cd/qemu>, the correspond linux kernel
> version could find here <https://github.com/intel/uintr-linux-kernel>.
> but now I have some problems when trying to accomplish instructions named
> SENDUIPI and UIRET.
> for SENDUIPI, the main function of it is sending the user-interrupts. the
> detail way is, machine access memory(address saved in new msr), then read
> another address from memory, then write some content to this memory. I read
> the qemu source code, find a lot of functions like tcg_gen_qemu_ld,  but
> when i click into it from IDE(vscode), i could not find where the function
> body(maybe due to the macro). So I don't understand how the function works
> and how can I wirte a new function to access guest machine memory and write
> back in qemu.
>

tcg_frontend: gen_op_ld_v-->tcg_gen_qemu_ld_tl-->tcg_gen_qemu_ld_i64
(tcg/tcg-op.c)-->gen_ldst_i64
tcg_backend: case INDEX_op_qemu_ld_i64:-->tcg_out_qemu_ld
(tcg-target.c.inc tcg/i386)
You only need to focus on the frontend and learn from how to translate
other instructions.

another problem is that I am not quite get the idea of accomplishment of
> Interrupt, i could find functions like raise_interrupt and raise_exception,
> but I don't understand how it interact with apic(how the control flow
> switched to other functions, i find cpu_loop_exit_restore, but can not find
> the function body), either how the interrupt handled.
>

hardware interrupt produce
pc_i8259_create-->i8259_init-->x86_allocate_cpu_irq-->pic_irq_request
pic_irq_request-->cpu_interrupt(cs, CPU_INTERRUPT_HARD)
-->softmmu/cpus.c/cpu_interrupt-->tcg_handle_interrupt
  -->cpu_reset_interrupt-->hw/core/cpu-common.c/cpu_reset_interrupt

 hardware interrupt handle
cpu_exec-->cpu_handle_interrupt-->cc->tcg_ops->cpu_exec_interrupt-->x86_cpu_exec_interrupt
-->cpu_get_pic_interrupt-->pic_read_irq
-->do_interrupt_x86_hardirq-->do_interrupt_all-->do_interrupt_protected-->
use siglongjmp or sigsetjmp

exception handle
cpu_handle_exception-->cc->tcg_ops->fake_user_interrupt-->x86_cpu_do_interrupt-->do_interrupt_all


>
>
the problem is difficult in some ways, I discussed with my classmates and
> friends, but there is no answer.
> so I'm hoping to get important information from you. Is my way of reading
> code right? Is there any tools for development(finding the function
> body)?How can I accomplish this quickly?
> thank you very very much!
> best wishes!
> Xiang Chen Dong
>

Everything here maybe have some mistakes.
Hope it is useful for you.
-- 
best wishes!

Wei Li


Re: [PATCH 0/2] target/i386: Some mmx/sse instructions don't require

2022-04-04 Thread Wei Li
Ping..

And the title is target/i386: Some mmx/sse instructions don't require
CR0.TS=0

On Fri, Mar 25, 2022 at 10:55 PM Wei Li  wrote:

> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/427
>
> All instructions decoded by 'gen_see' is assumed to require CRO.TS=0. But
> according to SDM, CRC32 doesn't require it. In fact, EMMS, FMMS and some
> mmx/sse instructions(0F38F[0-F], 0F3AF[0-F]) don't require it.
>
> To solve the problem, first to move EMMS and FMMS out of gen_sse. Then
> instructions in 'gen_sse' require it only when modrm & 0xF0 is false.
>
> Wei Li (2):
>   Move EMMS and FEMMS instructions out of gen_sse
>   Some mmx/sse instructions in 'gen_sse' don't require CRO.TS=0
>
>  target/i386/tcg/translate.c | 45 +
>  1 file changed, 21 insertions(+), 24 deletions(-)
>
> --
> 2.30.2
>
>
>
Thanks.
--
Wei Li


[PATCH 2/2] Some mmx/sse instructions in 'gen_sse' don't require CRO.TS=0

2022-03-25 Thread Wei Li
Some instructions in 'gen_sse' don't require CRO.TS=0 and the opcode of them are
0F38F[0-F], 0F3AF[0-F].

Signed-off-by: Wei Li 
---
 target/i386/tcg/translate.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index fe9fcdae96..14cf11771c 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -3139,8 +3139,16 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 is_xmm = 1;
 }
 }
+
+modrm = x86_ldub_code(env, s);
+reg = ((modrm >> 3) & 7);
+if (is_xmm) {
+reg |= REX_R(s);
+}
+mod = (modrm >> 6) & 3;
 /* simple MMX/SSE operation */
-if (s->flags & HF_TS_MASK) {
+if ((s->flags & HF_TS_MASK)
+&& (!(modrm & 0xF0))) {
 gen_exception(s, EXCP07_PREX, pc_start - s->cs_base);
 return;
 }
@@ -3159,13 +3167,6 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 if (!is_xmm) {
 gen_helper_enter_mmx(cpu_env);
 }
-
-modrm = x86_ldub_code(env, s);
-reg = ((modrm >> 3) & 7);
-if (is_xmm) {
-reg |= REX_R(s);
-}
-mod = (modrm >> 6) & 3;
 if (sse_fn_epp == SSE_SPECIAL) {
 b |= (b1 << 8);
 switch(b) {
-- 
2.30.2




[PATCH v4 1/1] fix cmpxchg and lock cmpxchg instruction

2022-03-22 Thread Wei Li
This patch fixes a bug reported on issues #508.
The problem is cmpxchg and lock cmpxchg would touch accumulator when
the comparison is equal.

Signed-off-by: Wei Li 
---
 target/i386/tcg/translate.c | 41 +
 1 file changed, 19 insertions(+), 22 deletions(-)

diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index 2a94d33742..9677f9576b 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -5339,7 +5339,7 @@ static target_ulong disas_insn(DisasContext *s, CPUState 
*cpu)
 case 0x1b0:
 case 0x1b1: /* cmpxchg Ev, Gv */
 {
-TCGv oldv, newv, cmpv;
+TCGv oldv, newv, cmpv, temp;
 
 ot = mo_b_d(b, dflag);
 modrm = x86_ldub_code(env, s);
@@ -5348,42 +5348,38 @@ static target_ulong disas_insn(DisasContext *s, 
CPUState *cpu)
 oldv = tcg_temp_new();
 newv = tcg_temp_new();
 cmpv = tcg_temp_new();
+temp = tcg_temp_new();
 gen_op_mov_v_reg(s, ot, newv, reg);
 tcg_gen_mov_tl(cmpv, cpu_regs[R_EAX]);
+tcg_gen_mov_tl(temp, cpu_regs[R_EAX]);
 
-if (s->prefix & PREFIX_LOCK) {
+if ((s->prefix & PREFIX_LOCK) ||
+(mod != 3)) {
+/* Use the tcg_gen_atomic_cmpxchg_tl path whenever mod != 3.
+   While an unlocked cmpxchg need not be atomic, it is not
+   required to be non-atomic either. */
 if (mod == 3) {
 goto illegal_op;
 }
 gen_lea_modrm(env, s, modrm);
 tcg_gen_atomic_cmpxchg_tl(oldv, s->A0, cmpv, newv,
   s->mem_index, ot | MO_LE);
-gen_op_mov_reg_v(s, ot, R_EAX, oldv);
+gen_extu(ot, oldv);
+gen_extu(ot, cmpv);
 } else {
-if (mod == 3) {
-rm = (modrm & 7) | REX_B(s);
-gen_op_mov_v_reg(s, ot, oldv, rm);
-} else {
-gen_lea_modrm(env, s, modrm);
-gen_op_ld_v(s, ot, oldv, s->A0);
-rm = 0; /* avoid warning */
-}
+rm = (modrm & 7) | REX_B(s);
+gen_op_mov_v_reg(s, ot, oldv, rm);
 gen_extu(ot, oldv);
 gen_extu(ot, cmpv);
 /* store value = (old == cmp ? new : old);  */
 tcg_gen_movcond_tl(TCG_COND_EQ, newv, oldv, cmpv, newv, oldv);
-if (mod == 3) {
-gen_op_mov_reg_v(s, ot, R_EAX, oldv);
-gen_op_mov_reg_v(s, ot, rm, newv);
-} else {
-/* Perform an unconditional store cycle like physical cpu;
-   must be before changing accumulator to ensure
-   idempotency if the store faults and the instruction
-   is restarted */
-gen_op_st_v(s, ot, newv, s->A0);
-gen_op_mov_reg_v(s, ot, R_EAX, oldv);
-}
+gen_op_mov_reg_v(s, ot, rm, newv);
 }
+/* Perform the merge into %al or %ax as required by ot. */
+gen_op_mov_reg_v(s, ot, R_EAX, oldv);
+/* Undo the entire modification to %rax if comparison equal. */
+tcg_gen_movcond_tl(TCG_COND_EQ, cpu_regs[R_EAX], oldv, cmpv,
+temp, cpu_regs[R_EAX]);
 tcg_gen_mov_tl(cpu_cc_src, oldv);
 tcg_gen_mov_tl(s->cc_srcT, cmpv);
 tcg_gen_sub_tl(cpu_cc_dst, cmpv, oldv);
@@ -5391,6 +5387,7 @@ static target_ulong disas_insn(DisasContext *s, CPUState 
*cpu)
 tcg_temp_free(oldv);
 tcg_temp_free(newv);
 tcg_temp_free(cmpv);
+tcg_temp_free(temp);
 }
 break;
 case 0x1c7: /* cmpxchg8b */
-- 
2.30.2




[PATCH v3 1/1] fix cmpxchg and lock cmpxchg instruction

2022-03-22 Thread Wei Li
Give a better code struture to reduce more code duplication according
to the discuss in patch v2.

Signed-off-by: Wei Li 
---
 target/i386/tcg/translate.c | 41 +
 1 file changed, 19 insertions(+), 22 deletions(-)

diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index 2a94d33742..9677f9576b 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -5339,7 +5339,7 @@ static target_ulong disas_insn(DisasContext *s, CPUState 
*cpu)
 case 0x1b0:
 case 0x1b1: /* cmpxchg Ev, Gv */
 {
-TCGv oldv, newv, cmpv;
+TCGv oldv, newv, cmpv, temp;
 
 ot = mo_b_d(b, dflag);
 modrm = x86_ldub_code(env, s);
@@ -5348,42 +5348,38 @@ static target_ulong disas_insn(DisasContext *s, 
CPUState *cpu)
 oldv = tcg_temp_new();
 newv = tcg_temp_new();
 cmpv = tcg_temp_new();
+temp = tcg_temp_new();
 gen_op_mov_v_reg(s, ot, newv, reg);
 tcg_gen_mov_tl(cmpv, cpu_regs[R_EAX]);
+tcg_gen_mov_tl(temp, cpu_regs[R_EAX]);
 
-if (s->prefix & PREFIX_LOCK) {
+if ((s->prefix & PREFIX_LOCK) ||
+(mod != 3)) {
+/* Use the tcg_gen_atomic_cmpxchg_tl path whenever mod != 3.
+   While an unlocked cmpxchg need not be atomic, it is not
+   required to be non-atomic either. */
 if (mod == 3) {
 goto illegal_op;
 }
 gen_lea_modrm(env, s, modrm);
 tcg_gen_atomic_cmpxchg_tl(oldv, s->A0, cmpv, newv,
   s->mem_index, ot | MO_LE);
-gen_op_mov_reg_v(s, ot, R_EAX, oldv);
+gen_extu(ot, oldv);
+gen_extu(ot, cmpv);
 } else {
-if (mod == 3) {
-rm = (modrm & 7) | REX_B(s);
-gen_op_mov_v_reg(s, ot, oldv, rm);
-} else {
-gen_lea_modrm(env, s, modrm);
-gen_op_ld_v(s, ot, oldv, s->A0);
-rm = 0; /* avoid warning */
-}
+rm = (modrm & 7) | REX_B(s);
+gen_op_mov_v_reg(s, ot, oldv, rm);
 gen_extu(ot, oldv);
 gen_extu(ot, cmpv);
 /* store value = (old == cmp ? new : old);  */
 tcg_gen_movcond_tl(TCG_COND_EQ, newv, oldv, cmpv, newv, oldv);
-if (mod == 3) {
-gen_op_mov_reg_v(s, ot, R_EAX, oldv);
-gen_op_mov_reg_v(s, ot, rm, newv);
-} else {
-/* Perform an unconditional store cycle like physical cpu;
-   must be before changing accumulator to ensure
-   idempotency if the store faults and the instruction
-   is restarted */
-gen_op_st_v(s, ot, newv, s->A0);
-gen_op_mov_reg_v(s, ot, R_EAX, oldv);
-}
+gen_op_mov_reg_v(s, ot, rm, newv);
 }
+/* Perform the merge into %al or %ax as required by ot. */
+gen_op_mov_reg_v(s, ot, R_EAX, oldv);
+/* Undo the entire modification to %rax if comparison equal. */
+tcg_gen_movcond_tl(TCG_COND_EQ, cpu_regs[R_EAX], oldv, cmpv,
+temp, cpu_regs[R_EAX]);
 tcg_gen_mov_tl(cpu_cc_src, oldv);
 tcg_gen_mov_tl(s->cc_srcT, cmpv);
 tcg_gen_sub_tl(cpu_cc_dst, cmpv, oldv);
@@ -5391,6 +5387,7 @@ static target_ulong disas_insn(DisasContext *s, CPUState 
*cpu)
 tcg_temp_free(oldv);
 tcg_temp_free(newv);
 tcg_temp_free(cmpv);
+tcg_temp_free(temp);
 }
 break;
 case 0x1c7: /* cmpxchg8b */
-- 
2.30.2




[PATCH v2 1/1] fix cmpxchg and lock cmpxchg instruction

2022-03-21 Thread Wei Li
One question is that we can reduce more code duplication if we use

-
if(foo){

tcg_gen_atomic_cmpxchg_tl(oldv, s->A0, cmpv, newv,
  s->mem_index, ot | MO_LE);
gen_extu(ot, oldv);
gen_extu(ot, cmpv); 
}else{

tcg_gen_movcond_tl(TCG_COND_EQ, newv, old, cmpv, newv, oldv);
gen_op_mov_reg_v(s, ot, rm, newv);
}
gen_op_mov_reg_v(s, ot, R_EAX, oldv);
tcg_gen_movcond_tl(TCG_COND_EQ, cpu_regs[R_EAX], oldv, cmpv,
temp, cpu_regs[R_EAX]);


The problem is gen_op_mov_reg_v(s, ot, rm, newv) will happen before
gen_op_mov_reg_v(s, ot, R_EAX, oldv). According to SDM, write to R_EAX
should happen before write to rm. I am not sure about its side effects.

All in all, if there is no side effect, we can use the code above to
reduce more code duplication. Or we use the code below to ensure
correctness.

Signed-off-by: Wei Li 
---
 target/i386/tcg/translate.c | 44 +++--
 1 file changed, 23 insertions(+), 21 deletions(-)

diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index 2a94d33742..6633d8ece6 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -5339,7 +5339,7 @@ static target_ulong disas_insn(DisasContext *s, CPUState 
*cpu)
 case 0x1b0:
 case 0x1b1: /* cmpxchg Ev, Gv */
 {
-TCGv oldv, newv, cmpv;
+TCGv oldv, newv, cmpv, temp;
 
 ot = mo_b_d(b, dflag);
 modrm = x86_ldub_code(env, s);
@@ -5348,41 +5348,42 @@ static target_ulong disas_insn(DisasContext *s, 
CPUState *cpu)
 oldv = tcg_temp_new();
 newv = tcg_temp_new();
 cmpv = tcg_temp_new();
+temp = tcg_temp_new();
 gen_op_mov_v_reg(s, ot, newv, reg);
 tcg_gen_mov_tl(cmpv, cpu_regs[R_EAX]);
+tcg_gen_mov_tl(temp, cpu_regs[R_EAX]);
 
-if (s->prefix & PREFIX_LOCK) {
+if ((s->prefix & PREFIX_LOCK) ||
+(mod != 3)) {
+/* Use the tcg_gen_atomic_cmpxchg_tl path whenever mod != 3.
+   While an unlocked cmpxchg need not be atomic, it is not
+   required to be non-atomic either. */
 if (mod == 3) {
 goto illegal_op;
 }
 gen_lea_modrm(env, s, modrm);
 tcg_gen_atomic_cmpxchg_tl(oldv, s->A0, cmpv, newv,
   s->mem_index, ot | MO_LE);
+gen_extu(ot, oldv);
+gen_extu(ot, cmpv);
+/* Perform the merge into %al or %ax as required by ot. */
 gen_op_mov_reg_v(s, ot, R_EAX, oldv);
+/* Undo the entire modification to %rax if comparison equal. */
+tcg_gen_movcond_tl(TCG_COND_EQ, cpu_regs[R_EAX], oldv, cmpv,
+temp, cpu_regs[R_EAX]);
 } else {
-if (mod == 3) {
-rm = (modrm & 7) | REX_B(s);
-gen_op_mov_v_reg(s, ot, oldv, rm);
-} else {
-gen_lea_modrm(env, s, modrm);
-gen_op_ld_v(s, ot, oldv, s->A0);
-rm = 0; /* avoid warning */
-}
+rm = (modrm & 7) | REX_B(s);
+gen_op_mov_v_reg(s, ot, oldv, rm);
 gen_extu(ot, oldv);
 gen_extu(ot, cmpv);
 /* store value = (old == cmp ? new : old);  */
 tcg_gen_movcond_tl(TCG_COND_EQ, newv, oldv, cmpv, newv, oldv);
-if (mod == 3) {
-gen_op_mov_reg_v(s, ot, R_EAX, oldv);
-gen_op_mov_reg_v(s, ot, rm, newv);
-} else {
-/* Perform an unconditional store cycle like physical cpu;
-   must be before changing accumulator to ensure
-   idempotency if the store faults and the instruction
-   is restarted */
-gen_op_st_v(s, ot, newv, s->A0);
-gen_op_mov_reg_v(s, ot, R_EAX, oldv);
-}
+/* Perform the merge into %al or %ax as required by ot. */
+gen_op_mov_reg_v(s, ot, R_EAX, oldv);
+/* Undo the entire modification to %rax if comparison equal. */
+tcg_gen_movcond_tl(TCG_COND_EQ, cpu_regs[R_EAX], oldv, cmpv,
+temp, cpu_regs[R_EAX]);
+gen_op_mov_reg_v(s, ot, rm, newv);
 }
 tcg_gen_mov_tl(cpu_cc_src, oldv);
 tcg_gen_mov_tl(s->cc_srcT, cmpv);
@@ -5391,6 +5392,7 @@ static target_ulong disas_insn(DisasContext *s, CPUState 
*cpu)
 tcg_temp_free(oldv);
 tcg_temp_free(newv);
 tcg_temp_free(cmpv);
+tcg_temp_free(temp);
 }
  

[PATCH v2 0/1] cmpxchg and lock cmpxchg should not touch accumulator

2022-03-21 Thread Wei Li
Bug: https://gitlab.com/qemu-project/qemu/-/issues/508

This series fix a bug reported on issues 508.
The problem is cmpxchg and lock cmpxchg would touch accumulator when
they should not do that.

Changes from v1
* cmpxchg uses the lock cmpxchg path whenever mod != 3 to reduce code
  duplication.
* lock cmpxchg uses movcond to replace branch.
* Combine the two patches into one patch because cmpxchg uses the lock
  cmpxchg path.

v1 link:
https://lists.gnu.org/archive/html/qemu-devel/2022-03/msg05023.html

Wei Li (1):
  fix cmpxchg and lock cmpxchg instruction

 target/i386/tcg/translate.c | 44 +++--
 1 file changed, 23 insertions(+), 21 deletions(-)

-- 
2.30.2




Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread

2019-05-03 Thread Wei Li
Hi Paolo,

That will be great, I would like to hear more details about the design and 
implementation once you get those ready. 

Thanks a lot,
Wei

On 5/3/19, 11:05 AM, "Paolo Bonzini"  wrote:

On 03/05/19 10:21, Wei Li wrote:
> Got it, thanks Stefan for your clarification!

Hi Wei,

Stefan and I should be posting a patch to add Linux SCSI driver
batching, and an implementation for virtio-scsi.

Paolo

> Wei
> 
> On 5/1/19, 9:36 AM, "Stefan Hajnoczi"  wrote:
> 
> On Mon, Apr 29, 2019 at 10:56:31AM -0700, Wei Li wrote:
> >Does this mean the performance could be improved via adding Batch 
I/O submission support in Guest driver side which will be able to reduce the 
number of virtqueue kicks?
> 
> Yes, I think so.  It's not obvious to me how a Linux SCSI driver is
> supposed to implement batching though.  The .queuecommand API doesn't
> seem to include information relevant to batching.
> 
> Stefan
> 
> 
> 
> 







Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread

2019-05-03 Thread Wei Li
Got it, thanks Stefan for your clarification!

Wei

On 5/1/19, 9:36 AM, "Stefan Hajnoczi"  wrote:

On Mon, Apr 29, 2019 at 10:56:31AM -0700, Wei Li wrote:
>Does this mean the performance could be improved via adding Batch I/O 
submission support in Guest driver side which will be able to reduce the number 
of virtqueue kicks?

Yes, I think so.  It's not obvious to me how a Linux SCSI driver is
supposed to implement batching though.  The .queuecommand API doesn't
seem to include information relevant to batching.

Stefan







Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread

2019-04-29 Thread Wei Li
Thanks Stefan!

Does this mean the performance could be improved via adding Batch I/O 
submission support in Guest driver side which will be able to reduce the number 
of virtqueue kicks?

Thanks,
Wei

On 4/29/19, 6:40 AM, "Stefan Hajnoczi"  wrote:

On Fri, Apr 26, 2019 at 10:14:16AM +0200, Paolo Bonzini wrote:
> On 23/04/19 14:04, Stefan Hajnoczi wrote:
> >> In addition, does Virtio-scsi support Batch I/O Submission feature
> >> which may be able to increase the IOPS via reducing the number of
> >> system calls?
> >
> > I don't see obvious batching support in drivers/scsi/virtio_scsi.c.
> > The Linux block layer supports batching but I'm not sure if the SCSI
> > layer does.
> 
> I think he's referring to QEMU, in which case yes, virtio-scsi does
> batch I/O submission.  See virtio_scsi_handle_cmd_req_prepare and
> virtio_scsi_handle_cmd_req_submit in hw/scsi/virtio-scsi.c, they do
> blk_io_plug and blk_io_unplug in order to batch I/O requests from QEMU
> to the host kernel.

This isn't fully effective since the guest driver kicks once per
request.  Therefore QEMU-level batching you mentioned only works if QEMU
is slower at handling virtqueue kicks than the guest is at submitting
requests.

I wonder if this is something that can be improved.

Stefan






Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread

2019-04-29 Thread Wei Li
Thanks Paolo for your clarification!

Just wanted to double confirm, does this mean batch I/O submission won't apply 
to aio=threads (which is the default mode)?

Thanks,
Wei


On 4/26/19, 9:25 PM, "Paolo Bonzini"  wrote:


> Thanks Stefan and Paolo for your response and advice!
> 
> Hi Paolo,
> 
> As to the virtio-scsi batch I/O submission feature in QEMU which you
> mentioned, is this feature turned on by default in QEMU 2.9 or there is a
> tunable parameters to turn on/off the feature?

Yes, it is available by default since 2.2.0.  It cannot be turned off, 
however
it is only possible to batch I/O with aio=native (and, since 2.12.0, with 
the NVMe
backend).

Paolo






Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread

2019-04-26 Thread Wei Li
Thanks Stefan and Paolo for your response and advice!

Hi Paolo,

As to the virtio-scsi batch I/O submission feature in QEMU which you mentioned, 
is this feature turned on by default in QEMU 2.9 or there is a tunable 
parameters to turn on/off the feature?

Thanks,
Wei

On 4/26/19, 1:14 AM, "Paolo Bonzini"  wrote:

On 23/04/19 14:04, Stefan Hajnoczi wrote:
>>In addition, does Virtio-scsi support Batch I/O Submission feature
>>which may be able to increase the IOPS via reducing the number of
>>system calls?
>
>I don't see obvious batching support in drivers/scsi/virtio_scsi.c.
>The Linux block layer supports batching but I'm not sure if the SCSI
>layer does.

I think he's referring to QEMU, in which case yes, virtio-scsi does
batch I/O submission.  See virtio_scsi_handle_cmd_req_prepare and
virtio_scsi_handle_cmd_req_submit in hw/scsi/virtio-scsi.c, they do
blk_io_plug and blk_io_unplug in order to batch I/O requests from QEMU
to the host kernel.





Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread

2019-04-22 Thread Wei Li
Hi Stefan,

I did investigation per your advices, please see inline for the details and 
questions.
 
   1. Compare "iostat -dx 1" inside the guest and host.  Are the I/O
   patterns comparable?  blktrace(8) can give you even more detail on
   the exact I/O patterns.  If the guest and host have different I/O
   patterns (blocksize, IOPS, queue depth) then request merging or
   I/O scheduler effects could be responsible for the difference.

[wei]: That's good point, I compared 'iostate -dx1" between guest and host, but 
I have not find obvious difference between guest and host which could 
responsible for the difference.

2. kvm_stat or perf record -a -e kvm:\* counters for vmexits and
   interrupt injections.  If these counters vary greatly between queue
   sizes, then that is usually a clue.  It's possible to get higher
   performance by spending more CPU cycles although your system doesn't
   have many CPUs available, so I'm not sure if this is the case.

[wei]: vmexits looks like a reason. I am using FIO tool to read/write block 
storage via following sample command, interesting thing is that kvm:kvm_exit 
count decreased from 846K to 395K after I increased num_queues from 2 to 4 
while the vCPU count is 2.
   1). Does this mean using more queues than vCPU count may increase 
IOPS via spending more CPU cycle? 
   2). Could you please help me better understand how more queues is 
able to spend more CPU cycle? Thanks!
   FIO command: fio --filename=/dev/sdb --direct=1 --rw=randrw --bs=4k 
--ioengine=libaio --iodepth=64 --numjobs=4 --time_based --group_reporting 
--name=iops --runtime=60 --eta-newline=1

3. Power management and polling (kvm.ko halt_poll_ns, tuned profiles,
   and QEMU iothread poll-max-ns).  It's expensive to wake a CPU when it
   goes into a low power mode due to idle.  There are several features
   that can keep the CPU awake or even poll so that request latency is
   reduced.  The reason why the number of queues may matter is that
   kicking multiple queues may keep the CPU awake more than batching
   multiple requests onto a small number of queues.
[wei]: CPU awake could be another reason, I noticed that kvm:kvm_vcpu_wakeup 
count decreased from 151K to 47K after I increased num_queues from 2 to 4 while 
the vCPU count is 2.
   1). Does this mean more queues may keep CPU more busy and awake 
which reduced the vcpu wakeup time?
   2). If using more num_queues than vCPU count is able to get higher 
IOPS for this case, is it safe to use 4 queues while it only have 2 vCPU, or 
there is any concern or impact by using more queues than vCPU count which I 
should keep in mind?

In addition, does Virtio-scsi support Batch I/O Submission feature which may be 
able to increase the IOPS via reducing the number of system calls?

Thanks,
Wei

On 4/16/19, 6:42 PM, "Wei Li"  wrote:

Thanks Stefan and Dongli for your feedback and advices!

I will do the further investigation per your advices and get back to you 
later on.

Thanks, 
-Wei

On 4/16/19, 2:20 AM, "Stefan Hajnoczi"  wrote:

On Tue, Apr 16, 2019 at 07:23:38AM +0800, Dongli Zhang wrote:
> 
        > 
> On 4/16/19 1:34 AM, Wei Li wrote:
> > Hi @Paolo Bonzini & @Stefan Hajnoczi,
> > 
> > Would you please help confirm whether @Paolo Bonzini's multiqueue 
feature change will benefit virtio-scsi or not? Thanks!
> > 
> > @Stefan Hajnoczi,
> > I also spent some time on exploring the virtio-scsi multi-queue 
features via num_queues parameter as below, here are what we found:
> > 
> > 1. Increase number of Queues from one to the same number as CPU 
will get better IOPS increase.
> > 2. Increase number of Queues to the number (e.g. 8) larger than the 
number of vCPU (e.g. 2) can get even better IOPS increase.
> 
> As mentioned in below link, when the number of hw queues is larger 
than
> nr_cpu_ids, the blk-mq layer would limit and only use at most 
nr_cpu_ids queues
> (e.g., /sys/block/sda/mq/).
> 
> That is, when the num_queus=4 while vcpus is 2, there should be only 
2 queues
> available /sys/block/sda/mq/
> 
> 
https://lore.kernel.org/lkml/1553682995-5682-1-git-send-email-dongli.zh...@oracle.com/
> 
> I am just curious how increasing the num_queues from 2 to 4 would 
double the
> iops, while there are only 2 vcpus available...

I don't know the answer.  It's especially hard to guess without seeing
the benchmark (fio?) configuration and QEMU command-line.

Common things to

Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread

2019-04-17 Thread Wei Li
Sounds good, let's keep in touch. __

Thanks,
Wei

On 4/17/19, 5:17 AM, "Paolo Bonzini"  wrote:

On 17/04/19 03:38, Wei Li wrote:
> Thanks Paolo for your response and clarification.
> 
> Btw, is there any rough schedule about when are you planning to start
> working on the multi queue feature?  Once you start working on the
> feature, I would like to hear more details about the design and
> better understand how this feature will benefit the performance of
> virtio-scsi.

I wish I knew... :)  However, hopefully I will share the details soon
with Sergio and start flushing that queue in 4.1.

Paolo






Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread

2019-04-16 Thread Wei Li
Thanks Stefan and Dongli for your feedback and advices!

I will do the further investigation per your advices and get back to you later 
on.

Thanks, 
-Wei

On 4/16/19, 2:20 AM, "Stefan Hajnoczi"  wrote:

On Tue, Apr 16, 2019 at 07:23:38AM +0800, Dongli Zhang wrote:
> 
> 
> On 4/16/19 1:34 AM, Wei Li wrote:
> > Hi @Paolo Bonzini & @Stefan Hajnoczi,
> > 
> > Would you please help confirm whether @Paolo Bonzini's multiqueue 
feature change will benefit virtio-scsi or not? Thanks!
> > 
> > @Stefan Hajnoczi,
> > I also spent some time on exploring the virtio-scsi multi-queue 
features via num_queues parameter as below, here are what we found:
> > 
> > 1. Increase number of Queues from one to the same number as CPU will 
get better IOPS increase.
> > 2. Increase number of Queues to the number (e.g. 8) larger than the 
number of vCPU (e.g. 2) can get even better IOPS increase.
> 
> As mentioned in below link, when the number of hw queues is larger than
> nr_cpu_ids, the blk-mq layer would limit and only use at most nr_cpu_ids 
queues
> (e.g., /sys/block/sda/mq/).
> 
> That is, when the num_queus=4 while vcpus is 2, there should be only 2 
queues
> available /sys/block/sda/mq/
> 
> 
https://lore.kernel.org/lkml/1553682995-5682-1-git-send-email-dongli.zh...@oracle.com/
> 
> I am just curious how increasing the num_queues from 2 to 4 would double 
the
> iops, while there are only 2 vcpus available...

I don't know the answer.  It's especially hard to guess without seeing
the benchmark (fio?) configuration and QEMU command-line.

Common things to look at are:

1. Compare "iostat -dx 1" inside the guest and host.  Are the I/O
   patterns comparable?  blktrace(8) can give you even more detail on
   the exact I/O patterns.  If the guest and host have different I/O
   patterns (blocksize, IOPS, queue depth) then request merging or
   I/O scheduler effects could be responsible for the difference.

2. kvm_stat or perf record -a -e kvm:\* counters for vmexits and
   interrupt injections.  If these counters vary greatly between queue
   sizes, then that is usually a clue.  It's possible to get higher
   performance by spending more CPU cycles although your system doesn't
   have many CPUs available, so I'm not sure if this is the case.

3. Power management and polling (kvm.ko halt_poll_ns, tuned profiles,
   and QEMU iothread poll-max-ns).  It's expensive to wake a CPU when it
   goes into a low power mode due to idle.  There are several features
   that can keep the CPU awake or even poll so that request latency is
   reduced.  The reason why the number of queues may matter is that
   kicking multiple queues may keep the CPU awake more than batching
   multiple requests onto a small number of queues.

Stefan






Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread

2019-04-16 Thread Wei Li
Thanks Paolo for your response and clarification. 

Btw, is there any rough schedule about when are you planning to start working 
on the multi queue feature?  Once you start working on the feature, I would 
like to hear more details about the design and better understand how this 
feature will benefit the performance of virtio-scsi.

Thanks again,
Wei

On 4/16/19, 7:01 AM, "Paolo Bonzini"  wrote:

On 05/04/19 23:09, Wei Li wrote:
> Thanks Stefan for your quick response!
> 
> Hi Paolo, Could you please send us a link related to the multiqueue
> feature which you are working on so that we could start getting some
> details about the feature.

I have never gotten to the point of multiqueue, a prerequisite for that
was to make the block layer thread safe.

The latest state of the work is at github.com/bonzini/qemu, branch
dataplane7.

Paolo






Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread

2019-04-15 Thread Wei Li
Hi @Paolo Bonzini & @Stefan Hajnoczi,

Would you please help confirm whether @Paolo Bonzini's multiqueue feature 
change will benefit virtio-scsi or not? Thanks!

@Stefan Hajnoczi,
I also spent some time on exploring the virtio-scsi multi-queue features via 
num_queues parameter as below, here are what we found:

1. Increase number of Queues from one to the same number as CPU will get better 
IOPS increase.
2. Increase number of Queues to the number (e.g. 8) larger than the number of 
vCPU (e.g. 2) can get even better IOPS increase.

In addition, It seems Qemu can get better IOPS while the attachment uses more 
queues than the number of vCPU, how could it possible? Could you please help us 
better understand the behavior? Thanks a lot!


Host CPU Configuration:
CPU(s):2
Thread(s) per core:2
Core(s) per socket:1
Socket(s): 1

Commands for multi queue Setup:
(QEMU)  device_add driver=virtio-scsi-pci num_queues=1 id=test1
(QEMU)  device_add driver=virtio-scsi-pci num_queues=2 id=test2
(QEMU)  device_add driver=virtio-scsi-pci num_queues=4 id=test4
(QEMU)  device_add driver=virtio-scsi-pci num_queues=8 id=test8


Result:
|  8 Queues   |  4 Queues |  2 Queues|  Single Queue
IOPS|   +29% |  27%   |11%   |  
Baseline

Thanks,
Wei

On 4/5/19, 2:09 PM, "Wei Li"  wrote:

Thanks Stefan for your quick response!

Hi Paolo,
Could you please send us a link related to the multiqueue feature which you 
are working on so that we could start getting some details about the feature.

Thanks again,
Wei 

On 4/1/19, 3:54 AM, "Stefan Hajnoczi"  wrote:

On Fri, Mar 29, 2019 at 08:16:36AM -0700, Wei Li wrote:
> Thanks Stefan for your reply and guidance!
> 
> We spent some time on exploring the multiple I/O Threads approach per 
your feedback. Based on the perf measurement data, we did see some IOPS 
improvement for multiple volumes, which is great. :)
> 
> In addition, IOPS for single Volume will still be a bottleneck, it 
seems like multiqueue block layer feature which Paolo is working on may be able 
to help improving the IOPS for single volume.
> 
> @Paolo, @Stefan, 
> Would you mind sharing the multiqueue feature code branch with us? So 
that we could get some rough idea about this feature and maybe start doing some 
exploration? 

Paolo last worked on this code, so he may be able to send you a link.

Stefan







Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread

2019-04-05 Thread Wei Li
Thanks Stefan for your quick response!

Hi Paolo,
Could you please send us a link related to the multiqueue feature which you are 
working on so that we could start getting some details about the feature.

Thanks again,
Wei 

On 4/1/19, 3:54 AM, "Stefan Hajnoczi"  wrote:

On Fri, Mar 29, 2019 at 08:16:36AM -0700, Wei Li wrote:
> Thanks Stefan for your reply and guidance!
> 
> We spent some time on exploring the multiple I/O Threads approach per 
your feedback. Based on the perf measurement data, we did see some IOPS 
improvement for multiple volumes, which is great. :)
> 
> In addition, IOPS for single Volume will still be a bottleneck, it seems 
like multiqueue block layer feature which Paolo is working on may be able to 
help improving the IOPS for single volume.
> 
> @Paolo, @Stefan, 
> Would you mind sharing the multiqueue feature code branch with us? So 
that we could get some rough idea about this feature and maybe start doing some 
exploration? 

Paolo last worked on this code, so he may be able to send you a link.

Stefan






Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread

2019-03-29 Thread Wei Li
Thanks Stefan for your reply and guidance!

We spent some time on exploring the multiple I/O Threads approach per your 
feedback. Based on the perf measurement data, we did see some IOPS improvement 
for multiple volumes, which is great. :)

In addition, IOPS for single Volume will still be a bottleneck, it seems like 
multiqueue block layer feature which Paolo is working on may be able to help 
improving the IOPS for single volume.

@Paolo, @Stefan, 
Would you mind sharing the multiqueue feature code branch with us? So that we 
could get some rough idea about this feature and maybe start doing some 
exploration? 

Thanks a lot!
Wei

On 3/5/19, 9:29 AM, "Stefan Hajnoczi"  wrote:

On Mon, Mar 04, 2019 at 09:33:26AM -0800, Wei Li wrote:
> While @Stefan mentioned about additional iothread object support of 
virtio-blk, Is the feature also supported by virtio-scsi? I am trying to 
exploring the perf multiple IO threads / per VM via followings:
> QMP setup example to create 2 io threads in QEMU, one io thread per 
device:
> 
> (QEMU) object-add qom-type=iothread id=iothread0
> 
> (QEMU) object-add qom-type=iothread id=iothread1
> 
>  
> 
> (QEMU) device_add driver=virtio-scsi-pci id=test0 iothread=iothread0
> 
> (QEMU) device_add driver=virtio-scsi-pci id=test1 iothread=iothread1
> 
>  
> 
> (QEMU) device_add driver=scsi-block drive=none0 id=v0 bus=test0.0
> 
> (QEMU) device_add driver=scsi-block drive=none1 id=v1 bus=test1.0

Yes, each virtio-scsi-pci device can be assigned to an iothread.

> You mentioned about the multi-queue devices feature, it seems like the 
multi-queue feature will help improve the IOPS of  single Device. Could you 
please provide more details?
> What’s the current plan of support multi-queue device? Which release will 
include the support or it has already been included in any existing release 
newer than 2.9?
> Is there any feature branch which I would get more details about the code 
and in progress status?

I have CCed Paolo, who has worked on multiqueue block layer support in
QEMU.  This feature is not yet complete.

The virtio-scsi device also supports multiqueue, but the QEMU block
layer will still be a single queue.

Stefan






Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread

2019-03-04 Thread Wei Li
Hi Stefan and all,

 

I spent some time on getting familiar with QEMU and relevant concepts. My 
project is using QEMU 2.9 with virtio-scsi backend, and I am exploring proper 
way to improve the IOPS of my project.

 

Thanks @Stefan for the response and advices!

 

Could you please help review and clarify following questions:
While @Stefan mentioned about additional iothread object support of virtio-blk, 
Is the feature also supported by virtio-scsi? I am trying to exploring the perf 
multiple IO threads / per VM via followings:
QMP setup example to create 2 io threads in QEMU, one io thread per device:

(QEMU) object-add qom-type=iothread id=iothread0

(QEMU) object-add qom-type=iothread id=iothread1

 

(QEMU) device_add driver=virtio-scsi-pci id=test0 iothread=iothread0

(QEMU) device_add driver=virtio-scsi-pci id=test1 iothread=iothread1

 

(QEMU) device_add driver=scsi-block drive=none0 id=v0 bus=test0.0

(QEMU) device_add driver=scsi-block drive=none1 id=v1 bus=test1.0
You mentioned about the multi-queue devices feature, it seems like the 
multi-queue feature will help improve the IOPS of  single Device. Could you 
please provide more details?
What’s the current plan of support multi-queue device? Which release will 
include the support or it has already been included in any existing release 
newer than 2.9?
Is there any feature branch which I would get more details about the code and 
in progress status?
In addition, Someone posted related to multi-queue 
https://marc.info/?l=linux-virtualization=135583400026151=2, but it only 
measure the bandwidth, do we have any perf result about IOPS improvement of 
Multi-Queue approach?
 

Thanks again,

Wei

 

 

On 2/18/19, 2:24 AM, "Stefan Hajnoczi"  wrote:

 

    On Thu, Feb 14, 2019 at 08:21:30AM -0800, Wei Li wrote:

    > I learnt that QEMU iothread architecture has one QEMU thread per vCPU and 
a dedicated event loop thread which is iothread, and I want to better 
understand whether there is any specific reason to have a single iothead 
instead of multiple iothreads?

    > Given that single iothread becomes a performance bottleneck in my 
project, if there any proper way to support multiple iothreads? E.g. have one 
iothread per volume attachment instead of single iothread per host?  But I am 
not quite sure whether it is feasible or not. Please let me know if you have 
any advices.

    

Hi,

    Please send general questions to qemu-devel@nongnu.org and CC me in the

    future.  That way others can participate in the discussion and it will

    be archived so someone searching for the same question will find the

    answer in the future.

    

QEMU supports additional IOThread objects:

    

  -object iothread,id=iothread0

  -device virtio-blk-pci,iothread=iothread0,drive=drive0

    

This virtio-blk device will perform device emulation and I/O in

    iothread0 instead of the main loop thread.

    

Currently only 1:1 device<->IOThread association is possible.  In the

    future 1:N should be possible and will allow multi-queue devices to

    achieve better performance.

    

Stefan