Re: qemu questions about x86
Dear 项晨东 On Sat, Apr 23, 2022 at 3:57 PM 项晨东 wrote: > Dear qemu developers: > hello~ I'm Xiang Chen dong, a student from Tsinghua University. recently I > am trying to accomplish new X86 feature named user-interrupts which can > view here > <https://www.intel.com/content/dam/develop/external/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf> > . > I worked for a couple of time, reaching status that new msrs added and > access of msrs is work well, also add new CPUID infos to qemu64, also I > could catch new instructions by modify `translate.c` file. my code could > find here <https://github.com/Xiang-cd/qemu>, the correspond linux kernel > version could find here <https://github.com/intel/uintr-linux-kernel>. > but now I have some problems when trying to accomplish instructions named > SENDUIPI and UIRET. > for SENDUIPI, the main function of it is sending the user-interrupts. the > detail way is, machine access memory(address saved in new msr), then read > another address from memory, then write some content to this memory. I read > the qemu source code, find a lot of functions like tcg_gen_qemu_ld, but > when i click into it from IDE(vscode), i could not find where the function > body(maybe due to the macro). So I don't understand how the function works > and how can I wirte a new function to access guest machine memory and write > back in qemu. > tcg_frontend: gen_op_ld_v-->tcg_gen_qemu_ld_tl-->tcg_gen_qemu_ld_i64 (tcg/tcg-op.c)-->gen_ldst_i64 tcg_backend: case INDEX_op_qemu_ld_i64:-->tcg_out_qemu_ld (tcg-target.c.inc tcg/i386) You only need to focus on the frontend and learn from how to translate other instructions. another problem is that I am not quite get the idea of accomplishment of > Interrupt, i could find functions like raise_interrupt and raise_exception, > but I don't understand how it interact with apic(how the control flow > switched to other functions, i find cpu_loop_exit_restore, but can not find > the function body), either how the interrupt handled. > hardware interrupt produce pc_i8259_create-->i8259_init-->x86_allocate_cpu_irq-->pic_irq_request pic_irq_request-->cpu_interrupt(cs, CPU_INTERRUPT_HARD) -->softmmu/cpus.c/cpu_interrupt-->tcg_handle_interrupt -->cpu_reset_interrupt-->hw/core/cpu-common.c/cpu_reset_interrupt hardware interrupt handle cpu_exec-->cpu_handle_interrupt-->cc->tcg_ops->cpu_exec_interrupt-->x86_cpu_exec_interrupt -->cpu_get_pic_interrupt-->pic_read_irq -->do_interrupt_x86_hardirq-->do_interrupt_all-->do_interrupt_protected--> use siglongjmp or sigsetjmp exception handle cpu_handle_exception-->cc->tcg_ops->fake_user_interrupt-->x86_cpu_do_interrupt-->do_interrupt_all > > the problem is difficult in some ways, I discussed with my classmates and > friends, but there is no answer. > so I'm hoping to get important information from you. Is my way of reading > code right? Is there any tools for development(finding the function > body)?How can I accomplish this quickly? > thank you very very much! > best wishes! > Xiang Chen Dong > Everything here maybe have some mistakes. Hope it is useful for you. -- best wishes! Wei Li
Re: [PATCH 0/2] target/i386: Some mmx/sse instructions don't require
Ping.. And the title is target/i386: Some mmx/sse instructions don't require CR0.TS=0 On Fri, Mar 25, 2022 at 10:55 PM Wei Li wrote: > Resolves: https://gitlab.com/qemu-project/qemu/-/issues/427 > > All instructions decoded by 'gen_see' is assumed to require CRO.TS=0. But > according to SDM, CRC32 doesn't require it. In fact, EMMS, FMMS and some > mmx/sse instructions(0F38F[0-F], 0F3AF[0-F]) don't require it. > > To solve the problem, first to move EMMS and FMMS out of gen_sse. Then > instructions in 'gen_sse' require it only when modrm & 0xF0 is false. > > Wei Li (2): > Move EMMS and FEMMS instructions out of gen_sse > Some mmx/sse instructions in 'gen_sse' don't require CRO.TS=0 > > target/i386/tcg/translate.c | 45 + > 1 file changed, 21 insertions(+), 24 deletions(-) > > -- > 2.30.2 > > > Thanks. -- Wei Li
[PATCH 2/2] Some mmx/sse instructions in 'gen_sse' don't require CRO.TS=0
Some instructions in 'gen_sse' don't require CRO.TS=0 and the opcode of them are 0F38F[0-F], 0F3AF[0-F]. Signed-off-by: Wei Li --- target/i386/tcg/translate.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c index fe9fcdae96..14cf11771c 100644 --- a/target/i386/tcg/translate.c +++ b/target/i386/tcg/translate.c @@ -3139,8 +3139,16 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b, is_xmm = 1; } } + +modrm = x86_ldub_code(env, s); +reg = ((modrm >> 3) & 7); +if (is_xmm) { +reg |= REX_R(s); +} +mod = (modrm >> 6) & 3; /* simple MMX/SSE operation */ -if (s->flags & HF_TS_MASK) { +if ((s->flags & HF_TS_MASK) +&& (!(modrm & 0xF0))) { gen_exception(s, EXCP07_PREX, pc_start - s->cs_base); return; } @@ -3159,13 +3167,6 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b, if (!is_xmm) { gen_helper_enter_mmx(cpu_env); } - -modrm = x86_ldub_code(env, s); -reg = ((modrm >> 3) & 7); -if (is_xmm) { -reg |= REX_R(s); -} -mod = (modrm >> 6) & 3; if (sse_fn_epp == SSE_SPECIAL) { b |= (b1 << 8); switch(b) { -- 2.30.2
[PATCH v4 1/1] fix cmpxchg and lock cmpxchg instruction
This patch fixes a bug reported on issues #508. The problem is cmpxchg and lock cmpxchg would touch accumulator when the comparison is equal. Signed-off-by: Wei Li --- target/i386/tcg/translate.c | 41 + 1 file changed, 19 insertions(+), 22 deletions(-) diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c index 2a94d33742..9677f9576b 100644 --- a/target/i386/tcg/translate.c +++ b/target/i386/tcg/translate.c @@ -5339,7 +5339,7 @@ static target_ulong disas_insn(DisasContext *s, CPUState *cpu) case 0x1b0: case 0x1b1: /* cmpxchg Ev, Gv */ { -TCGv oldv, newv, cmpv; +TCGv oldv, newv, cmpv, temp; ot = mo_b_d(b, dflag); modrm = x86_ldub_code(env, s); @@ -5348,42 +5348,38 @@ static target_ulong disas_insn(DisasContext *s, CPUState *cpu) oldv = tcg_temp_new(); newv = tcg_temp_new(); cmpv = tcg_temp_new(); +temp = tcg_temp_new(); gen_op_mov_v_reg(s, ot, newv, reg); tcg_gen_mov_tl(cmpv, cpu_regs[R_EAX]); +tcg_gen_mov_tl(temp, cpu_regs[R_EAX]); -if (s->prefix & PREFIX_LOCK) { +if ((s->prefix & PREFIX_LOCK) || +(mod != 3)) { +/* Use the tcg_gen_atomic_cmpxchg_tl path whenever mod != 3. + While an unlocked cmpxchg need not be atomic, it is not + required to be non-atomic either. */ if (mod == 3) { goto illegal_op; } gen_lea_modrm(env, s, modrm); tcg_gen_atomic_cmpxchg_tl(oldv, s->A0, cmpv, newv, s->mem_index, ot | MO_LE); -gen_op_mov_reg_v(s, ot, R_EAX, oldv); +gen_extu(ot, oldv); +gen_extu(ot, cmpv); } else { -if (mod == 3) { -rm = (modrm & 7) | REX_B(s); -gen_op_mov_v_reg(s, ot, oldv, rm); -} else { -gen_lea_modrm(env, s, modrm); -gen_op_ld_v(s, ot, oldv, s->A0); -rm = 0; /* avoid warning */ -} +rm = (modrm & 7) | REX_B(s); +gen_op_mov_v_reg(s, ot, oldv, rm); gen_extu(ot, oldv); gen_extu(ot, cmpv); /* store value = (old == cmp ? new : old); */ tcg_gen_movcond_tl(TCG_COND_EQ, newv, oldv, cmpv, newv, oldv); -if (mod == 3) { -gen_op_mov_reg_v(s, ot, R_EAX, oldv); -gen_op_mov_reg_v(s, ot, rm, newv); -} else { -/* Perform an unconditional store cycle like physical cpu; - must be before changing accumulator to ensure - idempotency if the store faults and the instruction - is restarted */ -gen_op_st_v(s, ot, newv, s->A0); -gen_op_mov_reg_v(s, ot, R_EAX, oldv); -} +gen_op_mov_reg_v(s, ot, rm, newv); } +/* Perform the merge into %al or %ax as required by ot. */ +gen_op_mov_reg_v(s, ot, R_EAX, oldv); +/* Undo the entire modification to %rax if comparison equal. */ +tcg_gen_movcond_tl(TCG_COND_EQ, cpu_regs[R_EAX], oldv, cmpv, +temp, cpu_regs[R_EAX]); tcg_gen_mov_tl(cpu_cc_src, oldv); tcg_gen_mov_tl(s->cc_srcT, cmpv); tcg_gen_sub_tl(cpu_cc_dst, cmpv, oldv); @@ -5391,6 +5387,7 @@ static target_ulong disas_insn(DisasContext *s, CPUState *cpu) tcg_temp_free(oldv); tcg_temp_free(newv); tcg_temp_free(cmpv); +tcg_temp_free(temp); } break; case 0x1c7: /* cmpxchg8b */ -- 2.30.2
[PATCH v3 1/1] fix cmpxchg and lock cmpxchg instruction
Give a better code struture to reduce more code duplication according to the discuss in patch v2. Signed-off-by: Wei Li --- target/i386/tcg/translate.c | 41 + 1 file changed, 19 insertions(+), 22 deletions(-) diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c index 2a94d33742..9677f9576b 100644 --- a/target/i386/tcg/translate.c +++ b/target/i386/tcg/translate.c @@ -5339,7 +5339,7 @@ static target_ulong disas_insn(DisasContext *s, CPUState *cpu) case 0x1b0: case 0x1b1: /* cmpxchg Ev, Gv */ { -TCGv oldv, newv, cmpv; +TCGv oldv, newv, cmpv, temp; ot = mo_b_d(b, dflag); modrm = x86_ldub_code(env, s); @@ -5348,42 +5348,38 @@ static target_ulong disas_insn(DisasContext *s, CPUState *cpu) oldv = tcg_temp_new(); newv = tcg_temp_new(); cmpv = tcg_temp_new(); +temp = tcg_temp_new(); gen_op_mov_v_reg(s, ot, newv, reg); tcg_gen_mov_tl(cmpv, cpu_regs[R_EAX]); +tcg_gen_mov_tl(temp, cpu_regs[R_EAX]); -if (s->prefix & PREFIX_LOCK) { +if ((s->prefix & PREFIX_LOCK) || +(mod != 3)) { +/* Use the tcg_gen_atomic_cmpxchg_tl path whenever mod != 3. + While an unlocked cmpxchg need not be atomic, it is not + required to be non-atomic either. */ if (mod == 3) { goto illegal_op; } gen_lea_modrm(env, s, modrm); tcg_gen_atomic_cmpxchg_tl(oldv, s->A0, cmpv, newv, s->mem_index, ot | MO_LE); -gen_op_mov_reg_v(s, ot, R_EAX, oldv); +gen_extu(ot, oldv); +gen_extu(ot, cmpv); } else { -if (mod == 3) { -rm = (modrm & 7) | REX_B(s); -gen_op_mov_v_reg(s, ot, oldv, rm); -} else { -gen_lea_modrm(env, s, modrm); -gen_op_ld_v(s, ot, oldv, s->A0); -rm = 0; /* avoid warning */ -} +rm = (modrm & 7) | REX_B(s); +gen_op_mov_v_reg(s, ot, oldv, rm); gen_extu(ot, oldv); gen_extu(ot, cmpv); /* store value = (old == cmp ? new : old); */ tcg_gen_movcond_tl(TCG_COND_EQ, newv, oldv, cmpv, newv, oldv); -if (mod == 3) { -gen_op_mov_reg_v(s, ot, R_EAX, oldv); -gen_op_mov_reg_v(s, ot, rm, newv); -} else { -/* Perform an unconditional store cycle like physical cpu; - must be before changing accumulator to ensure - idempotency if the store faults and the instruction - is restarted */ -gen_op_st_v(s, ot, newv, s->A0); -gen_op_mov_reg_v(s, ot, R_EAX, oldv); -} +gen_op_mov_reg_v(s, ot, rm, newv); } +/* Perform the merge into %al or %ax as required by ot. */ +gen_op_mov_reg_v(s, ot, R_EAX, oldv); +/* Undo the entire modification to %rax if comparison equal. */ +tcg_gen_movcond_tl(TCG_COND_EQ, cpu_regs[R_EAX], oldv, cmpv, +temp, cpu_regs[R_EAX]); tcg_gen_mov_tl(cpu_cc_src, oldv); tcg_gen_mov_tl(s->cc_srcT, cmpv); tcg_gen_sub_tl(cpu_cc_dst, cmpv, oldv); @@ -5391,6 +5387,7 @@ static target_ulong disas_insn(DisasContext *s, CPUState *cpu) tcg_temp_free(oldv); tcg_temp_free(newv); tcg_temp_free(cmpv); +tcg_temp_free(temp); } break; case 0x1c7: /* cmpxchg8b */ -- 2.30.2
[PATCH v2 1/1] fix cmpxchg and lock cmpxchg instruction
One question is that we can reduce more code duplication if we use - if(foo){ tcg_gen_atomic_cmpxchg_tl(oldv, s->A0, cmpv, newv, s->mem_index, ot | MO_LE); gen_extu(ot, oldv); gen_extu(ot, cmpv); }else{ tcg_gen_movcond_tl(TCG_COND_EQ, newv, old, cmpv, newv, oldv); gen_op_mov_reg_v(s, ot, rm, newv); } gen_op_mov_reg_v(s, ot, R_EAX, oldv); tcg_gen_movcond_tl(TCG_COND_EQ, cpu_regs[R_EAX], oldv, cmpv, temp, cpu_regs[R_EAX]); The problem is gen_op_mov_reg_v(s, ot, rm, newv) will happen before gen_op_mov_reg_v(s, ot, R_EAX, oldv). According to SDM, write to R_EAX should happen before write to rm. I am not sure about its side effects. All in all, if there is no side effect, we can use the code above to reduce more code duplication. Or we use the code below to ensure correctness. Signed-off-by: Wei Li --- target/i386/tcg/translate.c | 44 +++-- 1 file changed, 23 insertions(+), 21 deletions(-) diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c index 2a94d33742..6633d8ece6 100644 --- a/target/i386/tcg/translate.c +++ b/target/i386/tcg/translate.c @@ -5339,7 +5339,7 @@ static target_ulong disas_insn(DisasContext *s, CPUState *cpu) case 0x1b0: case 0x1b1: /* cmpxchg Ev, Gv */ { -TCGv oldv, newv, cmpv; +TCGv oldv, newv, cmpv, temp; ot = mo_b_d(b, dflag); modrm = x86_ldub_code(env, s); @@ -5348,41 +5348,42 @@ static target_ulong disas_insn(DisasContext *s, CPUState *cpu) oldv = tcg_temp_new(); newv = tcg_temp_new(); cmpv = tcg_temp_new(); +temp = tcg_temp_new(); gen_op_mov_v_reg(s, ot, newv, reg); tcg_gen_mov_tl(cmpv, cpu_regs[R_EAX]); +tcg_gen_mov_tl(temp, cpu_regs[R_EAX]); -if (s->prefix & PREFIX_LOCK) { +if ((s->prefix & PREFIX_LOCK) || +(mod != 3)) { +/* Use the tcg_gen_atomic_cmpxchg_tl path whenever mod != 3. + While an unlocked cmpxchg need not be atomic, it is not + required to be non-atomic either. */ if (mod == 3) { goto illegal_op; } gen_lea_modrm(env, s, modrm); tcg_gen_atomic_cmpxchg_tl(oldv, s->A0, cmpv, newv, s->mem_index, ot | MO_LE); +gen_extu(ot, oldv); +gen_extu(ot, cmpv); +/* Perform the merge into %al or %ax as required by ot. */ gen_op_mov_reg_v(s, ot, R_EAX, oldv); +/* Undo the entire modification to %rax if comparison equal. */ +tcg_gen_movcond_tl(TCG_COND_EQ, cpu_regs[R_EAX], oldv, cmpv, +temp, cpu_regs[R_EAX]); } else { -if (mod == 3) { -rm = (modrm & 7) | REX_B(s); -gen_op_mov_v_reg(s, ot, oldv, rm); -} else { -gen_lea_modrm(env, s, modrm); -gen_op_ld_v(s, ot, oldv, s->A0); -rm = 0; /* avoid warning */ -} +rm = (modrm & 7) | REX_B(s); +gen_op_mov_v_reg(s, ot, oldv, rm); gen_extu(ot, oldv); gen_extu(ot, cmpv); /* store value = (old == cmp ? new : old); */ tcg_gen_movcond_tl(TCG_COND_EQ, newv, oldv, cmpv, newv, oldv); -if (mod == 3) { -gen_op_mov_reg_v(s, ot, R_EAX, oldv); -gen_op_mov_reg_v(s, ot, rm, newv); -} else { -/* Perform an unconditional store cycle like physical cpu; - must be before changing accumulator to ensure - idempotency if the store faults and the instruction - is restarted */ -gen_op_st_v(s, ot, newv, s->A0); -gen_op_mov_reg_v(s, ot, R_EAX, oldv); -} +/* Perform the merge into %al or %ax as required by ot. */ +gen_op_mov_reg_v(s, ot, R_EAX, oldv); +/* Undo the entire modification to %rax if comparison equal. */ +tcg_gen_movcond_tl(TCG_COND_EQ, cpu_regs[R_EAX], oldv, cmpv, +temp, cpu_regs[R_EAX]); +gen_op_mov_reg_v(s, ot, rm, newv); } tcg_gen_mov_tl(cpu_cc_src, oldv); tcg_gen_mov_tl(s->cc_srcT, cmpv); @@ -5391,6 +5392,7 @@ static target_ulong disas_insn(DisasContext *s, CPUState *cpu) tcg_temp_free(oldv); tcg_temp_free(newv); tcg_temp_free(cmpv); +tcg_temp_free(temp); }
[PATCH v2 0/1] cmpxchg and lock cmpxchg should not touch accumulator
Bug: https://gitlab.com/qemu-project/qemu/-/issues/508 This series fix a bug reported on issues 508. The problem is cmpxchg and lock cmpxchg would touch accumulator when they should not do that. Changes from v1 * cmpxchg uses the lock cmpxchg path whenever mod != 3 to reduce code duplication. * lock cmpxchg uses movcond to replace branch. * Combine the two patches into one patch because cmpxchg uses the lock cmpxchg path. v1 link: https://lists.gnu.org/archive/html/qemu-devel/2022-03/msg05023.html Wei Li (1): fix cmpxchg and lock cmpxchg instruction target/i386/tcg/translate.c | 44 +++-- 1 file changed, 23 insertions(+), 21 deletions(-) -- 2.30.2
Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread
Hi Paolo, That will be great, I would like to hear more details about the design and implementation once you get those ready. Thanks a lot, Wei On 5/3/19, 11:05 AM, "Paolo Bonzini" wrote: On 03/05/19 10:21, Wei Li wrote: > Got it, thanks Stefan for your clarification! Hi Wei, Stefan and I should be posting a patch to add Linux SCSI driver batching, and an implementation for virtio-scsi. Paolo > Wei > > On 5/1/19, 9:36 AM, "Stefan Hajnoczi" wrote: > > On Mon, Apr 29, 2019 at 10:56:31AM -0700, Wei Li wrote: > >Does this mean the performance could be improved via adding Batch I/O submission support in Guest driver side which will be able to reduce the number of virtqueue kicks? > > Yes, I think so. It's not obvious to me how a Linux SCSI driver is > supposed to implement batching though. The .queuecommand API doesn't > seem to include information relevant to batching. > > Stefan > > > >
Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread
Got it, thanks Stefan for your clarification! Wei On 5/1/19, 9:36 AM, "Stefan Hajnoczi" wrote: On Mon, Apr 29, 2019 at 10:56:31AM -0700, Wei Li wrote: >Does this mean the performance could be improved via adding Batch I/O submission support in Guest driver side which will be able to reduce the number of virtqueue kicks? Yes, I think so. It's not obvious to me how a Linux SCSI driver is supposed to implement batching though. The .queuecommand API doesn't seem to include information relevant to batching. Stefan
Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread
Thanks Stefan! Does this mean the performance could be improved via adding Batch I/O submission support in Guest driver side which will be able to reduce the number of virtqueue kicks? Thanks, Wei On 4/29/19, 6:40 AM, "Stefan Hajnoczi" wrote: On Fri, Apr 26, 2019 at 10:14:16AM +0200, Paolo Bonzini wrote: > On 23/04/19 14:04, Stefan Hajnoczi wrote: > >> In addition, does Virtio-scsi support Batch I/O Submission feature > >> which may be able to increase the IOPS via reducing the number of > >> system calls? > > > > I don't see obvious batching support in drivers/scsi/virtio_scsi.c. > > The Linux block layer supports batching but I'm not sure if the SCSI > > layer does. > > I think he's referring to QEMU, in which case yes, virtio-scsi does > batch I/O submission. See virtio_scsi_handle_cmd_req_prepare and > virtio_scsi_handle_cmd_req_submit in hw/scsi/virtio-scsi.c, they do > blk_io_plug and blk_io_unplug in order to batch I/O requests from QEMU > to the host kernel. This isn't fully effective since the guest driver kicks once per request. Therefore QEMU-level batching you mentioned only works if QEMU is slower at handling virtqueue kicks than the guest is at submitting requests. I wonder if this is something that can be improved. Stefan
Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread
Thanks Paolo for your clarification! Just wanted to double confirm, does this mean batch I/O submission won't apply to aio=threads (which is the default mode)? Thanks, Wei On 4/26/19, 9:25 PM, "Paolo Bonzini" wrote: > Thanks Stefan and Paolo for your response and advice! > > Hi Paolo, > > As to the virtio-scsi batch I/O submission feature in QEMU which you > mentioned, is this feature turned on by default in QEMU 2.9 or there is a > tunable parameters to turn on/off the feature? Yes, it is available by default since 2.2.0. It cannot be turned off, however it is only possible to batch I/O with aio=native (and, since 2.12.0, with the NVMe backend). Paolo
Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread
Thanks Stefan and Paolo for your response and advice! Hi Paolo, As to the virtio-scsi batch I/O submission feature in QEMU which you mentioned, is this feature turned on by default in QEMU 2.9 or there is a tunable parameters to turn on/off the feature? Thanks, Wei On 4/26/19, 1:14 AM, "Paolo Bonzini" wrote: On 23/04/19 14:04, Stefan Hajnoczi wrote: >>In addition, does Virtio-scsi support Batch I/O Submission feature >>which may be able to increase the IOPS via reducing the number of >>system calls? > >I don't see obvious batching support in drivers/scsi/virtio_scsi.c. >The Linux block layer supports batching but I'm not sure if the SCSI >layer does. I think he's referring to QEMU, in which case yes, virtio-scsi does batch I/O submission. See virtio_scsi_handle_cmd_req_prepare and virtio_scsi_handle_cmd_req_submit in hw/scsi/virtio-scsi.c, they do blk_io_plug and blk_io_unplug in order to batch I/O requests from QEMU to the host kernel.
Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread
Hi Stefan, I did investigation per your advices, please see inline for the details and questions. 1. Compare "iostat -dx 1" inside the guest and host. Are the I/O patterns comparable? blktrace(8) can give you even more detail on the exact I/O patterns. If the guest and host have different I/O patterns (blocksize, IOPS, queue depth) then request merging or I/O scheduler effects could be responsible for the difference. [wei]: That's good point, I compared 'iostate -dx1" between guest and host, but I have not find obvious difference between guest and host which could responsible for the difference. 2. kvm_stat or perf record -a -e kvm:\* counters for vmexits and interrupt injections. If these counters vary greatly between queue sizes, then that is usually a clue. It's possible to get higher performance by spending more CPU cycles although your system doesn't have many CPUs available, so I'm not sure if this is the case. [wei]: vmexits looks like a reason. I am using FIO tool to read/write block storage via following sample command, interesting thing is that kvm:kvm_exit count decreased from 846K to 395K after I increased num_queues from 2 to 4 while the vCPU count is 2. 1). Does this mean using more queues than vCPU count may increase IOPS via spending more CPU cycle? 2). Could you please help me better understand how more queues is able to spend more CPU cycle? Thanks! FIO command: fio --filename=/dev/sdb --direct=1 --rw=randrw --bs=4k --ioengine=libaio --iodepth=64 --numjobs=4 --time_based --group_reporting --name=iops --runtime=60 --eta-newline=1 3. Power management and polling (kvm.ko halt_poll_ns, tuned profiles, and QEMU iothread poll-max-ns). It's expensive to wake a CPU when it goes into a low power mode due to idle. There are several features that can keep the CPU awake or even poll so that request latency is reduced. The reason why the number of queues may matter is that kicking multiple queues may keep the CPU awake more than batching multiple requests onto a small number of queues. [wei]: CPU awake could be another reason, I noticed that kvm:kvm_vcpu_wakeup count decreased from 151K to 47K after I increased num_queues from 2 to 4 while the vCPU count is 2. 1). Does this mean more queues may keep CPU more busy and awake which reduced the vcpu wakeup time? 2). If using more num_queues than vCPU count is able to get higher IOPS for this case, is it safe to use 4 queues while it only have 2 vCPU, or there is any concern or impact by using more queues than vCPU count which I should keep in mind? In addition, does Virtio-scsi support Batch I/O Submission feature which may be able to increase the IOPS via reducing the number of system calls? Thanks, Wei On 4/16/19, 6:42 PM, "Wei Li" wrote: Thanks Stefan and Dongli for your feedback and advices! I will do the further investigation per your advices and get back to you later on. Thanks, -Wei On 4/16/19, 2:20 AM, "Stefan Hajnoczi" wrote: On Tue, Apr 16, 2019 at 07:23:38AM +0800, Dongli Zhang wrote: > > > On 4/16/19 1:34 AM, Wei Li wrote: > > Hi @Paolo Bonzini & @Stefan Hajnoczi, > > > > Would you please help confirm whether @Paolo Bonzini's multiqueue feature change will benefit virtio-scsi or not? Thanks! > > > > @Stefan Hajnoczi, > > I also spent some time on exploring the virtio-scsi multi-queue features via num_queues parameter as below, here are what we found: > > > > 1. Increase number of Queues from one to the same number as CPU will get better IOPS increase. > > 2. Increase number of Queues to the number (e.g. 8) larger than the number of vCPU (e.g. 2) can get even better IOPS increase. > > As mentioned in below link, when the number of hw queues is larger than > nr_cpu_ids, the blk-mq layer would limit and only use at most nr_cpu_ids queues > (e.g., /sys/block/sda/mq/). > > That is, when the num_queus=4 while vcpus is 2, there should be only 2 queues > available /sys/block/sda/mq/ > > https://lore.kernel.org/lkml/1553682995-5682-1-git-send-email-dongli.zh...@oracle.com/ > > I am just curious how increasing the num_queues from 2 to 4 would double the > iops, while there are only 2 vcpus available... I don't know the answer. It's especially hard to guess without seeing the benchmark (fio?) configuration and QEMU command-line. Common things to
Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread
Sounds good, let's keep in touch. __ Thanks, Wei On 4/17/19, 5:17 AM, "Paolo Bonzini" wrote: On 17/04/19 03:38, Wei Li wrote: > Thanks Paolo for your response and clarification. > > Btw, is there any rough schedule about when are you planning to start > working on the multi queue feature? Once you start working on the > feature, I would like to hear more details about the design and > better understand how this feature will benefit the performance of > virtio-scsi. I wish I knew... :) However, hopefully I will share the details soon with Sergio and start flushing that queue in 4.1. Paolo
Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread
Thanks Stefan and Dongli for your feedback and advices! I will do the further investigation per your advices and get back to you later on. Thanks, -Wei On 4/16/19, 2:20 AM, "Stefan Hajnoczi" wrote: On Tue, Apr 16, 2019 at 07:23:38AM +0800, Dongli Zhang wrote: > > > On 4/16/19 1:34 AM, Wei Li wrote: > > Hi @Paolo Bonzini & @Stefan Hajnoczi, > > > > Would you please help confirm whether @Paolo Bonzini's multiqueue feature change will benefit virtio-scsi or not? Thanks! > > > > @Stefan Hajnoczi, > > I also spent some time on exploring the virtio-scsi multi-queue features via num_queues parameter as below, here are what we found: > > > > 1. Increase number of Queues from one to the same number as CPU will get better IOPS increase. > > 2. Increase number of Queues to the number (e.g. 8) larger than the number of vCPU (e.g. 2) can get even better IOPS increase. > > As mentioned in below link, when the number of hw queues is larger than > nr_cpu_ids, the blk-mq layer would limit and only use at most nr_cpu_ids queues > (e.g., /sys/block/sda/mq/). > > That is, when the num_queus=4 while vcpus is 2, there should be only 2 queues > available /sys/block/sda/mq/ > > https://lore.kernel.org/lkml/1553682995-5682-1-git-send-email-dongli.zh...@oracle.com/ > > I am just curious how increasing the num_queues from 2 to 4 would double the > iops, while there are only 2 vcpus available... I don't know the answer. It's especially hard to guess without seeing the benchmark (fio?) configuration and QEMU command-line. Common things to look at are: 1. Compare "iostat -dx 1" inside the guest and host. Are the I/O patterns comparable? blktrace(8) can give you even more detail on the exact I/O patterns. If the guest and host have different I/O patterns (blocksize, IOPS, queue depth) then request merging or I/O scheduler effects could be responsible for the difference. 2. kvm_stat or perf record -a -e kvm:\* counters for vmexits and interrupt injections. If these counters vary greatly between queue sizes, then that is usually a clue. It's possible to get higher performance by spending more CPU cycles although your system doesn't have many CPUs available, so I'm not sure if this is the case. 3. Power management and polling (kvm.ko halt_poll_ns, tuned profiles, and QEMU iothread poll-max-ns). It's expensive to wake a CPU when it goes into a low power mode due to idle. There are several features that can keep the CPU awake or even poll so that request latency is reduced. The reason why the number of queues may matter is that kicking multiple queues may keep the CPU awake more than batching multiple requests onto a small number of queues. Stefan
Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread
Thanks Paolo for your response and clarification. Btw, is there any rough schedule about when are you planning to start working on the multi queue feature? Once you start working on the feature, I would like to hear more details about the design and better understand how this feature will benefit the performance of virtio-scsi. Thanks again, Wei On 4/16/19, 7:01 AM, "Paolo Bonzini" wrote: On 05/04/19 23:09, Wei Li wrote: > Thanks Stefan for your quick response! > > Hi Paolo, Could you please send us a link related to the multiqueue > feature which you are working on so that we could start getting some > details about the feature. I have never gotten to the point of multiqueue, a prerequisite for that was to make the block layer thread safe. The latest state of the work is at github.com/bonzini/qemu, branch dataplane7. Paolo
Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread
Hi @Paolo Bonzini & @Stefan Hajnoczi, Would you please help confirm whether @Paolo Bonzini's multiqueue feature change will benefit virtio-scsi or not? Thanks! @Stefan Hajnoczi, I also spent some time on exploring the virtio-scsi multi-queue features via num_queues parameter as below, here are what we found: 1. Increase number of Queues from one to the same number as CPU will get better IOPS increase. 2. Increase number of Queues to the number (e.g. 8) larger than the number of vCPU (e.g. 2) can get even better IOPS increase. In addition, It seems Qemu can get better IOPS while the attachment uses more queues than the number of vCPU, how could it possible? Could you please help us better understand the behavior? Thanks a lot! Host CPU Configuration: CPU(s):2 Thread(s) per core:2 Core(s) per socket:1 Socket(s): 1 Commands for multi queue Setup: (QEMU) device_add driver=virtio-scsi-pci num_queues=1 id=test1 (QEMU) device_add driver=virtio-scsi-pci num_queues=2 id=test2 (QEMU) device_add driver=virtio-scsi-pci num_queues=4 id=test4 (QEMU) device_add driver=virtio-scsi-pci num_queues=8 id=test8 Result: | 8 Queues | 4 Queues | 2 Queues| Single Queue IOPS| +29% | 27% |11% | Baseline Thanks, Wei On 4/5/19, 2:09 PM, "Wei Li" wrote: Thanks Stefan for your quick response! Hi Paolo, Could you please send us a link related to the multiqueue feature which you are working on so that we could start getting some details about the feature. Thanks again, Wei On 4/1/19, 3:54 AM, "Stefan Hajnoczi" wrote: On Fri, Mar 29, 2019 at 08:16:36AM -0700, Wei Li wrote: > Thanks Stefan for your reply and guidance! > > We spent some time on exploring the multiple I/O Threads approach per your feedback. Based on the perf measurement data, we did see some IOPS improvement for multiple volumes, which is great. :) > > In addition, IOPS for single Volume will still be a bottleneck, it seems like multiqueue block layer feature which Paolo is working on may be able to help improving the IOPS for single volume. > > @Paolo, @Stefan, > Would you mind sharing the multiqueue feature code branch with us? So that we could get some rough idea about this feature and maybe start doing some exploration? Paolo last worked on this code, so he may be able to send you a link. Stefan
Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread
Thanks Stefan for your quick response! Hi Paolo, Could you please send us a link related to the multiqueue feature which you are working on so that we could start getting some details about the feature. Thanks again, Wei On 4/1/19, 3:54 AM, "Stefan Hajnoczi" wrote: On Fri, Mar 29, 2019 at 08:16:36AM -0700, Wei Li wrote: > Thanks Stefan for your reply and guidance! > > We spent some time on exploring the multiple I/O Threads approach per your feedback. Based on the perf measurement data, we did see some IOPS improvement for multiple volumes, which is great. :) > > In addition, IOPS for single Volume will still be a bottleneck, it seems like multiqueue block layer feature which Paolo is working on may be able to help improving the IOPS for single volume. > > @Paolo, @Stefan, > Would you mind sharing the multiqueue feature code branch with us? So that we could get some rough idea about this feature and maybe start doing some exploration? Paolo last worked on this code, so he may be able to send you a link. Stefan
Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread
Thanks Stefan for your reply and guidance! We spent some time on exploring the multiple I/O Threads approach per your feedback. Based on the perf measurement data, we did see some IOPS improvement for multiple volumes, which is great. :) In addition, IOPS for single Volume will still be a bottleneck, it seems like multiqueue block layer feature which Paolo is working on may be able to help improving the IOPS for single volume. @Paolo, @Stefan, Would you mind sharing the multiqueue feature code branch with us? So that we could get some rough idea about this feature and maybe start doing some exploration? Thanks a lot! Wei On 3/5/19, 9:29 AM, "Stefan Hajnoczi" wrote: On Mon, Mar 04, 2019 at 09:33:26AM -0800, Wei Li wrote: > While @Stefan mentioned about additional iothread object support of virtio-blk, Is the feature also supported by virtio-scsi? I am trying to exploring the perf multiple IO threads / per VM via followings: > QMP setup example to create 2 io threads in QEMU, one io thread per device: > > (QEMU) object-add qom-type=iothread id=iothread0 > > (QEMU) object-add qom-type=iothread id=iothread1 > > > > (QEMU) device_add driver=virtio-scsi-pci id=test0 iothread=iothread0 > > (QEMU) device_add driver=virtio-scsi-pci id=test1 iothread=iothread1 > > > > (QEMU) device_add driver=scsi-block drive=none0 id=v0 bus=test0.0 > > (QEMU) device_add driver=scsi-block drive=none1 id=v1 bus=test1.0 Yes, each virtio-scsi-pci device can be assigned to an iothread. > You mentioned about the multi-queue devices feature, it seems like the multi-queue feature will help improve the IOPS of single Device. Could you please provide more details? > What’s the current plan of support multi-queue device? Which release will include the support or it has already been included in any existing release newer than 2.9? > Is there any feature branch which I would get more details about the code and in progress status? I have CCed Paolo, who has worked on multiqueue block layer support in QEMU. This feature is not yet complete. The virtio-scsi device also supports multiqueue, but the QEMU block layer will still be a single queue. Stefan
Re: [Qemu-devel] Following up questions related to QEMU and I/O Thread
Hi Stefan and all, I spent some time on getting familiar with QEMU and relevant concepts. My project is using QEMU 2.9 with virtio-scsi backend, and I am exploring proper way to improve the IOPS of my project. Thanks @Stefan for the response and advices! Could you please help review and clarify following questions: While @Stefan mentioned about additional iothread object support of virtio-blk, Is the feature also supported by virtio-scsi? I am trying to exploring the perf multiple IO threads / per VM via followings: QMP setup example to create 2 io threads in QEMU, one io thread per device: (QEMU) object-add qom-type=iothread id=iothread0 (QEMU) object-add qom-type=iothread id=iothread1 (QEMU) device_add driver=virtio-scsi-pci id=test0 iothread=iothread0 (QEMU) device_add driver=virtio-scsi-pci id=test1 iothread=iothread1 (QEMU) device_add driver=scsi-block drive=none0 id=v0 bus=test0.0 (QEMU) device_add driver=scsi-block drive=none1 id=v1 bus=test1.0 You mentioned about the multi-queue devices feature, it seems like the multi-queue feature will help improve the IOPS of single Device. Could you please provide more details? What’s the current plan of support multi-queue device? Which release will include the support or it has already been included in any existing release newer than 2.9? Is there any feature branch which I would get more details about the code and in progress status? In addition, Someone posted related to multi-queue https://marc.info/?l=linux-virtualization=135583400026151=2, but it only measure the bandwidth, do we have any perf result about IOPS improvement of Multi-Queue approach? Thanks again, Wei On 2/18/19, 2:24 AM, "Stefan Hajnoczi" wrote: On Thu, Feb 14, 2019 at 08:21:30AM -0800, Wei Li wrote: > I learnt that QEMU iothread architecture has one QEMU thread per vCPU and a dedicated event loop thread which is iothread, and I want to better understand whether there is any specific reason to have a single iothead instead of multiple iothreads? > Given that single iothread becomes a performance bottleneck in my project, if there any proper way to support multiple iothreads? E.g. have one iothread per volume attachment instead of single iothread per host? But I am not quite sure whether it is feasible or not. Please let me know if you have any advices. Hi, Please send general questions to qemu-devel@nongnu.org and CC me in the future. That way others can participate in the discussion and it will be archived so someone searching for the same question will find the answer in the future. QEMU supports additional IOThread objects: -object iothread,id=iothread0 -device virtio-blk-pci,iothread=iothread0,drive=drive0 This virtio-blk device will perform device emulation and I/O in iothread0 instead of the main loop thread. Currently only 1:1 device<->IOThread association is possible. In the future 1:N should be possible and will allow multi-queue devices to achieve better performance. Stefan