[Qemu-devel] When is PCIUnregisterFunc called?
Hi QEMU developers, I'm trying to inject some operations during the emulated device teardown phase. For an emulated PCIe device, such as NVMe or IVSHMEM, I notice that QEMU registers PCIDeviceClass pc->init and pc->exit functions for that device. ->init() (e.g. nvme_init(), or ivshmem_init()) are marked as constructors and called before QEMU main() runs. However, I failed to find when the ->exit() function is called. When I add some printf() into the ->exit() func, it's not even printed. Could anyone give any pointers on this? Thanks. Best, Huaicheng
[Qemu-devel] map host MMIO address to guest
Hi all, I'm trying to map a host MMIO region (host PCIe device BAR) into guest physical address space. The goal is to enable direct control over that host MMIO region from guest OS by accessing a certain GPA. I know the address of the host MMIO region (one page). First I map the page into QEMU process address space and get a QEMU buffer. Then I use "memory_region_init_ram_ptr(); memory_region_add_subregion_overlap(system_memory, 512MB, my_mr, 1)" to map the QEMU buffer as part of guest physical address space (starting from 512MB to 512MB+4K). When I read/write to QEMU buffer, I can observe that correct MMIO region access is triggered. However, when I try to access the mapped MMIO region from guest OS (using a guest kernel module to access gpa:512MB directly), the following host kernel panic will be triggered. I don't understand why this happens. When I use the same method and map a host memory page (instead of a host MMIO page) into guest, it works fine. I appreciate if anyone can help analyze this? Thanks in advance. Best, Huaicheng [ 323.844213] BUG: unable to handle kernel paging request at ea0003faf460 [ 323.845671] IP: gup_pgd_range+0x2f5/0x860 [ 323.846615] PGD 23f7ed067 P4D 23f7ed067 PUD 23f7ec067 PMD 0 [ 323.847848] Oops: [#1] SMP [ 323.848692] Modules linked in: wpt(O) kvm_intel kvm irqbypass [ 323.850085] CPU: 2 PID: 4994 Comm: qemu-system-x86 Tainted: G O 4.15.0-rc4+ #10 [ 323.853002] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [ 323.855029] RIP: 0010:gup_pgd_range+0x2f5/0x860 [ 323.855792] RSP: 0018:c90004fdbae0 EFLAGS: 00010002 [ 323.856648] RAX: 03faf440 RBX: 555b8c6db000 RCX: 3000 [ 323.857618] RDX: ea0003faf440 RSI: 88022d4c36d0 RDI: 8000febd1067 [ 323.858598] RBP: c90004fdbb7c R08: 0400 R09: ea00 [ 323.859428] R10: 3000 R11: 8000febd1067 R12: 00022d4c3067 [ 323.860216] R13: c90004fdbba8 R14: 555b8c6da000 R15: 0007 [ 323.860979] FS: 7f4a3a527700() GS:88023fc8() knlGS: [ 323.861899] CS: 0010 DS: ES: CR0: 80050033 [ 323.862534] CR2: ea0003faf460 CR3: 000232723001 CR4: 000626e0 [ 323.863312] Call Trace: [ 323.863667] __get_user_pages_fast+0x6b/0x90 [ 323.864234] __gfn_to_pfn_memslot+0xf5/0x3b0 [kvm] [ 323.864858] ? kvm_irq_delivery_to_apic+0x51/0x2a0 [kvm] [ 323.865502] try_async_pf+0x53/0x1f0 [kvm] [ 323.866039] tdp_page_fault+0x112/0x280 [kvm] [ 323.866609] kvm_mmu_page_fault+0x53/0x130 [kvm] [ 323.867201] vmx_handle_exit+0x9b/0x1510 [kvm_intel] [ 323.867823] ? atomic_switch_perf_msrs+0x5f/0x80 [kvm_intel] [ 323.868504] ? vmx_vcpu_run+0x30a/0x4b0 [kvm_intel] [ 323.869101] kvm_arch_vcpu_ioctl_run+0xa79/0x1570 [kvm] [ 323.869748] ? kvm_vcpu_ioctl+0x2eb/0x570 [kvm] [ 323.870332] kvm_vcpu_ioctl+0x2eb/0x570 [kvm] [ 323.870897] ? kvm_vm_ioctl+0x142/0x7e0 [kvm] [ 323.871457] do_vfs_ioctl+0x8f/0x5b0 [ 323.871955] ? native_write_msr+0x6/0x20 [ 323.872476] ? security_file_ioctl+0x3e/0x60 [ 323.873024] SyS_ioctl+0x74/0x80 [ 323.873468] entry_SYSCALL_64_fastpath+0x1a/0x7d [ 323.874043] RIP: 0033:0x7f4a3c897f47 [ 323.874506] RSP: 002b:7f4a3a526a78 EFLAGS: 0246 ORIG_RAX: 0010 [ 323.875439] RAX: ffda RBX: ae80 RCX: 7f4a3c897f47 [ 323.876234] RDX: RSI: ae80 RDI: 000f [ 323.877093] RBP: 555b8c5ef450 R08: 555b89bc1a70 R09: [ 323.877931] R10: fee0 R11: 0246 R12: [ 323.878760] R13: 7f4a3e2db000 R14: 0006 R15: 555b8c5ef450 [ 323.879555] Code: 00 00 d3 e2 85 c2 75 ae 4c 85 c7 0f 85 d0 00 00 00 f7 c7 00 02 00 00 75 9d 48 89 f8 66 66 66 90 4c 21 d0 48 c1 e8 06 4a 8d 14 08 <48> 8b 42 20 4c 8d 58 ff a8 01 4c 0f 44 da 41 8b 43 1c 85 c0 0f [ 323.881713] RIP: gup_pgd_range+0x2f5/0x860 RSP: c90004fdbae0 [ 323.882509] CR2: ea0003faf460 [ 323.883069] ---[ end trace 2427ffda7b3b2a32 ]---
Re: [Qemu-devel] About cpu_physical_memory_map()
Hi Peter, Just a follow up on my previous question. I have figured it out by trying it out with QEMU. I'm writing to thank you again for your help! I really appreciate that. Thank you! Best, Huaicheng On Fri, Jun 1, 2018 at 1:00 AM Huaicheng Li wrote: > Hi Peter, > > Thank you a lot for the analysis! > > So it'll be simpler >> if you start with the buffer in the host QEMU process, map this >> in to the guest's physical address space at some GPA, tell the >> guest kernel that that's the GPA to use, and have the guest kernel >> map that GPA into the guest userspace process's virtual address space. >> (Think of how you would map a framebuffer, for instance.) > > > This makes sense to me. Could you help provide a pointer where I can refer > to similar implementations? > Should I do something like this during system memory initialization: > > memory_region_init_ram_ptr(my_mr, owner, "mybuf", buf_size, buf); // > where buf is the buffer in QEMU AS > memory_region_add_subregion(system_memory, GPA_OFFSET, my_mr); > > If I set guest memory to be "-m 1G", can I make "GPA_OFFSET" beyond 1GB > (e.g. 2GB)? This way, the guest OS > won't be able to access my buffer and use it like other regular RAM. > > Thanks! > > Best, > Huaicheng > > > > > On Thu, May 31, 2018 at 3:11 AM Peter Maydell > wrote: > >> On 30 May 2018 at 01:24, Huaicheng Li wrote: >> > Dear QEMU/KVM developers, >> > >> > I was trying to map a buffer in host QEMU process to a guest user space >> > application. I tried to achieve this >> > by allocating a buffer in the guest application first, then map this >> buffer >> > to QEMU process address space via >> > GVA -> GPA --> HVA (GPA to HVA is done via cpu_physical_memory_map). >> Last, >> > I wrote a host kernel driver to >> > walk QEMU process's page table and change corresponding page table >> entries >> > of HVA to the HPA of the target >> > buffer. >> >> This seems like the wrong way round to try to do this. As a rule >> of thumb, you'll have an easier life if you have things behave >> similarly to how they would in real hardware. So it'll be simpler >> if you start with the buffer in the host QEMU process, map this >> in to the guest's physical address space at some GPA, tell the >> guest kernel that that's the GPA to use, and have the guest kernel >> map that GPA into the guest userspace process's virtual address space. >> (Think of how you would map a framebuffer, for instance.) >> >> Changing the host page table entries for QEMU under its feet seems >> like it's never going to work reliably. >> >> (I think the specific problem you're running into is that guest memory >> is both mapped into the QEMU host process and also exposed to the >> guest VM. The former is controlled by the page tables for the >> QEMU host process, but the latter is a different set of page tables, >> which QEMU asks the kernel to configure, using KVM_SET_USER_MEMORY_REGION >> ioctls.) >> >> thanks >> -- PMM >> >
Re: [Qemu-devel] About cpu_physical_memory_map()
Hi Peter, Thank you a lot for the analysis! So it'll be simpler > if you start with the buffer in the host QEMU process, map this > in to the guest's physical address space at some GPA, tell the > guest kernel that that's the GPA to use, and have the guest kernel > map that GPA into the guest userspace process's virtual address space. > (Think of how you would map a framebuffer, for instance.) This makes sense to me. Could you help provide a pointer where I can refer to similar implementations? Should I do something like this during system memory initialization: memory_region_init_ram_ptr(my_mr, owner, "mybuf", buf_size, buf); // where buf is the buffer in QEMU AS memory_region_add_subregion(system_memory, GPA_OFFSET, my_mr); If I set guest memory to be "-m 1G", can I make "GPA_OFFSET" beyond 1GB (e.g. 2GB)? This way, the guest OS won't be able to access my buffer and use it like other regular RAM. Thanks! Best, Huaicheng On Thu, May 31, 2018 at 3:11 AM Peter Maydell wrote: > On 30 May 2018 at 01:24, Huaicheng Li wrote: > > Dear QEMU/KVM developers, > > > > I was trying to map a buffer in host QEMU process to a guest user space > > application. I tried to achieve this > > by allocating a buffer in the guest application first, then map this > buffer > > to QEMU process address space via > > GVA -> GPA --> HVA (GPA to HVA is done via cpu_physical_memory_map). > Last, > > I wrote a host kernel driver to > > walk QEMU process's page table and change corresponding page table > entries > > of HVA to the HPA of the target > > buffer. > > This seems like the wrong way round to try to do this. As a rule > of thumb, you'll have an easier life if you have things behave > similarly to how they would in real hardware. So it'll be simpler > if you start with the buffer in the host QEMU process, map this > in to the guest's physical address space at some GPA, tell the > guest kernel that that's the GPA to use, and have the guest kernel > map that GPA into the guest userspace process's virtual address space. > (Think of how you would map a framebuffer, for instance.) > > Changing the host page table entries for QEMU under its feet seems > like it's never going to work reliably. > > (I think the specific problem you're running into is that guest memory > is both mapped into the QEMU host process and also exposed to the > guest VM. The former is controlled by the page tables for the > QEMU host process, but the latter is a different set of page tables, > which QEMU asks the kernel to configure, using KVM_SET_USER_MEMORY_REGION > ioctls.) > > thanks > -- PMM >
[Qemu-devel] About cpu_physical_memory_map()
Dear QEMU/KVM developers, I was trying to map a buffer in host QEMU process to a guest user space application. I tried to achieve this by allocating a buffer in the guest application first, then map this buffer to QEMU process address space via GVA -> GPA --> HVA (GPA to HVA is done via cpu_physical_memory_map). Last, I wrote a host kernel driver to walk QEMU process's page table and change corresponding page table entries of HVA to the HPA of the target buffer. Basically, the idea is to keep GVA --> GPA --> HVA mapping (this step is to map guest buffer into QEMU's address space) and change HVA --> HPA1 (HPA1 is the physical base addr of the buffer malloc'ed by the guest application) to HVA --> HPA2 (HPA2 is the physical base addr of the target buffer we want to remap into guest application's address space) The above change is done by my kernel module to change the page table entries in QEMU's page table in the host system. Under this case, I expect to see GVA will point to HPA2, instead of HPA1. With the above change, when I access HVA in QEMU process, I find it indeed points to HPA2. However, inside guest OS, the application's GVA still points to HPA1 I have cleaned TLBs of QEMU process's page table as well as the guest application's page table (in guest OS) accordingly but the guest application's GVA is still mapped to HPA1, instead of HPA2. Does QEMU maintain a fixed GPA to HVA mapping? After goingthrough the code of "cpu_physical_memory_map()", I think HVA is calculated as ramblock->host + GPA since guest RAM space is a mmap'ed area in QEMU's address space and GPA is an offset within that area. Thus, GPA -> HVA mapping is fixed during runtime. Is QEMU/KVM doing another layer of TLB caching so as to the guest application picks up the old mapping to HPA1, instead of HPA2? Any comments are appreciated. Thank you! Best, Huaicheng
[Qemu-devel] [PATCH] hw/block/nvme: Add doorbell buffer config support
This patch adds Doorbell Buffer Config support (NVMe 1.3) to QEMU NVMe, based on Mihai Rusu / Lin Ming's Google vendor extension patch [1]. The basic idea of this optimization is to use a shared buffer between guest OS and QEMU to reduce # of MMIO operations (doorbell writes). This patch ports the original code to work under current QEMU and make it also work with SPDK. Unlike Linux kernel NVMe driver which builds the shadow buffer first and then creates SQ/CQ, SPDK first creates SQ/CQ and then issues this command to create shadow buffer. Thus, in this implementation, we also try to associate shadow buffer entry with each SQ/CQ during queue initialization. [1] http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg04127.html Peroformance results using a **ramdisk** backed virtual NVMe device in guest Linux 4.14 is as below: Note: "QEMU" represent stock QEMU and "+dbbuf" is QEMU with this patch. For psync, QD represents # of threads being used. IOPS (Linux kernel NVMe driver) psync libaio QD QEMU +dbbuf QEMU +dbbuf 1 47k 50k 45k47k 4 86k 107k 59k 143k 1695k 198k 58k 185k 6497k 259k 59k 216k IOPS (SPDK) QD QEMU +dbbuf 1 62k 71k 4 61k 191k 1660k 319k 6462k 364k We can see that this patch can greatly increase the IOPS (and lower the latency, not shown) (2.7x for psync, 3.7x for libaio and 5.9x for SPDK). ==Setup==: (1) VM script: x86_64-softmmu/qemu-system-x86_64 \ -name "nvme-FEMU-test" \ -enable-kvm \ -cpu host \ -smp 4 \ -m 8G \ -drive file=$IMGDIR/u14s.qcow2,if=ide,aio=native,cache=none,format=qcow2,id=hd0 \ -drive file=/mnt/tmpfs/test1.raw,if=none,aio=threads,format=raw,id=id0 \ -device nvme,drive=id0,serial=serial0,id=nvme0 \ -net user,hostfwd=tcp::8080-:22 \ -net nic,model=virtio \ -nographic \ (2) FIO configuration: [global] ioengine=libaio filename=/dev/nvme0n1 thread=1 group_reporting=1 direct=1 verify=0 time_based=1 ramp_time=0 runtime=30 ;size=1G iodepth=16 rw=randread bs=4k [test] numjobs=1 Signed-off-by: Huaicheng Li <huaich...@cs.uchicago.edu> --- hw/block/nvme.c | 97 +--- hw/block/nvme.h | 7 include/block/nvme.h | 2 ++ 3 files changed, 102 insertions(+), 4 deletions(-) diff --git a/hw/block/nvme.c b/hw/block/nvme.c index 85d2406400..3882037e36 100644 --- a/hw/block/nvme.c +++ b/hw/block/nvme.c @@ -9,7 +9,7 @@ */ /** - * Reference Specs: http://www.nvmexpress.org, 1.2, 1.1, 1.0e + * Reference Specs: http://www.nvmexpress.org, 1.3, 1.2, 1.1, 1.0e * * http://www.nvmexpress.org/resources/ */ @@ -33,6 +33,7 @@ #include "qapi/error.h" #include "qapi/visitor.h" #include "sysemu/block-backend.h" +#include "exec/memory.h" #include "qemu/log.h" #include "trace.h" @@ -244,6 +245,14 @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len, return status; } +static void nvme_update_cq_head(NvmeCQueue *cq) +{ +if (cq->db_addr) { +pci_dma_read(>ctrl->parent_obj, cq->db_addr, >head, +sizeof(cq->head)); +} +} + static void nvme_post_cqes(void *opaque) { NvmeCQueue *cq = opaque; @@ -254,6 +263,8 @@ static void nvme_post_cqes(void *opaque) NvmeSQueue *sq; hwaddr addr; +nvme_update_cq_head(cq); + if (nvme_cq_full(cq)) { break; } @@ -461,6 +472,7 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd) static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr, uint16_t sqid, uint16_t cqid, uint16_t size) { +uint32_t stride = 4 << NVME_CAP_DSTRD(n->bar.cap); int i; NvmeCQueue *cq; @@ -480,6 +492,11 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr, } sq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_sq, sq); +if (sqid && n->dbbuf_dbs && n->dbbuf_eis) { +sq->db_addr = n->dbbuf_dbs + 2 * sqid * stride; +sq->ei_addr = n->dbbuf_eis + 2 * sqid * stride; +} + assert(n->cq[cqid]); cq = n->cq[cqid]; QTAILQ_INSERT_TAIL(&(cq->sq_list), sq, entry); @@ -559,6 +576,8 @@ static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeCmd *cmd) static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr, uint16_t cqid, uint16_t vector, uint16_t size, uint16_t irq_enabled) { +uint32_t stride = 4 << NVME_CAP_DSTRD(n->bar.cap); + cq->ctrl = n; cq->cqid = cqid; cq->size = size; @@ -569,11 +588,51 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr, cq->head = cq->tail = 0; QTAILQ_INIT(>req_list); QTAILQ_INIT(>sq_list); +if (cqid && n->dbbuf_dbs && n->dbbuf_eis) { +cq->db_addr = n->dbbuf_dbs + (2 * cqid +
Re: [Qemu-devel] QEMU GSoC 2018 Project Idea (Apply polling to QEMU NVMe)
Hi Stefan, Paolo and I have filled the template. And Paolo has helped update the wiki with this proposed project idea. Thanks. Best, Huaicheng On Tue, Feb 27, 2018 at 10:35 AM, Stefan Hajnoczi <stefa...@gmail.com> wrote: > On Tue, Feb 27, 2018 at 12:04:48PM +0100, Paolo Bonzini wrote: > > On 27/02/2018 10:05, Huaicheng Li wrote: > > > Great to know that you'd like to mentor the project! If so, can we > make it > > > an official project idea and put it on QEMU GSoC page? > > > > Submissions need not come from the QEMU GSoC page. You are free to > > submit any idea that you think can be worthwhile. > > Please follow the process described here: > > https://wiki.qemu.org/Google_Summer_of_Code_2018#How_to_ > propose_a_custom_project_idea > > The project idea needs to be posted on the wiki page. > > Huaicheng & Paolo, please fill out this template so we can add it to the > wiki page: > > === TITLE === > > '''Summary:''' Short description of the project > > Detailed description of the project. > > '''Links:''' > * Wiki links to relevant material > * External links to mailing lists or web sites > > '''Details:''' > * Skill level: beginner or intermediate or advanced > * Language: C > * Mentor: Email address and IRC nick > * Suggested by: Person who suggested the idea >
Re: [Qemu-devel] QEMU GSoC 2018 Project Idea (Apply polling to QEMU NVMe)
Sounds great. Thanks! On Tue, Feb 27, 2018 at 5:04 AM, Paolo Bonzini <pbonz...@redhat.com> wrote: > On 27/02/2018 10:05, Huaicheng Li wrote: > > Including a RAM disk backend in QEMU would be nice too, and it may > > interest you as it would reduce the delta between upstream QEMU and > > FEMU. So this could be another idea. > > > > Glad you're also interested in this part. This can definitely be part of > the > > project. > > > > For (3), there is work in progress to add multiqueue support to > QEMU's > > block device layer. We're hoping to get the infrastructure part in > > (removing the AioContext lock) during the first half of 2018. As you > > say, we can see what the workload will be. > > > > Thanks for letting me know this. Could you provide a link to the on-going > > multiqueue implementation? I would like to learn how this is done. :) > > Well, there is no multiqueue implementation yet, but for now you can see > a lot of work in block/ regarding making drivers and BlockDriverState > thread safe. We can't just do it for null-co:// so we have a little > preparatory work to do. :) > > > However, the main issue that I'd love to see tackled is interrupt > > mitigation. With higher rates of I/O ops and high queue depth (e.g. > > 32), it's common for the guest to become slower when you introduce > > optimizations in QEMU. The reason is that lower latency causes > higher > > interrupt rates and that in turn slows down the guest. If you have > any > > ideas on how to work around this, I would love to hear about it. > > > > Yeah, indeed interrupt overhead (host-to-guest notification) is a > headache. > > I thought about this, and one intuitive optimization in my mind is to add > > interrupt coalescing support into QEMU NVMe. We may use some heuristic > to batch > > I/O completions back to guest, thus reducing # of interrupts. The > heuristic > > can be time-window based (i.e., for I/Os completed in the same time > window, > > we only do one interrupt for each CQ). > > > > I believe there are several research papers that can achieve direct > interrupt > > delivery without exits for para-virtual devices, but those need KVM side > > modifications. It might be not a good fit here. > > No, indeed. But the RAM disk backend and interrupt coalescing (for > either NVMe or virtio-blk... or maybe a generic scheme that can be > reused by virtio-net and others too!) is a good idea for the third part > of the project. > > > In any case, I would very much like to mentor this project. Let me > know > > if you have any more ideas on how to extend it! > > > > > > Great to know that you'd like to mentor the project! If so, can we make > it > > an official project idea and put it on QEMU GSoC page? > > Submissions need not come from the QEMU GSoC page. You are free to > submit any idea that you think can be worthwhile. > > Paolo >
Re: [Qemu-devel] QEMU GSoC 2018 Project Idea (Apply polling to QEMU NVMe)
Hi Paolo, Slightly rephrased: > (1) add shadow doorbell buffer and ioeventfd support into QEMU NVMe > emulation, which will reduce # of VM-exits and make them less expensive > (reduce VCPU latency. > (2) add iothread support to QEMU NVMe emulation. This can also be used > to eliminate VM-exits because iothreads can do adaptive polling. > (1) and (2) seem okay for at most 1.5 months, especially if you already > have experience with QEMU. Thanks a lot for rephrasing it to make it more clear. Yes, I think (1)(2) should be achievable in 1-1.5month. What needs to be added based on FEMU includes: ioeventfd support to QEMU NVMe, and use iothread for polling (current FEMU implementation uses a periodic timer to poll shadow buffer directly, moving to iothread would deliver better performance). Including a RAM disk backend in QEMU would be nice too, and it may > interest you as it would reduce the delta between upstream QEMU and > FEMU. So this could be another idea. Glad you're also interested in this part. This can definitely be part of the project. For (3), there is work in progress to add multiqueue support to QEMU's > block device layer. We're hoping to get the infrastructure part in > (removing the AioContext lock) during the first half of 2018. As you > say, we can see what the workload will be. Thanks for letting me know this. Could you provide a link to the on-going multiqueue implementation? I would like to learn how this is done. :) However, the main issue that I'd love to see tackled is interrupt > mitigation. With higher rates of I/O ops and high queue depth (e.g. > 32), it's common for the guest to become slower when you introduce > optimizations in QEMU. The reason is that lower latency causes higher > interrupt rates and that in turn slows down the guest. If you have any > ideas on how to work around this, I would love to hear about it. Yeah, indeed interrupt overhead (host-to-guest notification) is a headache. I thought about this, and one intuitive optimization in my mind is to add interrupt coalescing support into QEMU NVMe. We may use some heuristic to batch I/O completions back to guest, thus reducing # of interrupts. The heuristic can be time-window based (i.e., for I/Os completed in the same time window, we only do one interrupt for each CQ). I believe there are several research papers that can achieve direct interrupt delivery without exits for para-virtual devices, but those need KVM side modifications. It might be not a good fit here. In any case, I would very much like to mentor this project. Let me know > if you have any more ideas on how to extend it! Great to know that you'd like to mentor the project! If so, can we make it an official project idea and put it on QEMU GSoC page? Thank you so much for the feedbacks and agreeing to be a potential mentor for this project. I'm happy to see that you also think this is something that's worth putting efforts into. Best, Huaicheng On Mon, Feb 26, 2018 at 2:45 AM, Paolo Bonzini <pbonz...@redhat.com> wrote: > On 25/02/2018 23:52, Huaicheng Li wrote: > > I remember there were some discussions back in 2015 about this, but I > > don't see it finally done. For this project, I think we can go in three > > steps: (1). add the shadow doorbell buffer support into QEMU NVMe > > emulation, this will reduce # of VM-exits. (2). replace current timers > > used by QEMU NVMe with a separate polling thread, thus we can completely > > eliminate VM-exits. (3). Even further, we can adapt the architecture to > > use one polling thread for each NVMe queue pair, thus it's possible to > > provide more performance. (step 3 can be left for next year if the > > workload is too much for 3 months). > > Slightly rephrased: > > (1) add shadow doorbell buffer and ioeventfd support into QEMU NVMe > emulation, which will reduce # of VM-exits and make them less expensive > (reduce VCPU latency. > > (2) add iothread support to QEMU NVMe emulation. This can also be used > to eliminate VM-exits because iothreads can do adaptive polling. > > (1) and (2) seem okay for at most 1.5 months, especially if you already > have experience with QEMU. > > For (3), there is work in progress to add multiqueue support to QEMU's > block device layer. We're hoping to get the infrastructure part in > (removing the AioContext lock) during the first half of 2018. As you > say, we can see what the workload will be. > > Including a RAM disk backend in QEMU would be nice too, and it may > interest you as it would reduce the delta between upstream QEMU and > FEMU. So this could be another idea. > > However, the main issue that I'd love to see tackled is interrupt > mitigation. With higher rates of I/O ops and high queue depth (e.g. > 32), it's common for the guest to become slo
[Qemu-devel] QEMU GSoC 2018 Project Idea (Apply polling to QEMU NVMe)
Hi all, The project would be about utilizing shadow doorbell buffer features in NVMe 1.3 to enable QEMU side polling for virtualized NVMe device, thus achieving comparable performance as in virtio-dataplane. **Why not virtio?** The reason is many industrial/academic researchers uses QEMU NVMe as a performance platform for research/product prototyping. NVMe interface is better in the rich features it provides than virtio interface. If we can make QEMU NVMe performance competent with virtio, it will benefit a lot of communities. **Doable?** NVMe spec 1.3 introduces a shadow doorbell buffer which is aimed for virtual NVMe controller optimizations. QEMU can certainly utilize this feature to reduce or even eliminate VM-exits triggered by doorbell writes. I remember there were some discussions back in 2015 about this, but I don't see it finally done. For this project, I think we can go in three steps: (1). add the shadow doorbell buffer support into QEMU NVMe emulation, this will reduce # of VM-exits. (2). replace current timers used by QEMU NVMe with a separate polling thread, thus we can completely eliminate VM-exits. (3). Even further, we can adapt the architecture to use one polling thread for each NVMe queue pair, thus it's possible to provide more performance. (step 3 can be left for next year if the workload is too much for 3 months). Actually, I have an initial implementation over step (1)(2) and would like to work more on it to push it upstream. More information is in this papper, (Section 3.1 and Figure 2-left), http://ucare.cs.uchicago.edu/pdf/fast18-femu.pdf Comments are welcome. Thanks. Best, Huaicheng
[Qemu-devel] irqfd for QEMU NVMe
Hi all, I'm writing to ask if it's possible to use irqfd mechanism in QEMU's NVMe virtual controller implementation. My search results show that back in 2015, there is a discussion on improving QEMU NVMe performance by utilizing eventfd for guest-to-host notification, thus i guess irqfd should also work ? If so, could you briefly describe how to do that and how much can we save for the host-to-guest notification path by using irqfd? Thanks for your time. Best, Huaicheng
Re: [Qemu-devel] use timer for adding latency to each block I/O
> On May 16, 2016, at 11:33 AM, Stefan Hajnocziwrote: > > The way it's done in the "null" block driver is: > > static coroutine_fn int null_co_common(BlockDriverState *bs) > { >BDRVNullState *s = bs->opaque; > >if (s->latency_ns) { >co_aio_sleep_ns(bdrv_get_aio_context(bs), QEMU_CLOCK_REALTIME, >s->latency_ns); >} >return 0; > } Thanks so much, Stefan! It seems this is what I need. Best,
[Qemu-devel] use timer for adding latency to each block I/O
Hi all, My goal is to add latency for each I/O without blocking the submission path. Now I can know how long each I/O should wait before it’s submitted to the AIO queue via a model. Now the question is how can I make the I/O wait for that long time before it’s finally handled by worker threads. I’m thinking about adding a timer for each block I/O and its callback function will do the submission job. Below is some of my thought and questions about this idea: (1). The callback will be triggered by the thread where corresponding event queue locates, not VCPU thread which previously submits I/Os. How can I make safely submit I/O from the event-queue thread? (2). If each I/O is associated with a timer, and we also put per-I/O timer into the event queue. Will the event queue be capable of handling intensive I/O timers in accurate timing manner? (3). Which device interface would be better to work on? From my perspective, the virtio-data-plane would be a good choice since it has its own thread and event queue. (4). Any other concerns about this idea? Any suggestions will be greatly appreciated. Thanks for all your time. Best, Huaicheng
Re: [Qemu-devel] about correctness of IDE emulation
> On Apr 13, 2016, at 1:07 PM, John Snowwrote: > > Why do you want to use IDE? If you are looking for performance, why not > a virtio device? I’m just trying to understand how IDE emulation works and see where the overhead comes in. Thank you for the detailed explanation. I really appreciate that. Best, Huaicheng
Re: [Qemu-devel] about correctness of IDE emulation
> On Mar 14, 2016, at 10:09 PM, Huaicheng Li <lhc...@gmail.com> wrote: > > >> On Mar 13, 2016, at 8:42 PM, Fam Zheng <f...@redhat.com> wrote: >> >> On Sun, 03/13 14:37, Huaicheng Li (coperd) wrote: >>> Hi all, >>> >>> What I’m confused about is that: >>> >>> If one I/O is too large and may need several rounds (say 2) of DMA >>> transfers, >>> it seems the second round transfer begins only after the completion of the >>> first part, by reading data from **IDEState**. But the IDEState info may >>> have >>> been changed by VCPU threads (by writing new I/Os to it) when the first >>> transfer finishes. From the code, I see that IDE r/w call back function will >>> continue the second transfer by referencing IDEState’s information. Wouldn’t >>> this be problematic? Am I missing anything here? >> >> Can you give an concrete example? I/O in VCPU threads that changes IDEState >> must also take care of the DMA transfers, for example ide_reset() has >> blk_aio_cancel and clears s->nsectors. If an I/O handler fails to do so, it >> is >> a bug. >> >> Fam > > I get it now. ide_exec_cmd() can only proceed when BUSY_STAT|DRQ_STAT is not > set. > When the 2nd DMA transfer continues, BUSY_STAT | DRQ_STAT is already > set, i.e., no other new ide_exec_cmd() can enter. BSUY or DRQ is removed only > when > all DMA transfers are done, after which new writes to IDE are allowed. Thus > it’s safe. > > Thanks, Fam & Stefan. Hi all, I have some further puzzles about IDE emulation: (1). IDE can only handle I/Os one by one. So in the AIO queue there will always be only **ONE** I/O from this IDE, right? For the bigs I/Os which need to be spliced into several rounds of DMA transfers, they are also served one by one. (after one DMA transfer [as an AIO] is finished, another DMA transfer will be submitted and so on). Here I want to convey that there is no batch submission in IDE path at all. True? (2). When the guest kernel prepares to do a big I/O which need multiple rounds of DMA transfers, will each DMA transfer round (one PRD entry) be trapped and trigger one IDE emulation, or IDE will handle all the PRD in one shot? (3). I traced the execution of my guest application with big I/Os (each time reads 2MB), then in the IDE layer, I found that it’s splitted into 512KB chunks for each DMA transfer. Why is 512KB here?? From the BMDMA spec, PRD table can at most represent 64KB/8bytes = 8192 buffers, each of which can be a at most 64KB continuous buffer. This would give us 8192*64KB=512MB for each DMA. Am I missing anything here? Thanks for your attention. Best, Huaicheng
Re: [Qemu-devel] about correctness of IDE emulation
> On Mar 13, 2016, at 8:42 PM, Fam Zheng <f...@redhat.com> wrote: > > On Sun, 03/13 14:37, Huaicheng Li (coperd) wrote: >> Hi all, >> >> What I’m confused about is that: >> >> If one I/O is too large and may need several rounds (say 2) of DMA transfers, >> it seems the second round transfer begins only after the completion of the >> first part, by reading data from **IDEState**. But the IDEState info may have >> been changed by VCPU threads (by writing new I/Os to it) when the first >> transfer finishes. From the code, I see that IDE r/w call back function will >> continue the second transfer by referencing IDEState’s information. Wouldn’t >> this be problematic? Am I missing anything here? > > Can you give an concrete example? I/O in VCPU threads that changes IDEState > must also take care of the DMA transfers, for example ide_reset() has > blk_aio_cancel and clears s->nsectors. If an I/O handler fails to do so, it is > a bug. > > Fam I get it now. ide_exec_cmd() can only proceed when BUSY_STAT|DRQ_STAT is not set. When the 2nd DMA transfer continues, BUSY_STAT | DRQ_STAT is already set, i.e., no other new ide_exec_cmd() can enter. BSUY or DRQ is removed only when all DMA transfers are done, after which new writes to IDE are allowed. Thus it’s safe. Thanks, Fam & Stefan.
[Qemu-devel] about correctness of IDE emulation
Hi all, I meet some trouble in understanding IDE emulation: (1) IDE I/O Down Path (In VCPU thread): upon KVM_EXIT_IO, corresponding disk ioport write function will write IO info to IDEState, then ide read callback function will eventually split it into **several DMA transfers** and eventually submit them to the AIO request list for handling. (2). I/O Up Path (worker thread —> QEMU main loop thread) when the request in AIO request list has been successfully handled, the worker thread will signal the QEMU main thread this I/O completion event, which is later handled by its callback (posix_aio_read). posix_aio_read will then eventually return to IDE callback function, where virtual interrupt is generated to signal guest about I/O completion. What I’m confused about is that: If one I/O is too large and may need several rounds (say 2) of DMA transfers, it seems the second round transfer begins only after the completion of the first part, by reading data from **IDEState**. But the IDEState info may have been changed by VCPU threads (by writing new I/Os to it) when the first transfer finishes. From the code, I see that IDE r/w call back function will continue the second transfer by referencing IDEState’s information. Wouldn’t this be problematic? Am I missing anything here? Thanks. Best, Huaicheng
Re: [Qemu-devel] qemu AIO worker threads change causes Guest OS hangup
> On Mar 5, 2016, at 8:42 PM, Huaicheng Li (coperd) <lhc...@gmail.com> wrote: > > >> On Mar 1, 2016, at 3:01 PM, Paolo Bonzini <pbonz...@redhat.com >> <mailto:pbonz...@redhat.com>> wrote: >> >> This is done >> because the worker threads only care about the queued request list, not >> about active or completed requests. > > Do you think it would be useful to add an API for inserting one request back > to the queued list? For example, In case of request failure, we can insert it > back to the list for re-handling according to some rule before returning it > directly > to guest os. > > Best, > Huaicheng > Thank you for the help.
Re: [Qemu-devel] qemu AIO worker threads change causes Guest OS hangup
> On Mar 1, 2016, at 3:01 PM, Paolo Bonziniwrote: > > This is done > because the worker threads only care about the queued request list, not > about active or completed requests. Do you think it would be useful to add an API for inserting one request back to the queued list? For example, In case of request failure, we can insert it back to the list for re-handling according to some rule before returning it directly to guest os. Best, Huaicheng
Re: [Qemu-devel] qemu AIO worker threads change causes Guest OS hangup
> On Mar 1, 2016, at 3:34 PM, Stefan Hajnocziwrote: > > Have you seen Linux Documentation/device-mapper/delay.txt? > > You could set up a loopback block device and put the device-mapper delay > target on top to simulate latency. I’m working on one idea to emulate the latency of SSD read/write, which is *dynamically* changing according to the status of emulated flash media. Thanks for the suggestion.
[Qemu-devel] qemu AIO worker threads change causes Guest OS hangup
Hi all, I’m trying to add some latency conditionally to I/O requests (qemu_paiocb, from **IDE** disk emulation, **raw** image file). My idea is to add this part into the work thread: * First, set a timer for each incoming qemu_paiocb structure (e.g. 2ms) * When worker thread handles this I/O, it will first check if the timer has expired. If so, it will go to normal r/w handling to image files in host. Otherwise, it will insert this I/O request back to `request_list` via `qemu_paio_submit`. Here, I just want to skip the IO until the timer condition is satisfied. Logically, I think this should be right. But after I run some I/O tests inside Guest OS, the guest OS will hangup (freeze) with “INFO: task xxx blocked for more than 120 seconds”. From the guest OS’s perspective, the disk seems to be very busy. Thus, the kernel keeps waiting for IO and have no responsiveness to other tasks. So I guess it should still be the problem of worker threads. My questions are: * Is it safe to call `qemu_paio_submit` from one worker thread? Since all request_access accesses are protected by lock, I think this is OK. * What are the possible reasons why guest OS hangs up? My understand is that, although worker threads will busy with skipping I/O for many times, they will eventually finish the task (guest OS freezes after my r/w test program runs successfully, then guest OS becomes unresponsive). * Any thoughts on debugging? Currently I’m do some checking (e.g. the request_list length, number of threads) via printf. For this part, it seems hard to use gdb for debugging because guest OS will trigger timeout if I stay at some breakpoints for “too long”. Any suggestions would be appreciated. Thanks. Best, Huaicheng
Re: [Qemu-devel] Save IDEState data to files when VM shutdown
> Why are you trying to save the state during shutdown? The structure I added into IDEState keeps being updated when VM is up. So I think it’s a safe way to do this during shutdown. When the VM is started again, it can continue from the status saved during last shutdown. Thanks for your help. I will look into the code first. > On Dec 9, 2015, at 3:20 AM, Dr. David Alan Gilbert <dgilb...@redhat.com> > wrote: > > * Huaicheng Li (lhc...@gmail.com) wrote: >> Hi all, >> >> Please correct me if I’m wrong. >> >> I made some changes to IDE emulation (add some extra structures to “struct >> IDEState") and want to save these info to files when VM shutdowns. So I can >> reload these info from files next time when VM starts. According to my >> understanding, one IDEState structure is corresponding to one disk for VM >> and all available drives are probed/initialised by ide_init2() in hw/ide.c >> (I used qemu v0.11) during VM startup. It seemed that IDEState structure are >> saved to QEMUFile structure via pci_ide_save(), but I can only trace up to >> register_savevm(), where pci_ide_save() is registered as a callback. I can’t >> find where exactly this function starts execution or being called. My >> questions are: > > Version 0.11 is *ancient* - please start with something newer; pci_ide_save > was removed 6 years ago. > >> (1). Does QEMUFile structure represent a running VM instance, through which >> I can access the IDE drive (struct IDEState) pointers ? >> >> (2). When does qemu execute pci_ide_save()? > > QEMUFile is part of the migration code; it forms a stream of data containing > all of the device and RAM State during migration. See savevm.c for what > drives > this (in migration/savevm.c in modern qemu). > Extracting the state of one device from the stream isn't that easy. > >> (3). How does qemu handle VM shutdown? It seems ACPI event is sent to VM so >> guest OS will shutdown in the way like real OS running on real hardware. But >> how and where does qemu exactly handle this? I think I need to add my codes >> here. > > I don't know the detail of that; I suggest following > the code from qemu_system_powerdown. > > Why are you trying to save the state during shutdown? > > Dave > >> >> Any hints, suggestions would be appreciated. Thanks. >> >> Huaicheng Li > -- > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
[Qemu-devel] Save IDEState data to files when VM shutdown
Hi all, Please correct me if I’m wrong. I made some changes to IDE emulation (add some extra structures to “struct IDEState") and want to save these info to files when VM shutdowns. So I can reload these info from files next time when VM starts. According to my understanding, one IDEState structure is corresponding to one disk for VM and all available drives are probed/initialised by ide_init2() in hw/ide.c (I used qemu v0.11) during VM startup. It seemed that IDEState structure are saved to QEMUFile structure via pci_ide_save(), but I can only trace up to register_savevm(), where pci_ide_save() is registered as a callback. I can’t find where exactly this function starts execution or being called. My questions are: (1). Does QEMUFile structure represent a running VM instance, through which I can access the IDE drive (struct IDEState) pointers ? (2). When does qemu execute pci_ide_save()? (3). How does qemu handle VM shutdown? It seems ACPI event is sent to VM so guest OS will shutdown in the way like real OS running on real hardware. But how and where does qemu exactly handle this? I think I need to add my codes here. Any hints, suggestions would be appreciated. Thanks. Huaicheng Li
[Qemu-devel] nested VMX with IA32_FEATURE_CONTROL MSR(addr: 0x3a) value of ZERO
Hi, all I have a Linux 3.8 kernel (host) and run QEMU 1.5.3 on it. I want to test another hypervisor software in qemu so I enabled KVM's nested VMX function(by passing nested=1 parameter to the kvm module) and then started a guest machine. In the guest, I could see the vmx instruction set by reading /proc/cpuinfo and the kvm module can be correctly inserted. But when I read the value of the IA32_FEATURE_CONTROL MSR using msr-tools, it showed _0_, but the correct value should be _5_, since bit 0(virtualization lock bit) and bit 2 of that MSR must be set to enable the virtualization functionality. But in my vmware workstation guest with nested virtualization enabled, the value of that MSR is, indeed, _5_ as well as in the physical machine (of course). Here, I want to ask * Am I missing anything in my operation to totally enable the nested virtualization function ?? (I googled a lot and it seemed there were no additional steps) * Since the IA32_FEATURE_CONTROL MSR value should be set in BIOS and are kept unchanged during the runtime, is there any modified BIOS that qemu can use to enable the setting ?? Currently my qemu use the default one. -- Best Regards Huaicheng Li