[Qemu-devel] When is PCIUnregisterFunc called?

2018-07-24 Thread Huaicheng Li
Hi QEMU developers,

I'm trying to inject some operations during the emulated device teardown
phase.

For an emulated PCIe device, such as NVMe or IVSHMEM, I notice that QEMU
registers PCIDeviceClass pc->init and pc->exit functions for that device.
->init() (e.g. nvme_init(), or ivshmem_init()) are marked as constructors
and called before QEMU main() runs. However, I failed to find when the
->exit() function is called. When I add some printf() into the ->exit()
func, it's not even printed. Could anyone give any pointers on this?

Thanks.

Best,
Huaicheng


[Qemu-devel] map host MMIO address to guest

2018-06-25 Thread Huaicheng Li
Hi all,

I'm trying to map a host MMIO region (host PCIe device BAR) into guest
physical address space. The goal is to enable direct control over that host
MMIO region from guest OS by accessing a certain GPA.

I know the address of the host MMIO region (one page). First I map the page
into QEMU process address space and get a QEMU buffer. Then I use
"memory_region_init_ram_ptr();
memory_region_add_subregion_overlap(system_memory, 512MB, my_mr, 1)" to map
the QEMU buffer as part of guest physical address space (starting from
512MB to 512MB+4K).

When I read/write to QEMU buffer, I can observe that correct MMIO region
access is triggered. However, when I try to access the mapped MMIO region
from guest OS (using a guest kernel module to access gpa:512MB directly),
the following host kernel panic will be triggered.

I don't understand why this happens. When I use the same method and map a
host memory page (instead of a host MMIO page) into guest, it works fine. I
appreciate if anyone can help analyze this? Thanks in advance.

Best,
Huaicheng

[  323.844213] BUG: unable to handle kernel paging request at
ea0003faf460
[  323.845671] IP: gup_pgd_range+0x2f5/0x860
[  323.846615] PGD 23f7ed067 P4D 23f7ed067 PUD 23f7ec067 PMD 0
[  323.847848] Oops:  [#1] SMP
[  323.848692] Modules linked in: wpt(O) kvm_intel kvm irqbypass
[  323.850085] CPU: 2 PID: 4994 Comm: qemu-system-x86 Tainted: G
 O 4.15.0-rc4+ #10
[  323.853002] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
[  323.855029] RIP: 0010:gup_pgd_range+0x2f5/0x860
[  323.855792] RSP: 0018:c90004fdbae0 EFLAGS: 00010002
[  323.856648] RAX: 03faf440 RBX: 555b8c6db000 RCX:
3000
[  323.857618] RDX: ea0003faf440 RSI: 88022d4c36d0 RDI:
8000febd1067
[  323.858598] RBP: c90004fdbb7c R08: 0400 R09:
ea00
[  323.859428] R10: 3000 R11: 8000febd1067 R12:
00022d4c3067
[  323.860216] R13: c90004fdbba8 R14: 555b8c6da000 R15:
0007
[  323.860979] FS:  7f4a3a527700() GS:88023fc8()
knlGS:
[  323.861899] CS:  0010 DS:  ES:  CR0: 80050033
[  323.862534] CR2: ea0003faf460 CR3: 000232723001 CR4:
000626e0
[  323.863312] Call Trace:
[  323.863667]  __get_user_pages_fast+0x6b/0x90
[  323.864234]  __gfn_to_pfn_memslot+0xf5/0x3b0 [kvm]
[  323.864858]  ? kvm_irq_delivery_to_apic+0x51/0x2a0 [kvm]
[  323.865502]  try_async_pf+0x53/0x1f0 [kvm]
[  323.866039]  tdp_page_fault+0x112/0x280 [kvm]
[  323.866609]  kvm_mmu_page_fault+0x53/0x130 [kvm]
[  323.867201]  vmx_handle_exit+0x9b/0x1510 [kvm_intel]
[  323.867823]  ? atomic_switch_perf_msrs+0x5f/0x80 [kvm_intel]
[  323.868504]  ? vmx_vcpu_run+0x30a/0x4b0 [kvm_intel]
[  323.869101]  kvm_arch_vcpu_ioctl_run+0xa79/0x1570 [kvm]
[  323.869748]  ? kvm_vcpu_ioctl+0x2eb/0x570 [kvm]
[  323.870332]  kvm_vcpu_ioctl+0x2eb/0x570 [kvm]
[  323.870897]  ? kvm_vm_ioctl+0x142/0x7e0 [kvm]
[  323.871457]  do_vfs_ioctl+0x8f/0x5b0
[  323.871955]  ? native_write_msr+0x6/0x20
[  323.872476]  ? security_file_ioctl+0x3e/0x60
[  323.873024]  SyS_ioctl+0x74/0x80
[  323.873468]  entry_SYSCALL_64_fastpath+0x1a/0x7d
[  323.874043] RIP: 0033:0x7f4a3c897f47
[  323.874506] RSP: 002b:7f4a3a526a78 EFLAGS: 0246 ORIG_RAX:
0010
[  323.875439] RAX: ffda RBX: ae80 RCX:
7f4a3c897f47
[  323.876234] RDX:  RSI: ae80 RDI:
000f
[  323.877093] RBP: 555b8c5ef450 R08: 555b89bc1a70 R09:

[  323.877931] R10: fee0 R11: 0246 R12:

[  323.878760] R13: 7f4a3e2db000 R14: 0006 R15:
555b8c5ef450
[  323.879555] Code: 00 00 d3 e2 85 c2 75 ae 4c 85 c7 0f 85 d0 00 00 00 f7
c7 00 02 00 00 75 9d 48 89 f8 66 66 66 90 4c 21 d0 48 c1 e8 06 4a 8d 14 08
<48> 8b 42 20 4c 8d 58 ff a8 01 4c 0f 44
da 41 8b 43 1c 85 c0 0f
[  323.881713] RIP: gup_pgd_range+0x2f5/0x860 RSP: c90004fdbae0
[  323.882509] CR2: ea0003faf460
[  323.883069] ---[ end trace 2427ffda7b3b2a32 ]---


Re: [Qemu-devel] About cpu_physical_memory_map()

2018-06-05 Thread Huaicheng Li
Hi Peter,

Just a follow up on my previous question. I have figured it out by trying
it out with QEMU.
I'm writing to thank you again for your help! I really appreciate that.

Thank you!

Best,
Huaicheng

On Fri, Jun 1, 2018 at 1:00 AM Huaicheng Li 
wrote:

> Hi Peter,
>
> Thank you a lot for the analysis!
>
> So it'll be simpler
>> if you start with the buffer in the host QEMU process, map this
>> in to the guest's physical address space at some GPA, tell the
>> guest kernel that that's the GPA to use, and have the guest kernel
>> map that GPA into the guest userspace process's virtual address space.
>> (Think of how you would map a framebuffer, for instance.)
>
>
> This makes sense to me. Could you help provide a pointer where I can refer
> to similar implementations?
> Should I do something like this during system memory initialization:
>
> memory_region_init_ram_ptr(my_mr, owner, "mybuf", buf_size, buf); //
> where buf is the buffer in QEMU AS
> memory_region_add_subregion(system_memory, GPA_OFFSET, my_mr);
>
> If I set guest memory to be "-m 1G", can I make "GPA_OFFSET" beyond 1GB
> (e.g. 2GB)? This way, the guest OS
> won't be able to access my buffer and use it like other regular RAM.
>
> Thanks!
>
> Best,
> Huaicheng
>
>
>
>
> On Thu, May 31, 2018 at 3:11 AM Peter Maydell 
> wrote:
>
>> On 30 May 2018 at 01:24, Huaicheng Li  wrote:
>> > Dear QEMU/KVM developers,
>> >
>> > I was trying to map a buffer in host QEMU process to a guest user space
>> > application. I tried to achieve this
>> > by allocating a buffer in the guest application first, then map this
>> buffer
>> > to QEMU process address space via
>> > GVA -> GPA --> HVA (GPA to HVA is done via cpu_physical_memory_map).
>> Last,
>> > I wrote a host kernel driver to
>> > walk QEMU process's page table and change corresponding page table
>> entries
>> > of HVA to the HPA of the target
>> > buffer.
>>
>> This seems like the wrong way round to try to do this. As a rule
>> of thumb, you'll have an easier life if you have things behave
>> similarly to how they would in real hardware. So it'll be simpler
>> if you start with the buffer in the host QEMU process, map this
>> in to the guest's physical address space at some GPA, tell the
>> guest kernel that that's the GPA to use, and have the guest kernel
>> map that GPA into the guest userspace process's virtual address space.
>> (Think of how you would map a framebuffer, for instance.)
>>
>> Changing the host page table entries for QEMU under its feet seems
>> like it's never going to work reliably.
>>
>> (I think the specific problem you're running into is that guest memory
>> is both mapped into the QEMU host process and also exposed to the
>> guest VM. The former is controlled by the page tables for the
>> QEMU host process, but the latter is a different set of page tables,
>> which QEMU asks the kernel to configure, using KVM_SET_USER_MEMORY_REGION
>> ioctls.)
>>
>> thanks
>> -- PMM
>>
>


Re: [Qemu-devel] About cpu_physical_memory_map()

2018-06-01 Thread Huaicheng Li
Hi Peter,

Thank you a lot for the analysis!

So it'll be simpler
> if you start with the buffer in the host QEMU process, map this
> in to the guest's physical address space at some GPA, tell the
> guest kernel that that's the GPA to use, and have the guest kernel
> map that GPA into the guest userspace process's virtual address space.
> (Think of how you would map a framebuffer, for instance.)


This makes sense to me. Could you help provide a pointer where I can refer
to similar implementations?
Should I do something like this during system memory initialization:

memory_region_init_ram_ptr(my_mr, owner, "mybuf", buf_size, buf); //
where buf is the buffer in QEMU AS
memory_region_add_subregion(system_memory, GPA_OFFSET, my_mr);

If I set guest memory to be "-m 1G", can I make "GPA_OFFSET" beyond 1GB
(e.g. 2GB)? This way, the guest OS
won't be able to access my buffer and use it like other regular RAM.

Thanks!

Best,
Huaicheng




On Thu, May 31, 2018 at 3:11 AM Peter Maydell 
wrote:

> On 30 May 2018 at 01:24, Huaicheng Li  wrote:
> > Dear QEMU/KVM developers,
> >
> > I was trying to map a buffer in host QEMU process to a guest user space
> > application. I tried to achieve this
> > by allocating a buffer in the guest application first, then map this
> buffer
> > to QEMU process address space via
> > GVA -> GPA --> HVA (GPA to HVA is done via cpu_physical_memory_map).
> Last,
> > I wrote a host kernel driver to
> > walk QEMU process's page table and change corresponding page table
> entries
> > of HVA to the HPA of the target
> > buffer.
>
> This seems like the wrong way round to try to do this. As a rule
> of thumb, you'll have an easier life if you have things behave
> similarly to how they would in real hardware. So it'll be simpler
> if you start with the buffer in the host QEMU process, map this
> in to the guest's physical address space at some GPA, tell the
> guest kernel that that's the GPA to use, and have the guest kernel
> map that GPA into the guest userspace process's virtual address space.
> (Think of how you would map a framebuffer, for instance.)
>
> Changing the host page table entries for QEMU under its feet seems
> like it's never going to work reliably.
>
> (I think the specific problem you're running into is that guest memory
> is both mapped into the QEMU host process and also exposed to the
> guest VM. The former is controlled by the page tables for the
> QEMU host process, but the latter is a different set of page tables,
> which QEMU asks the kernel to configure, using KVM_SET_USER_MEMORY_REGION
> ioctls.)
>
> thanks
> -- PMM
>


[Qemu-devel] About cpu_physical_memory_map()

2018-05-29 Thread Huaicheng Li
Dear QEMU/KVM developers,

I was trying to map a buffer in host QEMU process to a guest user space
application. I tried to achieve this
by allocating a buffer in the guest application first, then map this buffer
to QEMU process address space via
GVA -> GPA --> HVA (GPA to HVA is done via cpu_physical_memory_map). Last,
I wrote a host kernel driver to
walk QEMU process's page table and change corresponding page table entries
of HVA to the HPA of the target
buffer.

Basically, the idea is to keep

GVA --> GPA --> HVA mapping (this step is to map guest buffer into QEMU's
address space)

and change

HVA --> HPA1 (HPA1 is the physical base addr of the buffer malloc'ed by the
guest application)
to
HVA --> HPA2 (HPA2 is the physical base addr of the target buffer we want
to remap into guest application's address space)

The above change is done by my kernel module to change the page table
entries in QEMU's page table in the host system.
Under this case, I expect to see GVA will point to HPA2, instead of HPA1.

With the above change, when I access HVA in QEMU process, I find it indeed
points to HPA2.
However, inside guest OS, the application's GVA still points to HPA1

I have cleaned TLBs of QEMU process's page table as well as the guest
application's page table (in guest OS) accordingly
but the guest application's GVA is still mapped to HPA1, instead of HPA2.

Does QEMU maintain a fixed GPA to HVA mapping? After goingthrough the code
of "cpu_physical_memory_map()", I think
HVA is calculated as ramblock->host + GPA since guest RAM space is a
mmap'ed area in QEMU's address space and GPA
is an offset within that area. Thus, GPA -> HVA mapping is fixed during
runtime. Is QEMU/KVM doing
another layer of TLB caching so as to the guest application picks up the
old mapping to HPA1, instead of HPA2?

Any comments are appreciated.

Thank you!

Best,
Huaicheng


[Qemu-devel] [PATCH] hw/block/nvme: Add doorbell buffer config support

2018-03-05 Thread Huaicheng Li

This patch adds Doorbell Buffer Config support (NVMe 1.3) to QEMU NVMe,
based on Mihai Rusu / Lin Ming's Google vendor extension patch [1]. The
basic idea of this optimization is to use a shared buffer between guest
OS and QEMU to reduce # of MMIO operations (doorbell writes). This patch
ports the original code to work under current QEMU and make it also
work with SPDK.

Unlike Linux kernel NVMe driver which builds the shadow buffer first and
then creates SQ/CQ, SPDK first creates SQ/CQ and then issues this command
to create shadow buffer. Thus, in this implementation, we also try to
associate shadow buffer entry with each SQ/CQ during queue initialization.

[1] http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg04127.html

Peroformance results using a **ramdisk** backed virtual NVMe device in guest
Linux 4.14 is as below:

Note: "QEMU" represent stock QEMU and "+dbbuf" is QEMU with this patch.
For psync, QD represents # of threads being used.


IOPS (Linux kernel NVMe driver)
  psync libaio
QD QEMU  +dbbuf  QEMU +dbbuf
1  47k  50k  45k47k
4  86k  107k 59k   143k
1695k  198k 58k   185k
6497k  259k 59k   216k


IOPS (SPDK)
QD  QEMU  +dbbuf
1  62k 71k
4  61k 191k
1660k 319k
6462k 364k

We can see that this patch can greatly increase the IOPS (and lower the
latency, not shown) (2.7x for psync, 3.7x for libaio and 5.9x for SPDK).

==Setup==:

(1) VM script:
x86_64-softmmu/qemu-system-x86_64 \
-name "nvme-FEMU-test" \
-enable-kvm \
-cpu host \
-smp 4 \
-m 8G \
-drive file=$IMGDIR/u14s.qcow2,if=ide,aio=native,cache=none,format=qcow2,id=hd0 
\
-drive file=/mnt/tmpfs/test1.raw,if=none,aio=threads,format=raw,id=id0 \
-device nvme,drive=id0,serial=serial0,id=nvme0 \
-net user,hostfwd=tcp::8080-:22 \
-net nic,model=virtio \
-nographic \

(2) FIO configuration:

[global]
ioengine=libaio
filename=/dev/nvme0n1
thread=1
group_reporting=1
direct=1
verify=0
time_based=1
ramp_time=0
runtime=30
;size=1G
iodepth=16
rw=randread
bs=4k

[test]
numjobs=1


Signed-off-by: Huaicheng Li <huaich...@cs.uchicago.edu>
---
hw/block/nvme.c  | 97 +---
hw/block/nvme.h  |  7 
include/block/nvme.h |  2 ++
3 files changed, 102 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 85d2406400..3882037e36 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -9,7 +9,7 @@
 */

/**
- * Reference Specs: http://www.nvmexpress.org, 1.2, 1.1, 1.0e
+ * Reference Specs: http://www.nvmexpress.org, 1.3, 1.2, 1.1, 1.0e
 *
 *  http://www.nvmexpress.org/resources/
 */
@@ -33,6 +33,7 @@
#include "qapi/error.h"
#include "qapi/visitor.h"
#include "sysemu/block-backend.h"
+#include "exec/memory.h"

#include "qemu/log.h"
#include "trace.h"
@@ -244,6 +245,14 @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
return status;
}

+static void nvme_update_cq_head(NvmeCQueue *cq)
+{
+if (cq->db_addr) {
+pci_dma_read(>ctrl->parent_obj, cq->db_addr, >head,
+sizeof(cq->head));
+}
+}
+
static void nvme_post_cqes(void *opaque)
{
NvmeCQueue *cq = opaque;
@@ -254,6 +263,8 @@ static void nvme_post_cqes(void *opaque)
NvmeSQueue *sq;
hwaddr addr;

+nvme_update_cq_head(cq);
+
if (nvme_cq_full(cq)) {
break;
}
@@ -461,6 +472,7 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
uint16_t sqid, uint16_t cqid, uint16_t size)
{
+uint32_t stride = 4 << NVME_CAP_DSTRD(n->bar.cap);
int i;
NvmeCQueue *cq;

@@ -480,6 +492,11 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, 
uint64_t dma_addr,
}
sq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_sq, sq);

+if (sqid && n->dbbuf_dbs && n->dbbuf_eis) {
+sq->db_addr = n->dbbuf_dbs + 2 * sqid * stride;
+sq->ei_addr = n->dbbuf_eis + 2 * sqid * stride;
+}
+
assert(n->cq[cqid]);
cq = n->cq[cqid];
QTAILQ_INSERT_TAIL(&(cq->sq_list), sq, entry);
@@ -559,6 +576,8 @@ static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeCmd *cmd)
static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
uint16_t cqid, uint16_t vector, uint16_t size, uint16_t irq_enabled)
{
+uint32_t stride = 4 << NVME_CAP_DSTRD(n->bar.cap);
+
cq->ctrl = n;
cq->cqid = cqid;
cq->size = size;
@@ -569,11 +588,51 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, 
uint64_t dma_addr,
cq->head = cq->tail = 0;
QTAILQ_INIT(>req_list);
QTAILQ_INIT(>sq_list);
+if (cqid && n->dbbuf_dbs && n->dbbuf_eis) {
+cq->db_addr = n->dbbuf_dbs + (2 * cqid + 

Re: [Qemu-devel] QEMU GSoC 2018 Project Idea (Apply polling to QEMU NVMe)

2018-03-01 Thread Huaicheng Li
Hi Stefan,

Paolo and I have filled the template. And Paolo has helped update the wiki
with this proposed project idea.

Thanks.

Best,
Huaicheng


On Tue, Feb 27, 2018 at 10:35 AM, Stefan Hajnoczi <stefa...@gmail.com>
wrote:

> On Tue, Feb 27, 2018 at 12:04:48PM +0100, Paolo Bonzini wrote:
> > On 27/02/2018 10:05, Huaicheng Li wrote:
> > > Great to know that you'd like to mentor the project! If so, can we
> make it
> > > an official project idea and put it on QEMU GSoC page?
> >
> > Submissions need not come from the QEMU GSoC page.  You are free to
> > submit any idea that you think can be worthwhile.
>
> Please follow the process described here:
>
> https://wiki.qemu.org/Google_Summer_of_Code_2018#How_to_
> propose_a_custom_project_idea
>
> The project idea needs to be posted on the wiki page.
>
> Huaicheng & Paolo, please fill out this template so we can add it to the
> wiki page:
>
> === TITLE ===
>
>  '''Summary:''' Short description of the project
>
>  Detailed description of the project.
>
>  '''Links:'''
>  * Wiki links to relevant material
>  * External links to mailing lists or web sites
>
>  '''Details:'''
>  * Skill level: beginner or intermediate or advanced
>  * Language: C
>  * Mentor: Email address and IRC nick
>  * Suggested by: Person who suggested the idea
>


Re: [Qemu-devel] QEMU GSoC 2018 Project Idea (Apply polling to QEMU NVMe)

2018-02-27 Thread Huaicheng Li
Sounds great. Thanks!

On Tue, Feb 27, 2018 at 5:04 AM, Paolo Bonzini <pbonz...@redhat.com> wrote:

> On 27/02/2018 10:05, Huaicheng Li wrote:
> > Including a RAM disk backend in QEMU would be nice too, and it may
> > interest you as it would reduce the delta between upstream QEMU and
> > FEMU.  So this could be another idea.
> >
> > Glad you're also interested in this part. This can definitely be part of
> the
> > project.
> >
> > For (3), there is work in progress to add multiqueue support to
> QEMU's
> > block device layer.  We're hoping to get the infrastructure part in
> > (removing the AioContext lock) during the first half of 2018.  As you
> > say, we can see what the workload will be.
> >
> > Thanks for letting me know this. Could you provide a link to the on-going
> > multiqueue implementation? I would like to learn how this is done. :)
>
> Well, there is no multiqueue implementation yet, but for now you can see
> a lot of work in block/ regarding making drivers and BlockDriverState
> thread safe.  We can't just do it for null-co:// so we have a little
> preparatory work to do. :)
>
> > However, the main issue that I'd love to see tackled is interrupt
> > mitigation.  With higher rates of I/O ops and high queue depth (e.g.
> > 32), it's common for the guest to become slower when you introduce
> > optimizations in QEMU.  The reason is that lower latency causes
> higher
> > interrupt rates and that in turn slows down the guest.  If you have
> any
> > ideas on how to work around this, I would love to hear about it.
> >
> > Yeah, indeed interrupt overhead (host-to-guest notification) is a
> headache.
> > I thought about this, and one intuitive optimization in my mind is to add
> > interrupt coalescing support into QEMU NVMe. We may use some heuristic
> to batch
> > I/O completions back to guest, thus reducing # of interrupts. The
> heuristic
> > can be time-window based (i.e., for I/Os completed in the same time
> window,
> > we only do one interrupt for each CQ).
> >
> > I believe there are several research papers that can achieve direct
> interrupt
> > delivery without exits for para-virtual devices, but those need KVM side
> > modifications. It might be not a good fit here.
>
> No, indeed.  But the RAM disk backend and interrupt coalescing (for
> either NVMe or virtio-blk... or maybe a generic scheme that can be
> reused by virtio-net and others too!) is a good idea for the third part
> of the project.
>
> > In any case, I would very much like to mentor this project.  Let me
> know
> > if you have any more ideas on how to extend it!
> >
> >
> > Great to know that you'd like to mentor the project! If so, can we make
> it
> > an official project idea and put it on QEMU GSoC page?
>
> Submissions need not come from the QEMU GSoC page.  You are free to
> submit any idea that you think can be worthwhile.
>
> Paolo
>


Re: [Qemu-devel] QEMU GSoC 2018 Project Idea (Apply polling to QEMU NVMe)

2018-02-27 Thread Huaicheng Li
Hi Paolo,

Slightly rephrased:
> (1) add shadow doorbell buffer and ioeventfd support into QEMU NVMe
> emulation, which will reduce # of VM-exits and make them less expensive
> (reduce VCPU latency.
> (2) add iothread support to QEMU NVMe emulation.  This can also be used
> to eliminate VM-exits because iothreads can do adaptive polling.
> (1) and (2) seem okay for at most 1.5 months, especially if you already
> have experience with QEMU.


Thanks a lot for rephrasing it to make it more clear.

Yes, I think (1)(2) should be achievable in 1-1.5month. What needs to be
added based on FEMU includes: ioeventfd support to QEMU NVMe, and
use iothread for polling (current FEMU implementation uses a periodic
timer to poll shadow buffer directly, moving to iothread would deliver
better
performance).

Including a RAM disk backend in QEMU would be nice too, and it may
> interest you as it would reduce the delta between upstream QEMU and
> FEMU.  So this could be another idea.


Glad you're also interested in this part. This can definitely be part of the
project.

For (3), there is work in progress to add multiqueue support to QEMU's
> block device layer.  We're hoping to get the infrastructure part in
> (removing the AioContext lock) during the first half of 2018.  As you
> say, we can see what the workload will be.


Thanks for letting me know this. Could you provide a link to the on-going
multiqueue implementation? I would like to learn how this is done. :)

However, the main issue that I'd love to see tackled is interrupt
> mitigation.  With higher rates of I/O ops and high queue depth (e.g.
> 32), it's common for the guest to become slower when you introduce
> optimizations in QEMU.  The reason is that lower latency causes higher
> interrupt rates and that in turn slows down the guest.  If you have any
> ideas on how to work around this, I would love to hear about it.


Yeah, indeed interrupt overhead (host-to-guest notification) is a headache.
I thought about this, and one intuitive optimization in my mind is to add
interrupt
coalescing support into QEMU NVMe. We may use some heuristic to batch
I/O completions back to guest, thus reducing # of interrupts. The heuristic
can be time-window based (i.e., for I/Os completed in the same time window,
we only do one interrupt for each CQ).

I believe there are several research papers that can achieve direct
interrupt
delivery without exits for para-virtual devices, but those need KVM side
modifications. It might be not a good fit here.


In any case, I would very much like to mentor this project.  Let me know
> if you have any more ideas on how to extend it!


Great to know that you'd like to mentor the project! If so, can we make it
an official project idea and put it on QEMU GSoC page?

Thank you so much for the feedbacks and agreeing to be a potential mentor
for this
project. I'm happy to see that you also think this is something that's worth
putting efforts into.

Best,
Huaicheng

On Mon, Feb 26, 2018 at 2:45 AM, Paolo Bonzini <pbonz...@redhat.com> wrote:

> On 25/02/2018 23:52, Huaicheng Li wrote:
> > I remember there were some discussions back in 2015 about this, but I
> > don't see it finally done. For this project, I think we can go in three
> > steps: (1). add the shadow doorbell buffer support into QEMU NVMe
> > emulation, this will reduce # of VM-exits. (2). replace current timers
> > used by QEMU NVMe with a separate polling thread, thus we can completely
> > eliminate VM-exits. (3). Even further, we can adapt the architecture to
> > use one polling thread for each NVMe queue pair, thus it's possible to
> > provide more performance. (step 3 can be left for next year if the
> > workload is too much for 3 months).
>
> Slightly rephrased:
>
> (1) add shadow doorbell buffer and ioeventfd support into QEMU NVMe
> emulation, which will reduce # of VM-exits and make them less expensive
> (reduce VCPU latency.
>
> (2) add iothread support to QEMU NVMe emulation.  This can also be used
> to eliminate VM-exits because iothreads can do adaptive polling.
>
> (1) and (2) seem okay for at most 1.5 months, especially if you already
> have experience with QEMU.
>
> For (3), there is work in progress to add multiqueue support to QEMU's
> block device layer.  We're hoping to get the infrastructure part in
> (removing the AioContext lock) during the first half of 2018.  As you
> say, we can see what the workload will be.
>
> Including a RAM disk backend in QEMU would be nice too, and it may
> interest you as it would reduce the delta between upstream QEMU and
> FEMU.  So this could be another idea.
>
> However, the main issue that I'd love to see tackled is interrupt
> mitigation.  With higher rates of I/O ops and high queue depth (e.g.
> 32), it's common for the guest to become slo

[Qemu-devel] QEMU GSoC 2018 Project Idea (Apply polling to QEMU NVMe)

2018-02-25 Thread Huaicheng Li
Hi all,

The project would be about utilizing shadow doorbell buffer features in
NVMe 1.3 to enable QEMU side polling for virtualized NVMe device, thus
achieving comparable performance as in virtio-dataplane.

**Why not virtio?**
The reason is many industrial/academic researchers uses QEMU NVMe as a
performance platform for research/product prototyping. NVMe interface is
better in the rich features it provides than virtio interface. If we can
make QEMU NVMe performance competent with virtio, it will benefit a lot of
communities.

**Doable?**
NVMe spec 1.3 introduces a shadow doorbell buffer which is aimed for
virtual NVMe controller optimizations. QEMU can certainly utilize this
feature to reduce or even eliminate VM-exits triggered by doorbell writes.

I remember there were some discussions back in 2015 about this, but I don't
see it finally done. For this project, I think we can go in three steps:
(1). add the shadow doorbell buffer support into QEMU NVMe emulation, this
will reduce # of VM-exits. (2). replace current timers used by QEMU NVMe
with a separate polling thread, thus we can completely eliminate VM-exits.
(3). Even further, we can adapt the architecture to use one polling thread
for each NVMe queue pair, thus it's possible to provide more performance.
(step 3 can be left for next year if the workload is too much for 3 months).

Actually, I have an initial implementation over step (1)(2) and would like
to work more on it to push it upstream. More information is in this papper,
(Section 3.1 and Figure 2-left),
http://ucare.cs.uchicago.edu/pdf/fast18-femu.pdf

Comments are welcome.

Thanks.

Best,
Huaicheng


[Qemu-devel] irqfd for QEMU NVMe

2017-09-08 Thread Huaicheng Li
Hi all,

I'm writing to ask if it's possible to use irqfd mechanism in QEMU's NVMe
virtual controller implementation. My search results show that back in
2015, there is a discussion on improving QEMU NVMe performance by utilizing
eventfd for guest-to-host notification, thus i guess irqfd should also work
? If so, could you briefly describe how to do that and how much can we save
for the host-to-guest notification path by using irqfd?

Thanks for your time.

Best,
Huaicheng


Re: [Qemu-devel] use timer for adding latency to each block I/O

2016-05-26 Thread Huaicheng Li

> On May 16, 2016, at 11:33 AM, Stefan Hajnoczi  wrote:
> 
> The way it's done in the "null" block driver is:
> 
> static coroutine_fn int null_co_common(BlockDriverState *bs)
> {
>BDRVNullState *s = bs->opaque;
> 
>if (s->latency_ns) {
>co_aio_sleep_ns(bdrv_get_aio_context(bs), QEMU_CLOCK_REALTIME,
>s->latency_ns);
>}
>return 0;
> }

Thanks so much,  Stefan! It seems this is what I need.

Best,




[Qemu-devel] use timer for adding latency to each block I/O

2016-05-12 Thread Huaicheng Li
Hi all,

My goal is to add latency for each I/O without blocking the submission path. 
Now 
I can know how long each I/O should wait before it’s submitted to the AIO queue 
via a model. Now the question is how can I make the I/O wait for that long time
before it’s finally handled by worker threads.

I’m thinking about adding a timer for each block I/O and its callback function 
will
do the submission job. Below is some of my thought and questions about this 
idea: 

(1). The callback will be triggered by the thread where corresponding event 
queue
locates, not VCPU thread which previously submits I/Os. How can I make safely
submit I/O from the event-queue thread?

(2). If each I/O is associated with a timer, and we also put per-I/O timer into 
the 
event queue.  Will the event queue be capable of handling intensive I/O timers
in accurate timing manner? 

(3). Which device interface would be better to work on? From my perspective,  
the 
virtio-data-plane would be a good choice since it has its own thread and event 
queue.

(4). Any other concerns about this idea?  


Any suggestions will be greatly appreciated. Thanks for all your time.

Best,
Huaicheng

 


Re: [Qemu-devel] about correctness of IDE emulation

2016-04-13 Thread Huaicheng Li

> On Apr 13, 2016, at 1:07 PM, John Snow  wrote:
> 
> Why do you want to use IDE? If you are looking for performance, why not
> a virtio device?

I’m just trying to understand how IDE emulation works and see where the
overhead comes in. Thank you for the detailed explanation. I really appreciate 
that.

Best,
Huaicheng


Re: [Qemu-devel] about correctness of IDE emulation

2016-04-13 Thread Huaicheng Li (coperd)

> On Mar 14, 2016, at 10:09 PM, Huaicheng Li <lhc...@gmail.com> wrote:
> 
> 
>> On Mar 13, 2016, at 8:42 PM, Fam Zheng <f...@redhat.com> wrote:
>> 
>> On Sun, 03/13 14:37, Huaicheng Li (coperd) wrote:
>>> Hi all, 
>>> 
>>> What I’m confused about is that:
>>> 
>>> If one I/O is too large and may need several rounds (say 2) of DMA 
>>> transfers,
>>> it seems the second round transfer begins only after the completion of the
>>> first part, by reading data from **IDEState**. But the IDEState info may 
>>> have
>>> been changed by VCPU threads (by writing new I/Os to it) when the first
>>> transfer finishes. From the code, I see that IDE r/w call back function will
>>> continue the second transfer by referencing IDEState’s information. Wouldn’t
>>> this be problematic? Am I missing anything here?
>> 
>> Can you give an concrete example? I/O in VCPU threads that changes IDEState
>> must also take care of the DMA transfers, for example ide_reset() has
>> blk_aio_cancel and clears s->nsectors. If an I/O handler fails to do so, it 
>> is
>> a bug.
>> 
>> Fam
> 
> I get it now. ide_exec_cmd() can only proceed when BUSY_STAT|DRQ_STAT is not 
> set.
> When the 2nd DMA transfer continues, BUSY_STAT | DRQ_STAT is already
> set, i.e., no other new ide_exec_cmd() can enter. BSUY or DRQ is removed only 
> when
> all DMA transfers are done, after which new writes to IDE are allowed. Thus 
> it’s safe.
> 
> Thanks, Fam & Stefan.

Hi all, I have some further puzzles about IDE emulation:

  (1). IDE can only handle I/Os one by one.  So in the AIO queue there will 
always be only
 **ONE** I/O from this IDE, right? For the bigs I/Os which need to be spliced 
into several 
rounds of DMA transfers, they are also served one by one. (after one DMA 
transfer [as an
AIO] is finished, another DMA transfer will be submitted and so on).  Here I 
want to convey
that there is no batch submission in IDE path at all. True?
  (2). When the guest kernel prepares to do a big I/O which need multiple 
rounds of  DMA 
transfers, will each DMA transfer round (one PRD entry) be trapped and trigger 
one IDE 
emulation, or IDE will handle all the PRD in one shot? 
  (3). I traced the execution of my guest application with big I/Os (each time 
reads 2MB),
then in the IDE layer, I found that it’s splitted into 512KB chunks for each 
DMA transfer. 
Why is 512KB here?? From the BMDMA spec, PRD table can at most represent 
64KB/8bytes
= 8192 buffers, each of which can be a at most 64KB continuous buffer. This 
would give
us 8192*64KB=512MB for each DMA. 

Am I missing anything here?  

Thanks for your attention.

Best,
Huaicheng





Re: [Qemu-devel] about correctness of IDE emulation

2016-03-14 Thread Huaicheng Li

> On Mar 13, 2016, at 8:42 PM, Fam Zheng <f...@redhat.com> wrote:
> 
> On Sun, 03/13 14:37, Huaicheng Li (coperd) wrote:
>> Hi all, 
>> 
>> What I’m confused about is that:
>> 
>> If one I/O is too large and may need several rounds (say 2) of DMA transfers,
>> it seems the second round transfer begins only after the completion of the
>> first part, by reading data from **IDEState**. But the IDEState info may have
>> been changed by VCPU threads (by writing new I/Os to it) when the first
>> transfer finishes. From the code, I see that IDE r/w call back function will
>> continue the second transfer by referencing IDEState’s information. Wouldn’t
>> this be problematic? Am I missing anything here?
> 
> Can you give an concrete example? I/O in VCPU threads that changes IDEState
> must also take care of the DMA transfers, for example ide_reset() has
> blk_aio_cancel and clears s->nsectors. If an I/O handler fails to do so, it is
> a bug.
> 
> Fam

I get it now. ide_exec_cmd() can only proceed when BUSY_STAT|DRQ_STAT is not 
set.
When the 2nd DMA transfer continues, BUSY_STAT | DRQ_STAT is already
set, i.e., no other new ide_exec_cmd() can enter. BSUY or DRQ is removed only 
when
all DMA transfers are done, after which new writes to IDE are allowed. Thus 
it’s safe.

Thanks, Fam & Stefan.


[Qemu-devel] about correctness of IDE emulation

2016-03-13 Thread Huaicheng Li (coperd)
Hi all, 

I meet some trouble in understanding IDE emulation:

(1) IDE I/O Down Path (In VCPU thread): 
upon KVM_EXIT_IO, corresponding disk ioport write function will write IO info 
to IDEState, then ide read callback function will eventually split it into 
**several DMA transfers** and eventually submit them to the AIO request list 
for handling. 

(2). I/O Up Path (worker thread —>  QEMU main loop thread)
when the request in AIO request list has been successfully handled, the worker 
thread will signal the QEMU main thread this I/O completion event, which is 
later handled by its callback (posix_aio_read). posix_aio_read will then 
eventually return to IDE callback function, where virtual interrupt is 
generated to signal guest about I/O completion.

What I’m confused about is that:

If one I/O is too large and may need several rounds (say 2) of DMA transfers, 
it seems the second round transfer begins only after the completion of the 
first part, by reading data from **IDEState**. But the IDEState info may have 
been changed by VCPU threads (by writing new I/Os to it) when the first 
transfer finishes. From the code, I see that IDE r/w call back function will 
continue the second transfer by referencing IDEState’s information. Wouldn’t 
this be problematic? Am I missing anything here?

Thanks.

Best,
Huaicheng


Re: [Qemu-devel] qemu AIO worker threads change causes Guest OS hangup

2016-03-07 Thread Huaicheng Li (coperd)

> On Mar 5, 2016, at 8:42 PM, Huaicheng Li (coperd) <lhc...@gmail.com> wrote:
> 
> 
>> On Mar 1, 2016, at 3:01 PM, Paolo Bonzini <pbonz...@redhat.com 
>> <mailto:pbonz...@redhat.com>> wrote:
>> 
>> This is done
>> because the worker threads only care about the queued request list, not
>> about active or completed requests.
> 
> Do you think it would be useful to add an API for inserting one request back 
> to the queued list? For example, In case of request failure, we can insert it 
> back to the list for re-handling according to some rule before returning it 
> directly 
> to guest os. 
> 
> Best,
> Huaicheng
> 

Thank you for the help. 

Re: [Qemu-devel] qemu AIO worker threads change causes Guest OS hangup

2016-03-05 Thread Huaicheng Li (coperd)

> On Mar 1, 2016, at 3:01 PM, Paolo Bonzini  wrote:
> 
> This is done
> because the worker threads only care about the queued request list, not
> about active or completed requests.

Do you think it would be useful to add an API for inserting one request back 
to the queued list? For example, In case of request failure, we can insert it 
back to the list for re-handling according to some rule before returning it 
directly 
to guest os. 

Best,
Huaicheng



Re: [Qemu-devel] qemu AIO worker threads change causes Guest OS hangup

2016-03-05 Thread Huaicheng Li (coperd)

> On Mar 1, 2016, at 3:34 PM, Stefan Hajnoczi  wrote:
> 
> Have you seen Linux Documentation/device-mapper/delay.txt?
> 
> You could set up a loopback block device and put the device-mapper delay 
> target on top to simulate latency.


I’m working on one idea to emulate the latency of SSD read/write, 
which is *dynamically* changing according to the status of emulated 
flash media. Thanks for the suggestion.

[Qemu-devel] qemu AIO worker threads change causes Guest OS hangup

2016-03-01 Thread Huaicheng Li
Hi all,

I’m trying to add some latency conditionally to I/O requests (qemu_paiocb, from 
**IDE** disk emulation, **raw** image file). 
My idea is to add this part into the work thread:

  * First, set a timer for each incoming qemu_paiocb structure (e.g. 2ms)
  * When worker thread handles this I/O, it will first check if the timer has 
expired.
 If so, it will go to normal r/w handling to image files in host. 
Otherwise, it will insert
 this I/O request back to `request_list` via `qemu_paio_submit`. Here, I 
just want to skip
 the IO until the timer condition is satisfied.

Logically, I think this should be right. 

But after I run some I/O tests inside Guest OS, the guest OS will hangup 
(freeze) with “INFO: task xxx blocked for more than 120 seconds”. 
From the guest OS’s perspective, the disk seems to be very busy. Thus, the 
kernel keeps waiting for IO and have no responsiveness to other tasks. So I 
guess it should still be the problem of worker threads.


My questions are:

  * Is it safe to call `qemu_paio_submit` from one worker thread? Since all 
request_access accesses are protected by lock, I think this is OK.

  * What are the possible reasons why guest OS hangs up? My understand is that, 
although worker threads will busy with skipping I/O for many times, they will 
eventually finish the task (guest OS freezes after my r/w test program runs 
successfully, then guest OS becomes unresponsive).

  * Any thoughts on debugging? Currently I’m do some checking (e.g. the 
request_list length, number of threads) via printf. For this part, it seems 
hard to use gdb for debugging because guest OS will trigger timeout if I stay 
at some breakpoints for “too long”. 

Any suggestions would be appreciated. 

Thanks.

Best,
Huaicheng










Re: [Qemu-devel] Save IDEState data to files when VM shutdown

2015-12-09 Thread Huaicheng Li
> Why are you trying to save the state during shutdown?


The structure I added into IDEState keeps being updated when VM is up. So I 
think it’s a safe way to do this during shutdown. When the VM is started again, 
it can continue from the status saved during last shutdown. 

Thanks for your help. I will look into the code first.


> On Dec 9, 2015, at 3:20 AM, Dr. David Alan Gilbert <dgilb...@redhat.com> 
> wrote:
> 
> * Huaicheng Li (lhc...@gmail.com) wrote:
>> Hi all,
>> 
>> Please correct me if I’m wrong.
>> 
>> I made some changes to IDE emulation (add some extra structures to “struct 
>> IDEState") and want to save these info to files when VM shutdowns. So I can 
>> reload these info from files next time when VM starts.  According to my 
>> understanding, one IDEState structure is corresponding to one disk for VM 
>> and all available drives are probed/initialised by ide_init2() in hw/ide.c 
>> (I used qemu v0.11) during VM startup. It seemed that IDEState structure are 
>> saved to QEMUFile structure via pci_ide_save(), but I can only trace up to 
>> register_savevm(), where pci_ide_save() is registered as a callback. I can’t 
>> find where exactly this function starts execution or being called.  My 
>> questions are: 
> 
> Version 0.11 is *ancient* - please start with something newer; pci_ide_save 
> was removed 6 years ago.
> 
>> (1). Does QEMUFile structure represent a running VM instance, through which 
>> I can access the IDE drive (struct IDEState) pointers ?
>> 
>> (2). When does qemu execute pci_ide_save()? 
> 
> QEMUFile is part of the migration code; it forms a stream of data containing
> all of the device and RAM State during migration.  See savevm.c for what 
> drives
> this (in migration/savevm.c in modern qemu).
> Extracting the state of one device from the stream isn't that easy.
> 
>> (3). How does qemu handle VM shutdown? It seems ACPI event is sent to VM so 
>> guest OS will shutdown in the way like real OS running on real hardware. But 
>> how and where does qemu exactly handle this? I think I need to add my codes 
>> here. 
> 
> I don't know the detail of that; I suggest following
> the code from qemu_system_powerdown.
> 
> Why are you trying to save the state during shutdown?
> 
> Dave
> 
>> 
>> Any hints, suggestions would be appreciated. Thanks.
>> 
>> Huaicheng Li
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




[Qemu-devel] Save IDEState data to files when VM shutdown

2015-12-08 Thread Huaicheng Li
Hi all,

Please correct me if I’m wrong.

I made some changes to IDE emulation (add some extra structures to “struct 
IDEState") and want to save these info to files when VM shutdowns. So I can 
reload these info from files next time when VM starts.  According to my 
understanding, one IDEState structure is corresponding to one disk for VM and 
all available drives are probed/initialised by ide_init2() in hw/ide.c (I used 
qemu v0.11) during VM startup. It seemed that IDEState structure are saved to 
QEMUFile structure via pci_ide_save(), but I can only trace up to 
register_savevm(), where pci_ide_save() is registered as a callback. I can’t 
find where exactly this function starts execution or being called.  My 
questions are: 

(1). Does QEMUFile structure represent a running VM instance, through which I 
can access the IDE drive (struct IDEState) pointers ?

(2). When does qemu execute pci_ide_save()? 

(3). How does qemu handle VM shutdown? It seems ACPI event is sent to VM so 
guest OS will shutdown in the way like real OS running on real hardware. But 
how and where does qemu exactly handle this? I think I need to add my codes 
here. 

Any hints, suggestions would be appreciated. Thanks.

Huaicheng Li


[Qemu-devel] nested VMX with IA32_FEATURE_CONTROL MSR(addr: 0x3a) value of ZERO

2015-01-13 Thread Huaicheng Li
Hi, all

I have a Linux 3.8 kernel (host) and run QEMU 1.5.3 on it. I want to test
another hypervisor software in qemu so I enabled KVM's nested VMX
function(by passing nested=1 parameter to the kvm module) and then
started a guest machine. In the guest, I could see the vmx instruction
set by reading /proc/cpuinfo and the kvm module can be correctly inserted.
But when I read the value of the IA32_FEATURE_CONTROL MSR using msr-tools,
it showed _0_, but the correct value should be _5_, since bit
0(virtualization lock bit) and bit 2 of that MSR must be set to enable the
virtualization functionality. But in my vmware workstation guest with
nested virtualization enabled, the value of that MSR is, indeed, _5_ as
well as in the physical machine (of course). Here, I want to ask

* Am I missing anything in my operation to totally enable the nested
virtualization function ?? (I googled a lot and it seemed there were no
additional steps)

* Since the IA32_FEATURE_CONTROL MSR value should be set in BIOS and are
kept unchanged during the runtime, is there any modified BIOS that qemu can
use to enable the setting ?? Currently my qemu use the default one.
-- 
Best Regards
Huaicheng Li