Re: [PATCH] hw/misc: Add a virtual pci device to dynamically attach memory to QEMU

2021-10-15 Thread david.dai
On Fri, Oct 15, 2021 at 11:27:02AM +0200, David Hildenbrand (da...@redhat.com) 
wrote:
> 
> 
> On 15.10.21 11:10, david.dai wrote:
> > On Wed, Oct 13, 2021 at 10:33:39AM +0200, David Hildenbrand 
> > (da...@redhat.com) wrote:
> >>
> >> CAUTION: This email originated from outside of the organization. Do not
> >> click links or open attachments unless you recognize the sender and know 
> >> the
> >> content is safe.
> >>
> >>
> >> On 13.10.21 10:13, david.dai wrote:
> >>> On Mon, Oct 11, 2021 at 09:43:53AM +0200, David Hildenbrand 
> >>> (da...@redhat.com) wrote:
> >>>>
> >>>>
> >>>>
> >>>>>> virito-mem currently relies on having a single sparse memory region 
> >>>>>> (anon
> >>>>>> mmap, mmaped file, mmaped huge pages, mmap shmem) per VM. Although we 
> >>>>>> can
> >>>>>> share memory with other processes, sharing with other VMs is not 
> >>>>>> intended.
> >>>>>> Instead of actually mmaping parts dynamically (which can be quite
> >>>>>> expensive), virtio-mem relies on punching holes into the backend and
> >>>>>> dynamically allocating memory/file blocks/... on access.
> >>>>>>
> >>>>>> So the easy way to make it work is:
> >>>>>>
> >>>>>> a) Exposing the CXL memory to the buddy via dax/kmem, esulting in 
> >>>>>> device
> >>>>>> memory getting managed by the buddy on a separate NUMA node.
> >>>>>>
> >>>>>
> >>>>> Linux kernel buddy system? how to guarantee other applications don't 
> >>>>> apply memory
> >>>>> from it
> >>>>
> >>>> Excellent question. Usually, you would online the memory to ZONE_MOVABLE,
> >>>> such that even if some other allocation ended up there, that it could
> >>>> get migrated somewhere else.
> >>>>
> >>>> For example, "daxctl reconfigure-device" tries doing that as default:
> >>>>
> >>>> https://pmem.io/ndctl/daxctl-reconfigure-device.html
> >>>>
> >>>> However, I agree that we might actually want to tell the system to not
> >>>> use this CPU-less node as fallback for other allocations, and that we
> >>>> might not want to swap out such memory etc.
> >>>>
> >>>>
> >>>> But, in the end all that virtio-mem needs to work in the hypervisor is
> >>>>
> >>>> a) A sparse memmap (anonymous RAM, memfd, file)
> >>>> b) A way to populate memory within that sparse memmap (e.g., on fault,
> >>>> using madvise(MADV_POPULATE_WRITE), fallocate())
> >>>> c) A way to discard memory (madvise(MADV_DONTNEED),
> >>>> fallocate(FALLOC_FL_PUNCH_HOLE))
> >>>>
> >>>> So instead of using anonymous memory+mbind, you can also mmap a sparse 
> >>>> file
> >>>> and rely on populate-on-demand. One alternative for your use case would 
> >>>> be
> >>>> to create a DAX  filesystem on that CXL memory (IIRC that should work) 
> >>>> and
> >>>> simply providing virtio-mem with a sparse file located on that 
> >>>> filesystem.
> >>>>
> >>>> Of course, you can also use some other mechanism as you might have in
> >>>> your approach, as long as it supports a,b,c.
> >>>>
> >>>>>
> >>>>>>
> >>>>>> b) (optional) allocate huge pages on that separate NUMA node.
> >>>>>> c) Use ordinary memory-device-ram or memory-device-memfd (for huge 
> >>>>>> pages),
> >>>>>> *bidning* the memory backend to that special NUMA node.
> >>>>>>
> >>>>> "-object memory-backend/device-ram or memory-device-memfd, id=mem0, 
> >>>>> size=768G"
> >>>>> How to bind backend memory to NUMA node
> >>>>>
> >>>>
> >>>> I think the syntax is "policy=bind,host-nodes=X"
> >>>>
> >>>> whereby X is a node mask. So for node "0" you'd use "host-nodes=0x1" for 
> >>>> "5"
> >>>> "host-nodes=0x20" etc.
> >>>>

Re: [PATCH] hw/misc: Add a virtual pci device to dynamically attach memory to QEMU

2021-10-15 Thread david.dai
On Wed, Oct 13, 2021 at 10:33:39AM +0200, David Hildenbrand (da...@redhat.com) 
wrote:
> 
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you recognize the sender and know the
> content is safe.
> 
> 
> On 13.10.21 10:13, david.dai wrote:
> > On Mon, Oct 11, 2021 at 09:43:53AM +0200, David Hildenbrand 
> > (da...@redhat.com) wrote:
> > > 
> > > 
> > > 
> > > > > virito-mem currently relies on having a single sparse memory region 
> > > > > (anon
> > > > > mmap, mmaped file, mmaped huge pages, mmap shmem) per VM. Although we 
> > > > > can
> > > > > share memory with other processes, sharing with other VMs is not 
> > > > > intended.
> > > > > Instead of actually mmaping parts dynamically (which can be quite
> > > > > expensive), virtio-mem relies on punching holes into the backend and
> > > > > dynamically allocating memory/file blocks/... on access.
> > > > > 
> > > > > So the easy way to make it work is:
> > > > > 
> > > > > a) Exposing the CXL memory to the buddy via dax/kmem, esulting in 
> > > > > device
> > > > > memory getting managed by the buddy on a separate NUMA node.
> > > > > 
> > > > 
> > > > Linux kernel buddy system? how to guarantee other applications don't 
> > > > apply memory
> > > > from it
> > > 
> > > Excellent question. Usually, you would online the memory to ZONE_MOVABLE,
> > > such that even if some other allocation ended up there, that it could
> > > get migrated somewhere else.
> > > 
> > > For example, "daxctl reconfigure-device" tries doing that as default:
> > > 
> > > https://pmem.io/ndctl/daxctl-reconfigure-device.html
> > > 
> > > However, I agree that we might actually want to tell the system to not
> > > use this CPU-less node as fallback for other allocations, and that we
> > > might not want to swap out such memory etc.
> > > 
> > > 
> > > But, in the end all that virtio-mem needs to work in the hypervisor is
> > > 
> > > a) A sparse memmap (anonymous RAM, memfd, file)
> > > b) A way to populate memory within that sparse memmap (e.g., on fault,
> > > using madvise(MADV_POPULATE_WRITE), fallocate())
> > > c) A way to discard memory (madvise(MADV_DONTNEED),
> > > fallocate(FALLOC_FL_PUNCH_HOLE))
> > > 
> > > So instead of using anonymous memory+mbind, you can also mmap a sparse 
> > > file
> > > and rely on populate-on-demand. One alternative for your use case would be
> > > to create a DAX  filesystem on that CXL memory (IIRC that should work) and
> > > simply providing virtio-mem with a sparse file located on that filesystem.
> > > 
> > > Of course, you can also use some other mechanism as you might have in
> > > your approach, as long as it supports a,b,c.
> > > 
> > > > 
> > > > > 
> > > > > b) (optional) allocate huge pages on that separate NUMA node.
> > > > > c) Use ordinary memory-device-ram or memory-device-memfd (for huge 
> > > > > pages),
> > > > > *bidning* the memory backend to that special NUMA node.
> > > > > 
> > > > "-object memory-backend/device-ram or memory-device-memfd, id=mem0, 
> > > > size=768G"
> > > > How to bind backend memory to NUMA node
> > > > 
> > > 
> > > I think the syntax is "policy=bind,host-nodes=X"
> > > 
> > > whereby X is a node mask. So for node "0" you'd use "host-nodes=0x1" for 
> > > "5"
> > > "host-nodes=0x20" etc.
> > > 
> > > > > 
> > > > > This will dynamically allocate memory from that special NUMA node, 
> > > > > resulting
> > > > > in the virtio-mem device completely being backed by that device 
> > > > > memory,
> > > > > being able to dynamically resize the memory allocation.
> > > > > 
> > > > > 
> > > > > Exposing an actual devdax to the virtio-mem device, shared by 
> > > > > multiple VMs
> > > > > isn't really what we want and won't work without major design 
> > > > > changes. Also,
> > > > > I'm not so sure it's a very clean design: exposing memo

Re: [PATCH] hw/misc: Add a virtual pci device to dynamically attach memory to QEMU

2021-10-13 Thread david.dai
On Mon, Oct 11, 2021 at 09:43:53AM +0200, David Hildenbrand (da...@redhat.com) 
wrote:
> 
> 
> 
> > > virito-mem currently relies on having a single sparse memory region (anon
> > > mmap, mmaped file, mmaped huge pages, mmap shmem) per VM. Although we can
> > > share memory with other processes, sharing with other VMs is not intended.
> > > Instead of actually mmaping parts dynamically (which can be quite
> > > expensive), virtio-mem relies on punching holes into the backend and
> > > dynamically allocating memory/file blocks/... on access.
> > > 
> > > So the easy way to make it work is:
> > > 
> > > a) Exposing the CXL memory to the buddy via dax/kmem, esulting in device
> > > memory getting managed by the buddy on a separate NUMA node.
> > > 
> > 
> > Linux kernel buddy system? how to guarantee other applications don't apply 
> > memory
> > from it
> 
> Excellent question. Usually, you would online the memory to ZONE_MOVABLE,
> such that even if some other allocation ended up there, that it could
> get migrated somewhere else.
> 
> For example, "daxctl reconfigure-device" tries doing that as default:
> 
> https://pmem.io/ndctl/daxctl-reconfigure-device.html
> 
> However, I agree that we might actually want to tell the system to not
> use this CPU-less node as fallback for other allocations, and that we
> might not want to swap out such memory etc.
> 
> 
> But, in the end all that virtio-mem needs to work in the hypervisor is
> 
> a) A sparse memmap (anonymous RAM, memfd, file)
> b) A way to populate memory within that sparse memmap (e.g., on fault,
>using madvise(MADV_POPULATE_WRITE), fallocate())
> c) A way to discard memory (madvise(MADV_DONTNEED),
>fallocate(FALLOC_FL_PUNCH_HOLE))
> 
> So instead of using anonymous memory+mbind, you can also mmap a sparse file
> and rely on populate-on-demand. One alternative for your use case would be
> to create a DAX  filesystem on that CXL memory (IIRC that should work) and
> simply providing virtio-mem with a sparse file located on that filesystem.
> 
> Of course, you can also use some other mechanism as you might have in
> your approach, as long as it supports a,b,c.
> 
> > 
> > > 
> > > b) (optional) allocate huge pages on that separate NUMA node.
> > > c) Use ordinary memory-device-ram or memory-device-memfd (for huge pages),
> > > *bidning* the memory backend to that special NUMA node.
> > > 
> > "-object memory-backend/device-ram or memory-device-memfd, id=mem0, 
> > size=768G"
> > How to bind backend memory to NUMA node
> > 
> 
> I think the syntax is "policy=bind,host-nodes=X"
> 
> whereby X is a node mask. So for node "0" you'd use "host-nodes=0x1" for "5"
> "host-nodes=0x20" etc.
> 
> > > 
> > > This will dynamically allocate memory from that special NUMA node, 
> > > resulting
> > > in the virtio-mem device completely being backed by that device memory,
> > > being able to dynamically resize the memory allocation.
> > > 
> > > 
> > > Exposing an actual devdax to the virtio-mem device, shared by multiple VMs
> > > isn't really what we want and won't work without major design changes. 
> > > Also,
> > > I'm not so sure it's a very clean design: exposing memory belonging to 
> > > other
> > > VMs to unrelated QEMU processes. This sounds like a serious security hole:
> > > if you managed to escalate to the QEMU process from inside the VM, you can
> > > access unrelated VM memory quite happily. You want an abstraction
> > > in-between, that makes sure each VM/QEMU process only sees private memory:
> > > for example, the buddy via dax/kmem.
> > > 
> > Hi David
> > Thanks for your suggestion, also sorry for my delayed reply due to my long 
> > vacation.
> > How does current virtio-mem dynamically attach memory to guest, via page 
> > fault?
> 
> Essentially you have a large sparse mmap. Withing that mmap, memory is
> populated on demand. Instead if mmap/munmap you perform a single large
> mmap and then dynamically populate memory/discard memory.
> 
> Right now, memory is populated via page faults on access. This is
> sub-optimal when dealing with limited resources (i.e., hugetlbfs,
> file blocks) and you might run out of backend memory.
> 
> I'm working on a "prealloc" mode, which will preallocate/populate memory
> necessary for exposing the next block of memory to the VM, and which
> fails gracefully if preallocation/population fails in the case of such
> limited resources.
> 
> The patch resides on:
>   https://github.com/davidhildenbrand/qemu/tree/virtio-mem-next
> 
> commit ded0e302c14ae1b68bdce9059dcca344e0a5f5f0
> Author: David Hildenbrand 
> Date:   Mon Aug 2 19:51:36 2021 +0200
> 
> virtio-mem: support "prealloc=on" option
> Especially for hugetlb, but also for file-based memory backends, we'd
> like to be able to prealloc memory, especially to make user errors less
> severe: crashing the VM when there are not sufficient huge pages around.
> A common option for hugetlb will be using "reserve=off,prealloc=off" for
> 

Re: [PATCH] hw/misc: Add a virtual pci device to dynamically attach memory to QEMU

2021-10-09 Thread david.dai
On Thu, Sep 30, 2021 at 12:33:30PM +0200, David Hildenbrand (da...@redhat.com) 
wrote:
> 
> 
> On 30.09.21 11:40, david.dai wrote:
> > On Wed, Sep 29, 2021 at 11:30:53AM +0200, David Hildenbrand 
> > (da...@redhat.com) wrote:
> > > 
> > > On 27.09.21 14:28, david.dai wrote:
> > > > On Mon, Sep 27, 2021 at 11:07:43AM +0200, David Hildenbrand 
> > > > (da...@redhat.com) wrote:
> > > > > 
> > > > > CAUTION: This email originated from outside of the organization. Do 
> > > > > not
> > > > > click links or open attachments unless you recognize the sender and 
> > > > > know the
> > > > > content is safe.
> > > > > 
> > > > > 
> > > > > On 27.09.21 10:27, Stefan Hajnoczi wrote:
> > > > > > On Sun, Sep 26, 2021 at 10:16:14AM +0800, David Dai wrote:
> > > > > > > Add a virtual pci to QEMU, the pci device is used to dynamically 
> > > > > > > attach memory
> > > > > > > to VM, so driver in guest can apply host memory in fly without 
> > > > > > > virtualization
> > > > > > > management software's help, such as libvirt/manager. The attached 
> > > > > > > memory is
> > > > > 
> > > > > We do have virtio-mem to dynamically attach memory to a VM. It could 
> > > > > be
> > > > > extended by a mechanism for the VM to request more/less memory, that's
> > > > > already a planned feature. But yeah, virito-mem memory is exposed as
> > > > > ordinary system RAM, not only via a BAR to mostly be managed by user 
> > > > > space
> > > > > completely.
> > > 
> > > There is a virtio-pmem spec proposal to expose the memory region via a PCI
> > > BAR. We could do something similar for virtio-mem, however, we would have 
> > > to
> > > wire that new model up differently in QEMU (it's no longer a "memory 
> > > device"
> > > like a DIMM then).
> > > 
> > > > > 
> > > > 
> > > > I wish virtio-mem can solve our problem, but it is a dynamic allocation 
> > > > mechanism
> > > > for system RAM in virtualization. In heterogeneous computing 
> > > > environments, the
> > > > attached memory usually comes from computing device, it should be 
> > > > managed separately.
> > > > we doesn't hope Linux MM controls it.
> > > 
> > > If that heterogeneous memory would have a dedicated node (which usually is
> > > the case IIRC) , and you let it manage by the Linux kernel (dax/kmem), you
> > > can bind the memory backend of virtio-mem to that special NUMA node. So 
> > > all
> > > memory managed by that virtio-mem device would come from that 
> > > heterogeneous
> > > memory.
> > > 
> > 
> > Yes, CXL type 2, 3 devices expose memory to host as a dedicated node, the 
> > node
> > is marked as soft_reserved_memory, dax/kmem can take over the node to 
> > create a
> > dax devcie. This dax device can be regarded as the memory backend of 
> > virtio-mem
> > 
> > I don't sure whether a dax device can be open by multiple VMs or host 
> > applications.
> 
> virito-mem currently relies on having a single sparse memory region (anon
> mmap, mmaped file, mmaped huge pages, mmap shmem) per VM. Although we can
> share memory with other processes, sharing with other VMs is not intended.
> Instead of actually mmaping parts dynamically (which can be quite
> expensive), virtio-mem relies on punching holes into the backend and
> dynamically allocating memory/file blocks/... on access.
> 
> So the easy way to make it work is:
> 
> a) Exposing the CXL memory to the buddy via dax/kmem, esulting in device
> memory getting managed by the buddy on a separate NUMA node.
>

Linux kernel buddy system? how to guarantee other applications don't apply 
memory
from it

>
> b) (optional) allocate huge pages on that separate NUMA node.
> c) Use ordinary memory-device-ram or memory-device-memfd (for huge pages),
> *bidning* the memory backend to that special NUMA node.
>
 
"-object memory-backend/device-ram or memory-device-memfd, id=mem0, size=768G"
How to bind backend memory to NUMA node

>
> This will dynamically allocate memory from that special NUMA node, resulting
> in the virtio-mem device completely being backed by that device memory,
> being able to dynamically resize the memory allocation.
> 
> 
> Exposing an actual devdax to the virtio-mem device, shared by multiple VMs
> isn't really what we want and won't work without major design changes. Also,
> I'm not so sure it's a very clean design: exposing memory belonging to other
> VMs to unrelated QEMU processes. This sounds like a serious security hole:
> if you managed to escalate to the QEMU process from inside the VM, you can
> access unrelated VM memory quite happily. You want an abstraction
> in-between, that makes sure each VM/QEMU process only sees private memory:
> for example, the buddy via dax/kmem.
> 
Hi David
Thanks for your suggestion, also sorry for my delayed reply due to my long 
vacation.
How does current virtio-mem dynamically attach memory to guest, via page fault?

Thanks,
David 


> -- 
> Thanks,
> 
> David / dhildenb
> 
> 





Re: [PATCH] hw/misc: Add a virtual pci device to dynamically attach memory to QEMU

2021-09-30 Thread david.dai
On Wed, Sep 29, 2021 at 11:30:53AM +0200, David Hildenbrand (da...@redhat.com) 
wrote: 
> 
> On 27.09.21 14:28, david.dai wrote:
> > On Mon, Sep 27, 2021 at 11:07:43AM +0200, David Hildenbrand 
> > (da...@redhat.com) wrote:
> > > 
> > > CAUTION: This email originated from outside of the organization. Do not
> > > click links or open attachments unless you recognize the sender and know 
> > > the
> > > content is safe.
> > > 
> > > 
> > > On 27.09.21 10:27, Stefan Hajnoczi wrote:
> > > > On Sun, Sep 26, 2021 at 10:16:14AM +0800, David Dai wrote:
> > > > > Add a virtual pci to QEMU, the pci device is used to dynamically 
> > > > > attach memory
> > > > > to VM, so driver in guest can apply host memory in fly without 
> > > > > virtualization
> > > > > management software's help, such as libvirt/manager. The attached 
> > > > > memory is
> > > 
> > > We do have virtio-mem to dynamically attach memory to a VM. It could be
> > > extended by a mechanism for the VM to request more/less memory, that's
> > > already a planned feature. But yeah, virito-mem memory is exposed as
> > > ordinary system RAM, not only via a BAR to mostly be managed by user space
> > > completely.
> 
> There is a virtio-pmem spec proposal to expose the memory region via a PCI
> BAR. We could do something similar for virtio-mem, however, we would have to
> wire that new model up differently in QEMU (it's no longer a "memory device"
> like a DIMM then).
> 
> > > 
> > 
> > I wish virtio-mem can solve our problem, but it is a dynamic allocation 
> > mechanism
> > for system RAM in virtualization. In heterogeneous computing environments, 
> > the
> > attached memory usually comes from computing device, it should be managed 
> > separately.
> > we doesn't hope Linux MM controls it.
> 
> If that heterogeneous memory would have a dedicated node (which usually is
> the case IIRC) , and you let it manage by the Linux kernel (dax/kmem), you
> can bind the memory backend of virtio-mem to that special NUMA node. So all
> memory managed by that virtio-mem device would come from that heterogeneous
> memory.
> 

Yes, CXL type 2, 3 devices expose memory to host as a dedicated node, the node
is marked as soft_reserved_memory, dax/kmem can take over the node to create a
dax devcie. This dax device can be regarded as the memory backend of virtio-mem

I don't sure whether a dax device can be open by multiple VMs or host 
applications. 

> You could then further use a separate NUMA node for that virtio-mem device
> inside the VM. But to the VM it would look like System memory with different
> performance characteristics. That would work fore some use cases I guess,
> but not sure for which not (I assume you can tell :) ).
> 

If the NUMA node in guest can be dynamically expanded by virtio-mem, maybe it is
a good thing. Because we will develop our own memory management driver to manage
device memory.
   
> We could even write an alternative virtio-mem mode, where device manage
> isn't exposed to the buddy but using some different way to user space.
> 
> > > > > isolated from System RAM, it can be used in heterogeneous memory 
> > > > > management for
> > > > > virtualization. Multiple VMs dynamically share same computing device 
> > > > > memory
> > > > > without memory overcommit.
> > > 
> > > This sounds a lot like MemExpand/MemLego ... am I right that this is the
> > > original design? I recall that VMs share a memory region and dynamically
> > > agree upon which part of the memory region a VM uses. I further recall 
> > > that
> > > there were malloc() hooks that would dynamically allocate such memory in
> > > user space from the shared memory region.
> > > 
> > 
> > Thank you for telling me about Memexpand/MemLego, I have carefully read the 
> > paper.
> > some ideas from it are same as this patch, such as software model and 
> > stack, but
> > it may have a security risk that whole shared memory is visible to all VMs.
> 
> How will you make sure that not all shared memory can be accessed by the
> other VMs? IOW, emulate !shared memory on shared memory?
> 
> > ---
> >   application
> > ---
> > memory management driver
> > ---
> >   pci driver
> > ---
> > virtual pci device
> > ---
> > 
> > > I can see some use cases f

Re: [PATCH] hw/misc: Add a virtual pci device to dynamically attach memory to QEMU

2021-09-27 Thread david.dai
On Mon, Sep 27, 2021 at 11:07:43AM +0200, David Hildenbrand (da...@redhat.com) 
wrote:
> 
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you recognize the sender and know the
> content is safe.
> 
> 
> On 27.09.21 10:27, Stefan Hajnoczi wrote:
> > On Sun, Sep 26, 2021 at 10:16:14AM +0800, David Dai wrote:
> > > Add a virtual pci to QEMU, the pci device is used to dynamically attach 
> > > memory
> > > to VM, so driver in guest can apply host memory in fly without 
> > > virtualization
> > > management software's help, such as libvirt/manager. The attached memory 
> > > is
> 
> We do have virtio-mem to dynamically attach memory to a VM. It could be
> extended by a mechanism for the VM to request more/less memory, that's
> already a planned feature. But yeah, virito-mem memory is exposed as
> ordinary system RAM, not only via a BAR to mostly be managed by user space
> completely.
>

I wish virtio-mem can solve our problem, but it is a dynamic allocation 
mechanism
for system RAM in virtualization. In heterogeneous computing environments, the
attached memory usually comes from computing device, it should be managed 
separately.
we doesn't hope Linux MM controls it.
 
> > > isolated from System RAM, it can be used in heterogeneous memory 
> > > management for
> > > virtualization. Multiple VMs dynamically share same computing device 
> > > memory
> > > without memory overcommit.
> 
> This sounds a lot like MemExpand/MemLego ... am I right that this is the
> original design? I recall that VMs share a memory region and dynamically
> agree upon which part of the memory region a VM uses. I further recall that
> there were malloc() hooks that would dynamically allocate such memory in
> user space from the shared memory region.
>

Thank you for telling me about Memexpand/MemLego, I have carefully read the 
paper.
some ideas from it are same as this patch, such as software model and stack, but
it may have a security risk that whole shared memory is visible to all VMs.
---
 application
---
memory management driver
---
 pci driver
---
   virtual pci device
---

> I can see some use cases for it, although the shared memory design isn't
> what you typically want in most VM environments.
>

The original design for this patch is to share a computing device among 
multipile
VMs. Each VM runs a computing application(for example, OpenCL application)
Our computing device can support a few applications in parallel. In addition, it
supports SVM(shared virtual memory) via IOMMU/ATS/PASID/PRI. Device exposes its
memory to host vis PCIe bar or CXL.mem, host constructs memory pool to uniformly
manage device memory, then attach device memory to VM via a virtual PCI device.
but we don't know how much memory should be assigned when creating VM, so we 
hope
memory is attached to VM on-demand. driver in guest triggers memory attaching, 
but
not outside virtualization management software. so the original requirements 
are:
1> The managed memory comes from device, it should be isolated from system RAM
2> The memory can be dynamically attached to VM in fly
3> The attached memory supports SVM and DMA operation with IOMMU

Thank you very much. 


Best Regards,
David Dai

> -- 
> Thanks,
> 
> David / dhildenb
> 
> 





Re: [PATCH] hw/misc: Add a virtual pci device to dynamically attach memory to QEMU

2021-09-27 Thread david.dai
On Mon, Sep 27, 2021 at 10:27:06AM +0200, Stefan Hajnoczi (stefa...@redhat.com) 
wrote:
> On Sun, Sep 26, 2021 at 10:16:14AM +0800, David Dai wrote:
> > Add a virtual pci to QEMU, the pci device is used to dynamically attach 
> > memory
> > to VM, so driver in guest can apply host memory in fly without 
> > virtualization
> > management software's help, such as libvirt/manager. The attached memory is
> > isolated from System RAM, it can be used in heterogeneous memory management 
> > for
> > virtualization. Multiple VMs dynamically share same computing device memory
> > without memory overcommit.
> > 
> > Signed-off-by: David Dai 
> 
> CCing David Hildenbrand (virtio-balloon and virtio-mem) and Igor
> Mammedov (host memory backend).
> 
> > ---
> >  docs/devel/dynamic_mdev.rst | 122 ++
> >  hw/misc/Kconfig |   5 +
> >  hw/misc/dynamic_mdev.c  | 456 
> >  hw/misc/meson.build |   1 +
> >  4 files changed, 584 insertions(+)
> >  create mode 100644 docs/devel/dynamic_mdev.rst
> >  create mode 100644 hw/misc/dynamic_mdev.c
> > 
> > diff --git a/docs/devel/dynamic_mdev.rst b/docs/devel/dynamic_mdev.rst
> > new file mode 100644
> > index 00..8e2edb6600
> > --- /dev/null
> > +++ b/docs/devel/dynamic_mdev.rst
> > @@ -0,0 +1,122 @@
> > +Motivation:
> > +In heterogeneous computing system, accelorator generally exposes its device
> 
> s/accelorator/accelerator/
> 
> (There are missing articles and small grammar tweaks that could be made,
> but I'm skipping the English language stuff for now.)
> 

Thank you for your review.

> > +memory to host via PCIe and CXL.mem(Compute Express Link) to share memory
> > +between host and device, and these memory generally are uniformly managed 
> > by
> > +host, they are called HDM (host managed device memory), further SVA (share
> > +virtual address) can be achieved on this base. One computing device may be 
> > used
> 
> Is this Shared Virtual Addressing (SVA) (also known as Shared Virtual
> Memory)? If yes, please use the exact name ("Shared Virtual Addressing",
> not "share virtual address") so that's clear and the reader can easily
> find out more information through a web search.
>
 
Yes, you are right.

> > +by multiple virtual machines if it supports SRIOV, to efficiently use 
> > device
> > +memory in virtualization, each VM allocates device memory on-demand without
> > +overcommit, but how to dynamically attach host memory resource to VM. A 
> > virtual
> 
> I cannot parse this sentence. Can you rephrase it and/or split it into
> multiple sentences?
> 
> > +PCI device, dynamic_mdev, is introduced to achieve this target. 
> > dynamic_mdev
> 
> I suggest calling it "memdev" instead of "mdev" to prevent confusion
> with VFIO mdev.
>

I agree your suggestion.
I will make changes according to your comments at new patch.

> > +has a big bar space which size can be assigned by user when creating VM, 
> > the
> > +bar doesn't have backend memory at initialization stage, later driver in 
> > guest
> > +triggers QEMU to map host memory to the bar space. how much memory, when 
> > and
> > +where memory will be mapped to are determined by guest driver, after device
> > +memory has been attached to the virtual PCI bar, application in guest can
> > +access device memory by the virtual PCI bar. Memory allocation and 
> > negotiation
> > +are left to guest driver and memory backend implementation. dynamic_mdev 
> > is a
> > +mechanism which provides significant benefits to heterogeneous memory
> > +virtualization.
> 
> David and Igor: please review this design. I'm not familiar enough with
> the various memory hotplug and ballooning mechanisms to give feedback on
> this.
> 
> > +Implementation:
> > +dynamic_mdev device has two bars, bar0 and bar2. bar0 is a 32-bit register 
> > bar
> > +which used to host control register for control and message communication, 
> > Bar2
> > +is a 64-bit mmio bar, which is used to attach host memory to, the bar size 
> > can
> > +be assigned via parameter when creating VM. Host memory is attached to 
> > this bar
> > +via mmap API.
> > +
> > +
> > +  VM1   VM2
> > + -----
> > +|  application  |  | application  |
> > +|   |  |  |
> > +|---|  |--|
> > +| guest driver  |  | guest driver |
> > +|   |--||  |   | -|   |
> > +|   | pci mem bar  ||  |   | pci mem bar  |   |
> > + ---|--|-   ---|--|---
> > +    --- --   --
> > +|| |   |   |  | |  |
> > +    --- --   --
> > +\  /
> > + \/
> > +  \  /
> > +   \