[virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend
On Tue, Apr 10, 2018 at 10:52:52AM +0800, Jason Wang wrote: > On 2018年04月02日 23:23, Tiwei Bie wrote: > > This patch introduces a mdev (mediated device) based hardware > > vhost backend. This backend is an abstraction of the various > > hardware vhost accelerators (potentially any device that uses > > virtio ring can be used as a vhost accelerator). Some generic > > mdev parent ops are provided for accelerator drivers to support > > generating mdev instances. > > > > What's this > > === > > > > The idea is that we can setup a virtio ring compatible device > > with the messages available at the vhost-backend. Originally, > > these messages are used to implement a software vhost backend, > > but now we will use these messages to setup a virtio ring > > compatible hardware device. Then the hardware device will be > > able to work with the guest virtio driver in the VM just like > > what the software backend does. That is to say, we can implement > > a hardware based vhost backend in QEMU, and any virtio ring > > compatible devices potentially can be used with this backend. > > (We also call it vDPA -- vhost Data Path Acceleration). > > > > One problem is that, different virtio ring compatible devices > > may have different device interfaces. That is to say, we will > > need different drivers in QEMU. It could be troublesome. And > > that's what this patch trying to fix. The idea behind this > > patch is very simple: mdev is a standard way to emulate device > > in kernel. > > So you just move the abstraction layer from qemu to kernel, and you still > need different drivers in kernel for different device interfaces of > accelerators. This looks even more complex than leaving it in qemu. As you > said, another idea is to implement userspace vhost backend for accelerators > which seems easier and could co-work with other parts of qemu without > inventing new type of messages. I'm not quite sure. Do you think it's acceptable to add various vendor specific hardware drivers in QEMU? > > Need careful thought here to seek a best solution here. Yeah, definitely! :) And your opinions would be very helpful! > > > So we defined a standard device based on mdev, which > > is able to accept vhost messages. When the mdev emulation code > > (i.e. the generic mdev parent ops provided by this patch) gets > > vhost messages, it will parse and deliver them to accelerator > > drivers. Drivers can use these messages to setup accelerators. > > > > That is to say, the generic mdev parent ops (e.g. read()/write()/ > > ioctl()/...) will be provided for accelerator drivers to register > > accelerators as mdev parent devices. And each accelerator device > > will support generating standard mdev instance(s). > > > > With this standard device interface, we will be able to just > > develop one userspace driver to implement the hardware based > > vhost backend in QEMU. > > > > Difference between vDPA and PCI passthru > > > > > > The key difference between vDPA and PCI passthru is that, in > > vDPA only the data path of the device (e.g. DMA ring, notify > > region and queue interrupt) is pass-throughed to the VM, the > > device control path (e.g. PCI configuration space and MMIO > > regions) is still defined and emulated by QEMU. > > > > The benefits of keeping virtio device emulation in QEMU compared > > with virtio device PCI passthru include (but not limit to): > > > > - consistent device interface for guest OS in the VM; > > - max flexibility on the hardware design, especially the > >accelerator for each vhost backend doesn't have to be a > >full PCI device; > > - leveraging the existing virtio live-migration framework; > > > > The interface of this mdev based device > > === > > > > 1. BAR0 > > > > The MMIO region described by BAR0 is the main control > > interface. Messages will be written to or read from > > this region. > > > > The message type is determined by the `request` field > > in message header. The message size is encoded in the > > message header too. The message format looks like this: > > > > struct vhost_vfio_op { > > __u64 request; > > __u32 flags; > > /* Flag values: */ > > #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */ > > __u32 size; > > union { > > __u64 u64; > > struct vhost_vring_state state; > > struct vhost_vring_addr addr; > > struct vhost_memory memory; > > } payload; > > }; > > > > The existing vhost-kernel ioctl cmds are reused as > > the message requests in above structure. > > > > Each message will be written to or read from this > > region at offset 0: > > > > int vhost_vfio_write(struct vhost_dev *dev, struct vhost_vfio_op *op) > > { > > int count = VHOST_VFIO_OP_HDR_SIZE + op->size; > > struct vhost_vfio *vfio = dev->opaque; > > int ret; > > > > ret = pwrite64(vfio->device_fd, op, count,
[virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend
On 2018年04月02日 23:23, Tiwei Bie wrote: This patch introduces a mdev (mediated device) based hardware vhost backend. This backend is an abstraction of the various hardware vhost accelerators (potentially any device that uses virtio ring can be used as a vhost accelerator). Some generic mdev parent ops are provided for accelerator drivers to support generating mdev instances. What's this === The idea is that we can setup a virtio ring compatible device with the messages available at the vhost-backend. Originally, these messages are used to implement a software vhost backend, but now we will use these messages to setup a virtio ring compatible hardware device. Then the hardware device will be able to work with the guest virtio driver in the VM just like what the software backend does. That is to say, we can implement a hardware based vhost backend in QEMU, and any virtio ring compatible devices potentially can be used with this backend. (We also call it vDPA -- vhost Data Path Acceleration). One problem is that, different virtio ring compatible devices may have different device interfaces. That is to say, we will need different drivers in QEMU. It could be troublesome. And that's what this patch trying to fix. The idea behind this patch is very simple: mdev is a standard way to emulate device in kernel. So you just move the abstraction layer from qemu to kernel, and you still need different drivers in kernel for different device interfaces of accelerators. This looks even more complex than leaving it in qemu. As you said, another idea is to implement userspace vhost backend for accelerators which seems easier and could co-work with other parts of qemu without inventing new type of messages. Need careful thought here to seek a best solution here. So we defined a standard device based on mdev, which is able to accept vhost messages. When the mdev emulation code (i.e. the generic mdev parent ops provided by this patch) gets vhost messages, it will parse and deliver them to accelerator drivers. Drivers can use these messages to setup accelerators. That is to say, the generic mdev parent ops (e.g. read()/write()/ ioctl()/...) will be provided for accelerator drivers to register accelerators as mdev parent devices. And each accelerator device will support generating standard mdev instance(s). With this standard device interface, we will be able to just develop one userspace driver to implement the hardware based vhost backend in QEMU. Difference between vDPA and PCI passthru The key difference between vDPA and PCI passthru is that, in vDPA only the data path of the device (e.g. DMA ring, notify region and queue interrupt) is pass-throughed to the VM, the device control path (e.g. PCI configuration space and MMIO regions) is still defined and emulated by QEMU. The benefits of keeping virtio device emulation in QEMU compared with virtio device PCI passthru include (but not limit to): - consistent device interface for guest OS in the VM; - max flexibility on the hardware design, especially the accelerator for each vhost backend doesn't have to be a full PCI device; - leveraging the existing virtio live-migration framework; The interface of this mdev based device === 1. BAR0 The MMIO region described by BAR0 is the main control interface. Messages will be written to or read from this region. The message type is determined by the `request` field in message header. The message size is encoded in the message header too. The message format looks like this: struct vhost_vfio_op { __u64 request; __u32 flags; /* Flag values: */ #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */ __u32 size; union { __u64 u64; struct vhost_vring_state state; struct vhost_vring_addr addr; struct vhost_memory memory; } payload; }; The existing vhost-kernel ioctl cmds are reused as the message requests in above structure. Each message will be written to or read from this region at offset 0: int vhost_vfio_write(struct vhost_dev *dev, struct vhost_vfio_op *op) { int count = VHOST_VFIO_OP_HDR_SIZE + op->size; struct vhost_vfio *vfio = dev->opaque; int ret; ret = pwrite64(vfio->device_fd, op, count, vfio->bar0_offset); if (ret != count) return -1; return 0; } int vhost_vfio_read(struct vhost_dev *dev, struct vhost_vfio_op *op) { int count = VHOST_VFIO_OP_HDR_SIZE + op->size; struct vhost_vfio *vfio = dev->opaque; uint64_t request = op->request; int ret; ret = pread64(vfio->device_fd, op, count, vfio->bar0_offset); if (ret != count || request != op->request) return -1; return 0; } It's quite straightforward to set things to the device. Just need to write the message to
[virtio-dev] [PATCH v32 4/4] virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON
The VIRTIO_BALLOON_F_PAGE_POISON feature bit is used to indicate if the guest is using page poisoning. Guest writes to the poison_val config field to tell host about the page poisoning value in use. Signed-off-by: Wei WangSuggested-by: Michael S. Tsirkin Cc: Michael S. Tsirkin Cc: Michal Hocko Cc: Andrew Morton --- drivers/virtio/virtio_balloon.c | 10 ++ include/uapi/linux/virtio_balloon.h | 3 +++ 2 files changed, 13 insertions(+) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 197f865..aea91ab 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -730,6 +730,7 @@ static struct file_system_type balloon_fs = { static int virtballoon_probe(struct virtio_device *vdev) { struct virtio_balloon *vb; + __u32 poison_val; int err; if (!vdev->config->get) { @@ -775,6 +776,11 @@ static int virtballoon_probe(struct virtio_device *vdev) vb->stop_cmd_id = cpu_to_virtio32(vb->vdev, VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID); INIT_WORK(>report_free_page_work, report_free_page_func); + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) { + memset(_val, PAGE_POISON, sizeof(poison_val)); + virtio_cwrite(vb->vdev, struct virtio_balloon_config, + poison_val, _val); + } } vb->nb.notifier_call = virtballoon_oom_notify; @@ -893,6 +899,9 @@ static int virtballoon_restore(struct virtio_device *vdev) static int virtballoon_validate(struct virtio_device *vdev) { + if (!page_poisoning_enabled()) + __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON); + __virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM); return 0; } @@ -902,6 +911,7 @@ static unsigned int features[] = { VIRTIO_BALLOON_F_STATS_VQ, VIRTIO_BALLOON_F_DEFLATE_ON_OOM, VIRTIO_BALLOON_F_FREE_PAGE_HINT, + VIRTIO_BALLOON_F_PAGE_POISON, }; static struct virtio_driver virtio_balloon_driver = { diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index b2d86c2..8b93581 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -35,6 +35,7 @@ #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */ #define VIRTIO_BALLOON_F_FREE_PAGE_HINT3 /* VQ to report free pages */ +#define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */ /* Size of a PFN in the balloon interface. */ #define VIRTIO_BALLOON_PFN_SHIFT 12 @@ -47,6 +48,8 @@ struct virtio_balloon_config { __u32 actual; /* Free page report command id, readonly by guest */ __u32 free_page_report_cmd_id; + /* Stores PAGE_POISON if page poisoning is in use */ + __u32 poison_val; }; #define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */ -- 2.7.4 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH v32 3/4] mm/page_poison: expose page_poisoning_enabled to kernel modules
In some usages, e.g. virtio-balloon, a kernel module needs to know if page poisoning is in use. This patch exposes the page_poisoning_enabled function to kernel modules. Signed-off-by: Wei WangCc: Andrew Morton Cc: Michal Hocko Cc: Michael S. Tsirkin Acked-by: Andrew Morton --- mm/page_poison.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/mm/page_poison.c b/mm/page_poison.c index e83fd44..762b472 100644 --- a/mm/page_poison.c +++ b/mm/page_poison.c @@ -17,6 +17,11 @@ static int early_page_poison_param(char *buf) } early_param("page_poison", early_page_poison_param); +/** + * page_poisoning_enabled - check if page poisoning is enabled + * + * Return true if page poisoning is enabled, or false if not. + */ bool page_poisoning_enabled(void) { /* @@ -29,6 +34,7 @@ bool page_poisoning_enabled(void) (!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) && debug_pagealloc_enabled())); } +EXPORT_SYMBOL_GPL(page_poisoning_enabled); static void poison_page(struct page *page) { -- 2.7.4 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH v32 1/4] mm: support reporting free page blocks
This patch adds support to walk through the free page blocks in the system and report them via a callback function. Some page blocks may leave the free list after zone->lock is released, so it is the caller's responsibility to either detect or prevent the use of such pages. One use example of this patch is to accelerate live migration by skipping the transfer of free pages reported from the guest. A popular method used by the hypervisor to track which part of memory is written during live migration is to write-protect all the guest memory. So, those pages that are reported as free pages but are written after the report function returns will be captured by the hypervisor, and they will be added to the next round of memory transfer. Signed-off-by: Wei WangSigned-off-by: Liang Li Cc: Michal Hocko Cc: Andrew Morton Cc: Michael S. Tsirkin Acked-by: Michal Hocko --- include/linux/mm.h | 6 mm/page_alloc.c| 97 ++ 2 files changed, 103 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index f945dff..4698df1 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1951,6 +1951,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size, unsigned long zone_start_pfn, unsigned long *zholes_size); extern void free_initmem(void); +extern int walk_free_mem_block(void *opaque, + int min_order, + int (*report_pfn_range)(void *opaque, + unsigned long pfn, + unsigned long num)); + /* * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK) * into the buddy system. The freed pages will be poisoned with pattern diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4ea0182..83e50fd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4912,6 +4912,103 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) show_swap_cache_info(); } +/* + * Walk through a free page list and report the found pfn range via the + * callback. + * + * Return 0 if it completes the reporting. Otherwise, return the non-zero + * value returned from the callback. + */ +static int walk_free_page_list(void *opaque, + struct zone *zone, + int order, + enum migratetype mt, + int (*report_pfn_range)(void *, + unsigned long, + unsigned long)) +{ + struct page *page; + struct list_head *list; + unsigned long pfn, flags; + int ret = 0; + + spin_lock_irqsave(>lock, flags); + list = >free_area[order].free_list[mt]; + list_for_each_entry(page, list, lru) { + pfn = page_to_pfn(page); + ret = report_pfn_range(opaque, pfn, 1 << order); + if (ret) + break; + } + spin_unlock_irqrestore(>lock, flags); + + return ret; +} + +/** + * walk_free_mem_block - Walk through the free page blocks in the system + * @opaque: the context passed from the caller + * @min_order: the minimum order of free lists to check + * @report_pfn_range: the callback to report the pfn range of the free pages + * + * If the callback returns a non-zero value, stop iterating the list of free + * page blocks. Otherwise, continue to report. + * + * Please note that there are no locking guarantees for the callback and + * that the reported pfn range might be freed or disappear after the + * callback returns so the caller has to be very careful how it is used. + * + * The callback itself must not sleep or perform any operations which would + * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC) + * or via any lock dependency. It is generally advisable to implement + * the callback as simple as possible and defer any heavy lifting to a + * different context. + * + * There is no guarantee that each free range will be reported only once + * during one walk_free_mem_block invocation. + * + * pfn_to_page on the given range is strongly discouraged and if there is + * an absolute need for that make sure to contact MM people to discuss + * potential problems. + * + * The function itself might sleep so it cannot be called from atomic + * contexts. + * + * In general low orders tend to be very volatile and so it makes more + * sense to query larger ones first for various optimizations which like + * ballooning etc... This will reduce the overhead as well. + * + * Return 0 if it completes the reporting. Otherwise, return the non-zero + * value returned from the callback. + */ +int walk_free_mem_block(void
[virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
On Mon, Apr 9, 2018 at 4:03 PM, Stephen Hemmingerwrote: > On Mon, 9 Apr 2018 15:30:42 -0700 > Siwei Liu wrote: > >> On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn wrote: >> >> No, implementation wise I'd avoid changing the class on the fly. What >> >> I'm looking to is a means to add a secondary class or class aliasing >> >> mechanism for netdevs that allows mapping for a kernel device >> >> namespace (/class/net-kernel) to userspace (/class/net). Imagine >> >> creating symlinks between these two namespaces as an analogy. All >> >> userspace visible netdevs today will have both a kernel name and a >> >> userspace visible name, having one (/class/net) referecing the other >> >> (/class/net-kernel) in its own namespace. The newly introduced >> >> IFF_AUTO_MANAGED device will have a kernel name only >> >> (/class/net-kernel). As a result, the existing applications using >> >> /class/net don't break, while we're adding the kernel namespace that >> >> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace >> >> at all. >> > >> > My gut feeling is this whole scheme will not fly. You really should be >> > talking to GregKH. >> >> Will do. Before spreading it out loudly I'd run it within netdev to >> clarify the need for why not exposing the lower netdevs is critical >> for cloud service providers in the face of introducing a new feature, >> and we are not hiding anything but exposing it in a way that don't >> break existing userspace applications while introducing feature is >> possible with the limitation of keeping old userspace still. >> >> > >> > Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic. >> > A device can start out as a normal device, and will change to being >> > automatic later, when the user on top of it probes. >> >> Sure. In whatever form it's still a netdev, and changing the namespace >> should be more dynamic than changing the class. >> >> Thanks a lot, >> -Siwei >> >> > >> > Andrew > > Also, remember for netdev's /sys is really a third class API. > The primary API's are netlink and ioctl. Also why not use existing > network namespaces rather than inventing a new abstraction? Because we want to leave old userspace unmodified while making SR-IOV live migration transparent to users. Specifically, we'd want old udevd to skip through uevents for the lower netdevs, while also making new udevd able to name the bypass_master interface by referencing the pci slot information which is only present in the sysfs entry for IFF_AUTO_MANAGED net device. The problem of using network namespace is that, no sysfs entry will be around for IFF_AUTO_MANAGED netdev if we isolate it out to a separate netns, unless we generalize the scope for what netns is designed for (isolation I mean). For auto-managed netdevs we don't neccessarily wants strict isolation, but rather a way of sticking to 1-netdev model for strict backward compatibility requiement of the old userspace, while exposing the information in a way new userspace understands. Thanks, -Siwei - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunnwrote: >> No, implementation wise I'd avoid changing the class on the fly. What >> I'm looking to is a means to add a secondary class or class aliasing >> mechanism for netdevs that allows mapping for a kernel device >> namespace (/class/net-kernel) to userspace (/class/net). Imagine >> creating symlinks between these two namespaces as an analogy. All >> userspace visible netdevs today will have both a kernel name and a >> userspace visible name, having one (/class/net) referecing the other >> (/class/net-kernel) in its own namespace. The newly introduced >> IFF_AUTO_MANAGED device will have a kernel name only >> (/class/net-kernel). As a result, the existing applications using >> /class/net don't break, while we're adding the kernel namespace that >> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace >> at all. > > My gut feeling is this whole scheme will not fly. You really should be > talking to GregKH. Will do. Before spreading it out loudly I'd run it within netdev to clarify the need for why not exposing the lower netdevs is critical for cloud service providers in the face of introducing a new feature, and we are not hiding anything but exposing it in a way that don't break existing userspace applications while introducing feature is possible with the limitation of keeping old userspace still. > > Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic. > A device can start out as a normal device, and will change to being > automatic later, when the user on top of it probes. Sure. In whatever form it's still a netdev, and changing the namespace should be more dynamic than changing the class. Thanks a lot, -Siwei > > Andrew - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
On Fri, Apr 6, 2018 at 8:19 PM, Andrew Lunnwrote: > Hi Siwei > >> I think everyone seems to agree not to fiddle with the ":" prefix, but >> rather have a new class of network subsystem under /sys/class thus a >> separate device namespace e.g. /sys/class/net-kernel for those >> auto-managed lower netdevs is needed. > > How do you get a device into this new class? I don't know the Linux > driver model too well, but to get a device out of one class and into > another, i think you need to device_del(dev). modify dev->class and > then device_add(dev). However, device_add() says you are not allowed > to do this. No, implementation wise I'd avoid changing the class on the fly. What I'm looking to is a means to add a secondary class or class aliasing mechanism for netdevs that allows mapping for a kernel device namespace (/class/net-kernel) to userspace (/class/net). Imagine creating symlinks between these two namespaces as an analogy. All userspace visible netdevs today will have both a kernel name and a userspace visible name, having one (/class/net) referecing the other (/class/net-kernel) in its own namespace. The newly introduced IFF_AUTO_MANAGED device will have a kernel name only (/class/net-kernel). As a result, the existing applications using /class/net don't break, while we're adding the kernel namespace that allows IFF_AUTO_MANAGED devices which will not be exposed to userspace at all. As this requires changing the internals of driver model core it's a rather big hammer approach I'd think. If there exists a better implementation than this to allow adding a separate layer of in-kernel device namespace, I'd more than welcome to hear about. > > And i don't even see how this helps. Are you also not going to call > list_netdevice()? Are you going to add some other list for these > devices in a different class? list_netdevice() is still called. I think with the current RFC patch, I've added two lists for netdevs under the kernel namespace: dev_cmpl_list and name_cmpl_hlist. As a result of that, all userspace netdevs get registered will be added to two types of lists: the userspace list for e.g. dev_list, and also the kernelspace list e.g. dev_cmpl_list (I can rename it to something more accurate). The IFF_AUTO_MANAGED device will be only added to kernelspace list e.g. dev_cmpl_list. Hope all your questions are answered. Thanks, -Siwei > >Andrew - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] Re: [PATCH v2] virtio_balloon: export hugetlb page allocation counts
On 03/22/2018 07:38 PM, Jason Wang wrote: On 2018年03月22日 11:10, Michael S. Tsirkin wrote: On Thu, Mar 22, 2018 at 09:52:18AM +0800, Jason Wang wrote: On 2018年03月20日 12:26, Jonathan Helman wrote: On Mar 19, 2018, at 7:31 PM, Jason Wangwrote: On 2018年03月20日 06:14, Jonathan Helman wrote: Export the number of successful and failed hugetlb page allocations via the virtio balloon driver. These 2 counts come directly from the vm_events HTLB_BUDDY_PGALLOC and HTLB_BUDDY_PGALLOC_FAIL. Signed-off-by: Jonathan Helman Reviewed-by: Jason Wang Thanks. --- drivers/virtio/virtio_balloon.c | 6 ++ include/uapi/linux/virtio_balloon.h | 4 +++- 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index dfe5684..6b237e3 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -272,6 +272,12 @@ static unsigned int update_balloon_stats(struct virtio_balloon *vb) pages_to_bytes(events[PSWPOUT])); update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]); update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]); +#ifdef CONFIG_HUGETLB_PAGE + update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC, + events[HTLB_BUDDY_PGALLOC]); + update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGFAIL, + events[HTLB_BUDDY_PGALLOC_FAIL]); +#endif #endif update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE, pages_to_bytes(i.freeram)); diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index 4e8b830..40297a3 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -53,7 +53,9 @@ struct virtio_balloon_config { #define VIRTIO_BALLOON_S_MEMTOT 5 /* Total amount of memory */ #define VIRTIO_BALLOON_S_AVAIL 6 /* Available memory as in /proc */ #define VIRTIO_BALLOON_S_CACHES 7 /* Disk caches */ -#define VIRTIO_BALLOON_S_NR 8 +#define VIRTIO_BALLOON_S_HTLB_PGALLOC 8 /* Hugetlb page allocations */ +#define VIRTIO_BALLOON_S_HTLB_PGFAIL 9 /* Hugetlb page allocation failures */ +#define VIRTIO_BALLOON_S_NR 10 /* * Memory statistics structure. Not for this patch, but it looks to me that exporting such nr through uapi is fragile. Sorry, can you explain what you mean here? Jon Spec said "Within an output buffer submitted to the statsq, the device MUST ignore entries with tag values that it does not recognize". So exporting VIRTIO_BALLOON_S_NR seems useless and device implementation can not depend on such number in uapi. Thanks Suggestions? I don't like to break build for people ... Didn't have a good idea. But maybe we should keep VIRTIO_BALLOON_S_NR unchanged, and add a comment here. Thanks I think Jason's comment is for a future patch. Didn't see this patch get applied, so wondering if it could be. Thanks, Jon - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Re: [virtio] [PATCH 4/5] packed-ring: reposition drivernormative on driver notifications
On 09/04/2018 20:58, Halil Pasic wrote: > The driver has to be careful to expose the new \field{flags} > value before checking if notifications are suppressed. This paragraph should also be reworked to be part of the normative text below, I think? The memory barrier must be "between two things", and only with this paragraph you know that it's between the flags write and the suppression structure read. Thanks, Paolo > +\drivernormative{\paragraph}{Notifying The Device}{Basic Facilities of a > Virtio Device / Packed Virtqueues / Supplying Buffers to The Device / > Notifying The Device} > +The driver MUST perform a suitable memory barrier before reading > +the Driver Event Suppression structure, to avoid missing a notification. > + - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH 5/5] packed-ring: tweak driver code example
This patch does three things. Emphasises that the example is pseudo-code as we tend to stick to C otherwise and tweaks the wording, fixes a typo in the pseudo code, and removes a line of code that I neither can see the necessity for nor the sense of. Signed-off-by: Halil Pasic--- packed-ring.tex |9 + 1 files changed, 5 insertions(+), 4 deletions(-) diff --git a/packed-ring.tex b/packed-ring.tex index 388850b..f619069 100644 --- a/packed-ring.tex +++ b/packed-ring.tex @@ -604,9 +604,11 @@ value before checking if notifications are suppressed. The driver MUST perform a suitable memory barrier before reading the Driver Event Suppression structure, to avoid missing a notification. -\subsubsection{Implementation Example}\label{sec:Basic Facilities of a Virtio Device / Packed Virtqueues / Supplying Buffers to The Device / Implementation Example} +\subsubsection{Pseudo-Code Example}\label{sec:Basic Facilities of a +Virtio Device / Packed Virtqueues / Supplying Buffers to The Device / +Pseudo-Code Example} -Below is a driver code example. It does not attempt to reduce +Below is driver pseudo-code example supplying buffers to the device. It does not attempt to reduce the number of device interrupts, neither does it support the VIRTIO_F_RING_EVENT_IDX feature. @@ -621,7 +623,6 @@ sgs = 0; for (each buffer element b) { sgs++; -vq->ids[vq->next_avail] = -1; vq->desc[vq->next_avail].address = get_addr(b); vq->desc[vq->next_avail].len = get_len(b); @@ -645,7 +646,7 @@ for (each buffer element b) { if (vq->next_avail >= vq->size) { vq->next_avail = 0; -vq->avail_wrap_count \^= 1; +vq->avail_wrap_count ^= 1; } } vq->sgs[id] = sgs; -- 1.7.1 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH 2/5] consistency: clean up stale wording for v1.1 terms
Occasionally we still use used and available ring instead of Device and Driver Area respectively. Same goes for descriptor table with the difference that depending on the context we need either Descriptor Area or descriptor ring. Let's clean these up, at least where the cleanup is trivial. Signed-off-by: Halil Pasic--- content.tex | 14 +++--- packed-ring.tex | 10 ++ 2 files changed, 13 insertions(+), 11 deletions(-) diff --git a/content.tex b/content.tex index 7a92cb1..e3ac662 100644 --- a/content.tex +++ b/content.tex @@ -1247,7 +1247,7 @@ The driver typically does this as follows, for each virtqueue a device has: \item Optionally, select a smaller virtqueue size and write it to \field{queue_size}. -\item Allocate and zero Descriptor Table, Available and Used rings for the +\item Allocate and zero Descriptor Area, Device Area and Driver Area for the virtqueue in contiguous physical memory. \item Optionally, if MSI-X capability is present and enabled on the @@ -1769,9 +1769,9 @@ nor behaviour: writing to \field{QueueSel}. } \hline - \mmioreg{QueueAlign}{Used Ring alignment in the virtual queue}{0x03c}{W}{% + \mmioreg{QueueAlign}{Device Area alignment in the virtual queue}{0x03c}{W}{% Writing to this register notifies the device about alignment -boundary of the Used Ring in bytes. This value should be a power +boundary of the Device Area in bytes. This value should be a power of 2 and applies to the queue selected by writing to \field{QueueSel}. } \hline @@ -1779,7 +1779,7 @@ nor behaviour: Writing to this register notifies the device about location of the virtual queue in the Guest's physical address space. This value is the index number of a page starting with the queue -Descriptor Table. Value zero (0x0) means physical address zero +Descriptor Area. Value zero (0x0) means physical address zero (0x) and is illegal. When the driver stops using the queue it writes zero (0x0) to this register. Reading from this register returns the currently used page @@ -1830,7 +1830,7 @@ The virtual queue is configured as follows: queue is not available. \item Allocate and zero the queue pages in contiguous virtual - memory, aligning the Used Ring to an optimal boundary (usually + memory, aligning the Device Area to an optimal boundary (usually page size). The driver should choose a queue size smaller than or equal to \field{QueueNumMax}. @@ -2066,8 +2066,8 @@ struct vq_info_block { \end{lstlisting} \field{desc}, \field{driver} and \field{device} contain the guest -addresses for the descriptor area, -available area and used area for queue \field{index}, respectively. The actual +addresses for the Descriptor Area, +Driver Area and Device Area for queue \field{index}, respectively. The actual virtqueue size (number of allocated buffers) is transmitted in \field{num}. \devicenormative{\paragraph}{Configuring a Virtqueue}{Virtio Transport Options / Virtio over channel I/O / Device Initialization / Configuring a Virtqueue} diff --git a/packed-ring.tex b/packed-ring.tex index eb006b0..e656fed 100644 --- a/packed-ring.tex +++ b/packed-ring.tex @@ -445,7 +445,9 @@ struct pvirtq_event_suppress { }; \end{lstlisting} -\devicenormative{\subsection}{The Virtqueue Descriptor Table}{Basic Facilities of a Virtio Device / Packed Virtqueues / The Virtqueue Descriptor Table} +\devicenormative{\subsection}{The Virtqueue Descriptor Ring}{Basic +Facilities of a Virtio Device / Packed Virtqueues / The Virtqueue +Descriptor Ring} A device MUST NOT write to a device-readable buffer, and a device SHOULD NOT read a device-writable buffer. A device MUST NOT use a descriptor unless it observes @@ -454,7 +456,7 @@ the VIRTQ_DESC_F_AVAIL bit in its \field{flags} being changed A device MUST NOT change a descriptor after changing it's the VIRTQ_DESC_F_USED bit in its \field{flags}. -\drivernormative{\subsection}{The Virtqueue Descriptor Table}{Basic Facilities of a Virtio Device / PAcked Virtqueues / The Virtqueue Descriptor Table} +\drivernormative{\subsection}{The Virtqueue Descriptor Ring}{Basic Facilities of a Virtio Device / PAcked Virtqueues / The Virtqueue Descriptor Ring} A driver MUST NOT change a descriptor unless it observes the VIRTQ_DESC_F_USED bit in its \field{flags} being changed. A driver MUST NOT change a descriptor after changing @@ -498,7 +500,7 @@ were made available by the driver. The device MAY limit the number of buffers it will allow in a list. -\drivernormative{\subsection}{Indirect Descriptors}{Basic Facilities of a Virtio Device / Packed Virtqueues / The Virtqueue Descriptor Table / Indirect Descriptors} +\drivernormative{\subsection}{Indirect Descriptors}{Basic Facilities of a Virtio Device / Packed Virtqueues / The Virtqueue Descriptor Ring / Indirect Descriptors} The driver MUST NOT set the DESC_F_INDIRECT flag
[virtio-dev] [PATCH 1/5] packed-ring: fix used descriptors but meant buffers
The subsection is talking about virtio buffers, that is possibly as scatter-gather list mapped by multiple descriptors. Mistakenly we at places we say descriptors instead of buffers. While at it fix the title case capitalization. We seem to use title case capitalization for subsections in most cases. Signed-off-by: Halil Pasic--- packed-ring.tex |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/packed-ring.tex b/packed-ring.tex index 2ef9559..eb006b0 100644 --- a/packed-ring.tex +++ b/packed-ring.tex @@ -272,10 +272,10 @@ Buffer ID is also reserved and is ignored by the device. In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE is reserved and is ignored by the device. -\subsection{In-order use of descriptors} -\label{sec:Packed Virtqueues / In-order use of descriptors} +\subsection{In-Order Use of Buffers} +\label{sec:Packed Virtqueues / In-Order Use of Buffers} -Some devices always use descriptors in the same order in which +Some devices always use buffers in the same order in which they have been made available. These devices can offer the VIRTIO_F_IN_ORDER feature. If negotiated, this knowledge allows devices to notify the use of a batch of buffers to the driver by -- 1.7.1 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH 4/5] packed-ring: reposition drivernormative on driver notifications
The drivernormative was simply in the wrong place. It does not belong to the code example. Signed-off-by: Halil Pasic--- packed-ring.tex |9 - 1 files changed, 4 insertions(+), 5 deletions(-) diff --git a/packed-ring.tex b/packed-ring.tex index d93df3b..388850b 100644 --- a/packed-ring.tex +++ b/packed-ring.tex @@ -600,6 +600,10 @@ Suppression Structure Format}. The driver has to be careful to expose the new \field{flags} value before checking if notifications are suppressed. +\drivernormative{\paragraph}{Notifying The Device}{Basic Facilities of a Virtio Device / Packed Virtqueues / Supplying Buffers to The Device / Notifying The Device} +The driver MUST perform a suitable memory barrier before reading +the Driver Event Suppression structure, to avoid missing a notification. + \subsubsection{Implementation Example}\label{sec:Basic Facilities of a Virtio Device / Packed Virtqueues / Supplying Buffers to The Device / Implementation Example} Below is a driver code example. It does not attempt to reduce @@ -658,11 +662,6 @@ if (vq->device_event.flags != RING_EVENT_FLAGS_DISABLE) { \end{lstlisting} - -\drivernormative{\paragraph}{Notifying The Device}{Basic Facilities of a Virtio Device / Packed Virtqueues / Supplying Buffers to The Device / Notifying The Device} -The driver MUST perform a suitable memory barrier before reading -the Driver Event Suppression structure, to avoid missing a notification. - \subsection{Receiving Used Buffers From The Device}\label{sec:Basic Facilities of a Virtio Device / Packed Virtqueues / Receiving Used Buffers From The Device} Once the device has used buffers referred to by a descriptor (read from or written to them, or -- 1.7.1 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH 3/5] packed-packed: fix supplying buffers description
There was a discrepancy between the pseudo code and the textual description of the algorithm describing how shall a driver supply buffers to the device. The textual description did not make sense to me, so I've changed that one to be more in sync with the pseudo code. Signed-off-by: Halil Pasic--- Still a bit shaky. Please review carefully! This patch may as well be rooted in a terrible misunderstanding. --- packed-ring.tex | 64 ++- 1 files changed, 35 insertions(+), 29 deletions(-) diff --git a/packed-ring.tex b/packed-ring.tex index e656fed..d93df3b 100644 --- a/packed-ring.tex +++ b/packed-ring.tex @@ -527,27 +527,12 @@ when using the packed virtqueue format in more detail. The driver offers buffers to one of the device's virtqueues as follows: \begin{enumerate} -\item The driver places the buffer into free descriptor(s) in the Descriptor Ring. - -\item The driver performs a suitable memory barrier to ensure that it updates - the descriptor(s) before checking for notification suppression. - -\item If notifications are not suppressed, the driver notifies the device -of the new available buffers. -\end{enumerate} - -What follows are the requirements of each stage in more detail. - -\subsubsection{Placing Available Buffers Into The Descriptor Ring}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Supplying Buffers to The Device / Placing Available Buffers Into The Descriptor Ring} - -For each buffer element, b: - -\begin{enumerate} -\item Get the next descriptor ring entry, d \item Get the next free buffer id value +\item For each buffer element, b: +\begin{enumerate} +\item Get the next free descriptor ring entry, d \item Set \field{d.addr} to the physical address of the start of b \item Set \field{d.len} to the length of b. -\item Set \field{d.id} to the buffer id \item Calculate the flags as follows: \begin{enumerate} \item If b is device-writable, set the VIRTQ_DESC_F_WRITE bit to 1, otherwise 0 @@ -556,20 +541,41 @@ For each buffer element, b: \end{enumerate} \item Perform a memory barrier to ensure that the descriptor has been initialized -\item Set \field{d.flags} to the calculated flags value +\item If b is not the first buffer element set \field{d.flags} to the + calculated flags value, otherwise save it \item If d is the last descriptor in the ring, toggle the - Driver Ring Wrap Counter + Driver Ring Wrap Counter and reset d to the head of the ring \item Otherwise, increment d to point at the next descriptor \end{enumerate} +\item Point d to the descriptor corresponding to the last buffer element + and set \field{d.id} to the buffer id +\item Save some value associated with the buffer id that makes the + skipping forward described in section \ref{sec:Packed Virtqueues / + Indirect Flag: Scatter-Gather Support} and in section \ref{sec:Packed + Virtqueues / In-Order Use of Buffers} +\item Point d to the descriptor corresponding to the first buffer element + and set \field{d.flags} to the value calculated and saved +\item The driver performs a suitable memory barrier to ensure that all + writes done up to this point precede all writhes done after this + point as observed by the device. +\item Point d to the descriptor corresponding to the first buffer element + and set \field{d.flags} to the value calculated and saved +\item The driver performs a suitable memory barrier to ensure that it updates + the descriptor(s) before checking for notification suppression. +\item If notifications are not suppressed, the driver notifies the device +of the new available buffers. +\end{enumerate} -This makes a single descriptor buffer available. However, in -general the driver MAY make use of a batch of descriptors as part +This makes a single virtio buffer available. However, in +general the driver MAY make use of a batch of buffers as part of a single request. In that case, it defers updating -the descriptor flags for the first descriptor -(and the previous memory barrier) until after the rest of -the descriptors have been initialized. +the descriptor flags for the first descriptor of the first +virtio buffer (and the preceding memory barrier) until after the rest of +the descriptors have been written. The notification steps are performed +only after the after the flags of the first descriptor of the first +virtio buffer are set. -Once the descriptor \field{flags} field is updated by the driver, this exposes +Once the \field{flags} field of the first descriptor is updated by the driver, this exposes the descriptor and its contents. The device MAY access the descriptor and any following descriptors the driver created and the memory they refer to immediately. @@ -577,9 +583,9 @@ memory they refer to immediately. \drivernormative{\paragraph}{Updating flags}{Basic Facilities of a Virtio Device / Packed Virtqueues /
[virtio-dev] [PATCH 0/5] fixes and tweaks for virtio v1.1
Some of these are more and some are less important. Please be careful, there is always a chance that I misunderstood something and that I'm trying to change for the worse. I've identified some other weak point's too, but tried to proceed in order the more grave and the better I think I understand it (and thus propose a fix) the higher the priority. Some stuff I've left for my notifiers patch set. This one and that one will have to get merged if both turn out to be good. Halil Pasic (5): packed-ring: fix used descriptors but meant buffers consistency: clean up stale wording for v1.1 terms packed-packed: fix supplying buffers description packed-ring: reposition drivernormative on driver notifications packed-ring: tweak driver code example content.tex | 14 packed-ring.tex | 96 +- 2 files changed, 59 insertions(+), 51 deletions(-) - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
On 4/9/2018 1:07 AM, Jiri Pirko wrote: Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudr...@intel.com wrote: On 4/6/2018 5:48 AM, Jiri Pirko wrote: Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudr...@intel.com wrote: [...] +static int virtnet_bypass_join_child(struct net_device *bypass_netdev, +struct net_device *child_netdev) +{ + struct virtnet_bypass_info *vbi; + bool backup; + + vbi = netdev_priv(bypass_netdev); + backup = (child_netdev->dev.parent == bypass_netdev->dev.parent); + if (backup ? rtnl_dereference(vbi->backup_netdev) : + rtnl_dereference(vbi->active_netdev)) { + netdev_info(bypass_netdev, + "%s attempting to join bypass dev when %s already present\n", + child_netdev->name, backup ? "backup" : "active"); Bypass module should check if there is already some other netdev enslaved and refuse right there. This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc as its bypass_netdev is same as the backup_netdev. Will add a flag while registering with the bypass module to indicate if the driver is doing a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module for 3 netdev scenario. Just let me undestand it clearly. What I expect the difference would be between 2netdev and3 netdev model is this: 2netdev: bypass_master / / VF_slave 3netdev: bypass_master / \ / \ VF_slave backup_slave Is that correct? If not, how does it look like? Looks correct. VF_slave and backup_slave are the original netdevs and are present in both the models. In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are marked as the 2 slaves of this new netdev. In the 2 netdev model, backup_slave acts as bypass_master and the bypass module doesn't have access to netdev_priv of the backup_slave. Once i moved all the ndo_ops of the master netdev to bypass module, i realized that we can move the create/destroy of the upper netdev also to bypass.c. That way the changes to virtio_net become very minimal. With these updates, bypass module now supports both the models by exporting 2 sets of functions. 3 netdev: int bypass_master_create(struct net_device *backup_netdev, struct bypass_master **pbypass_master); void bypass_master_destroy(struct bypass_master *bypass_master); 2 netdev: int bypass_master_register(struct net_device *backup_netdev, struct bypass_ops *ops, struct bypass_master **pbypass_master); void bypass_master_unregister(struct bypass_master *bypass_master); Will send the next revision in a day or two. Thanks Sridhar - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
RE: [virtio-dev] Dynamic assignment of vhost-user to a VM
Thank you Paolo Avi > -Original Message- > From: Paolo Bonzini [mailto:pbonz...@redhat.com] > Sent: Monday, 09 April, 2018 12:43 PM > To: Avi Cohen (A); virtio-dev@lists.oasis-open.org > Subject: Re: [virtio-dev] Dynamic assignment of vhost-user to a VM > > On 09/04/2018 08:33, Avi Cohen (A) wrote: > > Hello All, > > Is it possible to assign a vhost-user frontend to a VM *after* the VM was > launched ? > > Best Regards > > Avi > > If you're referring to QEMU then yes, you can dynamically create all of > character devices (to connect to the vhost-user server), network devices (to > create a vhost-user client) and NICs (that peer with the vhost-user server). > > If you're using libvirt to run QEMU, you can in fact invoke a single "virsh > attach- > device" command to handle all three steps. > > Thanks, > > Paolo
[virtio-dev] Re: [PATCH v31 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
On 04/09/2018 02:03 PM, Michael S. Tsirkin wrote: On Fri, Apr 06, 2018 at 08:17:23PM +0800, Wei Wang wrote: Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the support of reporting hints of guest free pages to host via virtio-balloon. Host requests the guest to report free page hints by sending a new cmd id to the guest via the free_page_report_cmd_id configuration register. When the guest starts to report, the first element added to the free page vq is the cmd id given by host. When the guest finishes the reporting of all the free pages, VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID is added to the vq to tell host that the reporting is done. Host polls the free page vq after sending the starting cmd id, so the guest doesn't need to kick after filling an element to the vq. Host may also requests the guest to stop the reporting in advance by sending the stop cmd id to the guest via the configuration register. Signed-off-by: Wei WangSigned-off-by: Liang Li Cc: Michael S. Tsirkin Cc: Michal Hocko Pretty good by now, Minor comments below. Thanks for the comments. --- drivers/virtio/virtio_balloon.c | 272 +++- include/uapi/linux/virtio_balloon.h | 4 + 2 files changed, 240 insertions(+), 36 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index dfe5684..aef73ee 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -51,9 +51,22 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); static struct vfsmount *balloon_mnt; #endif +enum virtio_balloon_vq { + VIRTIO_BALLOON_VQ_INFLATE, + VIRTIO_BALLOON_VQ_DEFLATE, + VIRTIO_BALLOON_VQ_STATS, + VIRTIO_BALLOON_VQ_FREE_PAGE, + VIRTIO_BALLOON_VQ_MAX +}; + struct virtio_balloon { struct virtio_device *vdev; - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq; + + /* Balloon's own wq for cpu-intensive work items */ + struct workqueue_struct *balloon_wq; + /* The free page reporting work item submitted to the balloon wq */ + struct work_struct report_free_page_work; /* The balloon servicing is delegated to a freezable workqueue. */ struct work_struct update_balloon_stats_work; @@ -63,6 +76,13 @@ struct virtio_balloon { spinlock_t stop_update_lock; bool stop_update; + /* The new cmd id received from host */ + uint32_t cmd_id_received; + /* The cmd id that is in use */ + __virtio32 cmd_id_use; I'd prefer cmd_id_active but it's not critical. OK, will change. + +static void report_free_page_func(struct work_struct *work) +{ + struct virtio_balloon *vb; + struct virtqueue *vq; + unsigned int unused; + int ret; + + vb = container_of(work, struct virtio_balloon, report_free_page_work); + vq = vb->free_page_vq; + + /* Start by sending the received cmd id to host with an outbuf. */ + ret = send_start_cmd_id(vb, vb->cmd_id_received); + if (unlikely(ret)) + goto err; + + ret = walk_free_mem_block(vb, 0, _balloon_send_free_pages); + if (unlikely(ret == -EIO)) + goto err; why is EIO special? I think you should special-case EINTR maybe. Actually EINTR isn't an error which needs to bail out. That's just the case that the vq is full, that hint isn't added. Maybe it is not necessary to treat the "vq full" case as an error. How about just returning "0" when the vq is full, instead of returning "EINTR"? (The next hint will continue to be added) + + /* End by sending a stop id to host with an outbuf. */ + ret = send_stop_cmd_id(vb); + if (likely(!ret)) { What happens on failure? Don't we need to detach anyway? Yes. Please see below, we could make some more change. + /* Ending: detach all the used buffers from the vq. */ + while (vq->num_free != virtqueue_get_vring_size(vq)) + virtqueue_get_buf(vq, ); This isn't all that happens here. It also waits for buffers to be consumed. Is this by design? And why is it good idea to busy poll while doing it? Because host and guest operations happen asynchronously. When the guest reaches here, host may have not put anything to the vq yet. How about doing this via the free page vq handler? Host will send a free page vq interrupt before exiting the optimization. I think this would be nicer. Best, Wei - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] Dynamic assignment of vhost-user to a VM
On 09/04/2018 08:33, Avi Cohen (A) wrote: > Hello All, > Is it possible to assign a vhost-user frontend to a VM *after* the VM was > launched ? > Best Regards > Avi If you're referring to QEMU then yes, you can dynamically create all of character devices (to connect to the vhost-user server), network devices (to create a vhost-user client) and NICs (that peer with the vhost-user server). If you're using libvirt to run QEMU, you can in fact invoke a single "virsh attach-device" command to handle all three steps. Thanks, Paolo - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Dynamic assignment of vhost-user to a VM
Hello All, Is it possible to assign a vhost-user frontend to a VM *after* the VM was launched ? Best Regards Avi - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Re: [PATCH v31 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
On Fri, Apr 06, 2018 at 08:17:23PM +0800, Wei Wang wrote: > Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the > support of reporting hints of guest free pages to host via virtio-balloon. > > Host requests the guest to report free page hints by sending a new cmd > id to the guest via the free_page_report_cmd_id configuration register. > > When the guest starts to report, the first element added to the free page > vq is the cmd id given by host. When the guest finishes the reporting > of all the free pages, VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID is added > to the vq to tell host that the reporting is done. Host polls the free > page vq after sending the starting cmd id, so the guest doesn't need to > kick after filling an element to the vq. > > Host may also requests the guest to stop the reporting in advance by > sending the stop cmd id to the guest via the configuration register. > > Signed-off-by: Wei Wang> Signed-off-by: Liang Li > Cc: Michael S. Tsirkin > Cc: Michal Hocko Pretty good by now, Minor comments below. > --- > drivers/virtio/virtio_balloon.c | 272 > +++- > include/uapi/linux/virtio_balloon.h | 4 + > 2 files changed, 240 insertions(+), 36 deletions(-) > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c > index dfe5684..aef73ee 100644 > --- a/drivers/virtio/virtio_balloon.c > +++ b/drivers/virtio/virtio_balloon.c > @@ -51,9 +51,22 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); > static struct vfsmount *balloon_mnt; > #endif > > +enum virtio_balloon_vq { > + VIRTIO_BALLOON_VQ_INFLATE, > + VIRTIO_BALLOON_VQ_DEFLATE, > + VIRTIO_BALLOON_VQ_STATS, > + VIRTIO_BALLOON_VQ_FREE_PAGE, > + VIRTIO_BALLOON_VQ_MAX > +}; > + > struct virtio_balloon { > struct virtio_device *vdev; > - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; > + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq; > + > + /* Balloon's own wq for cpu-intensive work items */ > + struct workqueue_struct *balloon_wq; > + /* The free page reporting work item submitted to the balloon wq */ > + struct work_struct report_free_page_work; > > /* The balloon servicing is delegated to a freezable workqueue. */ > struct work_struct update_balloon_stats_work; > @@ -63,6 +76,13 @@ struct virtio_balloon { > spinlock_t stop_update_lock; > bool stop_update; > > + /* The new cmd id received from host */ > + uint32_t cmd_id_received; > + /* The cmd id that is in use */ > + __virtio32 cmd_id_use; I'd prefer cmd_id_active but it's not critical. > + /* Buffer to store the stop sign */ > + __virtio32 stop_cmd_id; > + > /* Waiting for host to ack the pages we released. */ > wait_queue_head_t acked; > > @@ -320,17 +340,6 @@ static void stats_handle_request(struct virtio_balloon > *vb) > virtqueue_kick(vq); > } > > -static void virtballoon_changed(struct virtio_device *vdev) > -{ > - struct virtio_balloon *vb = vdev->priv; > - unsigned long flags; > - > - spin_lock_irqsave(>stop_update_lock, flags); > - if (!vb->stop_update) > - queue_work(system_freezable_wq, >update_balloon_size_work); > - spin_unlock_irqrestore(>stop_update_lock, flags); > -} > - > static inline s64 towards_target(struct virtio_balloon *vb) > { > s64 target; > @@ -347,6 +356,34 @@ static inline s64 towards_target(struct virtio_balloon > *vb) > return target - vb->num_pages; > } > > +static void virtballoon_changed(struct virtio_device *vdev) > +{ > + struct virtio_balloon *vb = vdev->priv; > + unsigned long flags; > + s64 diff = towards_target(vb); > + > + if (diff) { > + spin_lock_irqsave(>stop_update_lock, flags); > + if (!vb->stop_update) > + queue_work(system_freezable_wq, > +>update_balloon_size_work); > + spin_unlock_irqrestore(>stop_update_lock, flags); > + } > + > + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) { > + virtio_cread(vdev, struct virtio_balloon_config, > + free_page_report_cmd_id, >cmd_id_received); > + if (vb->cmd_id_received != > + VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID) { > + spin_lock_irqsave(>stop_update_lock, flags); > + if (!vb->stop_update) > + queue_work(vb->balloon_wq, > +>report_free_page_work); > + spin_unlock_irqrestore(>stop_update_lock, flags); > + } > + } > +} > + > static void update_balloon_size(struct virtio_balloon *vb) > { > u32 actual = vb->num_pages; > @@ -421,42 +458,178 @@ static void