[PATCH v33 4/4] virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON
The VIRTIO_BALLOON_F_PAGE_POISON feature bit is used to indicate if the guest is using page poisoning. Guest writes to the poison_val config field to tell host about the page poisoning value that is in use. Suggested-by: Michael S. Tsirkin Signed-off-by: Wei Wang Cc: Michael S. Tsirkin Cc: Michal Hocko Cc: Andrew Morton --- drivers/virtio/virtio_balloon.c | 10 ++ include/uapi/linux/virtio_balloon.h | 3 +++ 2 files changed, 13 insertions(+) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 582a03b..c59bb380 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -634,6 +634,7 @@ static struct file_system_type balloon_fs = { static int virtballoon_probe(struct virtio_device *vdev) { struct virtio_balloon *vb; + __u32 poison_val; int err; if (!vdev->config->get) { @@ -671,6 +672,11 @@ static int virtballoon_probe(struct virtio_device *vdev) goto out_del_vqs; } INIT_WORK(>report_free_page_work, report_free_page_func); + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) { + memset(_val, PAGE_POISON, sizeof(poison_val)); + virtio_cwrite(vb->vdev, struct virtio_balloon_config, + poison_val, _val); + } vb->hints = kmalloc(FREE_PAGE_HINT_MEM_SIZE, GFP_KERNEL); if (!vb->hints) { err = -ENOMEM; @@ -796,6 +802,9 @@ static int virtballoon_restore(struct virtio_device *vdev) static int virtballoon_validate(struct virtio_device *vdev) { + if (!page_poisoning_enabled()) + __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON); + __virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM); return 0; } @@ -805,6 +814,7 @@ static unsigned int features[] = { VIRTIO_BALLOON_F_STATS_VQ, VIRTIO_BALLOON_F_DEFLATE_ON_OOM, VIRTIO_BALLOON_F_FREE_PAGE_HINT, + VIRTIO_BALLOON_F_PAGE_POISON, }; static struct virtio_driver virtio_balloon_driver = { diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index 99b8416..f3b6191 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -35,6 +35,7 @@ #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */ #define VIRTIO_BALLOON_F_FREE_PAGE_HINT3 /* VQ to report free pages */ +#define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */ /* Size of a PFN in the balloon interface. */ #define VIRTIO_BALLOON_PFN_SHIFT 12 @@ -47,6 +48,8 @@ struct virtio_balloon_config { __u32 actual; /* Command sent from host */ __u32 host_cmd; + /* Stores PAGE_POISON if page poisoning is in use */ + __u32 poison_val; }; struct virtio_balloon_free_page_hints { -- 2.7.4 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v33 3/4] mm/page_poison: expose page_poisoning_enabled to kernel modules
In some usages, e.g. virtio-balloon, a kernel module needs to know if page poisoning is in use. This patch exposes the page_poisoning_enabled function to kernel modules. Signed-off-by: Wei Wang Cc: Andrew Morton Cc: Michal Hocko Cc: Michael S. Tsirkin Acked-by: Andrew Morton --- mm/page_poison.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/mm/page_poison.c b/mm/page_poison.c index aa2b3d3..830f604 100644 --- a/mm/page_poison.c +++ b/mm/page_poison.c @@ -17,6 +17,11 @@ static int __init early_page_poison_param(char *buf) } early_param("page_poison", early_page_poison_param); +/** + * page_poisoning_enabled - check if page poisoning is enabled + * + * Return true if page poisoning is enabled, or false if not. + */ bool page_poisoning_enabled(void) { /* @@ -29,6 +34,7 @@ bool page_poisoning_enabled(void) (!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) && debug_pagealloc_enabled())); } +EXPORT_SYMBOL_GPL(page_poisoning_enabled); static void poison_page(struct page *page) { -- 2.7.4 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the support of reporting hints of guest free pages to host via virtio-balloon. Host requests the guest to report free page hints by sending a command to the guest via setting the VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT bit of the host_cmd config register. As the first step here, virtio-balloon only reports free page hints from the max order (10) free page list to host. This has generated similar good results as reporting all free page hints during our tests. TODO: - support reporting free page hints from smaller order free page lists when there is a need/request from users. Signed-off-by: Wei Wang Signed-off-by: Liang Li Cc: Michael S. Tsirkin Cc: Michal Hocko Cc: Andrew Morton --- drivers/virtio/virtio_balloon.c | 187 +--- include/uapi/linux/virtio_balloon.h | 13 +++ 2 files changed, 163 insertions(+), 37 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 6b237e3..582a03b 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -43,6 +43,9 @@ #define OOM_VBALLOON_DEFAULT_PAGES 256 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80 +/* The size of memory in bytes allocated for reporting free page hints */ +#define FREE_PAGE_HINT_MEM_SIZE (PAGE_SIZE * 16) + static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES; module_param(oom_pages, int, S_IRUSR | S_IWUSR); MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); @@ -51,9 +54,22 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); static struct vfsmount *balloon_mnt; #endif +enum virtio_balloon_vq { + VIRTIO_BALLOON_VQ_INFLATE, + VIRTIO_BALLOON_VQ_DEFLATE, + VIRTIO_BALLOON_VQ_STATS, + VIRTIO_BALLOON_VQ_FREE_PAGE, + VIRTIO_BALLOON_VQ_MAX +}; + struct virtio_balloon { struct virtio_device *vdev; - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq; + + /* Balloon's own wq for cpu-intensive work items */ + struct workqueue_struct *balloon_wq; + /* The free page reporting work item submitted to the balloon wq */ + struct work_struct report_free_page_work; /* The balloon servicing is delegated to a freezable workqueue. */ struct work_struct update_balloon_stats_work; @@ -63,6 +79,8 @@ struct virtio_balloon { spinlock_t stop_update_lock; bool stop_update; + struct virtio_balloon_free_page_hints *hints; + /* Waiting for host to ack the pages we released. */ wait_queue_head_t acked; @@ -326,17 +344,6 @@ static void stats_handle_request(struct virtio_balloon *vb) virtqueue_kick(vq); } -static void virtballoon_changed(struct virtio_device *vdev) -{ - struct virtio_balloon *vb = vdev->priv; - unsigned long flags; - - spin_lock_irqsave(>stop_update_lock, flags); - if (!vb->stop_update) - queue_work(system_freezable_wq, >update_balloon_size_work); - spin_unlock_irqrestore(>stop_update_lock, flags); -} - static inline s64 towards_target(struct virtio_balloon *vb) { s64 target; @@ -353,6 +360,32 @@ static inline s64 towards_target(struct virtio_balloon *vb) return target - vb->num_pages; } +static void virtballoon_changed(struct virtio_device *vdev) +{ + struct virtio_balloon *vb = vdev->priv; + unsigned long flags; + uint32_t host_cmd; + s64 diff = towards_target(vb); + + if (diff) { + spin_lock_irqsave(>stop_update_lock, flags); + if (!vb->stop_update) + queue_work(system_freezable_wq, + >update_balloon_size_work); + spin_unlock_irqrestore(>stop_update_lock, flags); + } + + virtio_cread(vdev, struct virtio_balloon_config, host_cmd, _cmd); + if ((host_cmd & VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT) && +virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) { + spin_lock_irqsave(>stop_update_lock, flags); + if (!vb->stop_update) + queue_work(vb->balloon_wq, + >report_free_page_work); + spin_unlock_irqrestore(>stop_update_lock, flags); + } +} + static void update_balloon_size(struct virtio_balloon *vb) { u32 actual = vb->num_pages; @@ -425,44 +458,98 @@ static void update_balloon_size_func(struct work_struct *work) queue_work(system_freezable_wq, work); } +static void free_page_vq_cb(struct virtqueue *vq) +{ + unsigned int unused; + + while (virtqueue_get_buf(vq, )) + ; +} + static int init_vqs(struct virtio_balloon *vb) { - struct virtqueue *vqs[3]; - vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request }; - static const char * const names[] = {
[PATCH v33 1/4] mm: add a function to get free page blocks
This patch adds a function to get free pages blocks from a free page list. The obtained free page blocks are hints about free pages, because there is no guarantee that they are still on the free page list after the function returns. One use example of this patch is to accelerate live migration by skipping the transfer of free pages reported from the guest. A popular method used by the hypervisor to track which part of memory is written during live migration is to write-protect all the guest memory. So, those pages that are hinted as free pages but are written after this function returns will be captured by the hypervisor, and they will be added to the next round of memory transfer. Suggested-by: Linus Torvalds Signed-off-by: Wei Wang Signed-off-by: Liang Li Cc: Michal Hocko Cc: Andrew Morton Cc: Michael S. Tsirkin Cc: Linus Torvalds --- include/linux/mm.h | 1 + mm/page_alloc.c| 52 2 files changed, 53 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 0e49388..c58b4e5 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2002,6 +2002,7 @@ extern void free_area_init(unsigned long * zones_size); extern void free_area_init_node(int nid, unsigned long * zones_size, unsigned long zone_start_pfn, unsigned long *zholes_size); extern void free_initmem(void); +uint32_t get_from_free_page_list(int order, __le64 buf[], uint32_t size); /* * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 07b3c23..7c816d9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5043,6 +5043,58 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) show_swap_cache_info(); } +/** + * get_from_free_page_list - get free page blocks from a free page list + * @order: the order of the free page list to check + * @buf: the array to store the physical addresses of the free page blocks + * @size: the array size + * + * This function offers hints about free pages. There is no guarantee that + * the obtained free pages are still on the free page list after the function + * returns. pfn_to_page on the obtained free pages is strongly discouraged + * and if there is an absolute need for that, make sure to contact MM people + * to discuss potential problems. + * + * The addresses are currently stored to the array in little endian. This + * avoids the overhead of converting endianness by the caller who needs data + * in the little endian format. Big endian support can be added on demand in + * the future. + * + * Return the number of free page blocks obtained from the free page list. + * The maximum number of free page blocks that can be obtained is limited to + * the caller's array size. + */ +uint32_t get_from_free_page_list(int order, __le64 buf[], uint32_t size) +{ + struct zone *zone; + enum migratetype mt; + struct page *page; + struct list_head *list; + unsigned long addr, flags; + uint32_t index = 0; + + for_each_populated_zone(zone) { + spin_lock_irqsave(>lock, flags); + for (mt = 0; mt < MIGRATE_TYPES; mt++) { + list = >free_area[order].free_list[mt]; + list_for_each_entry(page, list, lru) { + addr = page_to_pfn(page) << PAGE_SHIFT; + if (likely(index < size)) { + buf[index++] = cpu_to_le64(addr); + } else { + spin_unlock_irqrestore(>lock, + flags); + return index; + } + } + } + spin_unlock_irqrestore(>lock, flags); + } + + return index; +} +EXPORT_SYMBOL_GPL(get_from_free_page_list); + static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) { zoneref->zone = zone; -- 2.7.4 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v33 0/4] Virtio-balloon: support free page reporting
This patch series is separated from the previous "Virtio-balloon Enhancement" series. The new feature, VIRTIO_BALLOON_F_FREE_PAGE_HINT, implemented by this series enables the virtio-balloon driver to report hints of guest free pages to the host. It can be used to accelerate live migration of VMs. Here is an introduction of this usage: Live migration needs to transfer the VM's memory from the source machine to the destination round by round. For the 1st round, all the VM's memory is transferred. From the 2nd round, only the pieces of memory that were written by the guest (after the 1st round) are transferred. One method that is popularly used by the hypervisor to track which part of memory is written is to write-protect all the guest memory. This feature enables the optimization by skipping the transfer of guest free pages during VM live migration. It is not concerned that the memory pages are used after they are given to the hypervisor as a hint of the free pages, because they will be tracked by the hypervisor and transferred in the subsequent round if they are used and written. * Tests - Test Environment Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz Guest: 8G RAM, 4 vCPU Migration setup: migrate_set_speed 100G, migrate_set_downtime 2 second - Test Results - Idle Guest Live Migration Time (results are averaged over 10 runs): - Optimization v.s. Legacy = 278ms vs 1757ms --> ~84% reduction - Guest with Linux Compilation Workload (make bzImage -j4): - Live Migration Time (average) Optimization v.s. Legacy = 1408ms v.s. 2528ms --> ~44% reduction - Linux Compilation Time Optimization v.s. Legacy = 5min3s v.s. 5min12s --> no obvious difference ChangeLog: v32->v33: - mm/get_from_free_page_list: The new implementation to get free page hints based on the suggestions from Linus: https://lkml.org/lkml/2018/6/11/764 This avoids the complex call chain, and looks more prudent. - virtio-balloon: - use a fix-sized buffer to get free page hints; - remove the cmd id related interface. Now host can just send a free page hint command to the guest (via the host_cmd config register) to start the reporting. Currentlty the guest reports only the max order free page hints to host, which has generated similar good results as before. But the interface used by virtio-balloon to report can support reporting more orders in the future when there is a need. v31->v32: - virtio-balloon: - rename cmd_id_use to cmd_id_active; - report_free_page_func: detach used buffers after host sends a vq interrupt, instead of busy waiting for used buffers. v30->v31: - virtio-balloon: - virtio_balloon_send_free_pages: return -EINTR rather than 1 to indicate an active stop requested by host; and add more comments to explain about access to cmd_id_received without locks; - add_one_sg: add TODO to comments about possible improvement. v29->v30: - mm/walk_free_mem_block: add cond_sched() for each order v28->v29: - mm/page_poison: only expose page_poison_enabled(), rather than more changes did in v28, as we are not 100% confident about that for now. - virtio-balloon: use a separate buffer for the stop cmd, instead of having the start and stop cmd use the same buffer. This avoids the corner case that the start cmd is overridden by the stop cmd when the host has a delay in reading the start cmd. v27->v28: - mm/page_poison: Move PAGE_POISON to page_poison.c and add a function to expose page poison val to kernel modules. v26->v27: - add a new patch to expose page_poisoning_enabled to kernel modules - virtio-balloon: set poison_val to 0x, instead of 0xaa v25->v26: virtio-balloon changes only - remove kicking free page vq since the host now polls the vq after initiating the reporting - report_free_page_func: detach all the used buffers after sending the stop cmd id. This avoids leaving the detaching burden (i.e. overhead) to the next cmd id. Detaching here isn't considered overhead since the stop cmd id has been sent, and host has already moved formard. v24->v25: - mm: change walk_free_mem_block to return 0 (instead of true) on completing the report, and return a non-zero value from the callabck, which stops the reporting. - virtio-balloon: - use enum instead of define for VIRTIO_BALLOON_VQ_INFLATE etc. - avoid __virtio_clear_bit when bailing out; - a new method to avoid reporting the some cmd id to host twice - destroy_workqueue can cancel free page work when the feature is negotiated; - fail probe when the free page vq size is less than 2. v23->v24: - change feature name VIRTIO_BALLOON_F_FREE_PAGE_VQ to
Re: [PULL] vhost: cleanups and fixes
On 06/14/2018 11:01 PM, Nitesh Narayan Lal wrote: Hi Wei, On 06/12/2018 07:05 AM, Wei Wang wrote: On 06/12/2018 09:59 AM, Linus Torvalds wrote: On Mon, Jun 11, 2018 at 6:36 PM Michael S. Tsirkin wrote: Maybe it will help to have GFP_NONE which will make any allocation fail if attempted. Linus, would this address your comment? It would definitely have helped me initially overlook that call chain. But then when I started looking at the whole dma_map_page() thing, it just raised my hackles again. I would seriously suggest having a much simpler version for the "no allocation, no dma mapping" case, so that it's *obvious* that that never happens. So instead of having virtio_balloon_send_free_pages() call a really generic complex chain of functions that in _some_ cases can do memory allocation, why isn't there a short-circuited "vitruque_add_datum()" that is guaranteed to never do anything like that? Honestly, I look at "add_one_sg()" and it really doesn't make me happy. It looks hacky as hell. If I read the code right, you're really trying to just queue up a simple tuple of , except you encode it as a page pointer in order to play games with the SG logic, and then you hmap that to the ring, except in this case it's all a fake ring that just adds the cpu-physical address instead. And to figuer that out, it's like five layers of indirection through different helper functions that *can* do more generic things but in this case don't. And you do all of this from a core VM callback function with some _really_ core VM locks held. That makes no sense to me. How about this: - get rid of all that code - make the core VM callback save the "these are the free memory regions" in a fixed and limited array. One that DOES JUST THAT. No crazy "SG IO dma-mapping function crap". Just a plain array of a fixed size, pre-allocated for that virtio instance. - make it obvious that what you do in that sequence is ten instructions and no allocations ("Look ma, I wrote a value to an array and incremented the array idex, and I'M DONE") - then in that workqueue entry that you start *anyway*, you empty the array and do all the crazy virtio stuff. In fact, while at it, just simplify the VM interface too. Instead of traversing a random number of buddy lists, just trraverse *one* - the top-level one. Are you seriously ever going to shrink or mark read-only anythin *but* something big enough to be in the maximum order? MAX_ORDER is what, 11? So we're talking 8MB blocks. Do you *really* want the balloon code to work on smaller things, particularly since the whole interface is fundamentally racy and opportunistic to begin with? OK, I will implement a new version based on the suggestions. Thanks. I have been working on a similar series [1] that is more generic, which solves the problem of giving unused memory back to the host and could be used to solve the migration problem as well. Can you take a look and see if you can use my series in some way? Hi Nitesh, I actually checked the last version, which dates back to last year. It seems the new version does not have fundamental differences. Actually there are obvious differences between the two series. This series provides a simple lightweight method (will continue to post out a new version with simpler interfaces based on the above suggestions) to offer free pages hints, and the hints are quit helpful for usages like accelerating live migration and guest snapshot. If I read that correctly, that series seems to provide true (guaranteed) free pages with much more heavyweight logics, but true free pages are not necessary for the live migration optimization, which is the goal and motivation of this work. And from my point of view, that series seems more like an alternative function to ballooning, which takes out free pages (or say guest unused pages) via allocation. I will join the discussion in that thread. Probably we would need to think about other new usages for that series. Best, Wei ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/3] Use sbitmap instead of percpu_ida
On Thu, Jun 14, 2018 at 10:06:58PM -0400, Martin K. Petersen wrote: > > Matthew, > > > Removing the percpu_ida code nets over 400 lines of removal. It's not > > as spectacular as deleting an entire architecture, but it's still a > > worthy reduction in lines of code. > > Since most of the changes are in scsi or target, should I take this > series through my tree? I'd welcome that. Nick seems to be inactive as target maintainer; his tree on kernel.org hasn't seen any updates in five months. Thanks! ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
On Thu, Jun 14, 2018 at 12:02:31PM +0200, Cornelia Huck wrote: > So, do you know from the outset that there will be such a coupled > device? I.e., is it a property of the VM definition? > > Can there be a 'prepared' virtio-net device that presents the STANDBY > feature even if there currently is no vfio-handled device available -- > but making it possible to simply hotplug that device later? > Should it be possible to add a virtio/vfio pair later on? Yes, that's the plan, more or less. > > >>> I think Qemu should check if guest virtio-net supports this feature and > > >>> provide a mechanism for > > >>> an upper layer indicating if the STANDBY feature is successfully > > >>> negotiated or not. > > >>> The upper layer can then decide if it should hot plug a VF with the same > > >>> MAC and manage the 2 links. > > >>> If VF is successfully hot plugged, virtio-net link should be disabled. BTW I see no reason to do this last part. The role of the standby device is to be always there. > > >> > > >> Did you even talk to upper layer management about it? > > >> Just list the steps they need to do and you will see > > >> that's a lot of machinery to manage by the upper layer. > > >> > > >> What do we gain in flexibility? As far as I can see the > > >> only gain is some resources saved for legacy VMs. > > >> > > >> That's not a lot as tenant of the upper layer probably already has > > >> at least a hunch that it's a new guest otherwise > > >> why bother specifying the feature at all - you > > >> save even more resources without it. > > >> > > > > > > I am not all that familiar with how Qemu manages network devices. If we > > > can > > > do all the > > > required management of the primary/standby devices within Qemu, that is > > > definitely a better > > > approach without upper layer involvement. > > > > Right. I would imagine in the extreme case the upper layer doesn't > > have to be involved at all if QEMU manages all hot plug/unplug logic. > > The management tool can supply passthrough device and virtio with the > > same group UUID, QEMU auto-manages the presence of the primary, and > > hot plug the device as needed before or after the migration. > > I do not really see how you can manage that kind of stuff in QEMU only. So right now failover is limited to pci passthrough devices only. The idea is to realize the vfio device but not expose it to guest. Have a separate command to expose it to guest. Hotunplug would also hide it from guest but not unrealize it. This will help ensure that e.g. on migration failure we can re-expose the device without risk of running out of resources. -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/3] Use sbitmap instead of percpu_ida
Matthew, > Removing the percpu_ida code nets over 400 lines of removal. It's not > as spectacular as deleting an entire architecture, but it's still a > worthy reduction in lines of code. Since most of the changes are in scsi or target, should I take this series through my tree? -- Martin K. Petersen Oracle Linux Engineering ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
Thank you for sharing your thoughts, Cornelia. With questions below, I think you raised really good points, some of which I don't have answer yet and would also like to explore here. First off, I don't want to push the discussion to the extreme at this point, or sell anything about having QEMU manage everything automatically. Don't get me wrong, it's not there yet. Let's don't assume we are tied to a specific or concerte solution. I think the key for our discussion might be to define or refine the boundary between VM and guest, e.g. what each layer is expected to control and manage exactly. In my view, there might be possibly 3 different options to represent the failover device conceipt to QEMU and libvirt (or any upper layer software): a. Seperate device: in this model, virtio and passthough remains separate devices just as today. QEMU exposes the standby feature bit for virtio, and publish status/event around the negotiation process of this feature bit for libvirt to react upon. Since Libvirt has the pairing relationship itself, maybe through MAC address or something else, it can control the presence of primary by hot plugging or unplugging the passthrough device, although it has to work tightly with virtio's feature negotation process. Not just for migration but also various corner scenarios (driver/feature ok, device reset, reboot, legacy guest etc) along virtio's feature negotiation. b. Coupled device: in this model, virtio and passthough devices are weakly coupled using some group ID, i.e. QEMU match the passthough device for a standby virtio instance by comparing the group ID value present behind each device's bridge. Libvirt provides QEMU the group ID for both type of devices, and only deals with hot plug for migration, by checking some migration status exposed (e.g. the feature negotiation status on the virtio device) by QEMU. QEMU manages the visibility of the primary in guest along virtio's feature negotiation process. c. Fully combined device: in this model, virtio and passthough devices are viewed as a single VM interface altogther. QEMU not just controls the visibility of the primary in guest, but can also manage the exposure of the passthrough for migratability. It can be like that libvirt supplies the group ID to QEMU. Or libvirt does not even have to provide group ID for grouping the two devices, if just one single combined device is exposed by QEMU. In either case, QEMU manages all aspect of such internal construct, including virtio feature negotiation, presence of the primary, and live migration. It looks like to me that, in your opinion, you seem to prefer go with (a). While I'm actually okay with either (b) or (c). Do I understand your point correctly? The reason that I feel that (a) might not be ideal, just as Michael alluded to (quoting below), is that as management stack, it really doesn't need to care about the detailed process of feature negotiation (if we view the guest presence of the primary as part of feature negotiation at an extended level not just virtio). All it needs to be done is to hand in the required devices to QEMU and that's all. Why do we need to addd various hooks, events for whichever happens internally within the guest? '' Primary device is added with a special "primary-failover" flag. A virtual machine is then initialized with just a standby virtio device. Primary is not yet added. Later QEMU detects that guest driver device set DRIVER_OK. It then exposes the primary device to the guest, and triggers a device addition event (hot-plug event) for it. If QEMU detects guest driver removal, it initiates a hot-unplug sequence to remove the primary driver. In particular, if QEMU detects guest re-initialization (e.g. by detecting guest reset) it immediately removes the primary device. '' and, '' management just wants to give the primary to guest and later take it back, it really does not care about the details of the process, so I don't see what does pushing it up the stack buy you. So I don't think it *needs* to be done in libvirt. It probably can if you add a bunch of hooks so it knows whenever vm reboots, driver binds and unbinds from device, and can check that backup flag was set. If you are pushing for a setup like that please get a buy-in from libvirt maintainers or better write a patch. '' Let me know if my clarifications sound clear to you now. Thanks, -Siwei On Thu, Jun 14, 2018 at 3:02 AM, Cornelia Huck wrote: > I've been pointed to this discussion (which I had missed previously) > and I'm getting a headache. Let me first summarize how I understand how > this feature is supposed to work, then I'll respond to some individual > points. > > The basic idea is to enable guests to migrate seamlessly, while still > making it possible for them to use a passed-through device for more > performance etc. The means to do so is to hook a virtio-net device > together with a network device passed through via vfio. The > vfio-handled device is there for
Re: [PATCH v5 2/3] x86/asm: add _ASM_ARG* constants for argument registers to
On 06/14/18 13:59, Nick Desaulniers wrote: > On Thu, Jun 14, 2018 at 1:48 PM H. Peter Anvin wrote: >> >> On 06/13/18 14:05, Nick Desaulniers wrote: >>> From: "H. Peter Anvin" >>> >>> i386 and x86-64 uses different registers for arguments; make them >>> available so we don't have to #ifdef in the actual code. >>> >>> Native size and specified size (q, l, w, b) versions are provided. >>> >>> Suggested-by: Sedat Dilek >>> Signed-off-by: H. Peter Anvin >>> Signed-off-by: Nick Desaulniers >> >> I still object to Suggested-by: here. Sedat did a correction, which is >> a Reviewed-by:, but unless I'm completely out to sea, there was no >> suggestion on Sedat's part of this; and I had certainly not seen it when >> I wrote the code. > > I'm fine with changing it from a Suggested-by to a Reviewed-by. Can > you do that when you apply the set, or should I send a v6? > I'm not handling patch mechanics for x86 at the moment. -hpa ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 2/3] x86/asm: add _ASM_ARG* constants for argument registers to
On 06/13/18 14:05, Nick Desaulniers wrote: > From: "H. Peter Anvin" > > i386 and x86-64 uses different registers for arguments; make them > available so we don't have to #ifdef in the actual code. > > Native size and specified size (q, l, w, b) versions are provided. > > Suggested-by: Sedat Dilek > Signed-off-by: H. Peter Anvin > Signed-off-by: Nick Desaulniers I still object to Suggested-by: here. Sedat did a correction, which is a Reviewed-by:, but unless I'm completely out to sea, there was no suggestion on Sedat's part of this; and I had certainly not seen it when I wrote the code. -hpa ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
On Wed, Jun 13, 2018 at 06:02:01PM -0700, Siwei Liu wrote: > >> And it's the guest that needs failover support not the VM. > > > > > > Isn't guest and VM synonymous? Guest is whatever software is running on top of the hypervisor. The virtual machine is the interface between the two. -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
I've been pointed to this discussion (which I had missed previously) and I'm getting a headache. Let me first summarize how I understand how this feature is supposed to work, then I'll respond to some individual points. The basic idea is to enable guests to migrate seamlessly, while still making it possible for them to use a passed-through device for more performance etc. The means to do so is to hook a virtio-net device together with a network device passed through via vfio. The vfio-handled device is there for performance, the virtio device for migratability. We have a new virtio feature bit for that which needs to be negotiated for that 'combined' device to be available. We have to consider two cases: - Older guests that do not support the new feature bit. We presume that those guests will be confused if they get two network devices with the same MAC, so the idea is to not show them the vfio-handled device at all. - Guests that negotiate the feature bit. We only know positively that they (a) know the feature bit and (b) are prepared to handle the consequences of negotiating it after they set the FEATURES_OK bit. This is therefore the earliest point in time that the vfio-handled device should be visible or usable for the guest. On Wed, 13 Jun 2018 18:02:01 -0700 Siwei Liu wrote: > On Tue, Jun 12, 2018 at 5:08 PM, Samudrala, Sridhar > wrote: > > On 6/12/2018 4:34 AM, Michael S. Tsirkin wrote: > >> > >> On Mon, Jun 11, 2018 at 10:02:45PM -0700, Samudrala, Sridhar wrote: > >>> > >>> On 6/11/2018 7:17 PM, Michael S. Tsirkin wrote: > > On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote: > > > > On 2018年06月12日 01:26, Michael S. Tsirkin wrote: > >> > >> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote: > >>> > >>> This feature bit can be used by hypervisor to indicate virtio_net > >>> device to > >>> act as a standby for another device with the same MAC address. > >>> > >>> I tested this with a small change to the patch to mark the STANDBY > >>> feature 'true' > >>> by default as i am using libvirt to start the VMs. > >>> Is there a way to pass the newly added feature bit 'standby' to qemu > >>> via libvirt > >>> XML file? > >>> > >>> Signed-off-by: Sridhar Samudrala > >> > >> So I do not think we can commit to this interface: we > >> really need to control visibility of the primary device. > > > > The problem is legacy guest won't use primary device at all if we do > > this. > > And that's by design - I think it's the only way to ensure the > legacy guest isn't confused. > >>> > >>> Yes. I think so. But i am not sure if Qemu is the right place to control > >>> the visibility > >>> of the primary device. The primary device may not be specified as an > >>> argument to Qemu. It > >>> may be plugged in later. > >>> The cloud service provider is providing a feature that enables low > >>> latency datapath and live > >>> migration capability. > >>> A tenant can use this feature only if he is running a VM that has > >>> virtio-net with failover support. So, do you know from the outset that there will be such a coupled device? I.e., is it a property of the VM definition? Can there be a 'prepared' virtio-net device that presents the STANDBY feature even if there currently is no vfio-handled device available -- but making it possible to simply hotplug that device later? Should it be possible to add a virtio/vfio pair later on? > >> > >> Well live migration is there already. The new feature is low latency > >> data path. > > > > > > we get live migration with just virtio. But I meant live migration with VF > > as > > primary device. > > > >> > >> And it's the guest that needs failover support not the VM. > > > > > > Isn't guest and VM synonymous? I think we need to be really careful to not mix up the two: The VM contains the definitions, but it is up to the guest how it uses them. > > > > > >> > >> > >>> I think Qemu should check if guest virtio-net supports this feature and > >>> provide a mechanism for > >>> an upper layer indicating if the STANDBY feature is successfully > >>> negotiated or not. > >>> The upper layer can then decide if it should hot plug a VF with the same > >>> MAC and manage the 2 links. > >>> If VF is successfully hot plugged, virtio-net link should be disabled. > >> > >> Did you even talk to upper layer management about it? > >> Just list the steps they need to do and you will see > >> that's a lot of machinery to manage by the upper layer. > >> > >> What do we gain in flexibility? As far as I can see the > >> only gain is some resources saved for legacy VMs. > >> > >> That's not a lot as tenant of the upper layer probably already has > >> at least a hunch that it's a new guest otherwise > >> why bother specifying the feature at all - you > >> save even more resources without it. > >> > > > > I