Re: [question] VFIO Device Migration: The vCPU may be paused during vfio device DMA in iommu nested stage mode && vSVA
On 9/24/2021 12:17 PM, Tian, Kevin wrote: From: Kunkun Jiang Sent: Friday, September 24, 2021 2:19 PM Hi all, I encountered a problem in vfio device migration test. The vCPU may be paused during vfio-pci DMA in iommu nested stage mode && vSVA. This may lead to migration fail and other problems related to device hardware and driver implementation. It may be a bit early to discuss this issue, after all, the iommu nested stage mode and vSVA are not yet mature. But judging from the current implementation, we will definitely encounter this problem in the future. Yes, this is a known limitation to support migration with vSVA. This is the current process of vSVA processing translation fault in iommu nested stage mode (take SMMU as an example): guest os 4.handle translation fault 5.send CMD_RESUME to vSMMU qemu 3.inject fault into guest os 6.deliver response to host os (vfio/vsmmu) host os 2.notify the qemu 7.send CMD_RESUME to SMMU (vfio/smmu) SMMU 1.address translation fault 8.retry or terminate The order is 1--->8. Currently, qemu may pause vCPU at any step. It is possible to pause vCPU at step 1-5, that is, in a DMA. This may lead to migration fail and other problems related to device hardware and driver implementation. For example, the device status cannot be changed from RUNNING && SAVING to SAVING, because the device DMA is not over. As far as i can see, vCPU should not be paused during a device IO process, such as DMA. However, currently live migration does not pay attention to the state of vfio device when pausing the vCPU. And if the vCPU is not paused, the vfio device is always running. This looks like a *deadlock*. Basically this requires: 1) stopping vCPU after stopping device (could selectively enable this sequence for vSVA); I don't think this is change is required. When vCPUs are at halt vCPU states are already saved, step 4 or 5 will be taken care by that. Then when device is transitioned in SAVING state, save qemu and host os state in the migration stream, i.e. state at step 2 and 3, depending on that take action while resuming, about step 6 or 7 to run. Thanks, Kirti 2) when stopping device, the driver should block new requests from vCPU (queued to a pending list) and then drain all in-fly requests including faults; * to block this further requires switching from fast-path to slow trap-emulation path for the cmd portal before stopping the device; 3) save the pending requests in the vm image and replay them after the vm is resumed; * finally disable blocking by switching back to the fast-path for the cmd portal; Do you have any ideas to solve this problem? Looking forward to your replay. We verified above flow can work in our internal POC. Thanks Kevin
Re: [PATCH 07/16] vfio: Avoid error_propagate() after migrate_add_blocker()
On 7/20/2021 6:23 PM, Markus Armbruster wrote: When migrate_add_blocker(blocker, &errp) is followed by error_propagate(errp, err), we can often just as well do migrate_add_blocker(..., errp). This is the case in vfio_migration_probe(). Prior art: commit 386f6c07d2 "error: Avoid error_propagate() after migrate_add_blocker()". Cc: Kirti Wankhede Cc: Alex Williamson Signed-off-by: Markus Armbruster --- hw/vfio/migration.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 82f654afb6..ff6b45de6b 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -858,7 +858,6 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) { VFIOContainer *container = vbasedev->group->container; struct vfio_region_info *info = NULL; -Error *local_err = NULL; int ret = -ENOTSUP; if (!vbasedev->enable_migration || !container->dirty_pages_supported) { @@ -885,9 +884,8 @@ add_blocker: "VFIO device doesn't support migration"); g_free(info); -ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err); -if (local_err) { -error_propagate(errp, local_err); +ret = migrate_add_blocker(vbasedev->migration_blocker, errp); +if (ret < 0) { error_free(vbasedev->migration_blocker); vbasedev->migration_blocker = NULL; } Reviewed by: Kirti Wankhede
Re: [PATCH v2 17/21] contrib/gitdm: add domain-map for NVIDIA
On 7/14/2021 11:50 PM, Alex Bennée wrote: Signed-off-by: Alex Bennée Cc: Kirti Wankhede Cc: Yishai Hadas Message-Id: <20210714093638.21077-18-alex.ben...@linaro.org> --- contrib/gitdm/domain-map | 1 + 1 file changed, 1 insertion(+) diff --git a/contrib/gitdm/domain-map b/contrib/gitdm/domain-map index 0b0cd9feee..329ff09029 100644 --- a/contrib/gitdm/domain-map +++ b/contrib/gitdm/domain-map @@ -24,6 +24,7 @@ microsoft.com Microsoft mvista.com MontaVista nokia.com Nokia nuviainc.comNUVIA +nvidia.com NVIDIA oracle.com Oracle proxmox.com Proxmox quicinc.com Qualcomm Innovation Center Reviewed-by: Kirti Wankhede
Re: [PATCH v1 1/1] vfio: Make migration support non experimental by default.
On 7/10/2021 1:14 PM, Claudio Fontana wrote: On 3/8/21 5:09 PM, Tarun Gupta wrote: VFIO migration support in QEMU is experimental as of now, which was done to provide soak time and resolve concerns regarding bit-stream. But, with the patches discussed in https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mail-archive.com%2Fqemu-devel%40nongnu.org%2Fmsg784931.html&data=04%7C01%7Ckwankhede%40nvidia.com%7C98194e8a856f4e6b611c08d943769ab5%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637614998961553398%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=A2EY9LEqGE0BSrT25h2WtWonb5oi0O%2B6%2BQmvhVf8Wd4%3D&reserved=0 , we have corrected ordering of saving PCI config space and bit-stream. So, this patch proposes to make vfio migration support in QEMU to be enabled by default. Tested by successfully migrating mdev device. Signed-off-by: Tarun Gupta Signed-off-by: Kirti Wankhede --- hw/vfio/pci.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index f74be78209..15e26f460b 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -3199,7 +3199,7 @@ static Property vfio_pci_dev_properties[] = { DEFINE_PROP_BIT("x-igd-opregion", VFIOPCIDevice, features, VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false), DEFINE_PROP_BOOL("x-enable-migration", VFIOPCIDevice, - vbasedev.enable_migration, false), + vbasedev.enable_migration, true), DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false), DEFINE_PROP_BOOL("x-balloon-allowed", VFIOPCIDevice, vbasedev.ram_block_discard_allowed, false), Hello, has plain snapshot been tested? Yes. If I issue the HMP command "savevm", and then "loadvm", will things work fine? Yes Thanks, Kirti
Re: [PATCH v1 1/1] vfio/migration: Correct device state from vmstate change for savevm case.
CCing more Nvidia folks who are testing this patch. Gentle Ping for review. Thanks, Kirti On 6/9/2021 12:07 AM, Kirti Wankhede wrote: Set _SAVING flag for device state from vmstate change handler when it gets called from savevm. Currently State transition savevm/suspend is seen as: _RUNNING -> _STOP -> Stop-and-copy -> _STOP State transition savevm/suspend should be: _RUNNING -> Stop-and-copy -> _STOP State transition from _RUNNING to _STOP occurs from vfio_vmstate_change() where when vmstate changes from running to !running, _RUNNING flag is reset but at the same time when vfio_vmstate_change() is called for RUN_STATE_SAVE_VM, _SAVING bit should be set. Reported by: Yishai Hadas Signed-off-by: Kirti Wankhede --- hw/vfio/migration.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 384576cfc051..33242b2313b9 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -725,7 +725,16 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state) * _RUNNING bit */ mask = ~VFIO_DEVICE_STATE_RUNNING; -value = 0; + +/* + * When VM state transition to stop for savevm command, device should + * start saving data. + */ +if (state == RUN_STATE_SAVE_VM) { +value = VFIO_DEVICE_STATE_SAVING; +} else { +value = 0; +} } ret = vfio_migration_set_state(vbasedev, mask, value);
[PATCH v1 1/1] vfio/migration: Correct device state from vmstate change for savevm case.
Set _SAVING flag for device state from vmstate change handler when it gets called from savevm. Currently State transition savevm/suspend is seen as: _RUNNING -> _STOP -> Stop-and-copy -> _STOP State transition savevm/suspend should be: _RUNNING -> Stop-and-copy -> _STOP State transition from _RUNNING to _STOP occurs from vfio_vmstate_change() where when vmstate changes from running to !running, _RUNNING flag is reset but at the same time when vfio_vmstate_change() is called for RUN_STATE_SAVE_VM, _SAVING bit should be set. Reported by: Yishai Hadas Signed-off-by: Kirti Wankhede --- hw/vfio/migration.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 384576cfc051..33242b2313b9 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -725,7 +725,16 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state) * _RUNNING bit */ mask = ~VFIO_DEVICE_STATE_RUNNING; -value = 0; + +/* + * When VM state transition to stop for savevm command, device should + * start saving data. + */ +if (state == RUN_STATE_SAVE_VM) { +value = VFIO_DEVICE_STATE_SAVING; +} else { +value = 0; +} } ret = vfio_migration_set_state(vbasedev, mask, value); -- 2.7.0
Re: [PATCH] vfio: Fix unregister SaveVMHandler in vfio_migration_finalize
On 5/28/2021 7:34 AM, Kunkun Jiang wrote: Hi Philippe, On 2021/5/27 21:44, Philippe Mathieu-Daudé wrote: On 5/27/21 2:31 PM, Kunkun Jiang wrote: In the vfio_migration_init(), the SaveVMHandler is registered for VFIO device. But it lacks the operation of 'unregister'. It will lead to 'Segmentation fault (core dumped)' in qemu_savevm_state_setup(), if performing live migration after a VFIO device is hot deleted. Fixes: 7c2f5f75f94 (vfio: Register SaveVMHandlers for VFIO device) Reported-by: Qixin Gan Signed-off-by: Kunkun Jiang Cc: qemu-sta...@nongnu.org --- hw/vfio/migration.c | 1 + 1 file changed, 1 insertion(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 201642d75e..ef397ebe6c 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -892,6 +892,7 @@ void vfio_migration_finalize(VFIODevice *vbasedev) remove_migration_state_change_notifier(&migration->migration_state); qemu_del_vm_change_state_handler(migration->vm_state); + unregister_savevm(VMSTATE_IF(vbasedev->dev), "vfio", vbasedev); Hmm what about devices using "%s/vfio" id? The unregister_savevm() needs 'VMSTATEIf *obj'. If we pass a non-null 'obj' to unregister_svevm(), it will handle the devices using "%s/vfio" id with the following code: if (obj) { char *oid = vmstate_if_get_id(obj); if (oid) { pstrcpy(id, sizeof(id), oid); pstrcat(id, sizeof(id), "/"); g_free(oid); } } pstrcat(id, sizeof(id), idstr); This fix seems fine to me. By the way, I'm puzzled that register_savevm_live() and unregister_savevm() handle devices using "%s/vfio" id differently. So I learned the commit history of register_savevm_live() and unregister_savevm(). In the beginning, both them need 'DeviceState *dev', which are replaced with VMStateIf in 3cad405babb. Later in ce62df5378b, the 'dev' was removed, because no caller of register_savevm_live() need to pass a non-null 'dev' at that time. So now the vfio devices need to handle the 'id' first and then call register_savevm_live(). I am wondering whether we need to add 'VMSTATEIf *obj' in register_savevm_live(). What do you think of this? I think proposed change above is independent of this fix. I'll defer to other experts. Reviewed by: Kirti Wankhede
Re: [PATCH v3 1/3] vfio: Move the saving of the config space to the right place in VFIO migration
Reviewed-by: Kirti Wankhede On 2/23/2021 7:52 AM, Shenming Lu wrote: On ARM64 the VFIO SET_IRQS ioctl is dependent on the VM interrupt setup, if the restoring of the VFIO PCI device config space is before the VGIC, an error might occur in the kernel. So we move the saving of the config space to the non-iterable process, thus it will be called after the VGIC according to their priorities. As for the possible dependence of the device specific migration data on it's config space, we can let the vendor driver to include any config info it needs in its own data stream. Signed-off-by: Shenming Lu --- hw/vfio/migration.c | 25 +++-- 1 file changed, 15 insertions(+), 10 deletions(-) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 00daa50ed8..f5bf67f642 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -575,11 +575,6 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) return ret; } -ret = vfio_save_device_config_state(f, opaque); -if (ret) { -return ret; -} - ret = vfio_update_pending(vbasedev); if (ret) { return ret; @@ -620,6 +615,19 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) return ret; } +static void vfio_save_state(QEMUFile *f, void *opaque) +{ +VFIODevice *vbasedev = opaque; +int ret; + +ret = vfio_save_device_config_state(f, opaque); +if (ret) { +error_report("%s: Failed to save device config space", + vbasedev->name); +qemu_file_set_error(f, ret); +} +} + static int vfio_load_setup(QEMUFile *f, void *opaque) { VFIODevice *vbasedev = opaque; @@ -670,11 +678,7 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id) switch (data) { case VFIO_MIG_FLAG_DEV_CONFIG_STATE: { -ret = vfio_load_device_config_state(f, opaque); -if (ret) { -return ret; -} -break; +return vfio_load_device_config_state(f, opaque); } case VFIO_MIG_FLAG_DEV_SETUP_STATE: { @@ -720,6 +724,7 @@ static SaveVMHandlers savevm_vfio_handlers = { .save_live_pending = vfio_save_pending, .save_live_iterate = vfio_save_iterate, .save_live_complete_precopy = vfio_save_complete_precopy, +.save_state = vfio_save_state, .load_setup = vfio_load_setup, .load_cleanup = vfio_load_cleanup, .load_state = vfio_load_state,
Re: [RFC PATCH v2 1/3] vfio: Move the saving of the config space to the right place in VFIO migration
On 1/27/2021 3:06 AM, Alex Williamson wrote: On Thu, 10 Dec 2020 10:21:21 +0800 Shenming Lu wrote: On 2020/12/10 2:34, Alex Williamson wrote: On Wed, 9 Dec 2020 13:29:47 +0100 Cornelia Huck wrote: On Wed, 9 Dec 2020 16:09:17 +0800 Shenming Lu wrote: On ARM64 the VFIO SET_IRQS ioctl is dependent on the VM interrupt setup, if the restoring of the VFIO PCI device config space is before the VGIC, an error might occur in the kernel. So we move the saving of the config space to the non-iterable process, so that it will be called after the VGIC according to their priorities. As for the possible dependence of the device specific migration data on it's config space, we can let the vendor driver to include any config info it needs in its own data stream. (Should we note this in the header file linux-headers/linux/vfio.h?) Given that the header is our primary source about how this interface should act, we need to properly document expectations about what will be saved/restored when there (well, in the source file in the kernel.) That goes in both directions: what a userspace must implement, and what a vendor driver can rely on. Yeah, in order to make the vendor driver and QEMU cooperate better, we might need to document some expectations about the data section in the migration region... [Related, but not a todo for you: I think we're still missing proper documentation of the whole migration feature.] Yes, we never saw anything past v1 of the documentation patch. Thanks, I'll get back on this and send next version soon. By the way, is there anything unproper with this patch? Wish your suggestion. :-) I'm really hoping for some feedback from Kirti, I understand the NVIDIA vGPU driver to have some dependency on this. Thanks, NVIDIA driver doesn't use device config space value/information during device data restoration, so we are good with this change. Thanks, Kirti Alex Signed-off-by: Shenming Lu --- hw/vfio/migration.c | 25 +++-- 1 file changed, 15 insertions(+), 10 deletions(-) .
Re: [RFC PATCH v2 1/3] vfio: Move the saving of the config space to the right place in VFIO migration
On 12/9/2020 1:39 PM, Shenming Lu wrote: On ARM64 the VFIO SET_IRQS ioctl is dependent on the VM interrupt setup, if the restoring of the VFIO PCI device config space is before the VGIC, an error might occur in the kernel. So we move the saving of the config space to the non-iterable process, so that it will be called after the VGIC according to their priorities. As for the possible dependence of the device specific migration data on it's config space, we can let the vendor driver to include any config info it needs in its own data stream. (Should we note this in the header file linux-headers/linux/vfio.h?) Signed-off-by: Shenming Lu --- hw/vfio/migration.c | 25 +++-- 1 file changed, 15 insertions(+), 10 deletions(-) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 00daa50ed8..3b9de1353a 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -575,11 +575,6 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) return ret; } -ret = vfio_save_device_config_state(f, opaque); -if (ret) { -return ret; -} - ret = vfio_update_pending(vbasedev); if (ret) { return ret; @@ -620,6 +615,19 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) return ret; } +static void vfio_save_state(QEMUFile *f, void *opaque) +{ +VFIODevice *vbasedev = opaque; +int ret; + +/* The device specific data is migrated in the iterable process. */ +ret = vfio_save_device_config_state(f, opaque); +if (ret) { +error_report("%s: Failed to save device config space", + vbasedev->name); +} +} + Since error is not propagated, set error in migration stream for migration to fail, use qemu_file_set_error() on error. Thanks, Kirti static int vfio_load_setup(QEMUFile *f, void *opaque) { VFIODevice *vbasedev = opaque; @@ -670,11 +678,7 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id) switch (data) { case VFIO_MIG_FLAG_DEV_CONFIG_STATE: { -ret = vfio_load_device_config_state(f, opaque); -if (ret) { -return ret; -} -break; +return vfio_load_device_config_state(f, opaque); } case VFIO_MIG_FLAG_DEV_SETUP_STATE: { @@ -720,6 +724,7 @@ static SaveVMHandlers savevm_vfio_handlers = { .save_live_pending = vfio_save_pending, .save_live_iterate = vfio_save_iterate, .save_live_complete_precopy = vfio_save_complete_precopy, +.save_state = vfio_save_state, .load_setup = vfio_load_setup, .load_cleanup = vfio_load_cleanup, .load_state = vfio_load_state,
Re: [RFC PATCH v2 1/3] vfio: Move the saving of the config space to the right place in VFIO migration
On 1/27/2021 3:06 AM, Alex Williamson wrote: On Thu, 10 Dec 2020 10:21:21 +0800 Shenming Lu wrote: On 2020/12/10 2:34, Alex Williamson wrote: On Wed, 9 Dec 2020 13:29:47 +0100 Cornelia Huck wrote: On Wed, 9 Dec 2020 16:09:17 +0800 Shenming Lu wrote: On ARM64 the VFIO SET_IRQS ioctl is dependent on the VM interrupt setup, if the restoring of the VFIO PCI device config space is before the VGIC, an error might occur in the kernel. So we move the saving of the config space to the non-iterable process, so that it will be called after the VGIC according to their priorities. As for the possible dependence of the device specific migration data on it's config space, we can let the vendor driver to include any config info it needs in its own data stream. (Should we note this in the header file linux-headers/linux/vfio.h?) Given that the header is our primary source about how this interface should act, we need to properly document expectations about what will be saved/restored when there (well, in the source file in the kernel.) That goes in both directions: what a userspace must implement, and what a vendor driver can rely on. Yeah, in order to make the vendor driver and QEMU cooperate better, we might need to document some expectations about the data section in the migration region... [Related, but not a todo for you: I think we're still missing proper documentation of the whole migration feature.] Yes, we never saw anything past v1 of the documentation patch. Thanks, By the way, is there anything unproper with this patch? Wish your suggestion. :-) I'm really hoping for some feedback from Kirti, I understand the NVIDIA vGPU driver to have some dependency on this. Thanks, I need to verify this patch. Spare me a day to verify this. Thanks, Kirti Alex Signed-off-by: Shenming Lu --- hw/vfio/migration.c | 25 +++-- 1 file changed, 15 insertions(+), 10 deletions(-) .
Re: [PATCH] vfio/migrate: Move switch of dirty tracking into vfio_memory_listener
On 1/11/2021 1:04 PM, Keqian Zhu wrote: For now the switch of vfio dirty page tracking is integrated into the vfio_save_handler, it causes some problems [1]. Sorry, I missed [1] mail, somehow it didn't landed in my inbox. The object of dirty tracking is guest memory, but the object of the vfio_save_handler is device state. This mixed logic produces unnecessary coupling and conflicts: 1. Coupling: Their saving granule is different (perVM vs perDevice). vfio will enable dirty_page_tracking for each devices, actually once is enough. That's correct, enabling dirty page tracking once is enough. But log_start and log_stop gets called on address space update transaction, region_add() or region_del(), at this point migration may not be active. We don't want to allocate bitmap memory in kernel for lifetime of VM, without knowing migration will be happen or not. vfio_iommu_type1 module should allocate bitmap memory only while migration is active. Paolo's suggestion here to use log_global_start and log_global_stop callbacks seems correct here. But at this point vfio device state is not yet changed to |_SAVING as you had identified it in [1]. May be we can start tracking bitmap in iommu_type1 module while device is not yet _SAVING, but getting dirty bitmap while device is yet not in _SAVING|_RUNNING state doesn't seem optimal solution. Pasting here your question from [1] > Before start dirty tracking, we will check and ensure that the device > is at _SAVING state and return error otherwise. But the question is > that what is the rationale? Why does the VFIO_IOMMU_DIRTY_PAGES > ioctl have something to do with the device state? Lets walk through the types of devices we are supporting: 1. mdev devices without IOMMU backed device Vendor driver pins pages as and when required during runtime. We can say that vendor driver is smart which identifies the pages to pin. We are good here. 2. mdev device with IOMMU backed device This is similar to vfio-pci, direct assigned device, where all pages are pinned at VM bootup. Vendor driver is not smart, so bitmap query will report all pages dirty always. If --auto-converge is not set, VM stucks infinitely in pre-copy phase. This is known to us. 3. mdev device with IOMMU backed device with smart vendor driver In this case as well all pages are pinned at VM bootup, but vendor driver is smart to identify the pages and pin them explicitly. Pages can be pinned anytime, i.e. during normal VM runtime or on setting _SAVING flag (entering pre-copy phase) or while in iterative pre-copy phase. There is no restriction based on these phases for calling vfio_pin_pages(). Vendor driver can start pinning pages based on its device state when _SAVING flag is set. In that case, if dirty bitmap is queried before that then it will report all sysmem as dirty with an unnecessary copy of sysmem. As an optimal solution, I think its better to query bitmap only after all vfio devices are in pre-copy phase, i.e. after _SAVING flag is set. 2. Conflicts: The ram_save_setup() traverses all memory_listeners to execute their log_start() and log_sync() hooks to get the first round dirty bitmap, which is used by the bulk stage of ram saving. However, it can't get dirty bitmap from vfio, as @savevm_ram_handlers is registered before @vfio_save_handler. Right, but it can get dirty bitmap from vfio device in it's iterative callback ram_save_pending -> migration_bitmap_sync_precopy() .. -> vfio_listerner_log_sync Thanks, Kirti Move the switch of vfio dirty_page_tracking into vfio_memory_listener can solve above problems. Besides, Do not require devices in SAVING state for vfio_sync_dirty_bitmap(). [1] https://www.spinics.net/lists/kvm/msg229967.html Reported-by: Zenghui Yu Signed-off-by: Keqian Zhu --- hw/vfio/common.c| 53 + hw/vfio/migration.c | 35 -- 2 files changed, 44 insertions(+), 44 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 6ff1daa763..9128cd7ee1 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -311,7 +311,7 @@ bool vfio_mig_active(void) return true; } -static bool vfio_devices_all_saving(VFIOContainer *container) +static bool vfio_devices_all_dirty_tracking(VFIOContainer *container) { VFIOGroup *group; VFIODevice *vbasedev; @@ -329,13 +329,8 @@ static bool vfio_devices_all_saving(VFIOContainer *container) return false; } -if (migration->device_state & VFIO_DEVICE_STATE_SAVING) { -if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) -&& (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) { -return false; -} -continue; -} else { +if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF
[PATCH v2 1/1] Fix to show vfio migration stat in migration status
Header file where CONFIG_VFIO is defined is not included in migration.c file. Moved populate_vfio_info() to hw/vfio/common.c file. Added its stub in stubs/vfio.c file. Updated header files and meson file accordingly. Fixes: 3710586caa5d ("qapi: Add VFIO devices migration stats in Migration stats") Signed-off-by: Kirti Wankhede --- hw/vfio/common.c | 12 +++- include/hw/vfio/vfio-common.h | 1 - include/hw/vfio/vfio.h| 2 ++ migration/migration.c | 16 +--- stubs/meson.build | 1 + stubs/vfio.c | 7 +++ 6 files changed, 22 insertions(+), 17 deletions(-) create mode 100644 stubs/vfio.c diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 6ff1daa763f8..4868c0fef504 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -25,6 +25,7 @@ #endif #include +#include "qapi/qapi-types-migration.h" #include "hw/vfio/vfio-common.h" #include "hw/vfio/vfio.h" #include "exec/address-spaces.h" @@ -292,7 +293,7 @@ const MemoryRegionOps vfio_region_ops = { * Device state interfaces */ -bool vfio_mig_active(void) +static bool vfio_mig_active(void) { VFIOGroup *group; VFIODevice *vbasedev; @@ -311,6 +312,15 @@ bool vfio_mig_active(void) return true; } +void populate_vfio_info(MigrationInfo *info) +{ +if (vfio_mig_active()) { +info->has_vfio = true; +info->vfio = g_malloc0(sizeof(*info->vfio)); +info->vfio->transferred = vfio_mig_bytes_transferred(); +} +} + static bool vfio_devices_all_saving(VFIOContainer *container) { VFIOGroup *group; diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 6141162d7aea..cc47bd7d4456 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -205,7 +205,6 @@ extern const MemoryRegionOps vfio_region_ops; typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList; extern VFIOGroupList vfio_group_list; -bool vfio_mig_active(void); int64_t vfio_mig_bytes_transferred(void); #ifdef CONFIG_LINUX diff --git a/include/hw/vfio/vfio.h b/include/hw/vfio/vfio.h index 86248f54360a..d1e6f4b26f35 100644 --- a/include/hw/vfio/vfio.h +++ b/include/hw/vfio/vfio.h @@ -4,4 +4,6 @@ bool vfio_eeh_as_ok(AddressSpace *as); int vfio_eeh_as_op(AddressSpace *as, uint32_t op); +void populate_vfio_info(MigrationInfo *info); + #endif diff --git a/migration/migration.c b/migration/migration.c index 87a9b59f83f4..c164594c1d8d 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -56,10 +56,7 @@ #include "net/announce.h" #include "qemu/queue.h" #include "multifd.h" - -#ifdef CONFIG_VFIO -#include "hw/vfio/vfio-common.h" -#endif +#include "hw/vfio/vfio.h" #define MAX_THROTTLE (128 << 20) /* Migration transfer speed throttling */ @@ -1041,17 +1038,6 @@ static void populate_disk_info(MigrationInfo *info) } } -static void populate_vfio_info(MigrationInfo *info) -{ -#ifdef CONFIG_VFIO -if (vfio_mig_active()) { -info->has_vfio = true; -info->vfio = g_malloc0(sizeof(*info->vfio)); -info->vfio->transferred = vfio_mig_bytes_transferred(); -} -#endif -} - static void fill_source_migration_info(MigrationInfo *info) { MigrationState *s = migrate_get_current(); diff --git a/stubs/meson.build b/stubs/meson.build index 82b7ba60abe5..909956674847 100644 --- a/stubs/meson.build +++ b/stubs/meson.build @@ -53,3 +53,4 @@ if have_system stub_ss.add(files('semihost.c')) stub_ss.add(files('xen-hw-stub.c')) endif +stub_ss.add(files('vfio.c')) diff --git a/stubs/vfio.c b/stubs/vfio.c new file mode 100644 index ..9cc8753cd102 --- /dev/null +++ b/stubs/vfio.c @@ -0,0 +1,7 @@ +#include "qemu/osdep.h" +#include "qapi/qapi-types-migration.h" +#include "hw/vfio/vfio.h" + +void populate_vfio_info(MigrationInfo *info) +{ +} -- 2.7.0
Re: [PATCH 1/1] Fix to show vfio migration stat in migration status
On 11/26/2020 12:33 AM, Dr. David Alan Gilbert wrote: * Kirti Wankhede (kwankh...@nvidia.com) wrote: On 11/25/2020 3:00 PM, Dr. David Alan Gilbert wrote: * Kirti Wankhede (kwankh...@nvidia.com) wrote: Header file where CONFIG_VFIO is defined is not included in migration.c file. Include config devices header file in migration.c. Fixes: 3710586caa5d ("qapi: Add VFIO devices migration stats in Migration stats") Signed-off-by: Kirti Wankhede Given it's got build problems; I suggest actually something cleaner would be to swing populate_vfio_info into one of the vfio specific files, add a stubs/ entry somewhere and then migration.c doesn't need to include the device or header stuff. Still function prototype for populate_vfio_info() and its stub has to be placed in some header file. Which header file isn't that important; Any recommendation which header file to use? Thanks, Kirti and the stub goes in a file in stubs/ Earlier I used CONFIG_LINUX instead of CONFIG_VFIO which works here. Should I change it back to CONFIG_LINUX? No. I'm not very much aware of meson build system, I tested by configuring specific target, but I think by default if target build is not specified during configuration, it builds for multiple target that's where this build is failing. Any help on how to fix it would be helpful. With my suggestion you don't have to do anything clever to meson (which I don't know much about either). Dave Thanks, Kirti Dave --- meson.build | 1 + migration/migration.c | 1 + 2 files changed, 2 insertions(+) diff --git a/meson.build b/meson.build index 7ddf983ff7f5..24526499cfb5 100644 --- a/meson.build +++ b/meson.build @@ -1713,6 +1713,7 @@ common_ss.add_all(when: 'CONFIG_USER_ONLY', if_true: user_ss) common_all = common_ss.apply(config_all, strict: false) common_all = static_library('common', + c_args:'-DCONFIG_DEVICES="@0@-config-devices.h"'.format(target) , build_by_default: false, sources: common_all.sources() + genh, dependencies: common_all.dependencies(), diff --git a/migration/migration.c b/migration/migration.c index 87a9b59f83f4..650efb81daad 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -57,6 +57,7 @@ #include "qemu/queue.h" #include "multifd.h" +#include CONFIG_DEVICES #ifdef CONFIG_VFIO #include "hw/vfio/vfio-common.h" #endif -- 2.7.0
Re: [PATCH 1/1] Fix to show vfio migration stat in migration status
On 11/25/2020 3:00 PM, Dr. David Alan Gilbert wrote: * Kirti Wankhede (kwankh...@nvidia.com) wrote: Header file where CONFIG_VFIO is defined is not included in migration.c file. Include config devices header file in migration.c. Fixes: 3710586caa5d ("qapi: Add VFIO devices migration stats in Migration stats") Signed-off-by: Kirti Wankhede Given it's got build problems; I suggest actually something cleaner would be to swing populate_vfio_info into one of the vfio specific files, add a stubs/ entry somewhere and then migration.c doesn't need to include the device or header stuff. Still function prototype for populate_vfio_info() and its stub has to be placed in some header file. Earlier I used CONFIG_LINUX instead of CONFIG_VFIO which works here. Should I change it back to CONFIG_LINUX? I'm not very much aware of meson build system, I tested by configuring specific target, but I think by default if target build is not specified during configuration, it builds for multiple target that's where this build is failing. Any help on how to fix it would be helpful. Thanks, Kirti Dave --- meson.build | 1 + migration/migration.c | 1 + 2 files changed, 2 insertions(+) diff --git a/meson.build b/meson.build index 7ddf983ff7f5..24526499cfb5 100644 --- a/meson.build +++ b/meson.build @@ -1713,6 +1713,7 @@ common_ss.add_all(when: 'CONFIG_USER_ONLY', if_true: user_ss) common_all = common_ss.apply(config_all, strict: false) common_all = static_library('common', + c_args:'-DCONFIG_DEVICES="@0@-config-devices.h"'.format(target) , build_by_default: false, sources: common_all.sources() + genh, dependencies: common_all.dependencies(), diff --git a/migration/migration.c b/migration/migration.c index 87a9b59f83f4..650efb81daad 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -57,6 +57,7 @@ #include "qemu/queue.h" #include "multifd.h" +#include CONFIG_DEVICES #ifdef CONFIG_VFIO #include "hw/vfio/vfio-common.h" #endif -- 2.7.0
Re: [PATCH 1/1] Fix to show vfio migration stat in migration status
On 11/23/2020 10:03 PM, Alex Williamson wrote: On Thu, 19 Nov 2020 01:58:47 +0530 Kirti Wankhede wrote: Header file where CONFIG_VFIO is defined is not included in migration.c file. Include config devices header file in migration.c. Fixes: 3710586caa5d ("qapi: Add VFIO devices migration stats in Migration stats") Signed-off-by: Kirti Wankhede --- meson.build | 1 + migration/migration.c | 1 + 2 files changed, 2 insertions(+) diff --git a/meson.build b/meson.build index 7ddf983ff7f5..24526499cfb5 100644 --- a/meson.build +++ b/meson.build @@ -1713,6 +1713,7 @@ common_ss.add_all(when: 'CONFIG_USER_ONLY', if_true: user_ss) common_all = common_ss.apply(config_all, strict: false) common_all = static_library('common', + c_args:'-DCONFIG_DEVICES="@0@-config-devices.h"'.format(target) , build_by_default: false, sources: common_all.sources() + genh, dependencies: common_all.dependencies(), diff --git a/migration/migration.c b/migration/migration.c index 87a9b59f83f4..650efb81daad 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -57,6 +57,7 @@ #include "qemu/queue.h" #include "multifd.h" +#include CONFIG_DEVICES #ifdef CONFIG_VFIO #include "hw/vfio/vfio-common.h" #endif Fails to build... I didn't see this in my testing. Any specific configuration/build which fails? Thanks, Kirti [1705/8465] Compiling C object libcommon.fa.p/migration_postcopy-ram.c.o [1706/8465] Compiling C object libcommon.fa.p/migration_migration.c.o FAILED: libcommon.fa.p/migration_migration.c.o cc -Ilibcommon.fa.p -I. -I.. -I../slirp -I../slirp/src -Iqapi -Itrace -Iui -Iui/shader -I/usr/include/libpng16 -I/usr/include/capstone -I/usr/include/SDL2 -I/usr/include/gtk-3.0 -I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -I/usr/include/harfbuzz -I/usr/include/fribidi -I/usr/include/freetype2 -I/usr/include/cairo -I/usr/include/pixman-1 -I/usr/include/gdk-pixbuf-2.0 -I/usr/include/libmount -I/usr/include/blkid -I/usr/include/gio-unix-2.0 -I/usr/include/atk-1.0 -I/usr/include/at-spi2-atk/2.0 -I/usr/include/dbus-1.0 -I/usr/lib64/dbus-1.0/include -I/usr/include/at-spi-2.0 -I/usr/include/spice-1 -I/usr/include/spice-server -I/usr/include/cacard -I/usr/include/nss3 -I/usr/include/nspr4 -I/usr/include/vte-2.91 -I/usr/include/virgl -I/usr/include/libusb-1.0 -fdiagnostics-color=auto -pipe -Wall -Winvalid-pch -std=gnu99 -O2 -g -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -m64 -mcx16 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredu ndant-decls -Wundef -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -fno-common -fwrapv -Wold-style-declaration -Wold-style-definition -Wtype-limits -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wempty-body -Wnested-externs -Wendif-labels -Wexpansion-to-defined -Wno-missing-include-dirs -Wno-shift-negative-value -Wno-psabi -fstack-protector-strong -isystem /tmp/tmp.HlKsni7iGC/linux-headers -isystem linux-headers -iquote /tmp/tmp.HlKsni7iGC/tcg/i386 -iquote . -iquote /tmp/tmp.HlKsni7iGC -iquote /tmp/tmp.HlKsni7iGC/accel/tcg -iquote /tmp/tmp.HlKsni7iGC/include -iquote /tmp/tmp.HlKsni7iGC/disas/libvixl -pthread -fPIC -DSTRUCT_IOVEC_DEFINED -D_DEFAULT_SOURCE -D_XOPEN_SOURCE=600 -DNCURSES_WIDECHAR -Wno-undef -D_REENTRANT '-DCONFIG_DEVICES="xtensa-linux-user-config-devices.h"' -MD -MQ libcommon.fa.p/migration_migration.c.o -MF libcommon.fa.p/migration_migration.c.o.d -o libcommon.fa.p/migration_migration.c.o -c ../migration/migration.c : fatal error: xtensa-linux-user-config-devices.h: No such file or directory compilation terminated. [1707/8465] Compiling C object libcommon.fa.p/hw_pci-bridge_dec.c.o [1708/8465] Compiling C object libcommon.fa.p/backends_hostmem-memfd.c.o [1709/8465] Compiling C object libcommon.fa.p/hw_display_edid-region.c.o [1710/8465] Compiling C object libcommon.fa.p/ui_gtk-gl-area.c.o [1711/8465] Compiling C object libcommon.fa.p/disas_s390.c.o [1712/8465] Compiling C object libcommon.fa.p/hw_pci-host_gpex-acpi.c.o [1713/8465] Compiling C object libcommon.fa.p/hw_misc_macio_macio.c.o [1714/8465] Compiling C object libcommon.fa.p/hw_misc_bcm2835_mbox.c.o [1715/8465] Compiling C object libcommon.fa.p/hw_pci-bridge_xio3130_upstream.c.o [1716/8465] Compiling C object libcommon.fa.p/hw_display_qxl-logger.c.o [1717/8465] Compiling C object libcommon.fa.p/hw_net_net_tx_pkt.c.o [1718/8465] Compiling C object libcommon.fa.p/hw_char_xen_console.c.o [1719/8465] Compiling C object libqemu-mips64el-softmmu.fa.p/target_mips_msa_helper.c.o [1720/8465] Compiling C object libqemu-mips64el-softmmu.fa.p/target_mips_translate.c.o [1721/8465] Compiling C++ object libcommon.fa.p/disas_nanomips.cpp.o ninja: build st
[PATCH v2 1/1] vfio: Change default dirty pages tracking behavior during migration
By default dirty pages tracking is enabled during iterative phase (pre-copy phase). Added per device opt-out option 'pre-copy-dirty-page-tracking' to disable dirty pages tracking during iterative phase. If the option 'pre-copy-dirty-page-tracking=off' is set for any VFIO device, dirty pages tracking during iterative phase will be disabled. Signed-off-by: Kirti Wankhede --- hw/vfio/common.c | 11 +++ hw/vfio/pci.c | 3 +++ include/hw/vfio/vfio-common.h | 1 + 3 files changed, 11 insertions(+), 4 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index c1fdbf17f2e6..6ff1daa763f8 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -311,7 +311,7 @@ bool vfio_mig_active(void) return true; } -static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container) +static bool vfio_devices_all_saving(VFIOContainer *container) { VFIOGroup *group; VFIODevice *vbasedev; @@ -329,8 +329,11 @@ static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container) return false; } -if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) && -!(migration->device_state & VFIO_DEVICE_STATE_RUNNING)) { +if (migration->device_state & VFIO_DEVICE_STATE_SAVING) { +if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) +&& (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) { +return false; +} continue; } else { return false; @@ -1125,7 +1128,7 @@ static void vfio_listerner_log_sync(MemoryListener *listener, return; } -if (vfio_devices_all_stopped_and_saving(container)) { +if (vfio_devices_all_saving(container)) { vfio_sync_dirty_bitmap(container, section); } } diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 58c0ce8971e3..5601df6d6241 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -3182,6 +3182,9 @@ static void vfio_instance_init(Object *obj) static Property vfio_pci_dev_properties[] = { DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host), DEFINE_PROP_STRING("sysfsdev", VFIOPCIDevice, vbasedev.sysfsdev), +DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice, +vbasedev.pre_copy_dirty_page_tracking, +ON_OFF_AUTO_ON), DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice, display, ON_OFF_AUTO_OFF), DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0), diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index baeb4dcff102..267cf854bbba 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -129,6 +129,7 @@ typedef struct VFIODevice { unsigned int flags; VFIOMigration *migration; Error *migration_blocker; +OnOffAuto pre_copy_dirty_page_tracking; } VFIODevice; struct VFIODeviceOps { -- 2.7.0
Re: [v2 1/1] vfio: Change default dirty pages tracking behavior during migration
Sorry for spam, resending it again with 'PATCH'in subject. Kirti. On 11/23/2020 7:38 PM, Kirti Wankhede wrote: By default dirty pages tracking is enabled during iterative phase (pre-copy phase). Added per device opt-out option 'pre-copy-dirty-page-tracking' to disable dirty pages tracking during iterative phase. If the option 'pre-copy-dirty-page-tracking=off' is set for any VFIO device, dirty pages tracking during iterative phase will be disabled. Signed-off-by: Kirti Wankhede --- hw/vfio/common.c | 11 +++ hw/vfio/pci.c | 3 +++ include/hw/vfio/vfio-common.h | 1 + 3 files changed, 11 insertions(+), 4 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index c1fdbf17f2e6..6ff1daa763f8 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -311,7 +311,7 @@ bool vfio_mig_active(void) return true; } -static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container) +static bool vfio_devices_all_saving(VFIOContainer *container) { VFIOGroup *group; VFIODevice *vbasedev; @@ -329,8 +329,11 @@ static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container) return false; } -if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) && -!(migration->device_state & VFIO_DEVICE_STATE_RUNNING)) { +if (migration->device_state & VFIO_DEVICE_STATE_SAVING) { +if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) +&& (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) { +return false; +} continue; } else { return false; @@ -1125,7 +1128,7 @@ static void vfio_listerner_log_sync(MemoryListener *listener, return; } -if (vfio_devices_all_stopped_and_saving(container)) { +if (vfio_devices_all_saving(container)) { vfio_sync_dirty_bitmap(container, section); } } diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 58c0ce8971e3..5601df6d6241 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -3182,6 +3182,9 @@ static void vfio_instance_init(Object *obj) static Property vfio_pci_dev_properties[] = { DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host), DEFINE_PROP_STRING("sysfsdev", VFIOPCIDevice, vbasedev.sysfsdev), +DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice, +vbasedev.pre_copy_dirty_page_tracking, +ON_OFF_AUTO_ON), DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice, display, ON_OFF_AUTO_OFF), DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0), diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index baeb4dcff102..267cf854bbba 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -129,6 +129,7 @@ typedef struct VFIODevice { unsigned int flags; VFIOMigration *migration; Error *migration_blocker; +OnOffAuto pre_copy_dirty_page_tracking; } VFIODevice; struct VFIODeviceOps {
[v2 1/1] vfio: Change default dirty pages tracking behavior during migration
By default dirty pages tracking is enabled during iterative phase (pre-copy phase). Added per device opt-out option 'pre-copy-dirty-page-tracking' to disable dirty pages tracking during iterative phase. If the option 'pre-copy-dirty-page-tracking=off' is set for any VFIO device, dirty pages tracking during iterative phase will be disabled. Signed-off-by: Kirti Wankhede --- hw/vfio/common.c | 11 +++ hw/vfio/pci.c | 3 +++ include/hw/vfio/vfio-common.h | 1 + 3 files changed, 11 insertions(+), 4 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index c1fdbf17f2e6..6ff1daa763f8 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -311,7 +311,7 @@ bool vfio_mig_active(void) return true; } -static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container) +static bool vfio_devices_all_saving(VFIOContainer *container) { VFIOGroup *group; VFIODevice *vbasedev; @@ -329,8 +329,11 @@ static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container) return false; } -if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) && -!(migration->device_state & VFIO_DEVICE_STATE_RUNNING)) { +if (migration->device_state & VFIO_DEVICE_STATE_SAVING) { +if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) +&& (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) { +return false; +} continue; } else { return false; @@ -1125,7 +1128,7 @@ static void vfio_listerner_log_sync(MemoryListener *listener, return; } -if (vfio_devices_all_stopped_and_saving(container)) { +if (vfio_devices_all_saving(container)) { vfio_sync_dirty_bitmap(container, section); } } diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 58c0ce8971e3..5601df6d6241 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -3182,6 +3182,9 @@ static void vfio_instance_init(Object *obj) static Property vfio_pci_dev_properties[] = { DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host), DEFINE_PROP_STRING("sysfsdev", VFIOPCIDevice, vbasedev.sysfsdev), +DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice, +vbasedev.pre_copy_dirty_page_tracking, +ON_OFF_AUTO_ON), DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice, display, ON_OFF_AUTO_OFF), DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0), diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index baeb4dcff102..267cf854bbba 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -129,6 +129,7 @@ typedef struct VFIODevice { unsigned int flags; VFIOMigration *migration; Error *migration_blocker; +OnOffAuto pre_copy_dirty_page_tracking; } VFIODevice; struct VFIODeviceOps { -- 2.7.0
Re: [PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration
On 11/14/2020 2:47 PM, Shenming Lu wrote: When running VFIO migration, I found that the restoring of VFIO PCI device’s config space is before VGIC on ARM64 target. But generally, interrupt controllers need to be restored before PCI devices. Is there any other way by which VGIC can be restored before PCI device? Besides, if a VFIO PCI device is configured to have directly-injected MSIs (VLPIs), the restoring of its config space will trigger the configuring of these VLPIs (in kernel), where it would return an error as I saw due to the dependency on kvm’s vgic. Can this be fixed in kernel to re-initialize the kernel state? To avoid this, we can move the saving of the config space from the iterable process to the non-iterable process, so that it will be called after VGIC according to their priorities. With this change, at resume side, pre-copy phase data would reach destination without restored config space. VFIO device on destination might need it's config space setup and validated before it can accept further VFIO device specific migration state. This also changes bit-stream, so it would break migration with original migration patch-set. Thanks, Kirti Signed-off-by: Shenming Lu --- hw/vfio/migration.c | 22 ++ 1 file changed, 6 insertions(+), 16 deletions(-) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 3ce285ea39..028da35a25 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -351,7 +351,7 @@ static int vfio_update_pending(VFIODevice *vbasedev) return 0; } -static int vfio_save_device_config_state(QEMUFile *f, void *opaque) +static void vfio_save_device_config_state(QEMUFile *f, void *opaque) { VFIODevice *vbasedev = opaque; @@ -365,13 +365,14 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque) trace_vfio_save_device_config_state(vbasedev->name); -return qemu_file_get_error(f); +if (qemu_file_get_error(f)) +error_report("%s: Failed to save device config space", + vbasedev->name); } static int vfio_load_device_config_state(QEMUFile *f, void *opaque) { VFIODevice *vbasedev = opaque; -uint64_t data; if (vbasedev->ops && vbasedev->ops->vfio_load_config) { int ret; @@ -384,15 +385,8 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque) } } -data = qemu_get_be64(f); -if (data != VFIO_MIG_FLAG_END_OF_STATE) { -error_report("%s: Failed loading device config space, " - "end flag incorrect 0x%"PRIx64, vbasedev->name, data); -return -EINVAL; -} - trace_vfio_load_device_config_state(vbasedev->name); -return qemu_file_get_error(f); +return 0; } static int vfio_set_dirty_page_tracking(VFIODevice *vbasedev, bool start) @@ -575,11 +569,6 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) return ret; } -ret = vfio_save_device_config_state(f, opaque); -if (ret) { -return ret; -} - ret = vfio_update_pending(vbasedev); if (ret) { return ret; @@ -720,6 +709,7 @@ static SaveVMHandlers savevm_vfio_handlers = { .save_live_pending = vfio_save_pending, .save_live_iterate = vfio_save_iterate, .save_live_complete_precopy = vfio_save_complete_precopy, +.save_state = vfio_save_device_config_state, .load_setup = vfio_load_setup, .load_cleanup = vfio_load_cleanup, .load_state = vfio_load_state,
[PATCH 1/1] vfio: Change default dirty pages tracking behavior during migration
By default dirty pages tracking is enabled during iterative phase (pre-copy phase). Added per device opt-out option 'pre-copy-dirty-page-tracking' to disable dirty pages tracking during iterative phase. If the option 'pre-copy-dirty-page-tracking=off' is set for any VFIO device, dirty pages tracking during iterative phase will be disabled. Signed-off-by: Kirti Wankhede --- hw/vfio/common.c | 11 +++ hw/vfio/pci.c | 3 +++ include/hw/vfio/vfio-common.h | 1 + 3 files changed, 11 insertions(+), 4 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index c1fdbf17f2e6..6ff1daa763f8 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -311,7 +311,7 @@ bool vfio_mig_active(void) return true; } -static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container) +static bool vfio_devices_all_saving(VFIOContainer *container) { VFIOGroup *group; VFIODevice *vbasedev; @@ -329,8 +329,11 @@ static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container) return false; } -if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) && -!(migration->device_state & VFIO_DEVICE_STATE_RUNNING)) { +if (migration->device_state & VFIO_DEVICE_STATE_SAVING) { +if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) +&& (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) { +return false; +} continue; } else { return false; @@ -1125,7 +1128,7 @@ static void vfio_listerner_log_sync(MemoryListener *listener, return; } -if (vfio_devices_all_stopped_and_saving(container)) { +if (vfio_devices_all_saving(container)) { vfio_sync_dirty_bitmap(container, section); } } diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 58c0ce8971e3..5bea4b3e71f5 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -3182,6 +3182,9 @@ static void vfio_instance_init(Object *obj) static Property vfio_pci_dev_properties[] = { DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host), DEFINE_PROP_STRING("sysfsdev", VFIOPCIDevice, vbasedev.sysfsdev), +DEFINE_PROP_ON_OFF_AUTO("pre-copy-dirty-page-tracking", VFIOPCIDevice, +vbasedev.pre_copy_dirty_page_tracking, +ON_OFF_AUTO_ON), DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice, display, ON_OFF_AUTO_OFF), DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0), diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index baeb4dcff102..267cf854bbba 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -129,6 +129,7 @@ typedef struct VFIODevice { unsigned int flags; VFIOMigration *migration; Error *migration_blocker; +OnOffAuto pre_copy_dirty_page_tracking; } VFIODevice; struct VFIODeviceOps { -- 2.7.0
Re: [PATCH RFC] vfio: Set the priority of VFIO VM state change handler explicitly
On 11/17/2020 7:10 AM, Shenming Lu wrote: In VFIO VM state change handler, VFIO devices are transitioned in _SAVING state, which should keep them from sending interrupts. Then we can save the pending states of all interrupts in GIC VM state change handler (on ARM). So we have to set the priority of VFIO VM state change handler explicitly (like virtio devices) to ensure it is called before GIC's in saving. Signed-off-by: Shenming Lu --- hw/vfio/migration.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 55261562d4..d0d30864ba 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -857,7 +857,8 @@ static int vfio_migration_init(VFIODevice *vbasedev, register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers, vbasedev); -migration->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change, +migration->vm_state = qdev_add_vm_change_state_handler(vbasedev->dev, + vfio_vmstate_change, vbasedev); migration->migration_state.notify = vfio_migration_state_notifier; add_migration_state_change_notifier(&migration->migration_state); Looks good to me. Reviewed-by: Kirti Wankhede
[PATCH 1/1] Fix to show vfio migration stat in migration status
Header file where CONFIG_VFIO is defined is not included in migration.c file. Include config devices header file in migration.c. Fixes: 3710586caa5d ("qapi: Add VFIO devices migration stats in Migration stats") Signed-off-by: Kirti Wankhede --- meson.build | 1 + migration/migration.c | 1 + 2 files changed, 2 insertions(+) diff --git a/meson.build b/meson.build index 7ddf983ff7f5..24526499cfb5 100644 --- a/meson.build +++ b/meson.build @@ -1713,6 +1713,7 @@ common_ss.add_all(when: 'CONFIG_USER_ONLY', if_true: user_ss) common_all = common_ss.apply(config_all, strict: false) common_all = static_library('common', + c_args:'-DCONFIG_DEVICES="@0@-config-devices.h"'.format(target) , build_by_default: false, sources: common_all.sources() + genh, dependencies: common_all.dependencies(), diff --git a/migration/migration.c b/migration/migration.c index 87a9b59f83f4..650efb81daad 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -57,6 +57,7 @@ #include "qemu/queue.h" #include "multifd.h" +#include CONFIG_DEVICES #ifdef CONFIG_VFIO #include "hw/vfio/vfio-common.h" #endif -- 2.7.0
Re: [RFC PATCH for-QEMU-5.2] vfio: Make migration support experimental
On 11/10/2020 2:40 PM, Dr. David Alan Gilbert wrote: * Alex Williamson (alex.william...@redhat.com) wrote: On Mon, 9 Nov 2020 19:44:17 + "Dr. David Alan Gilbert" wrote: * Alex Williamson (alex.william...@redhat.com) wrote: Per the proposed documentation for vfio device migration: Dirty pages are tracked when device is in stop-and-copy phase because if pages are marked dirty during pre-copy phase and content is transfered from source to destination, there is no way to know newly dirtied pages from the point they were copied earlier until device stops. To avoid repeated copy of same content, pinned pages are marked dirty only during stop-and-copy phase. Essentially, since we don't have hardware dirty page tracking for assigned devices at this point, we consider any page that is pinned by an mdev vendor driver or pinned and mapped through the IOMMU to be perpetually dirty. In the worst case, this may result in all of guest memory being considered dirty during every iteration of live migration. The current vfio implementation of migration has chosen to mask device dirtied pages until the final stages of migration in order to avoid this worst case scenario. Allowing the device to implement a policy decision to prioritize reduced migration data like this jeopardizes QEMU's overall ability to implement any degree of service level guarantees during migration. For example, any estimates towards achieving acceptable downtime margins cannot be trusted when such a device is present. The vfio device should participate in dirty page tracking to the best of its ability throughout migration, even if that means the dirty footprint of the device impedes migration progress, allowing both QEMU and higher level management tools to decide whether to continue the migration or abort due to failure to achieve the desired behavior. I don't feel particularly badly about the decision to squash it in during the stop-and-copy phase; for devices where the pinned memory is large, I don't think doing it during the main phase makes much sense; especially if you then have to deal with tracking changes in pinning. AFAIK the kernel support for tracking changes in page pinning already exists, this is largely the vfio device in QEMU that decides when to start exposing the device dirty footprint to QEMU. I'm a bit surprised by this answer though, we don't really know what the device memory footprint is. It might be large, it might be nothing, but by not participating in dirty page tracking until the VM is stopped, we can't know what the footprint is and how it will affect downtime. Is it really the place of a QEMU device driver to impose this sort of policy? If it could actually track changes then I'd agree we shouldn't impose any policy; but if it's just marking the whole area as dirty we're going to need a bodge somewhere; this bodge doesn't look any worse than the others to me. Having said that, I agree with marking it as experimental, because I'm dubious how useful it will be for the same reason, I worry about whether the downtime will be so large to make it pointless. Not all device state is large, for example NIC might only report currently mapped RX buffers which usually not more than a 1GB and could be as low as 10's of MB. GPU might or might not have large data, that depends on its use cases. TBH I think that's the wrong reason to mark it experimental. There's clearly demand for vfio device migration and even if the practical use cases are initially small, they will expand over time and hardware will get better. My objection is that the current behavior masks the hardware and device limitations, leading to unrealistic expectations. If the user expects minimal downtime, configures convergence to account for that, QEMU thinks it can achieve it, and then the device marks everything dirty, that's not supportable. Yes, agreed. Yes, there is demand for vfio device migration and many devices owners started scoping and development for migration support. Instead of making whole migration support as experimental, we can have opt-in option to decide to mark sys mem pages dirty during iterative phase (pre-copy phase) of migration. Thanks, Kirti OTOH if the vfio device participates in dirty tracking through pre-copy, then the practical use cases will find themselves as migrations will either be aborted because downtime tolerances cannot be achieved or downtimes will be configured to match reality. Thanks, Without a way to prioritise the unpinned memory during that period, we're going to be repeatedly sending the pinned memory which is going to lead to a much larger bandwidth usage that required; so that's going in completely the wrong direction and also wrong from the point of view of the user. Dave Alex Reviewed-by: Dr. David Alan Gilbert Link: ht
Re: [PATCH v1] docs/devel: Add VFIO device migration documentation
On 11/6/2020 2:56 AM, Alex Williamson wrote: On Fri, 6 Nov 2020 02:22:11 +0530 Kirti Wankhede wrote: On 11/6/2020 12:41 AM, Alex Williamson wrote: On Fri, 6 Nov 2020 00:29:36 +0530 Kirti Wankhede wrote: On 11/4/2020 6:15 PM, Alex Williamson wrote: On Wed, 4 Nov 2020 13:25:40 +0530 Kirti Wankhede wrote: On 11/4/2020 1:57 AM, Alex Williamson wrote: On Wed, 4 Nov 2020 01:18:12 +0530 Kirti Wankhede wrote: On 10/30/2020 12:35 AM, Alex Williamson wrote: On Thu, 29 Oct 2020 23:11:16 +0530 Kirti Wankhede wrote: +System memory dirty pages tracking +-- + +A ``log_sync`` memory listener callback is added to mark system memory pages s/is added to mark/marks those/ +as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried s/by/by the/ s/Dirty/The dirty/ +per container. All pages pinned by vendor driver through vfio_pin_pages() s/by/by the/ +external API have to be marked as dirty during migration. When there are CPU +writes, CPU dirty page tracking can identify dirtied pages, but any page pinned +by vendor driver can also be written by device. There is currently no device s/by/by the/ (x2) +which has hardware support for dirty page tracking. So all pages which are +pinned by vendor driver are considered as dirty. +Dirty pages are tracked when device is in stop-and-copy phase because if pages +are marked dirty during pre-copy phase and content is transfered from source to +destination, there is no way to know newly dirtied pages from the point they +were copied earlier until device stops. To avoid repeated copy of same content, +pinned pages are marked dirty only during stop-and-copy phase. Let me take a quick stab at rewriting this paragraph (not sure if I understood it correctly): "Dirty pages are tracked when the device is in the stop-and-copy phase. During the pre-copy phase, it is not possible to distinguish a dirty page that has been transferred from the source to the destination from newly dirtied pages, which would lead to repeated copying of the same content. Therefore, pinned pages are only marked dirty during the stop-and-copy phase." ? I think above rephrase only talks about repeated copying in pre-copy phase. Used "copied earlier until device stops" to indicate both pre-copy and stop-and-copy till device stops. Now I'm confused, I thought we had abandoned the idea that we can only report pinned pages during stop-and-copy. Doesn't the device needs to expose its dirty memory footprint during the iterative phase regardless of whether that causes repeat copies? If QEMU iterates and sees that all memory is still dirty, it may have transferred more data, but it can actually predict if it can achieve its downtime tolerances. Which is more important, less data transfer or predictability? Thanks, Even if QEMU copies and transfers content of all sys mem pages during pre-copy (worst case with IOMMU backed mdev device when its vendor driver is not smart to pin pages explicitly and all sys mem pages are marked dirty), then also its prediction about downtime tolerance will not be correct, because during stop-and-copy again all pages need to be copied as device can write to any of those pinned pages. I think you're only reiterating my point. If QEMU copies all of guest memory during the iterative phase and each time it sees that all memory is dirty, such as if CPUs or devices (including assigned devices) are dirtying pages as fast as it copies them (or continuously marks them dirty), then QEMU can predict that downtime will require copying all pages. But as of now there is no way to know if device has dirtied pages during iterative phase. This claim doesn't make any sense, pinned pages are considered persistently dirtied, during the iterative phase and while stopped. If instead devices don't mark dirty pages until the VM is stopped, then QEMU might iterate through memory copy and predict a short downtime because not much memory is dirty, only to be surprised that all of memory is suddenly dirty. At that point it's too late, the VM is already stopped, the predicted short downtime takes far longer than expected. This is exactly why we made the kernel interface mark pinned pages persistently dirty when it was proposed that we only report pinned pages once. Thanks, Since there is no way to know if device dirtied pages during iterative phase, QEMU should query pinned pages in stop-and-copy phase. As above, I don't believe this is true. Whenever there will be hardware support or some software mechanism to report pages dirtied by device then we will add a capability bit in migration capability and based on that capability bit qemu/user space app should decide to query dirty pages
[PATCH v2 1/1] Fix use after free in vfio_migration_probe
Fixes Coverity issue: CID 1436126: Memory - illegal accesses (USE_AFTER_FREE) Fixes: a9e271ec9b36 ("vfio: Add migration region initialization and finalize function") Signed-off-by: Kirti Wankhede Reviewed-by: David Edmondson Reviewed-by: Alex Bennée Reviewed-by: Philippe Mathieu-Daudé --- hw/vfio/migration.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 3ce285ea395d..55261562d4f3 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -897,8 +897,8 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) goto add_blocker; } -g_free(info); trace_vfio_migration_probe(vbasedev->name, info->index); +g_free(info); return 0; add_blocker: -- 2.7.0
[PATCH 1/1] Change the order of g_free(info) and tracepoint
Fixes Coverity issue: CID 1436126: Memory - illegal accesses (USE_AFTER_FREE) Fixes: a9e271ec9b36 ("vfio: Add migration region initialization and finalize function") Signed-off-by: Kirti Wankhede --- hw/vfio/migration.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 3ce285ea395d..55261562d4f3 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -897,8 +897,8 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) goto add_blocker; } -g_free(info); trace_vfio_migration_probe(vbasedev->name, info->index); +g_free(info); return 0; add_blocker: -- 2.7.0
Re: [PATCH v1] docs/devel: Add VFIO device migration documentation
On 11/6/2020 12:41 AM, Alex Williamson wrote: On Fri, 6 Nov 2020 00:29:36 +0530 Kirti Wankhede wrote: On 11/4/2020 6:15 PM, Alex Williamson wrote: On Wed, 4 Nov 2020 13:25:40 +0530 Kirti Wankhede wrote: On 11/4/2020 1:57 AM, Alex Williamson wrote: On Wed, 4 Nov 2020 01:18:12 +0530 Kirti Wankhede wrote: On 10/30/2020 12:35 AM, Alex Williamson wrote: On Thu, 29 Oct 2020 23:11:16 +0530 Kirti Wankhede wrote: +System memory dirty pages tracking +-- + +A ``log_sync`` memory listener callback is added to mark system memory pages s/is added to mark/marks those/ +as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried s/by/by the/ s/Dirty/The dirty/ +per container. All pages pinned by vendor driver through vfio_pin_pages() s/by/by the/ +external API have to be marked as dirty during migration. When there are CPU +writes, CPU dirty page tracking can identify dirtied pages, but any page pinned +by vendor driver can also be written by device. There is currently no device s/by/by the/ (x2) +which has hardware support for dirty page tracking. So all pages which are +pinned by vendor driver are considered as dirty. +Dirty pages are tracked when device is in stop-and-copy phase because if pages +are marked dirty during pre-copy phase and content is transfered from source to +destination, there is no way to know newly dirtied pages from the point they +were copied earlier until device stops. To avoid repeated copy of same content, +pinned pages are marked dirty only during stop-and-copy phase. Let me take a quick stab at rewriting this paragraph (not sure if I understood it correctly): "Dirty pages are tracked when the device is in the stop-and-copy phase. During the pre-copy phase, it is not possible to distinguish a dirty page that has been transferred from the source to the destination from newly dirtied pages, which would lead to repeated copying of the same content. Therefore, pinned pages are only marked dirty during the stop-and-copy phase." ? I think above rephrase only talks about repeated copying in pre-copy phase. Used "copied earlier until device stops" to indicate both pre-copy and stop-and-copy till device stops. Now I'm confused, I thought we had abandoned the idea that we can only report pinned pages during stop-and-copy. Doesn't the device needs to expose its dirty memory footprint during the iterative phase regardless of whether that causes repeat copies? If QEMU iterates and sees that all memory is still dirty, it may have transferred more data, but it can actually predict if it can achieve its downtime tolerances. Which is more important, less data transfer or predictability? Thanks, Even if QEMU copies and transfers content of all sys mem pages during pre-copy (worst case with IOMMU backed mdev device when its vendor driver is not smart to pin pages explicitly and all sys mem pages are marked dirty), then also its prediction about downtime tolerance will not be correct, because during stop-and-copy again all pages need to be copied as device can write to any of those pinned pages. I think you're only reiterating my point. If QEMU copies all of guest memory during the iterative phase and each time it sees that all memory is dirty, such as if CPUs or devices (including assigned devices) are dirtying pages as fast as it copies them (or continuously marks them dirty), then QEMU can predict that downtime will require copying all pages. But as of now there is no way to know if device has dirtied pages during iterative phase. This claim doesn't make any sense, pinned pages are considered persistently dirtied, during the iterative phase and while stopped. If instead devices don't mark dirty pages until the VM is stopped, then QEMU might iterate through memory copy and predict a short downtime because not much memory is dirty, only to be surprised that all of memory is suddenly dirty. At that point it's too late, the VM is already stopped, the predicted short downtime takes far longer than expected. This is exactly why we made the kernel interface mark pinned pages persistently dirty when it was proposed that we only report pinned pages once. Thanks, Since there is no way to know if device dirtied pages during iterative phase, QEMU should query pinned pages in stop-and-copy phase. As above, I don't believe this is true. Whenever there will be hardware support or some software mechanism to report pages dirtied by device then we will add a capability bit in migration capability and based on that capability bit qemu/user space app should decide to query dirty pages in iterative phase. Yes, we could advertise support for fine granularity dirty page tracking, but I completely disagree that we should consider p
Re: [PATCH v1] docs/devel: Add VFIO device migration documentation
On 11/4/2020 6:15 PM, Alex Williamson wrote: On Wed, 4 Nov 2020 13:25:40 +0530 Kirti Wankhede wrote: On 11/4/2020 1:57 AM, Alex Williamson wrote: On Wed, 4 Nov 2020 01:18:12 +0530 Kirti Wankhede wrote: On 10/30/2020 12:35 AM, Alex Williamson wrote: On Thu, 29 Oct 2020 23:11:16 +0530 Kirti Wankhede wrote: +System memory dirty pages tracking +-- + +A ``log_sync`` memory listener callback is added to mark system memory pages s/is added to mark/marks those/ +as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried s/by/by the/ s/Dirty/The dirty/ +per container. All pages pinned by vendor driver through vfio_pin_pages() s/by/by the/ +external API have to be marked as dirty during migration. When there are CPU +writes, CPU dirty page tracking can identify dirtied pages, but any page pinned +by vendor driver can also be written by device. There is currently no device s/by/by the/ (x2) +which has hardware support for dirty page tracking. So all pages which are +pinned by vendor driver are considered as dirty. +Dirty pages are tracked when device is in stop-and-copy phase because if pages +are marked dirty during pre-copy phase and content is transfered from source to +destination, there is no way to know newly dirtied pages from the point they +were copied earlier until device stops. To avoid repeated copy of same content, +pinned pages are marked dirty only during stop-and-copy phase. Let me take a quick stab at rewriting this paragraph (not sure if I understood it correctly): "Dirty pages are tracked when the device is in the stop-and-copy phase. During the pre-copy phase, it is not possible to distinguish a dirty page that has been transferred from the source to the destination from newly dirtied pages, which would lead to repeated copying of the same content. Therefore, pinned pages are only marked dirty during the stop-and-copy phase." ? I think above rephrase only talks about repeated copying in pre-copy phase. Used "copied earlier until device stops" to indicate both pre-copy and stop-and-copy till device stops. Now I'm confused, I thought we had abandoned the idea that we can only report pinned pages during stop-and-copy. Doesn't the device needs to expose its dirty memory footprint during the iterative phase regardless of whether that causes repeat copies? If QEMU iterates and sees that all memory is still dirty, it may have transferred more data, but it can actually predict if it can achieve its downtime tolerances. Which is more important, less data transfer or predictability? Thanks, Even if QEMU copies and transfers content of all sys mem pages during pre-copy (worst case with IOMMU backed mdev device when its vendor driver is not smart to pin pages explicitly and all sys mem pages are marked dirty), then also its prediction about downtime tolerance will not be correct, because during stop-and-copy again all pages need to be copied as device can write to any of those pinned pages. I think you're only reiterating my point. If QEMU copies all of guest memory during the iterative phase and each time it sees that all memory is dirty, such as if CPUs or devices (including assigned devices) are dirtying pages as fast as it copies them (or continuously marks them dirty), then QEMU can predict that downtime will require copying all pages. But as of now there is no way to know if device has dirtied pages during iterative phase. This claim doesn't make any sense, pinned pages are considered persistently dirtied, during the iterative phase and while stopped. If instead devices don't mark dirty pages until the VM is stopped, then QEMU might iterate through memory copy and predict a short downtime because not much memory is dirty, only to be surprised that all of memory is suddenly dirty. At that point it's too late, the VM is already stopped, the predicted short downtime takes far longer than expected. This is exactly why we made the kernel interface mark pinned pages persistently dirty when it was proposed that we only report pinned pages once. Thanks, Since there is no way to know if device dirtied pages during iterative phase, QEMU should query pinned pages in stop-and-copy phase. As above, I don't believe this is true. Whenever there will be hardware support or some software mechanism to report pages dirtied by device then we will add a capability bit in migration capability and based on that capability bit qemu/user space app should decide to query dirty pages in iterative phase. Yes, we could advertise support for fine granularity dirty page tracking, but I completely disagree that we should consider pinned pages clean until suddenly exposing them as dirty once the VM is stopped. Thanks, Should QEMU copy dirtied pages twice, during iterative phase and then when VM is stopped? Thanks, Kirti
Re: [PATCH v1] docs/devel: Add VFIO device migration documentation
On 11/4/2020 1:57 AM, Alex Williamson wrote: On Wed, 4 Nov 2020 01:18:12 +0530 Kirti Wankhede wrote: On 10/30/2020 12:35 AM, Alex Williamson wrote: On Thu, 29 Oct 2020 23:11:16 +0530 Kirti Wankhede wrote: +System memory dirty pages tracking +-- + +A ``log_sync`` memory listener callback is added to mark system memory pages s/is added to mark/marks those/ +as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried s/by/by the/ s/Dirty/The dirty/ +per container. All pages pinned by vendor driver through vfio_pin_pages() s/by/by the/ +external API have to be marked as dirty during migration. When there are CPU +writes, CPU dirty page tracking can identify dirtied pages, but any page pinned +by vendor driver can also be written by device. There is currently no device s/by/by the/ (x2) +which has hardware support for dirty page tracking. So all pages which are +pinned by vendor driver are considered as dirty. +Dirty pages are tracked when device is in stop-and-copy phase because if pages +are marked dirty during pre-copy phase and content is transfered from source to +destination, there is no way to know newly dirtied pages from the point they +were copied earlier until device stops. To avoid repeated copy of same content, +pinned pages are marked dirty only during stop-and-copy phase. Let me take a quick stab at rewriting this paragraph (not sure if I understood it correctly): "Dirty pages are tracked when the device is in the stop-and-copy phase. During the pre-copy phase, it is not possible to distinguish a dirty page that has been transferred from the source to the destination from newly dirtied pages, which would lead to repeated copying of the same content. Therefore, pinned pages are only marked dirty during the stop-and-copy phase." ? I think above rephrase only talks about repeated copying in pre-copy phase. Used "copied earlier until device stops" to indicate both pre-copy and stop-and-copy till device stops. Now I'm confused, I thought we had abandoned the idea that we can only report pinned pages during stop-and-copy. Doesn't the device needs to expose its dirty memory footprint during the iterative phase regardless of whether that causes repeat copies? If QEMU iterates and sees that all memory is still dirty, it may have transferred more data, but it can actually predict if it can achieve its downtime tolerances. Which is more important, less data transfer or predictability? Thanks, Even if QEMU copies and transfers content of all sys mem pages during pre-copy (worst case with IOMMU backed mdev device when its vendor driver is not smart to pin pages explicitly and all sys mem pages are marked dirty), then also its prediction about downtime tolerance will not be correct, because during stop-and-copy again all pages need to be copied as device can write to any of those pinned pages. I think you're only reiterating my point. If QEMU copies all of guest memory during the iterative phase and each time it sees that all memory is dirty, such as if CPUs or devices (including assigned devices) are dirtying pages as fast as it copies them (or continuously marks them dirty), then QEMU can predict that downtime will require copying all pages. But as of now there is no way to know if device has dirtied pages during iterative phase. If instead devices don't mark dirty pages until the VM is stopped, then QEMU might iterate through memory copy and predict a short downtime because not much memory is dirty, only to be surprised that all of memory is suddenly dirty. At that point it's too late, the VM is already stopped, the predicted short downtime takes far longer than expected. This is exactly why we made the kernel interface mark pinned pages persistently dirty when it was proposed that we only report pinned pages once. Thanks, Since there is no way to know if device dirtied pages during iterative phase, QEMU should query pinned pages in stop-and-copy phase. Whenever there will be hardware support or some software mechanism to report pages dirtied by device then we will add a capability bit in migration capability and based on that capability bit qemu/user space app should decide to query dirty pages in iterative phase. Thanks, Kirti
Re: [PATCH v1] docs/devel: Add VFIO device migration documentation
On 10/30/2020 12:35 AM, Alex Williamson wrote: On Thu, 29 Oct 2020 23:11:16 +0530 Kirti Wankhede wrote: +System memory dirty pages tracking +-- + +A ``log_sync`` memory listener callback is added to mark system memory pages s/is added to mark/marks those/ +as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried s/by/by the/ s/Dirty/The dirty/ +per container. All pages pinned by vendor driver through vfio_pin_pages() s/by/by the/ +external API have to be marked as dirty during migration. When there are CPU +writes, CPU dirty page tracking can identify dirtied pages, but any page pinned +by vendor driver can also be written by device. There is currently no device s/by/by the/ (x2) +which has hardware support for dirty page tracking. So all pages which are +pinned by vendor driver are considered as dirty. +Dirty pages are tracked when device is in stop-and-copy phase because if pages +are marked dirty during pre-copy phase and content is transfered from source to +destination, there is no way to know newly dirtied pages from the point they +were copied earlier until device stops. To avoid repeated copy of same content, +pinned pages are marked dirty only during stop-and-copy phase. Let me take a quick stab at rewriting this paragraph (not sure if I understood it correctly): "Dirty pages are tracked when the device is in the stop-and-copy phase. During the pre-copy phase, it is not possible to distinguish a dirty page that has been transferred from the source to the destination from newly dirtied pages, which would lead to repeated copying of the same content. Therefore, pinned pages are only marked dirty during the stop-and-copy phase." ? I think above rephrase only talks about repeated copying in pre-copy phase. Used "copied earlier until device stops" to indicate both pre-copy and stop-and-copy till device stops. Now I'm confused, I thought we had abandoned the idea that we can only report pinned pages during stop-and-copy. Doesn't the device needs to expose its dirty memory footprint during the iterative phase regardless of whether that causes repeat copies? If QEMU iterates and sees that all memory is still dirty, it may have transferred more data, but it can actually predict if it can achieve its downtime tolerances. Which is more important, less data transfer or predictability? Thanks, Even if QEMU copies and transfers content of all sys mem pages during pre-copy (worst case with IOMMU backed mdev device when its vendor driver is not smart to pin pages explicitly and all sys mem pages are marked dirty), then also its prediction about downtime tolerance will not be correct, because during stop-and-copy again all pages need to be copied as device can write to any of those pinned pages. Thanks, Kirti
Re: Out-of-Process Device Emulation session at KVM Forum 2020
On 10/29/2020 10:12 PM, Daniel P. Berrangé wrote: On Thu, Oct 29, 2020 at 04:15:30PM +, David Edmondson wrote: On Thursday, 2020-10-29 at 21:02:05 +08, Jason Wang wrote: 2) Did qemu even try to migrate opaque blobs before? It's probably a bad design of migration protocol as well. The TPM emulator backend migrates blobs that are only understood by swtpm. The separate slirp-helper net backend does the same too IIUC When sys mem pages are marked dirty and content is copied to destination, content of sys mem is also opaque to QEMU. Thanks, Kirti
Re: [PATCH v1] docs/devel: Add VFIO device migration documentation
Thanks for corrections Cornelia. I had done the corrections you suggested I had not replied, see my comments on couple of places where I disagree. On 10/29/2020 5:22 PM, Cornelia Huck wrote: On Thu, 29 Oct 2020 11:23:11 +0530 Kirti Wankhede wrote: Document interfaces used for VFIO device migration. Added flow of state changes during live migration with VFIO device. Signed-off-by: Kirti Wankhede --- MAINTAINERS | 1 + docs/devel/vfio-migration.rst | 119 ++ You probably want to include this into the Developer's Guide via index.rst. Ok. 2 files changed, 120 insertions(+) create mode 100644 docs/devel/vfio-migration.rst diff --git a/MAINTAINERS b/MAINTAINERS index 6a197bd358d6..6f3fcffc6b3d 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1728,6 +1728,7 @@ M: Alex Williamson S: Supported F: hw/vfio/* F: include/hw/vfio/ +F: docs/devel/vfio-migration.rst vfio-ccw M: Cornelia Huck diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst new file mode 100644 index ..dab9127825e4 --- /dev/null +++ b/docs/devel/vfio-migration.rst @@ -0,0 +1,119 @@ += +VFIO device Migration += + +VFIO devices use iterative approach for migration because certain VFIO devices s/use/use an/ ? +(e.g. GPU) have large amount of data to be transfered. The iterative pre-copy +phase of migration allows for the guest to continue whilst the VFIO device state +is transferred to destination, this helps to reduce the total downtime of the s/to destination,/to the destination;/ +VM. VFIO devices can choose to skip the pre-copy phase of migration by returning +pending_bytes as zero during pre-copy phase. s/during/during the/ + +Detailed description of UAPI for VFIO device for migration is in the comment +above ``vfio_device_migration_info`` structure definition in header file +linux-headers/linux/vfio.h. I think I'd copy that to this file. If I'm looking at the documentation, I'd rather not go hunting for source code to find out what structure you are talking about. Plus, as it's UAPI, I don't expect it to change much, so it should be easy to keep the definitions in sync (famous last words). I feel its duplication of documentation. I would like to know others views as well. + +VFIO device hooks for iterative approach: +- A ``save_setup`` function that setup migration region, sets _SAVING flag in s/setup/sets up the/ s/in/in the/ +VFIO device state and inform VFIO IOMMU module to start dirty page tracking. s/inform/informs the/ + +- A ``load_setup`` function that setup migration region on the destination and s/setup/sets up the/ +sets _RESUMING flag in VFIO device state. s/in/in the/ + +- A ``save_live_pending`` function that reads pending_bytes from vendor driver +that indicate how much more data the vendor driver yet to save for the VFIO +device. "A ``save_live_pending`` function that reads pending_bytes from the vendor driver, which indicates the amount of data that the vendor driver has yet to save for the VFIO device." ? + +- A ``save_live_iterate`` function that reads VFIO device's data from vendor s/reads/reads the/ s/from/from the/ +driver through migration region during iterative phase. s/through/through the/ + +- A ``save_live_complete_precopy`` function that resets _RUNNING flag from VFIO s/from/from the/ +device state, saves device config space, if any, and iteratively copies s/saves/saves the/ +remaining data for VFIO device till pending_bytes returned by vendor driver +is zero. "...and interactively copies the remaining data for the VFIO device until the vendor driver indicates that no data remains (pending_bytes is zero)." ? + +- A ``load_state`` function loads config section and data sections generated by +above save functions. "A ``load_state`` function that loads the config section and the data sections that are generated by the save functions above." ? + +- ``cleanup`` functions for both save and load that unmap migration region. ..."that perform any migration-related cleanup, including unmapping the migration region." ? + +VM state change handler is registered to change VFIO device state based on VM +state change. "A VM state change handler is registered to change the VFIO device state when the VM state changes." ? + +Similarly, a migration state change notifier is added to get a notification on s/added/registered/ ? +migration state change. These states are translated to VFIO device state and +conveyed to vendor driver. + +System memory dirty pages tracking +-- + +A ``log_sync`` memory listener callback is added to mark system memory pages s/is added to mark/marks those/ +as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried s/by/b
[PATCH v1] docs/devel: Add VFIO device migration documentation
Document interfaces used for VFIO device migration. Added flow of state changes during live migration with VFIO device. Signed-off-by: Kirti Wankhede --- MAINTAINERS | 1 + docs/devel/vfio-migration.rst | 119 ++ 2 files changed, 120 insertions(+) create mode 100644 docs/devel/vfio-migration.rst diff --git a/MAINTAINERS b/MAINTAINERS index 6a197bd358d6..6f3fcffc6b3d 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1728,6 +1728,7 @@ M: Alex Williamson S: Supported F: hw/vfio/* F: include/hw/vfio/ +F: docs/devel/vfio-migration.rst vfio-ccw M: Cornelia Huck diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst new file mode 100644 index ..dab9127825e4 --- /dev/null +++ b/docs/devel/vfio-migration.rst @@ -0,0 +1,119 @@ += +VFIO device Migration += + +VFIO devices use iterative approach for migration because certain VFIO devices +(e.g. GPU) have large amount of data to be transfered. The iterative pre-copy +phase of migration allows for the guest to continue whilst the VFIO device state +is transferred to destination, this helps to reduce the total downtime of the +VM. VFIO devices can choose to skip the pre-copy phase of migration by returning +pending_bytes as zero during pre-copy phase. + +Detailed description of UAPI for VFIO device for migration is in the comment +above ``vfio_device_migration_info`` structure definition in header file +linux-headers/linux/vfio.h. + +VFIO device hooks for iterative approach: +- A ``save_setup`` function that setup migration region, sets _SAVING flag in +VFIO device state and inform VFIO IOMMU module to start dirty page tracking. + +- A ``load_setup`` function that setup migration region on the destination and +sets _RESUMING flag in VFIO device state. + +- A ``save_live_pending`` function that reads pending_bytes from vendor driver +that indicate how much more data the vendor driver yet to save for the VFIO +device. + +- A ``save_live_iterate`` function that reads VFIO device's data from vendor +driver through migration region during iterative phase. + +- A ``save_live_complete_precopy`` function that resets _RUNNING flag from VFIO +device state, saves device config space, if any, and iteratively copies +remaining data for VFIO device till pending_bytes returned by vendor driver +is zero. + +- A ``load_state`` function loads config section and data sections generated by +above save functions. + +- ``cleanup`` functions for both save and load that unmap migration region. + +VM state change handler is registered to change VFIO device state based on VM +state change. + +Similarly, a migration state change notifier is added to get a notification on +migration state change. These states are translated to VFIO device state and +conveyed to vendor driver. + +System memory dirty pages tracking +-- + +A ``log_sync`` memory listener callback is added to mark system memory pages +as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried +per container. All pages pinned by vendor driver through vfio_pin_pages() +external API have to be marked as dirty during migration. When there are CPU +writes, CPU dirty page tracking can identify dirtied pages, but any page pinned +by vendor driver can also be written by device. There is currently no device +which has hardware support for dirty page tracking. So all pages which are +pinned by vendor driver are considered as dirty. +Dirty pages are tracked when device is in stop-and-copy phase because if pages +are marked dirty during pre-copy phase and content is transfered from source to +destination, there is no way to know newly dirtied pages from the point they +were copied earlier until device stops. To avoid repeated copy of same content, +pinned pages are marked dirty only during stop-and-copy phase. + +System memory dirty pages tracking when vIOMMU is enabled +- +With vIOMMU, IO virtual address range can get unmapped while in pre-copy phase +of migration. In that case, unmap ioctl returns pages pinned in that range and +QEMU reports corresponding guest physical pages dirty. +During stop-and-copy phase, an IOMMU notifier is used to get a callback for +mapped pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for +those mapped ranges. + +Flow of state changes during Live migration +=== +Below is the flow of state change during live migration where states in brackets +represent VM state, migration state and VFIO device state as: +(VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE) + +Live migration save path + +QEMU normal running state +(RUNNING, _NONE, _RUNNING) +| + migrate_init s
Re: [PATCH v29 05/17] vfio: Add VM state change handler to know state of VM
On 10/26/2020 7:32 PM, Alex Williamson wrote: On Mon, 26 Oct 2020 19:18:51 +0530 Kirti Wankhede wrote: On 10/26/2020 6:30 PM, Alex Williamson wrote: On Mon, 26 Oct 2020 15:06:15 +0530 Kirti Wankhede wrote: VM state change handler is called on change in VM's state. Based on VM state, VFIO device state should be changed. Added read/write helper functions for migration region. Added function to set device_state. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Reviewed-by: Dr. David Alan Gilbert Reviewed-by: Cornelia Huck --- hw/vfio/migration.c | 158 ++ hw/vfio/trace-events | 2 + include/hw/vfio/vfio-common.h | 4 ++ 3 files changed, 164 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index fd7faf423cdc..65ce735d667b 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c [snip] @@ -64,6 +216,9 @@ static int vfio_migration_init(VFIODevice *vbasedev, ret = -EINVAL; goto err; } + +migration->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change, + vbasedev); return 0; err: Fails to build, @migration is not defined. We could use vbasedev->migration or pull defining and setting @migration from patch 06. Thanks, Pulling and setting migration from patch 06 seems better option. Should I resend patch 5 & 6 only? I've resolved this locally as patch 05: @@ -38,6 +190,7 @@ static int vfio_migration_init(VFIODevice *vbasedev, { int ret; Object *obj; +VFIOMigration *migration; if (!vbasedev->ops->vfio_get_object) { return -EINVAL; @@ -64,6 +217,10 @@ static int vfio_migration_init(VFIODevice *vbasedev, ret = -EINVAL; goto err; } + +migration = vbasedev->migration; +migration->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change, + vbasedev); return 0; err: patch 06: @@ -219,8 +243,11 @@ static int vfio_migration_init(VFIODevice *vbasedev, } migration = vbasedev->migration; +migration->vbasedev = vbasedev; migration->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change, vbasedev); +migration->migration_state.notify = vfio_migration_state_notifier; +add_migration_state_change_notifier(&migration->migration_state); return 0; err: If you're satisfied with that, no need to resend. Thanks, Yes, this is exactly I was going to send. Thanks for fixing it. Thanks, Kirti
Re: [PATCH v29 05/17] vfio: Add VM state change handler to know state of VM
On 10/26/2020 6:30 PM, Alex Williamson wrote: On Mon, 26 Oct 2020 15:06:15 +0530 Kirti Wankhede wrote: VM state change handler is called on change in VM's state. Based on VM state, VFIO device state should be changed. Added read/write helper functions for migration region. Added function to set device_state. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Reviewed-by: Dr. David Alan Gilbert Reviewed-by: Cornelia Huck --- hw/vfio/migration.c | 158 ++ hw/vfio/trace-events | 2 + include/hw/vfio/vfio-common.h | 4 ++ 3 files changed, 164 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index fd7faf423cdc..65ce735d667b 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c [snip] @@ -64,6 +216,9 @@ static int vfio_migration_init(VFIODevice *vbasedev, ret = -EINVAL; goto err; } + +migration->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change, + vbasedev); return 0; err: Fails to build, @migration is not defined. We could use vbasedev->migration or pull defining and setting @migration from patch 06. Thanks, Pulling and setting migration from patch 06 seems better option. Should I resend patch 5 & 6 only? Thanks, Kirti
[PATCH v29 14/17] vfio: Dirty page tracking when vIOMMU is enabled
When vIOMMU is enabled, register MAP notifier from log_sync when all devices in container are in stop and copy phase of migration. Call replay and get dirty pages from notifier callback. Suggested-by: Alex Williamson Signed-off-by: Kirti Wankhede Reviewed-by: Yan Zhao --- hw/vfio/common.c | 88 hw/vfio/trace-events | 1 + 2 files changed, 83 insertions(+), 6 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 2634387df948..c0b5b6245a47 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -442,8 +442,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section) } /* Called with rcu_read_lock held. */ -static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr, - bool *read_only) +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr, + ram_addr_t *ram_addr, bool *read_only) { MemoryRegion *mr; hwaddr xlat; @@ -474,8 +474,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr, return false; } -*vaddr = memory_region_get_ram_ptr(mr) + xlat; -*read_only = !writable || mr->readonly; +if (vaddr) { +*vaddr = memory_region_get_ram_ptr(mr) + xlat; +} + +if (ram_addr) { +*ram_addr = memory_region_get_ram_addr(mr) + xlat; +} + +if (read_only) { +*read_only = !writable || mr->readonly; +} return true; } @@ -485,7 +494,6 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); VFIOContainer *container = giommu->container; hwaddr iova = iotlb->iova + giommu->iommu_offset; -bool read_only; void *vaddr; int ret; @@ -501,7 +509,9 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) rcu_read_lock(); if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) { -if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) { +bool read_only; + +if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only)) { goto out; } /* @@ -899,11 +909,77 @@ err_out: return ret; } +typedef struct { +IOMMUNotifier n; +VFIOGuestIOMMU *giommu; +} vfio_giommu_dirty_notifier; + +static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) +{ +vfio_giommu_dirty_notifier *gdn = container_of(n, +vfio_giommu_dirty_notifier, n); +VFIOGuestIOMMU *giommu = gdn->giommu; +VFIOContainer *container = giommu->container; +hwaddr iova = iotlb->iova + giommu->iommu_offset; +ram_addr_t translated_addr; + +trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask); + +if (iotlb->target_as != &address_space_memory) { +error_report("Wrong target AS \"%s\", only system memory is allowed", + iotlb->target_as->name ? iotlb->target_as->name : "none"); +return; +} + +rcu_read_lock(); +if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) { +int ret; + +ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1, +translated_addr); +if (ret) { +error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx") = %d (%m)", + container, iova, + iotlb->addr_mask + 1, ret); +} +} +rcu_read_unlock(); +} + static int vfio_sync_dirty_bitmap(VFIOContainer *container, MemoryRegionSection *section) { ram_addr_t ram_addr; +if (memory_region_is_iommu(section->mr)) { +VFIOGuestIOMMU *giommu; + +QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) { +if (MEMORY_REGION(giommu->iommu) == section->mr && +giommu->n.start == section->offset_within_region) { +Int128 llend; +vfio_giommu_dirty_notifier gdn = { .giommu = giommu }; +int idx = memory_region_iommu_attrs_to_index(giommu->iommu, + MEMTXATTRS_UNSPECIFIED); + +llend = int128_add(int128_make64(section->offset_within_region), + section->size); +llend = int128_sub(llend, int128_one()); + +iommu_notifier_init(&gdn.n, +vfio_iommu_map_dirty_notify, +IOMMU_NOTIFIER_MAP, +section->offset_within_region, +int128_get64(llend), +
[PATCH v29 11/17] vfio: Get migration capability flags for container
Added helper functions to get IOMMU info capability chain. Added function to get migration capability information from that capability chain for IOMMU container. Similar change was proposed earlier: https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg03759.html Disable migration for devices if IOMMU module doesn't support migration capability. Signed-off-by: Kirti Wankhede Cc: Shameer Kolothum Cc: Eric Auger --- hw/vfio/common.c | 90 +++ hw/vfio/migration.c | 7 +++- include/hw/vfio/vfio-common.h | 3 ++ 3 files changed, 91 insertions(+), 9 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index c6e98b8d61be..d4959c036dd1 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -1228,6 +1228,75 @@ static int vfio_init_container(VFIOContainer *container, int group_fd, return 0; } +static int vfio_get_iommu_info(VFIOContainer *container, + struct vfio_iommu_type1_info **info) +{ + +size_t argsz = sizeof(struct vfio_iommu_type1_info); + +*info = g_new0(struct vfio_iommu_type1_info, 1); +again: +(*info)->argsz = argsz; + +if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) { +g_free(*info); +*info = NULL; +return -errno; +} + +if (((*info)->argsz > argsz)) { +argsz = (*info)->argsz; +*info = g_realloc(*info, argsz); +goto again; +} + +return 0; +} + +static struct vfio_info_cap_header * +vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id) +{ +struct vfio_info_cap_header *hdr; +void *ptr = info; + +if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) { +return NULL; +} + +for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) { +if (hdr->id == id) { +return hdr; +} +} + +return NULL; +} + +static void vfio_get_iommu_info_migration(VFIOContainer *container, + struct vfio_iommu_type1_info *info) +{ +struct vfio_info_cap_header *hdr; +struct vfio_iommu_type1_info_cap_migration *cap_mig; + +hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION); +if (!hdr) { +return; +} + +cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration, +header); + +/* + * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of + * TARGET_PAGE_SIZE to mark those dirty. + */ +if (cap_mig->pgsize_bitmap & TARGET_PAGE_SIZE) { +container->dirty_pages_supported = true; +container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size; +container->dirty_pgsizes = cap_mig->pgsize_bitmap; +} +} + static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, Error **errp) { @@ -1297,6 +1366,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, container->space = space; container->fd = fd; container->error = NULL; +container->dirty_pages_supported = false; QLIST_INIT(&container->giommu_list); QLIST_INIT(&container->hostwin_list); @@ -1309,7 +1379,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, case VFIO_TYPE1v2_IOMMU: case VFIO_TYPE1_IOMMU: { -struct vfio_iommu_type1_info info; +struct vfio_iommu_type1_info *info; /* * FIXME: This assumes that a Type1 IOMMU can map any 64-bit @@ -1318,15 +1388,19 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, * existing Type1 IOMMUs generally support any IOVA we're * going to actually try in practice. */ -info.argsz = sizeof(info); -ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info); -/* Ignore errors */ -if (ret || !(info.flags & VFIO_IOMMU_INFO_PGSIZES)) { +ret = vfio_get_iommu_info(container, &info); + +if (ret || !(info->flags & VFIO_IOMMU_INFO_PGSIZES)) { /* Assume 4k IOVA page size */ -info.iova_pgsizes = 4096; +info->iova_pgsizes = 4096; } -vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes); -container->pgsizes = info.iova_pgsizes; +vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes); +container->pgsizes = info->iova_pgsizes; + +if (!ret) { +vfio_get_iommu_info_migration(container, info); +} +g_free(info); break; } case VFIO_SPAPR_TCE_v2_IOMMU: diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 6ac72b46a88b..93f8fe7bd869 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -832,9 +832,14 @@ err: int vfio_migration_probe(VFIODevice *vbasedev, Error *
[PATCH v29 17/17] qapi: Add VFIO devices migration stats in Migration stats
Added amount of bytes transferred to the VM at destination by all VFIO devices Signed-off-by: Kirti Wankhede Reviewed-by: Dr. David Alan Gilbert --- hw/vfio/common.c | 19 +++ hw/vfio/migration.c | 9 + include/hw/vfio/vfio-common.h | 3 +++ migration/migration.c | 17 + monitor/hmp-cmds.c| 6 ++ qapi/migration.json | 17 + 6 files changed, 71 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 49c68a5253ae..56f6fee66a55 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -292,6 +292,25 @@ const MemoryRegionOps vfio_region_ops = { * Device state interfaces */ +bool vfio_mig_active(void) +{ +VFIOGroup *group; +VFIODevice *vbasedev; + +if (QLIST_EMPTY(&vfio_group_list)) { +return false; +} + +QLIST_FOREACH(group, &vfio_group_list, next) { +QLIST_FOREACH(vbasedev, &group->device_list, next) { +if (vbasedev->migration_blocker) { +return false; +} +} +} +return true; +} + static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container) { VFIOGroup *group; diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index ffedbcca179d..2d657289c68e 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -45,6 +45,8 @@ #define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xef13ULL) #define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL) +static int64_t bytes_transferred; + static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count, off_t off, bool iswrite) { @@ -255,6 +257,7 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size) *size = data_size; } +bytes_transferred += data_size; return ret; } @@ -785,6 +788,7 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data) case MIGRATION_STATUS_CANCELLING: case MIGRATION_STATUS_CANCELLED: case MIGRATION_STATUS_FAILED: +bytes_transferred = 0; ret = vfio_migration_set_state(vbasedev, ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING), VFIO_DEVICE_STATE_RUNNING); @@ -866,6 +870,11 @@ err: /* -- */ +int64_t vfio_mig_bytes_transferred(void) +{ +return bytes_transferred; +} + int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) { VFIOContainer *container = vbasedev->group->container; diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index b1c1b18fd228..24e299d97425 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -203,6 +203,9 @@ extern const MemoryRegionOps vfio_region_ops; typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList; extern VFIOGroupList vfio_group_list; +bool vfio_mig_active(void); +int64_t vfio_mig_bytes_transferred(void); + #ifdef CONFIG_LINUX int vfio_get_region_info(VFIODevice *vbasedev, int index, struct vfio_region_info **info); diff --git a/migration/migration.c b/migration/migration.c index 0575ecb37953..995ccd96a774 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -57,6 +57,10 @@ #include "qemu/queue.h" #include "multifd.h" +#ifdef CONFIG_VFIO +#include "hw/vfio/vfio-common.h" +#endif + #define MAX_THROTTLE (128 << 20) /* Migration transfer speed throttling */ /* Amount of time to allocate to each "chunk" of bandwidth-throttled @@ -1002,6 +1006,17 @@ static void populate_disk_info(MigrationInfo *info) } } +static void populate_vfio_info(MigrationInfo *info) +{ +#ifdef CONFIG_VFIO +if (vfio_mig_active()) { +info->has_vfio = true; +info->vfio = g_malloc0(sizeof(*info->vfio)); +info->vfio->transferred = vfio_mig_bytes_transferred(); +} +#endif +} + static void fill_source_migration_info(MigrationInfo *info) { MigrationState *s = migrate_get_current(); @@ -1026,6 +1041,7 @@ static void fill_source_migration_info(MigrationInfo *info) populate_time_info(info, s); populate_ram_info(info, s); populate_disk_info(info); +populate_vfio_info(info); break; case MIGRATION_STATUS_COLO: info->has_status = true; @@ -1034,6 +1050,7 @@ static void fill_source_migration_info(MigrationInfo *info) case MIGRATION_STATUS_COMPLETED: populate_time_info(info, s); populate_ram_info(info, s); +populate_vfio_info(info); break; case MIGRATION_STATUS_FAILED: info->has_status = true; diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c index 9789f4277f50..56e9bad33d94 100644 --- a/monitor/hmp-cmds.c +++ b/monitor/hmp-cmds.c @@ -357,6 +357,12 @@
[PATCH v29 09/17] vfio: Add load state functions to SaveVMHandlers
Sequence during _RESUMING device state: While data for this device is available, repeat below steps: a. read data_offset from where user application should write data. b. write data of data_size to migration region from data_offset. c. write data_size which indicates vendor driver that data is written in staging buffer. For user, data is opaque. User should write data in the same order as received. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Reviewed-by: Dr. David Alan Gilbert Reviewed-by: Yan Zhao --- hw/vfio/migration.c | 195 +++ hw/vfio/trace-events | 4 ++ 2 files changed, 199 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 41d568558479..6ac72b46a88b 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -257,6 +257,77 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size) return ret; } +static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev, +uint64_t data_size) +{ +VFIORegion *region = &vbasedev->migration->region; +uint64_t data_offset = 0, size, report_size; +int ret; + +do { +ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset), + region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_offset)); +if (ret < 0) { +return ret; +} + +if (data_offset + data_size > region->size) { +/* + * If data_size is greater than the data section of migration region + * then iterate the write buffer operation. This case can occur if + * size of migration region at destination is smaller than size of + * migration region at source. + */ +report_size = size = region->size - data_offset; +data_size -= size; +} else { +report_size = size = data_size; +data_size = 0; +} + +trace_vfio_load_state_device_data(vbasedev->name, data_offset, size); + +while (size) { +void *buf; +uint64_t sec_size; +bool buf_alloc = false; + +buf = get_data_section_size(region, data_offset, size, &sec_size); + +if (!buf) { +buf = g_try_malloc(sec_size); +if (!buf) { +error_report("%s: Error allocating buffer ", __func__); +return -ENOMEM; +} +buf_alloc = true; +} + +qemu_get_buffer(f, buf, sec_size); + +if (buf_alloc) { +ret = vfio_mig_write(vbasedev, buf, sec_size, +region->fd_offset + data_offset); +g_free(buf); + +if (ret < 0) { +return ret; +} +} +size -= sec_size; +data_offset += sec_size; +} + +ret = vfio_mig_write(vbasedev, &report_size, sizeof(report_size), +region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_size)); +if (ret < 0) { +return ret; +} +} while (data_size); + +return 0; +} + static int vfio_update_pending(VFIODevice *vbasedev) { VFIOMigration *migration = vbasedev->migration; @@ -293,6 +364,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque) return qemu_file_get_error(f); } +static int vfio_load_device_config_state(QEMUFile *f, void *opaque) +{ +VFIODevice *vbasedev = opaque; +uint64_t data; + +if (vbasedev->ops && vbasedev->ops->vfio_load_config) { +int ret; + +ret = vbasedev->ops->vfio_load_config(vbasedev, f); +if (ret) { +error_report("%s: Failed to load device config space", + vbasedev->name); +return ret; +} +} + +data = qemu_get_be64(f); +if (data != VFIO_MIG_FLAG_END_OF_STATE) { +error_report("%s: Failed loading device config space, " + "end flag incorrect 0x%"PRIx64, vbasedev->name, data); +return -EINVAL; +} + +trace_vfio_load_device_config_state(vbasedev->name); +return qemu_file_get_error(f); +} + static void vfio_migration_cleanup(VFIODevice *vbasedev) { VFIOMigration *migration = vbasedev->migration; @@ -483,12 +581,109 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) return ret; } +static int vfio_load_setup(QEMUFile *f, void *opaque) +{ +VFIODevice *vbasedev = opaque; +VFIOMigration *migration = vbasedev->migration; +int ret = 0; + +if (migration->region.mmaps) { +ret = vfio_region_mmap(&migration->region); +if (ret) { +error_report("%s: Failed to mmap VFIO migration region %d: %s"
[PATCH v29 16/17] vfio: Make vfio-pci device migration capable
If the device is not a failover primary device, call vfio_migration_probe() and vfio_migration_finalize() to enable migration support for those devices that support it respectively to tear it down again. Removed migration blocker from VFIO PCI device specific structure and use migration blocker from generic structure of VFIO device. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Reviewed-by: Dr. David Alan Gilbert Reviewed-by: Cornelia Huck --- hw/vfio/pci.c | 28 hw/vfio/pci.h | 1 - 2 files changed, 8 insertions(+), 21 deletions(-) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index e27c88be6d85..58c0ce8971e3 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -2791,17 +2791,6 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) return; } -if (!pdev->failover_pair_id) { -error_setg(&vdev->migration_blocker, -"VFIO device doesn't support migration"); -ret = migrate_add_blocker(vdev->migration_blocker, errp); -if (ret) { -error_free(vdev->migration_blocker); -vdev->migration_blocker = NULL; -return; -} -} - vdev->vbasedev.name = g_path_get_basename(vdev->vbasedev.sysfsdev); vdev->vbasedev.ops = &vfio_pci_ops; vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI; @@ -3069,6 +3058,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) } } +if (!pdev->failover_pair_id) { +ret = vfio_migration_probe(&vdev->vbasedev, errp); +if (ret) { +error_report("%s: Migration disabled", vdev->vbasedev.name); +} +} + vfio_register_err_notifier(vdev); vfio_register_req_notifier(vdev); vfio_setup_resetfn_quirk(vdev); @@ -3083,11 +3079,6 @@ out_teardown: vfio_bars_exit(vdev); error: error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name); -if (vdev->migration_blocker) { -migrate_del_blocker(vdev->migration_blocker); -error_free(vdev->migration_blocker); -vdev->migration_blocker = NULL; -} } static void vfio_instance_finalize(Object *obj) @@ -3099,10 +3090,6 @@ static void vfio_instance_finalize(Object *obj) vfio_bars_finalize(vdev); g_free(vdev->emulated_config_bits); g_free(vdev->rom); -if (vdev->migration_blocker) { -migrate_del_blocker(vdev->migration_blocker); -error_free(vdev->migration_blocker); -} /* * XXX Leaking igd_opregion is not an oversight, we can't remove the * fw_cfg entry therefore leaking this allocation seems like the safest @@ -3130,6 +3117,7 @@ static void vfio_exitfn(PCIDevice *pdev) } vfio_teardown_msi(vdev); vfio_bars_exit(vdev); +vfio_migration_finalize(&vdev->vbasedev); } static void vfio_pci_reset(DeviceState *dev) diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h index bce71a9ac93f..1574ef983f8f 100644 --- a/hw/vfio/pci.h +++ b/hw/vfio/pci.h @@ -172,7 +172,6 @@ struct VFIOPCIDevice { bool no_vfio_ioeventfd; bool enable_ramfb; VFIODisplay *dpy; -Error *migration_blocker; Notifier irqchip_change_notifier; }; -- 2.7.0
[PATCH v29 06/17] vfio: Add migration state change notifier
Added migration state change notifier to get notification on migration state change. These states are translated to VFIO device state and conveyed to vendor driver. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Reviewed-by: Dr. David Alan Gilbert Reviewed-by: Cornelia Huck --- hw/vfio/migration.c | 30 ++ hw/vfio/trace-events | 1 + include/hw/vfio/vfio-common.h | 2 ++ 3 files changed, 33 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 65ce735d667b..888a615d39ea 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -175,6 +175,30 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state) (migration->device_state & mask) | value); } +static void vfio_migration_state_notifier(Notifier *notifier, void *data) +{ +MigrationState *s = data; +VFIOMigration *migration = container_of(notifier, VFIOMigration, +migration_state); +VFIODevice *vbasedev = migration->vbasedev; +int ret; + +trace_vfio_migration_state_notifier(vbasedev->name, +MigrationStatus_str(s->state)); + +switch (s->state) { +case MIGRATION_STATUS_CANCELLING: +case MIGRATION_STATUS_CANCELLED: +case MIGRATION_STATUS_FAILED: +ret = vfio_migration_set_state(vbasedev, + ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING), + VFIO_DEVICE_STATE_RUNNING); +if (ret) { +error_report("%s: Failed to set state RUNNING", vbasedev->name); +} +} +} + static void vfio_migration_exit(VFIODevice *vbasedev) { VFIOMigration *migration = vbasedev->migration; @@ -190,6 +214,7 @@ static int vfio_migration_init(VFIODevice *vbasedev, { int ret; Object *obj; +VFIOMigration *migration; if (!vbasedev->ops->vfio_get_object) { return -EINVAL; @@ -217,8 +242,12 @@ static int vfio_migration_init(VFIODevice *vbasedev, goto err; } +migration = vbasedev->migration; +migration->vbasedev = vbasedev; migration->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change, vbasedev); +migration->migration_state.notify = vfio_migration_state_notifier; +add_migration_state_change_notifier(&migration->migration_state); return 0; err: @@ -268,6 +297,7 @@ void vfio_migration_finalize(VFIODevice *vbasedev) if (vbasedev->migration) { VFIOMigration *migration = vbasedev->migration; +remove_migration_state_change_notifier(&migration->migration_state); qemu_del_vm_change_state_handler(migration->vm_state); vfio_migration_exit(vbasedev); } diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index 41de81f12f60..78d7d83b5ef8 100644 --- a/hw/vfio/trace-events +++ b/hw/vfio/trace-events @@ -150,3 +150,4 @@ vfio_display_edid_write_error(void) "" vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d" vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d" vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d" +vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s" diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 9a571f1fb552..2bd593ba38bb 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -59,10 +59,12 @@ typedef struct VFIORegion { } VFIORegion; typedef struct VFIOMigration { +struct VFIODevice *vbasedev; VMChangeStateEntry *vm_state; VFIORegion region; uint32_t device_state; int vm_running; +Notifier migration_state; } VFIOMigration; typedef struct VFIOAddressSpace { -- 2.7.0
[PATCH v29 10/17] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled
mr->ram_block is NULL when mr->is_iommu is true, then fr.dirty_log_mask wasn't set correctly due to which memory listener's log_sync doesn't get called. This patch returns log_mask with DIRTY_MEMORY_MIGRATION set when IOMMU is enabled. Signed-off-by: Kirti Wankhede Reviewed-by: Yan Zhao --- softmmu/memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/softmmu/memory.c b/softmmu/memory.c index 403ff3abc99b..94f606e9d9d9 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -1792,7 +1792,7 @@ bool memory_region_is_ram_device(MemoryRegion *mr) uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr) { uint8_t mask = mr->dirty_log_mask; -if (global_dirty_log && mr->ram_block) { +if (global_dirty_log && (mr->ram_block || memory_region_is_iommu(mr))) { mask |= (1 << DIRTY_MEMORY_MIGRATION); } return mask; -- 2.7.0
[PATCH v29 13/17] vfio: Add vfio_listener_log_sync to mark dirty pages
vfio_listener_log_sync gets list of dirty pages from container using VFIO_IOMMU_GET_DIRTY_BITMAP ioctl and mark those pages dirty when all devices are stopped and saving state. Return early for the RAM block section of mapped MMIO region. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia --- hw/vfio/common.c | 116 +++ hw/vfio/trace-events | 1 + 2 files changed, 117 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index d4959c036dd1..2634387df948 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -29,6 +29,7 @@ #include "hw/vfio/vfio.h" #include "exec/address-spaces.h" #include "exec/memory.h" +#include "exec/ram_addr.h" #include "hw/hw.h" #include "qemu/error-report.h" #include "qemu/main-loop.h" @@ -37,6 +38,7 @@ #include "sysemu/reset.h" #include "trace.h" #include "qapi/error.h" +#include "migration/migration.h" VFIOGroupList vfio_group_list = QLIST_HEAD_INITIALIZER(vfio_group_list); @@ -287,6 +289,39 @@ const MemoryRegionOps vfio_region_ops = { }; /* + * Device state interfaces + */ + +static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container) +{ +VFIOGroup *group; +VFIODevice *vbasedev; +MigrationState *ms = migrate_get_current(); + +if (!migration_is_setup_or_active(ms->state)) { +return false; +} + +QLIST_FOREACH(group, &container->group_list, container_next) { +QLIST_FOREACH(vbasedev, &group->device_list, next) { +VFIOMigration *migration = vbasedev->migration; + +if (!migration) { +return false; +} + +if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) && +!(migration->device_state & VFIO_DEVICE_STATE_RUNNING)) { +continue; +} else { +return false; +} +} +} +return true; +} + +/* * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86 */ static int vfio_dma_unmap(VFIOContainer *container, @@ -812,9 +847,90 @@ static void vfio_listener_region_del(MemoryListener *listener, } } +static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, + uint64_t size, ram_addr_t ram_addr) +{ +struct vfio_iommu_type1_dirty_bitmap *dbitmap; +struct vfio_iommu_type1_dirty_bitmap_get *range; +uint64_t pages; +int ret; + +dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range)); + +dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range); +dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP; +range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data; +range->iova = iova; +range->size = size; + +/* + * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of + * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap's pgsize to + * TARGET_PAGE_SIZE. + */ +range->bitmap.pgsize = TARGET_PAGE_SIZE; + +pages = TARGET_PAGE_ALIGN(range->size) >> TARGET_PAGE_BITS; +range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / + BITS_PER_BYTE; +range->bitmap.data = g_try_malloc0(range->bitmap.size); +if (!range->bitmap.data) { +ret = -ENOMEM; +goto err_out; +} + +ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap); +if (ret) { +error_report("Failed to get dirty bitmap for iova: 0x%llx " +"size: 0x%llx err: %d", +range->iova, range->size, errno); +goto err_out; +} + +cpu_physical_memory_set_dirty_lebitmap((uint64_t *)range->bitmap.data, +ram_addr, pages); + +trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size, +range->bitmap.size, ram_addr); +err_out: +g_free(range->bitmap.data); +g_free(dbitmap); + +return ret; +} + +static int vfio_sync_dirty_bitmap(VFIOContainer *container, + MemoryRegionSection *section) +{ +ram_addr_t ram_addr; + +ram_addr = memory_region_get_ram_addr(section->mr) + + section->offset_within_region; + +return vfio_get_dirty_bitmap(container, + TARGET_PAGE_ALIGN(section->offset_within_address_space), + int128_get64(section->size), ram_addr); +} + +static void vfio_listerner_log_sync(MemoryListener *listener, +MemoryRegionSection *section) +{ +VFIOContainer *container = container_of(listener, VFIOContainer, listener); + +if (vfio_listener_skipped_section(section) || +!container->
[PATCH v29 07/17] vfio: Register SaveVMHandlers for VFIO device
Define flags to be used as delimiter in migration stream for VFIO devices. Added .save_setup and .save_cleanup functions. Map & unmap migration region from these functions at source during saving or pre-copy phase. Set VFIO device state depending on VM's state. During live migration, VM is running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO device. During save-restore, VM is paused, _SAVING state is set for VFIO device. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Reviewed-by: Cornelia Huck Reviewed-by: Yan Zhao --- hw/vfio/migration.c | 102 +++ hw/vfio/trace-events | 2 + 2 files changed, 104 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 888a615d39ea..d3ef9e18f39c 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -8,12 +8,15 @@ */ #include "qemu/osdep.h" +#include "qemu/main-loop.h" +#include "qemu/cutils.h" #include #include "sysemu/runstate.h" #include "hw/vfio/vfio-common.h" #include "cpu.h" #include "migration/migration.h" +#include "migration/vmstate.h" #include "migration/qemu-file.h" #include "migration/register.h" #include "migration/blocker.h" @@ -25,6 +28,22 @@ #include "trace.h" #include "hw/hw.h" +/* + * Flags to be used as unique delimiters for VFIO devices in the migration + * stream. These flags are composed as: + * 0x => MSB 32-bit all 1s + * 0xef10 => Magic ID, represents emulated (virtual) function IO + * 0x => 16-bits reserved for flags + * + * The beginning of state information is marked by _DEV_CONFIG_STATE, + * _DEV_SETUP_STATE, or _DEV_DATA_STATE, respectively. The end of a + * certain state information is marked by _END_OF_STATE. + */ +#define VFIO_MIG_FLAG_END_OF_STATE (0xef11ULL) +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE (0xef12ULL) +#define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xef13ULL) +#define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL) + static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count, off_t off, bool iswrite) { @@ -129,6 +148,75 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask, return 0; } +static void vfio_migration_cleanup(VFIODevice *vbasedev) +{ +VFIOMigration *migration = vbasedev->migration; + +if (migration->region.mmaps) { +vfio_region_unmap(&migration->region); +} +} + +/* -- */ + +static int vfio_save_setup(QEMUFile *f, void *opaque) +{ +VFIODevice *vbasedev = opaque; +VFIOMigration *migration = vbasedev->migration; +int ret; + +trace_vfio_save_setup(vbasedev->name); + +qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE); + +if (migration->region.mmaps) { +/* + * Calling vfio_region_mmap() from migration thread. Memory API called + * from this function require locking the iothread when called from + * outside the main loop thread. + */ +qemu_mutex_lock_iothread(); +ret = vfio_region_mmap(&migration->region); +qemu_mutex_unlock_iothread(); +if (ret) { +error_report("%s: Failed to mmap VFIO migration region: %s", + vbasedev->name, strerror(-ret)); +error_report("%s: Falling back to slow path", vbasedev->name); +} +} + +ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK, + VFIO_DEVICE_STATE_SAVING); +if (ret) { +error_report("%s: Failed to set state SAVING", vbasedev->name); +return ret; +} + +qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); + +ret = qemu_file_get_error(f); +if (ret) { +return ret; +} + +return 0; +} + +static void vfio_save_cleanup(void *opaque) +{ +VFIODevice *vbasedev = opaque; + +vfio_migration_cleanup(vbasedev); +trace_vfio_save_cleanup(vbasedev->name); +} + +static SaveVMHandlers savevm_vfio_handlers = { +.save_setup = vfio_save_setup, +.save_cleanup = vfio_save_cleanup, +}; + +/* -- */ + static void vfio_vmstate_change(void *opaque, int running, RunState state) { VFIODevice *vbasedev = opaque; @@ -215,6 +303,8 @@ static int vfio_migration_init(VFIODevice *vbasedev, int ret; Object *obj; VFIOMigration *migration; +char id[256] = ""; +g_autofree char *path = NULL, *oid = NULL; if (!vbasedev->ops->vfio_get_object) { return -EINVAL; @@ -244,6 +334,18 @@ static int vfio_migration_init(VFIODevice *vbasedev, m
[PATCH v29 12/17] vfio: Add function to start and stop dirty pages tracking
Call VFIO_IOMMU_DIRTY_PAGES ioctl to start and stop dirty pages tracking for VFIO devices. Signed-off-by: Kirti Wankhede Reviewed-by: Dr. David Alan Gilbert --- hw/vfio/migration.c | 36 1 file changed, 36 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 93f8fe7bd869..ffedbcca179d 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -11,6 +11,7 @@ #include "qemu/main-loop.h" #include "qemu/cutils.h" #include +#include #include "sysemu/runstate.h" #include "hw/vfio/vfio-common.h" @@ -391,10 +392,40 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque) return qemu_file_get_error(f); } +static int vfio_set_dirty_page_tracking(VFIODevice *vbasedev, bool start) +{ +int ret; +VFIOMigration *migration = vbasedev->migration; +VFIOContainer *container = vbasedev->group->container; +struct vfio_iommu_type1_dirty_bitmap dirty = { +.argsz = sizeof(dirty), +}; + +if (start) { +if (migration->device_state & VFIO_DEVICE_STATE_SAVING) { +dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START; +} else { +return -EINVAL; +} +} else { +dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP; +} + +ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty); +if (ret) { +error_report("Failed to set dirty tracking flag 0x%x errno: %d", + dirty.flags, errno); +return -errno; +} +return ret; +} + static void vfio_migration_cleanup(VFIODevice *vbasedev) { VFIOMigration *migration = vbasedev->migration; +vfio_set_dirty_page_tracking(vbasedev, false); + if (migration->region.mmaps) { vfio_region_unmap(&migration->region); } @@ -435,6 +466,11 @@ static int vfio_save_setup(QEMUFile *f, void *opaque) return ret; } +ret = vfio_set_dirty_page_tracking(vbasedev, true); +if (ret) { +return ret; +} + qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); ret = qemu_file_get_error(f); -- 2.7.0
[PATCH v29 15/17] vfio: Add ioctl to get dirty pages bitmap during dma unmap
With vIOMMU, IO virtual address range can get unmapped while in pre-copy phase of migration. In that case, unmap ioctl should return pages pinned in that range and QEMU should find its correcponding guest physical addresses and report those dirty. Suggested-by: Alex Williamson Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia --- hw/vfio/common.c | 96 +--- 1 file changed, 92 insertions(+), 4 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index c0b5b6245a47..49c68a5253ae 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -321,11 +321,94 @@ static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container) return true; } +static bool vfio_devices_all_running_and_saving(VFIOContainer *container) +{ +VFIOGroup *group; +VFIODevice *vbasedev; +MigrationState *ms = migrate_get_current(); + +if (!migration_is_setup_or_active(ms->state)) { +return false; +} + +QLIST_FOREACH(group, &container->group_list, container_next) { +QLIST_FOREACH(vbasedev, &group->device_list, next) { +VFIOMigration *migration = vbasedev->migration; + +if (!migration) { +return false; +} + +if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) && +(migration->device_state & VFIO_DEVICE_STATE_RUNNING)) { +continue; +} else { +return false; +} +} +} +return true; +} + +static int vfio_dma_unmap_bitmap(VFIOContainer *container, + hwaddr iova, ram_addr_t size, + IOMMUTLBEntry *iotlb) +{ +struct vfio_iommu_type1_dma_unmap *unmap; +struct vfio_bitmap *bitmap; +uint64_t pages = TARGET_PAGE_ALIGN(size) >> TARGET_PAGE_BITS; +int ret; + +unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap)); + +unmap->argsz = sizeof(*unmap) + sizeof(*bitmap); +unmap->iova = iova; +unmap->size = size; +unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP; +bitmap = (struct vfio_bitmap *)&unmap->data; + +/* + * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of + * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap_pgsize to + * TARGET_PAGE_SIZE. + */ + +bitmap->pgsize = TARGET_PAGE_SIZE; +bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / + BITS_PER_BYTE; + +if (bitmap->size > container->max_dirty_bitmap_size) { +error_report("UNMAP: Size of bitmap too big 0x%llx", bitmap->size); +ret = -E2BIG; +goto unmap_exit; +} + +bitmap->data = g_try_malloc0(bitmap->size); +if (!bitmap->data) { +ret = -ENOMEM; +goto unmap_exit; +} + +ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap); +if (!ret) { +cpu_physical_memory_set_dirty_lebitmap((uint64_t *)bitmap->data, +iotlb->translated_addr, pages); +} else { +error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m"); +} + +g_free(bitmap->data); +unmap_exit: +g_free(unmap); +return ret; +} + /* * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86 */ static int vfio_dma_unmap(VFIOContainer *container, - hwaddr iova, ram_addr_t size) + hwaddr iova, ram_addr_t size, + IOMMUTLBEntry *iotlb) { struct vfio_iommu_type1_dma_unmap unmap = { .argsz = sizeof(unmap), @@ -334,6 +417,11 @@ static int vfio_dma_unmap(VFIOContainer *container, .size = size, }; +if (iotlb && container->dirty_pages_supported && +vfio_devices_all_running_and_saving(container)) { +return vfio_dma_unmap_bitmap(container, iova, size, iotlb); +} + while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) { /* * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c @@ -381,7 +469,7 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova, * the VGA ROM space. */ if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 || -(errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 && +(errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 && ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) { return 0; } @@ -531,7 +619,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) iotlb->addr_mask + 1, vaddr, ret); } } else { -ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1); +ret = vfio_dma_unmap(contain
[PATCH v29 08/17] vfio: Add save state functions to SaveVMHandlers
Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy functions. These functions handles pre-copy and stop-and-copy phase. In _SAVING|_RUNNING device state or pre-copy phase: - read pending_bytes. If pending_bytes > 0, go through below steps. - read data_offset - indicates kernel driver to write data to staging buffer. - read data_size - amount of data in bytes written by vendor driver in migration region. - read data_size bytes of data from data_offset in the migration region. - Write data packet to file stream as below: {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data, VFIO_MIG_FLAG_END_OF_STATE } In _SAVING device state or stop-and-copy phase a. read config space of device and save to migration file stream. This doesn't need to be from vendor driver. Any other special config state from driver can be saved as data in following iteration. b. read pending_bytes. If pending_bytes > 0, go through below steps. c. read data_offset - indicates kernel driver to write data to staging buffer. d. read data_size - amount of data in bytes written by vendor driver in migration region. e. read data_size bytes of data from data_offset in the migration region. f. Write data packet as below: {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data} g. iterate through steps b to f while (pending_bytes > 0) h. Write {VFIO_MIG_FLAG_END_OF_STATE} When data region is mapped, its user's responsibility to read data from data_offset of data_size before moving to next steps. Added fix suggested by Artem Polyakov to reset pending_bytes in vfio_save_iterate(). Added fix suggested by Zhi Wang to add 0 as data size in migration stream and add END_OF_STATE delimiter to indicate phase complete. Suggested-by: Artem Polyakov Suggested-by: Zhi Wang Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Reviewed-by: Yan Zhao --- hw/vfio/migration.c | 276 ++ hw/vfio/trace-events | 6 + include/hw/vfio/vfio-common.h | 1 + 3 files changed, 283 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index d3ef9e18f39c..41d568558479 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -148,6 +148,151 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask, return 0; } +static void *get_data_section_size(VFIORegion *region, uint64_t data_offset, + uint64_t data_size, uint64_t *size) +{ +void *ptr = NULL; +uint64_t limit = 0; +int i; + +if (!region->mmaps) { +if (size) { +*size = MIN(data_size, region->size - data_offset); +} +return ptr; +} + +for (i = 0; i < region->nr_mmaps; i++) { +VFIOMmap *map = region->mmaps + i; + +if ((data_offset >= map->offset) && +(data_offset < map->offset + map->size)) { + +/* check if data_offset is within sparse mmap areas */ +ptr = map->mmap + data_offset - map->offset; +if (size) { +*size = MIN(data_size, map->offset + map->size - data_offset); +} +break; +} else if ((data_offset < map->offset) && + (!limit || limit > map->offset)) { +/* + * data_offset is not within sparse mmap areas, find size of + * non-mapped area. Check through all list since region->mmaps list + * is not sorted. + */ +limit = map->offset; +} +} + +if (!ptr && size) { +*size = limit ? MIN(data_size, limit - data_offset) : data_size; +} +return ptr; +} + +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size) +{ +VFIOMigration *migration = vbasedev->migration; +VFIORegion *region = &migration->region; +uint64_t data_offset = 0, data_size = 0, sz; +int ret; + +ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset), + region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_offset)); +if (ret < 0) { +return ret; +} + +ret = vfio_mig_read(vbasedev, &data_size, sizeof(data_size), +region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_size)); +if (ret < 0) { +return ret; +} + +trace_vfio_save_buffer(vbasedev->name, data_offset, data_size, + migration->pending_bytes); + +qemu_put_be64(f, data_size); +sz = data_size; + +while (sz) { +void *buf; +uint64_t sec_size; +bool buf_allocated = false; + +buf = get_data_section_size(region, data_offset, sz, &sec_size); + +if (!buf) { +buf = g_try_malloc(sec_size); +if (!buf) { +error_report("%s: Error allocating buffer ",
[PATCH v29 05/17] vfio: Add VM state change handler to know state of VM
VM state change handler is called on change in VM's state. Based on VM state, VFIO device state should be changed. Added read/write helper functions for migration region. Added function to set device_state. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Reviewed-by: Dr. David Alan Gilbert Reviewed-by: Cornelia Huck --- hw/vfio/migration.c | 158 ++ hw/vfio/trace-events | 2 + include/hw/vfio/vfio-common.h | 4 ++ 3 files changed, 164 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index fd7faf423cdc..65ce735d667b 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -10,6 +10,7 @@ #include "qemu/osdep.h" #include +#include "sysemu/runstate.h" #include "hw/vfio/vfio-common.h" #include "cpu.h" #include "migration/migration.h" @@ -22,6 +23,157 @@ #include "exec/ram_addr.h" #include "pci.h" #include "trace.h" +#include "hw/hw.h" + +static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count, + off_t off, bool iswrite) +{ +int ret; + +ret = iswrite ? pwrite(vbasedev->fd, val, count, off) : +pread(vbasedev->fd, val, count, off); +if (ret < count) { +error_report("vfio_mig_%s %d byte %s: failed at offset 0x%lx, err: %s", + iswrite ? "write" : "read", count, + vbasedev->name, off, strerror(errno)); +return (ret < 0) ? ret : -EINVAL; +} +return 0; +} + +static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, size_t count, + off_t off, bool iswrite) +{ +int ret, done = 0; +__u8 *tbuf = buf; + +while (count) { +int bytes = 0; + +if (count >= 8 && !(off % 8)) { +bytes = 8; +} else if (count >= 4 && !(off % 4)) { +bytes = 4; +} else if (count >= 2 && !(off % 2)) { +bytes = 2; +} else { +bytes = 1; +} + +ret = vfio_mig_access(vbasedev, tbuf, bytes, off, iswrite); +if (ret) { +return ret; +} + +count -= bytes; +done += bytes; +off += bytes; +tbuf += bytes; +} +return done; +} + +#define vfio_mig_read(f, v, c, o) vfio_mig_rw(f, (__u8 *)v, c, o, false) +#define vfio_mig_write(f, v, c, o) vfio_mig_rw(f, (__u8 *)v, c, o, true) + +#define VFIO_MIG_STRUCT_OFFSET(f) \ + offsetof(struct vfio_device_migration_info, f) +/* + * Change the device_state register for device @vbasedev. Bits set in @mask + * are preserved, bits set in @value are set, and bits not set in either @mask + * or @value are cleared in device_state. If the register cannot be accessed, + * the resulting state would be invalid, or the device enters an error state, + * an error is returned. + */ + +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask, +uint32_t value) +{ +VFIOMigration *migration = vbasedev->migration; +VFIORegion *region = &migration->region; +off_t dev_state_off = region->fd_offset + + VFIO_MIG_STRUCT_OFFSET(device_state); +uint32_t device_state; +int ret; + +ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state), +dev_state_off); +if (ret < 0) { +return ret; +} + +device_state = (device_state & mask) | value; + +if (!VFIO_DEVICE_STATE_VALID(device_state)) { +return -EINVAL; +} + +ret = vfio_mig_write(vbasedev, &device_state, sizeof(device_state), + dev_state_off); +if (ret < 0) { +int rret; + +rret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state), + dev_state_off); + +if ((rret < 0) || (VFIO_DEVICE_STATE_IS_ERROR(device_state))) { +hw_error("%s: Device in error state 0x%x", vbasedev->name, + device_state); +return rret ? rret : -EIO; +} +return ret; +} + +migration->device_state = device_state; +trace_vfio_migration_set_state(vbasedev->name, device_state); +return 0; +} + +static void vfio_vmstate_change(void *opaque, int running, RunState state) +{ +VFIODevice *vbasedev = opaque; +VFIOMigration *migration = vbasedev->migration; +uint32_t value, mask; +int ret; + +if ((vbasedev->migration->vm_running == running)) { +return; +} + +if (running) { +/* + * Here device state can have one of _SAVING, _RESUMING or _STOP bit. + * Transition from _SAVING to _RUNNING can happen if there is mig
[PATCH v29 04/17] vfio: Add migration region initialization and finalize function
Whether the VFIO device supports migration or not is decided based of migration region query. If migration region query is successful and migration region initialization is successful then migration is supported else migration is blocked. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Acked-by: Dr. David Alan Gilbert --- hw/vfio/meson.build | 1 + hw/vfio/migration.c | 122 ++ hw/vfio/trace-events | 3 ++ include/hw/vfio/vfio-common.h | 9 4 files changed, 135 insertions(+) create mode 100644 hw/vfio/migration.c diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build index 37efa74018bc..da9af297a0c5 100644 --- a/hw/vfio/meson.build +++ b/hw/vfio/meson.build @@ -2,6 +2,7 @@ vfio_ss = ss.source_set() vfio_ss.add(files( 'common.c', 'spapr.c', + 'migration.c', )) vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files( 'display.c', diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c new file mode 100644 index ..fd7faf423cdc --- /dev/null +++ b/hw/vfio/migration.c @@ -0,0 +1,122 @@ +/* + * Migration support for VFIO devices + * + * Copyright NVIDIA, Inc. 2020 + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include "qemu/osdep.h" +#include + +#include "hw/vfio/vfio-common.h" +#include "cpu.h" +#include "migration/migration.h" +#include "migration/qemu-file.h" +#include "migration/register.h" +#include "migration/blocker.h" +#include "migration/misc.h" +#include "qapi/error.h" +#include "exec/ramlist.h" +#include "exec/ram_addr.h" +#include "pci.h" +#include "trace.h" + +static void vfio_migration_exit(VFIODevice *vbasedev) +{ +VFIOMigration *migration = vbasedev->migration; + +vfio_region_exit(&migration->region); +vfio_region_finalize(&migration->region); +g_free(vbasedev->migration); +vbasedev->migration = NULL; +} + +static int vfio_migration_init(VFIODevice *vbasedev, + struct vfio_region_info *info) +{ +int ret; +Object *obj; + +if (!vbasedev->ops->vfio_get_object) { +return -EINVAL; +} + +obj = vbasedev->ops->vfio_get_object(vbasedev); +if (!obj) { +return -EINVAL; +} + +vbasedev->migration = g_new0(VFIOMigration, 1); + +ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region, +info->index, "migration"); +if (ret) { +error_report("%s: Failed to setup VFIO migration region %d: %s", + vbasedev->name, info->index, strerror(-ret)); +goto err; +} + +if (!vbasedev->migration->region.size) { +error_report("%s: Invalid zero-sized VFIO migration region %d", + vbasedev->name, info->index); +ret = -EINVAL; +goto err; +} +return 0; + +err: +vfio_migration_exit(vbasedev); +return ret; +} + +/* -- */ + +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) +{ +struct vfio_region_info *info = NULL; +Error *local_err = NULL; +int ret; + +ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION, + VFIO_REGION_SUBTYPE_MIGRATION, &info); +if (ret) { +goto add_blocker; +} + +ret = vfio_migration_init(vbasedev, info); +if (ret) { +goto add_blocker; +} + +g_free(info); +trace_vfio_migration_probe(vbasedev->name, info->index); +return 0; + +add_blocker: +error_setg(&vbasedev->migration_blocker, + "VFIO device doesn't support migration"); +g_free(info); + +ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err); +if (local_err) { +error_propagate(errp, local_err); +error_free(vbasedev->migration_blocker); +vbasedev->migration_blocker = NULL; +} +return ret; +} + +void vfio_migration_finalize(VFIODevice *vbasedev) +{ +if (vbasedev->migration) { +vfio_migration_exit(vbasedev); +} + +if (vbasedev->migration_blocker) { +migrate_del_blocker(vbasedev->migration_blocker); +error_free(vbasedev->migration_blocker); +vbasedev->migration_blocker = NULL; +} +} diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index a0c7b49a2ebc..9ced5ec6277c 100644 --- a/hw/vfio/trace-events +++ b/hw/vfio/trace-events @@ -145,3 +145,6 @@ vfio_display_edid_link_up(void) "" vfio_display_edid_link_down(void) "" vfio_display_edid_update(uint32_t prefx
[PATCH v29 02/17] vfio: Add vfio_get_object callback to VFIODeviceOps
Hook vfio_get_object callback for PCI devices. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Suggested-by: Cornelia Huck Reviewed-by: Cornelia Huck --- hw/vfio/pci.c | 8 include/hw/vfio/vfio-common.h | 1 + 2 files changed, 9 insertions(+) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 0d83eb0e47bb..bffd5bfe3b78 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -2394,10 +2394,18 @@ static void vfio_pci_compute_needs_reset(VFIODevice *vbasedev) } } +static Object *vfio_pci_get_object(VFIODevice *vbasedev) +{ +VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); + +return OBJECT(vdev); +} + static VFIODeviceOps vfio_pci_ops = { .vfio_compute_needs_reset = vfio_pci_compute_needs_reset, .vfio_hot_reset_multi = vfio_pci_hot_reset_multi, .vfio_eoi = vfio_intx_eoi, +.vfio_get_object = vfio_pci_get_object, }; int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp) diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index dc95f527b583..fe99c36a693a 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -119,6 +119,7 @@ struct VFIODeviceOps { void (*vfio_compute_needs_reset)(VFIODevice *vdev); int (*vfio_hot_reset_multi)(VFIODevice *vdev); void (*vfio_eoi)(VFIODevice *vdev); +Object *(*vfio_get_object)(VFIODevice *vdev); }; typedef struct VFIOGroup { -- 2.7.0
[PATCH v29 03/17] vfio: Add save and load functions for VFIO PCI devices
Added functions to save and restore PCI device specific data, specifically config space of PCI device. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia --- hw/vfio/pci.c | 51 +++ include/hw/vfio/vfio-common.h | 2 ++ 2 files changed, 53 insertions(+) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index bffd5bfe3b78..e27c88be6d85 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -41,6 +41,7 @@ #include "trace.h" #include "qapi/error.h" #include "migration/blocker.h" +#include "migration/qemu-file.h" #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug" @@ -2401,11 +2402,61 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev) return OBJECT(vdev); } +static bool vfio_msix_present(void *opaque, int version_id) +{ +PCIDevice *pdev = opaque; + +return msix_present(pdev); +} + +const VMStateDescription vmstate_vfio_pci_config = { +.name = "VFIOPCIDevice", +.version_id = 1, +.minimum_version_id = 1, +.fields = (VMStateField[]) { +VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice), +VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present), +VMSTATE_END_OF_LIST() +} +}; + +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f) +{ +VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); + +vmstate_save_state(f, &vmstate_vfio_pci_config, vdev, NULL); +} + +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f) +{ +VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); +PCIDevice *pdev = &vdev->pdev; +int ret; + +ret = vmstate_load_state(f, &vmstate_vfio_pci_config, vdev, 1); +if (ret) { +return ret; +} + +vfio_pci_write_config(pdev, PCI_COMMAND, + pci_get_word(pdev->config + PCI_COMMAND), 2); + +if (msi_enabled(pdev)) { +vfio_msi_enable(vdev); +} else if (msix_enabled(pdev)) { +vfio_msix_enable(vdev); +} + +return ret; +} + static VFIODeviceOps vfio_pci_ops = { .vfio_compute_needs_reset = vfio_pci_compute_needs_reset, .vfio_hot_reset_multi = vfio_pci_hot_reset_multi, .vfio_eoi = vfio_intx_eoi, .vfio_get_object = vfio_pci_get_object, +.vfio_save_config = vfio_pci_save_config, +.vfio_load_config = vfio_pci_load_config, }; int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp) diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index fe99c36a693a..ba6169cd926e 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -120,6 +120,8 @@ struct VFIODeviceOps { int (*vfio_hot_reset_multi)(VFIODevice *vdev); void (*vfio_eoi)(VFIODevice *vdev); Object *(*vfio_get_object)(VFIODevice *vdev); +void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f); +int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f); }; typedef struct VFIOGroup { -- 2.7.0
[PATCH v29 00/17] Add migration support for VFIO devices
ation post copy is not supported. v28 -> 29 - Nit picks. - Write through PCI_COMMAND register on loading PCI config space as suggested by Yan. v27 -> 28 - Nit picks and minor changes suggested by Alex. v26 -> 27 - Major change in Patch 3 -PCI config space save and long using VMSTATE_* - Major change in Patch 14 - Dirty page tracking when vIOMMU is enabled using IOMMU notifier and its replay functionality - as suggested by Alex. - Some Structure changes to keep all migration related members at one place. - Pulled fix suggested by Zhi Wang https://www.mail-archive.com/qemu-devel@nongnu.org/msg743722.html - Add comments where even suggested and required. v25 -> 26 - Removed emulated_config_bits cache and vdev->pdev.wmask from config space save load functions. - Used VMStateDescription for config space save and load functionality. - Major fixes from previous version review. https://www.mail-archive.com/qemu-devel@nongnu.org/msg714625.html v23 -> 25 - Updated config space save and load to save config cache, emulated bits cache and wmask cache. - Created idr string as suggested by Dr Dave that includes bus path. - Updated save and load function to read/write data to mixed regions, mapped or trapped. - When vIOMMU is enabled, created mapped iova range list which also keeps translated address. This list is used to mark dirty pages. This reduces downtime significantly with vIOMMU enabled than migration patches from previous version. - Removed get_address_limit() function from v23 patch as this not required now. v22 -> v23 -- Fixed issue reported by Yan https://lore.kernel.org/kvm/97977ede-3c5b-c5a5-7858-7eecd7dd5...@nvidia.com/ - Sending this version to test v23 kernel version patches: https://lore.kernel.org/kvm/1589998088-3250-1-git-send-email-kwankh...@nvidia.com/ v18 -> v22 - Few fixes from v18 review. But not yet fixed all concerns. I'll address those concerns in subsequent iterations. - Sending this version to test v22 kernel version patches: https://lore.kernel.org/kvm/1589781397-28368-1-git-send-email-kwankh...@nvidia.com/ v16 -> v18 - Nit fixes - Get migration capability flags from container - Added VFIO stats to MigrationInfo - Fixed bug reported by Yan https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg4.html v9 -> v16 - KABI almost finalised on kernel patches. - Added support for migration with vIOMMU enabled. v8 -> v9: - Split patch set in 2 sets, Kernel and QEMU sets. - Dirty pages bitmap is queried from IOMMU container rather than from vendor driver for per device. Added 2 ioctls to achieve this. v7 -> v8: - Updated comments for KABI - Added BAR address validation check during PCI device's config space load as suggested by Dr. David Alan Gilbert. - Changed vfio_migration_set_state() to set or clear device state flags. - Some nit fixes. v6 -> v7: - Fix build failures. v5 -> v6: - Fix build failure. v4 -> v5: - Added decriptive comment about the sequence of access of members of structure vfio_device_migration_info to be followed based on Alex's suggestion - Updated get dirty pages sequence. - As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to get_object, save_config and load_config. - Fixed multiple nit picks. - Tested live migration with multiple vfio device assigned to a VM. v3 -> v4: - Added one more bit for _RESUMING flag to be set explicitly. - data_offset field is read-only for user space application. - data_size is read for every iteration before reading data from migration, that is removed assumption that data will be till end of migration region. - If vendor driver supports mappable sparsed region, map those region during setup state of save/load, similarly unmap those from cleanup routines. - Handles race condition that causes data corruption in migration region during save device state by adding mutex and serialiaing save_buffer and get_dirty_pages routines. - Skip called get_dirty_pages routine for mapped MMIO region of device. - Added trace events. - Splitted into multiple functional patches. v2 -> v3: - Removed enum of VFIO device states. Defined VFIO device state with 2 bits. - Re-structured vfio_device_migration_info to keep it minimal and defined action on read and write access on its members. v1 -> v2: - Defined MIGRATION region type and sub-type which should be used with region type capability. - Re-structured vfio_device_migration_info. This structure will be placed at 0th offset of migration region. - Replaced ioctl with read/write for trapped part of migration region. - Added both type of access support, trapped or mmapped, for data section of the region. - Moved PCI device functions to pci file. - Added iteration to get dirty page bitmap until bitmap for all requested pages are copied. Thanks, Kirti Kirti Wankhede (17): vfio: Add function to unmap VFIO region vfio: Add vfio_get_object callback to
[PATCH v29 01/17] vfio: Add function to unmap VFIO region
This function will be used for migration region. Migration region is mmaped when migration starts and will be unmapped when migration is complete. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Reviewed-by: Cornelia Huck --- hw/vfio/common.c | 32 hw/vfio/trace-events | 1 + include/hw/vfio/vfio-common.h | 1 + 3 files changed, 30 insertions(+), 4 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 13471ae29436..c6e98b8d61be 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -924,6 +924,18 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region, return 0; } +static void vfio_subregion_unmap(VFIORegion *region, int index) +{ +trace_vfio_region_unmap(memory_region_name(®ion->mmaps[index].mem), +region->mmaps[index].offset, +region->mmaps[index].offset + +region->mmaps[index].size - 1); +memory_region_del_subregion(region->mem, ®ion->mmaps[index].mem); +munmap(region->mmaps[index].mmap, region->mmaps[index].size); +object_unparent(OBJECT(®ion->mmaps[index].mem)); +region->mmaps[index].mmap = NULL; +} + int vfio_region_mmap(VFIORegion *region) { int i, prot = 0; @@ -954,10 +966,7 @@ int vfio_region_mmap(VFIORegion *region) region->mmaps[i].mmap = NULL; for (i--; i >= 0; i--) { -memory_region_del_subregion(region->mem, ®ion->mmaps[i].mem); -munmap(region->mmaps[i].mmap, region->mmaps[i].size); -object_unparent(OBJECT(®ion->mmaps[i].mem)); -region->mmaps[i].mmap = NULL; +vfio_subregion_unmap(region, i); } return ret; @@ -982,6 +991,21 @@ int vfio_region_mmap(VFIORegion *region) return 0; } +void vfio_region_unmap(VFIORegion *region) +{ +int i; + +if (!region->mem) { +return; +} + +for (i = 0; i < region->nr_mmaps; i++) { +if (region->mmaps[i].mmap) { +vfio_subregion_unmap(region, i); +} +} +} + void vfio_region_exit(VFIORegion *region) { int i; diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index 93a0bc2522f8..a0c7b49a2ebc 100644 --- a/hw/vfio/trace-events +++ b/hw/vfio/trace-events @@ -113,6 +113,7 @@ vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Reg vfio_region_exit(const char *name, int index) "Device %s, region %d" vfio_region_finalize(const char *name, int index) "Device %s, region %d" vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d" +vfio_region_unmap(const char *name, unsigned long offset, unsigned long end) "Region %s unmap [0x%lx - 0x%lx]" vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries" vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]" vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8" diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index c78f3ff5593c..dc95f527b583 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -171,6 +171,7 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region, int index, const char *name); int vfio_region_mmap(VFIORegion *region); void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled); +void vfio_region_unmap(VFIORegion *region); void vfio_region_exit(VFIORegion *region); void vfio_region_finalize(VFIORegion *region); void vfio_reset_handler(void *opaque); -- 2.7.0
Re: [PATCH v28 07/17] vfio: Register SaveVMHandlers for VFIO device
On 10/24/2020 4:56 PM, Yan Zhao wrote: On Fri, Oct 23, 2020 at 04:10:33PM +0530, Kirti Wankhede wrote: Define flags to be used as delimiter in migration stream for VFIO devices. Added .save_setup and .save_cleanup functions. Map & unmap migration region from these functions at source during saving or pre-copy phase. Set VFIO device state depending on VM's state. During live migration, VM is running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO device. During save-restore, VM is paused, _SAVING state is set for VFIO device. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia --- hw/vfio/migration.c | 102 +++ hw/vfio/trace-events | 2 + 2 files changed, 104 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index a0f0e79b9b73..94d2bdae5c54 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -8,12 +8,15 @@ */ #include "qemu/osdep.h" +#include "qemu/main-loop.h" +#include "qemu/cutils.h" #include #include "sysemu/runstate.h" #include "hw/vfio/vfio-common.h" #include "cpu.h" #include "migration/migration.h" +#include "migration/vmstate.h" #include "migration/qemu-file.h" #include "migration/register.h" #include "migration/blocker.h" @@ -25,6 +28,22 @@ #include "trace.h" #include "hw/hw.h" +/* + * Flags to be used as unique delimiters for VFIO devices in the migration + * stream. These flags are composed as: + * 0x => MSB 32-bit all 1s + * 0xef10 => Magic ID, represents emulated (virtual) function IO + * 0x => 16-bits reserved for flags + * + * The beginning of state information is marked by _DEV_CONFIG_STATE, + * _DEV_SETUP_STATE, or _DEV_DATA_STATE, respectively. The end of a + * certain state information is marked by _END_OF_STATE. + */ +#define VFIO_MIG_FLAG_END_OF_STATE (0xef11ULL) +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE (0xef12ULL) +#define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xef13ULL) +#define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL) + static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count, off_t off, bool iswrite) { @@ -129,6 +148,75 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask, return 0; } +static void vfio_migration_cleanup(VFIODevice *vbasedev) +{ +VFIOMigration *migration = vbasedev->migration; + +if (migration->region.mmaps) { +vfio_region_unmap(&migration->region); +} +} + +/* -- */ + +static int vfio_save_setup(QEMUFile *f, void *opaque) +{ +VFIODevice *vbasedev = opaque; +VFIOMigration *migration = vbasedev->migration; +int ret; + +trace_vfio_save_setup(vbasedev->name); + +qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE); + +if (migration->region.mmaps) { +/* + * Calling vfio_region_mmap() from migration thread. Memory API called + * from this function require locking the iothread when called from + * outside the main loop thread. + */ +qemu_mutex_lock_iothread(); +ret = vfio_region_mmap(&migration->region); +qemu_mutex_unlock_iothread(); +if (ret) { +error_report("%s: Failed to mmap VFIO migration region: %s", + vbasedev->name, strerror(-ret)); +error_report("%s: Falling back to slow path", vbasedev->name); +} +} + +ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK, + VFIO_DEVICE_STATE_SAVING); +if (ret) { +error_report("%s: Failed to set state SAVING", vbasedev->name); +return ret; +} + is it possible to call vfio_update_pending() and vfio_save_buffer() here? so that vendor driver has a chance to hook compatibility checking string early in save_setup stage and can avoid to hook the string in both precopy iteration stage and stop and copy stage. I would says its not about which stage, very first string irrespective of migration stage, it should be version compatibility check. I don't think that needed in setup. But I think it's ok if we agree to add this later. Besides that, Reviewed-by: Yan Zhao Thanks. Kirti +qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); + +ret = qemu_file_get_error(f); +if (ret) { +return ret; +} + +return 0; +} + +static void vfio_save_cleanup(void *opaque) +{ +VFIODevice *vbasedev = opaque; + +vfio_migration_cleanup(vbasedev); +trace_vfio_save_cleanup(vbasedev->name); +} + +static SaveVMHandlers savevm_vfio_handlers = { +.save_setup = vfio_save_s
Re: [PATCH v28 05/17] vfio: Add VM state change handler to know state of VM
On 10/23/2020 5:02 PM, Cornelia Huck wrote: On Fri, 23 Oct 2020 16:10:31 +0530 Kirti Wankhede wrote: VM state change handler is called on change in VM's state. Based on VM state, VFIO device state should be changed. Added read/write helper functions for migration region. Added function to set device_state. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Reviewed-by: Dr. David Alan Gilbert Hm, this version looks a bit different from the one Dave gave his R-b for... does it still apply? I would defer that to Dave. --- hw/vfio/migration.c | 156 ++ hw/vfio/trace-events | 2 + include/hw/vfio/vfio-common.h | 4 ++ 3 files changed, 162 insertions(+) Reviewed-by: Cornelia Huck Thanks. Kirti
Re: [PATCH v28 04/17] vfio: Add migration region initialization and finalize function
On 10/23/2020 4:54 PM, Cornelia Huck wrote: On Fri, 23 Oct 2020 16:10:30 +0530 Kirti Wankhede wrote: Whether the VFIO device supports migration or not is decided based of migration region query. If migration region query is successful and migration region initialization is successful then migration is supported else migration is blocked. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Acked-by: Dr. David Alan Gilbert --- hw/vfio/meson.build | 1 + hw/vfio/migration.c | 133 ++ hw/vfio/trace-events | 3 + include/hw/vfio/vfio-common.h | 9 +++ 4 files changed, 146 insertions(+) create mode 100644 hw/vfio/migration.c (...) +static int vfio_migration_init(VFIODevice *vbasedev, + struct vfio_region_info *info) +{ +int ret; +Object *obj; +VFIOMigration *migration; + +if (!vbasedev->ops->vfio_get_object) { +return -EINVAL; +} + +obj = vbasedev->ops->vfio_get_object(vbasedev); +if (!obj) { +return -EINVAL; +} + +migration = g_new0(VFIOMigration, 1); + +ret = vfio_region_setup(obj, vbasedev, &migration->region, +info->index, "migration"); +if (ret) { +error_report("%s: Failed to setup VFIO migration region %d: %s", + vbasedev->name, info->index, strerror(-ret)); +goto err; +} + +vbasedev->migration = migration; + +if (!migration->region.size) { +error_report("%s: Invalid zero-sized of VFIO migration region %d", s/of // + vbasedev->name, info->index); +ret = -EINVAL; +goto err; +} +return 0; + +err: +vfio_migration_region_exit(vbasedev); +g_free(migration); +vbasedev->migration = NULL; +return ret; +} (...) +void vfio_migration_finalize(VFIODevice *vbasedev) +{ +VFIOMigration *migration = vbasedev->migration; I don't think you need this variable? Removing it. + +if (migration) { +vfio_migration_region_exit(vbasedev); +g_free(vbasedev->migration); +vbasedev->migration = NULL; +} + +if (vbasedev->migration_blocker) { +migrate_del_blocker(vbasedev->migration_blocker); +error_free(vbasedev->migration_blocker); +vbasedev->migration_blocker = NULL; +} +} (...)
Re: [PATCH v28 03/17] vfio: Add save and load functions for VFIO PCI devices
On 10/24/2020 7:46 PM, Alex Williamson wrote: On Sat, 24 Oct 2020 19:53:39 +0800 Yan Zhao wrote: hi when I migrating VFs, the PCI_COMMAND is not properly saved. and the target side would meet below bug root@tester:~# [ 189.360671] ++>> reset starts here: iavf_reset_task !!! [ 199.360798] iavf :00:04.0: Reset never finished (0) [ 199.380504] kernel BUG at drivers/pci/msi.c:352! [ 199.382957] invalid opcode: [#1] SMP PTI [ 199.384855] CPU: 1 PID: 419 Comm: kworker/1:2 Tainted: G OE 5.0.0-13-generic #14-Ubuntu [ 199.388204] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 199.392401] Workqueue: events iavf_reset_task [iavf] [ 199.393586] RIP: 0010:free_msi_irqs+0x17b/0x1b0 [ 199.394659] Code: 84 e1 fe ff ff 45 31 f6 eb 11 41 83 c6 01 44 39 73 14 0f 86 ce fe ff ff 8b 7b 10 44 01 f7 e8 3c 7a ba ff 48 83 78 70 00 74 e0 <0f> 0b 49 8d b5 b0 00 00 00 e8 07 27 bb ff e9 cf fe ff ff 48 8b 78 [ 199.399056] RSP: 0018:abd1006cfdb8 EFLAGS: 00010282 [ 199.400302] RAX: 9e336d8a2800 RBX: 9eb006c0 RCX: [ 199.402000] RDX: RSI: 0019 RDI: baa68100 [ 199.403168] RBP: abd1006cfde8 R08: 9e3375000248 R09: 9e3375000338 [ 199.404343] R10: R11: baa68108 R12: 9e3374ef12c0 [ 199.405526] R13: 9e3374ef1000 R14: R15: 9e3371f2d018 [ 199.406702] FS: () GS:9e3375b0() knlGS: [ 199.408027] CS: 0010 DS: ES: CR0: 80050033 [ 199.408987] CR2: CR3: 33266000 CR4: 06e0 [ 199.410155] DR0: DR1: DR2: [ 199.411321] DR3: DR6: fffe0ff0 DR7: 0400 [ 199.412437] Call Trace: [ 199.412750] pci_disable_msix+0xf3/0x120 [ 199.413227] iavf_reset_interrupt_capability.part.40+0x19/0x40 [iavf] [ 199.413998] iavf_reset_task+0x4b3/0x9d0 [iavf] [ 199.414544] process_one_work+0x20f/0x410 [ 199.415026] worker_thread+0x34/0x400 [ 199.415486] kthread+0x120/0x140 [ 199.415876] ? process_one_work+0x410/0x410 [ 199.416380] ? __kthread_parkme+0x70/0x70 [ 199.416864] ret_from_fork+0x35/0x40 I verified MSIx with SRIOV VF, and I don't see this issue at my end. I fixed it with below patch. commit ad3efa0eeea7edb352294bfce35b904b8d3c759c Author: Yan Zhao Date: Sat Oct 24 19:45:01 2020 +0800 msix fix. Signed-off-by: Yan Zhao diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index f63f15b553..92f71bf933 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -2423,8 +2423,14 @@ const VMStateDescription vmstate_vfio_pci_config = { static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f) { VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); +PCIDevice *pdev = &vdev->pdev; +uint16_t pci_cmd; + +pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2); +qemu_put_be16(f, pci_cmd); vmstate_save_state(f, &vmstate_vfio_pci_config, vdev, NULL); + } static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f) @@ -2432,6 +2438,10 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f) VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); PCIDevice *pdev = &vdev->pdev; int ret; +uint16_t pci_cmd; + +pci_cmd = qemu_get_be16(f); +vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2); ret = vmstate_load_state(f, &vmstate_vfio_pci_config, vdev, 1); if (ret) { We need to avoid this sort of ad-hoc stuffing random fields into the config stream. The command register is already migrated in vconfig, it only needs to be written through vfio: vfio_pci_write_config(pdev, PCI_COMMAND, pci_get_word(pdev->config, PCI_COMMAND), 2); I verified at my end again. pci command value (using pci_default_read_config()) before vmstate_save_state() is 0x507 and at destination after vmstate_load_state() is also 0x507 - with pci_default_read_config() and the cached config space value using pci_get_word() - both are 0x507. VM restores successfully. Yan, can you share pci command values before and after as above? what exactly is missing? Thanks, Kirti Thanks, Alex On Fri, Oct 23, 2020 at 04:10:29PM +0530, Kirti Wankhede wrote: Added functions to save and restore PCI device specific data, specifically config space of PCI device. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia --- hw/vfio/pci.c | 48 +++ include/hw/vfio/vfio-common.h | 2 ++ 2 files changed, 50 insertions(+) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index bffd5bfe3b78..92cc25a5489f 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -41,6 +41,7 @@ #include &quo
Re: [PATCH v28 00/17] Add migration support for VFIO devices
On 10/24/2020 10:26 PM, Philippe Mathieu-Daudé wrote: Hi Kirti, On 10/23/20 12:40 PM, Kirti Wankhede wrote: Hi, This Patch set adds migration support for VFIO devices in QEMU. ... Since there is no device which has hardware support for system memmory dirty bitmap tracking, right now there is no other API from vendor driver to VFIO IOMMU module to report dirty pages. In future, when such hardware support will be implemented, an API will be required in kernel such that vendor driver could report dirty pages to VFIO module during migration phases. Below is the flow of state change for live migration where states in brackets represent VM state, migration state and VFIO device state as: (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE) Live migration save path: QEMU normal running state (RUNNING, _NONE, _RUNNING) | migrate_init spawns migration_thread. (RUNNING, _SETUP, _RUNNING|_SAVING) Migration thread then calls each device's .save_setup() | (RUNNING, _ACTIVE, _RUNNING|_SAVING) If device is active, get pending bytes by .save_live_pending() if pending bytes >= threshold_size, call save_live_iterate() Data of VFIO device for pre-copy phase is copied. Iterate till total pending bytes converge and are less than threshold | On migration completion, vCPUs stops and calls .save_live_complete_precopy for each active device. VFIO device is then transitioned in _SAVING state. (FINISH_MIGRATE, _DEVICE, _SAVING) For VFIO device, iterate in .save_live_complete_precopy until pending data is 0. (FINISH_MIGRATE, _DEVICE, _STOPPED) | (FINISH_MIGRATE, _COMPLETED, _STOPPED) Migraton thread schedule cleanup bottom half and exit Live migration resume path: Incomming migration calls .load_setup for each device (RESTORE_VM, _ACTIVE, _STOPPED) | For each device, .load_state is called for that device section data (RESTORE_VM, _ACTIVE, _RESUMING) | At the end, called .load_cleanup for each device and vCPUs are started. | (RUNNING, _NONE, _RUNNING) Note that: - Migration post copy is not supported. Can you commit this ^^^ somewhere in docs/devel/ please? (as a patch on top of this series) Philippe, Alex, I'm going to respin this series with r-bs and fix suggested by Yan. Should this doc be part of this series or we can add it later after 10/27 if again review of this doc would need some iterations? Thanks, Kirti
Re: [PATCH v28 04/17] vfio: Add migration region initialization and finalize function
On 10/24/2020 7:51 PM, Alex Williamson wrote: On Sat, 24 Oct 2020 15:09:14 +0530 Kirti Wankhede wrote: On 10/23/2020 10:22 PM, Alex Williamson wrote: On Fri, 23 Oct 2020 16:10:30 +0530 Kirti Wankhede wrote: Whether the VFIO device supports migration or not is decided based of migration region query. If migration region query is successful and migration region initialization is successful then migration is supported else migration is blocked. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Acked-by: Dr. David Alan Gilbert --- hw/vfio/meson.build | 1 + hw/vfio/migration.c | 133 ++ hw/vfio/trace-events | 3 + include/hw/vfio/vfio-common.h | 9 +++ 4 files changed, 146 insertions(+) create mode 100644 hw/vfio/migration.c diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build index 37efa74018bc..da9af297a0c5 100644 --- a/hw/vfio/meson.build +++ b/hw/vfio/meson.build @@ -2,6 +2,7 @@ vfio_ss = ss.source_set() vfio_ss.add(files( 'common.c', 'spapr.c', + 'migration.c', )) vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files( 'display.c', diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c new file mode 100644 index ..bbe6e0b7a6cc --- /dev/null +++ b/hw/vfio/migration.c @@ -0,0 +1,133 @@ +/* + * Migration support for VFIO devices + * + * Copyright NVIDIA, Inc. 2020 + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include "qemu/osdep.h" +#include + +#include "hw/vfio/vfio-common.h" +#include "cpu.h" +#include "migration/migration.h" +#include "migration/qemu-file.h" +#include "migration/register.h" +#include "migration/blocker.h" +#include "migration/misc.h" +#include "qapi/error.h" +#include "exec/ramlist.h" +#include "exec/ram_addr.h" +#include "pci.h" +#include "trace.h" + +static void vfio_migration_region_exit(VFIODevice *vbasedev) +{ +VFIOMigration *migration = vbasedev->migration; + +if (!migration) { +return; +} + +vfio_region_exit(&migration->region); +vfio_region_finalize(&migration->region); I think it would make sense to also: g_free(migration); vbasedev->migration = NULL; here as well so the callers don't need to. No, vfio_migration_init() case, err case is also hit when vbasedev->migration is not yet set but local variable migration is not-NULL. So why do we even call vfio_migration_region_exit() for that error case? It seems that could just g_free(migration); return ret; rather than goto err. Thanks, Removing temporary local variable, with that above 2 can be moved to exit function. Thanks, Kirti Alex Not worth a re-spin itself, maybe a follow-up if there's no other reason for a re-spin. Thanks, Alex +} + +static int vfio_migration_init(VFIODevice *vbasedev, + struct vfio_region_info *info) +{ +int ret; +Object *obj; +VFIOMigration *migration; + +if (!vbasedev->ops->vfio_get_object) { +return -EINVAL; +} + +obj = vbasedev->ops->vfio_get_object(vbasedev); +if (!obj) { +return -EINVAL; +} + +migration = g_new0(VFIOMigration, 1); + +ret = vfio_region_setup(obj, vbasedev, &migration->region, +info->index, "migration"); +if (ret) { +error_report("%s: Failed to setup VFIO migration region %d: %s", + vbasedev->name, info->index, strerror(-ret)); +goto err; +} + +vbasedev->migration = migration; + +if (!migration->region.size) { +error_report("%s: Invalid zero-sized of VFIO migration region %d", + vbasedev->name, info->index); +ret = -EINVAL; +goto err; +} +return 0; + +err: +vfio_migration_region_exit(vbasedev); +g_free(migration); +vbasedev->migration = NULL; +return ret; +} + +/* -- */ + +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) +{ +struct vfio_region_info *info = NULL; +Error *local_err = NULL; +int ret; + +ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION, + VFIO_REGION_SUBTYPE_MIGRATION, &info); +if (ret) { +goto add_blocker; +} + +ret = vfio_migration_init(vbasedev, info); +if (ret) { +goto add_blocker; +} + +g_free(info); +trace_vfio_migration_probe(vbasedev->name, info->index); +return 0; + +add_blocker: +error_setg(&vbasedev->migration_blocker, + "VFIO devi
Re: [PATCH v28 04/17] vfio: Add migration region initialization and finalize function
On 10/23/2020 10:22 PM, Alex Williamson wrote: On Fri, 23 Oct 2020 16:10:30 +0530 Kirti Wankhede wrote: Whether the VFIO device supports migration or not is decided based of migration region query. If migration region query is successful and migration region initialization is successful then migration is supported else migration is blocked. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Acked-by: Dr. David Alan Gilbert --- hw/vfio/meson.build | 1 + hw/vfio/migration.c | 133 ++ hw/vfio/trace-events | 3 + include/hw/vfio/vfio-common.h | 9 +++ 4 files changed, 146 insertions(+) create mode 100644 hw/vfio/migration.c diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build index 37efa74018bc..da9af297a0c5 100644 --- a/hw/vfio/meson.build +++ b/hw/vfio/meson.build @@ -2,6 +2,7 @@ vfio_ss = ss.source_set() vfio_ss.add(files( 'common.c', 'spapr.c', + 'migration.c', )) vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files( 'display.c', diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c new file mode 100644 index ..bbe6e0b7a6cc --- /dev/null +++ b/hw/vfio/migration.c @@ -0,0 +1,133 @@ +/* + * Migration support for VFIO devices + * + * Copyright NVIDIA, Inc. 2020 + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include "qemu/osdep.h" +#include + +#include "hw/vfio/vfio-common.h" +#include "cpu.h" +#include "migration/migration.h" +#include "migration/qemu-file.h" +#include "migration/register.h" +#include "migration/blocker.h" +#include "migration/misc.h" +#include "qapi/error.h" +#include "exec/ramlist.h" +#include "exec/ram_addr.h" +#include "pci.h" +#include "trace.h" + +static void vfio_migration_region_exit(VFIODevice *vbasedev) +{ +VFIOMigration *migration = vbasedev->migration; + +if (!migration) { +return; +} + +vfio_region_exit(&migration->region); +vfio_region_finalize(&migration->region); I think it would make sense to also: g_free(migration); vbasedev->migration = NULL; here as well so the callers don't need to. No, vfio_migration_init() case, err case is also hit when vbasedev->migration is not yet set but local variable migration is not-NULL. Thanks, Kirti Not worth a re-spin itself, maybe a follow-up if there's no other reason for a re-spin. Thanks, Alex +} + +static int vfio_migration_init(VFIODevice *vbasedev, + struct vfio_region_info *info) +{ +int ret; +Object *obj; +VFIOMigration *migration; + +if (!vbasedev->ops->vfio_get_object) { +return -EINVAL; +} + +obj = vbasedev->ops->vfio_get_object(vbasedev); +if (!obj) { +return -EINVAL; +} + +migration = g_new0(VFIOMigration, 1); + +ret = vfio_region_setup(obj, vbasedev, &migration->region, +info->index, "migration"); +if (ret) { +error_report("%s: Failed to setup VFIO migration region %d: %s", + vbasedev->name, info->index, strerror(-ret)); +goto err; +} + +vbasedev->migration = migration; + +if (!migration->region.size) { +error_report("%s: Invalid zero-sized of VFIO migration region %d", + vbasedev->name, info->index); +ret = -EINVAL; +goto err; +} +return 0; + +err: +vfio_migration_region_exit(vbasedev); +g_free(migration); +vbasedev->migration = NULL; +return ret; +} + +/* -- */ + +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) +{ +struct vfio_region_info *info = NULL; +Error *local_err = NULL; +int ret; + +ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION, + VFIO_REGION_SUBTYPE_MIGRATION, &info); +if (ret) { +goto add_blocker; +} + +ret = vfio_migration_init(vbasedev, info); +if (ret) { +goto add_blocker; +} + +g_free(info); +trace_vfio_migration_probe(vbasedev->name, info->index); +return 0; + +add_blocker: +error_setg(&vbasedev->migration_blocker, + "VFIO device doesn't support migration"); +g_free(info); + +ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err); +if (local_err) { +error_propagate(errp, local_err); +error_free(vbasedev->migration_blocker); +vbasedev->migration_blocker = NULL; +} +return ret; +} + +void vfio_migration_finaliz
[PATCH v28 17/17] qapi: Add VFIO devices migration stats in Migration stats
Added amount of bytes transferred to the VM at destination by all VFIO devices Signed-off-by: Kirti Wankhede Reviewed-by: Dr. David Alan Gilbert --- hw/vfio/common.c | 19 +++ hw/vfio/migration.c | 9 + include/hw/vfio/vfio-common.h | 3 +++ migration/migration.c | 17 + monitor/hmp-cmds.c| 6 ++ qapi/migration.json | 17 + 6 files changed, 71 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 49c68a5253ae..56f6fee66a55 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -292,6 +292,25 @@ const MemoryRegionOps vfio_region_ops = { * Device state interfaces */ +bool vfio_mig_active(void) +{ +VFIOGroup *group; +VFIODevice *vbasedev; + +if (QLIST_EMPTY(&vfio_group_list)) { +return false; +} + +QLIST_FOREACH(group, &vfio_group_list, next) { +QLIST_FOREACH(vbasedev, &group->device_list, next) { +if (vbasedev->migration_blocker) { +return false; +} +} +} +return true; +} + static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container) { VFIOGroup *group; diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index d4ba24c2dfae..37390b9c05fe 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -45,6 +45,8 @@ #define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xef13ULL) #define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL) +static int64_t bytes_transferred; + static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count, off_t off, bool iswrite) { @@ -255,6 +257,7 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size) *size = data_size; } +bytes_transferred += data_size; return ret; } @@ -785,6 +788,7 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data) case MIGRATION_STATUS_CANCELLING: case MIGRATION_STATUS_CANCELLED: case MIGRATION_STATUS_FAILED: +bytes_transferred = 0; ret = vfio_migration_set_state(vbasedev, ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING), VFIO_DEVICE_STATE_RUNNING); @@ -871,6 +875,11 @@ err: /* -- */ +int64_t vfio_mig_bytes_transferred(void) +{ +return bytes_transferred; +} + int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) { VFIOContainer *container = vbasedev->group->container; diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index b1c1b18fd228..24e299d97425 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -203,6 +203,9 @@ extern const MemoryRegionOps vfio_region_ops; typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList; extern VFIOGroupList vfio_group_list; +bool vfio_mig_active(void); +int64_t vfio_mig_bytes_transferred(void); + #ifdef CONFIG_LINUX int vfio_get_region_info(VFIODevice *vbasedev, int index, struct vfio_region_info **info); diff --git a/migration/migration.c b/migration/migration.c index 0575ecb37953..995ccd96a774 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -57,6 +57,10 @@ #include "qemu/queue.h" #include "multifd.h" +#ifdef CONFIG_VFIO +#include "hw/vfio/vfio-common.h" +#endif + #define MAX_THROTTLE (128 << 20) /* Migration transfer speed throttling */ /* Amount of time to allocate to each "chunk" of bandwidth-throttled @@ -1002,6 +1006,17 @@ static void populate_disk_info(MigrationInfo *info) } } +static void populate_vfio_info(MigrationInfo *info) +{ +#ifdef CONFIG_VFIO +if (vfio_mig_active()) { +info->has_vfio = true; +info->vfio = g_malloc0(sizeof(*info->vfio)); +info->vfio->transferred = vfio_mig_bytes_transferred(); +} +#endif +} + static void fill_source_migration_info(MigrationInfo *info) { MigrationState *s = migrate_get_current(); @@ -1026,6 +1041,7 @@ static void fill_source_migration_info(MigrationInfo *info) populate_time_info(info, s); populate_ram_info(info, s); populate_disk_info(info); +populate_vfio_info(info); break; case MIGRATION_STATUS_COLO: info->has_status = true; @@ -1034,6 +1050,7 @@ static void fill_source_migration_info(MigrationInfo *info) case MIGRATION_STATUS_COMPLETED: populate_time_info(info, s); populate_ram_info(info, s); +populate_vfio_info(info); break; case MIGRATION_STATUS_FAILED: info->has_status = true; diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c index 9789f4277f50..56e9bad33d94 100644 --- a/monitor/hmp-cmds.c +++ b/monitor/hmp-cmds.c @@ -357,6 +357,12 @@
[PATCH v28 12/17] vfio: Add function to start and stop dirty pages tracking
Call VFIO_IOMMU_DIRTY_PAGES ioctl to start and stop dirty pages tracking for VFIO devices. Signed-off-by: Kirti Wankhede Reviewed-by: Dr. David Alan Gilbert --- hw/vfio/migration.c | 36 1 file changed, 36 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 0dc40e34a4de..d4ba24c2dfae 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -11,6 +11,7 @@ #include "qemu/main-loop.h" #include "qemu/cutils.h" #include +#include #include "sysemu/runstate.h" #include "hw/vfio/vfio-common.h" @@ -391,10 +392,40 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque) return qemu_file_get_error(f); } +static int vfio_set_dirty_page_tracking(VFIODevice *vbasedev, bool start) +{ +int ret; +VFIOMigration *migration = vbasedev->migration; +VFIOContainer *container = vbasedev->group->container; +struct vfio_iommu_type1_dirty_bitmap dirty = { +.argsz = sizeof(dirty), +}; + +if (start) { +if (migration->device_state & VFIO_DEVICE_STATE_SAVING) { +dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START; +} else { +return -EINVAL; +} +} else { +dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP; +} + +ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty); +if (ret) { +error_report("Failed to set dirty tracking flag 0x%x errno: %d", + dirty.flags, errno); +return -errno; +} +return ret; +} + static void vfio_migration_cleanup(VFIODevice *vbasedev) { VFIOMigration *migration = vbasedev->migration; +vfio_set_dirty_page_tracking(vbasedev, false); + if (migration->region.mmaps) { vfio_region_unmap(&migration->region); } @@ -435,6 +466,11 @@ static int vfio_save_setup(QEMUFile *f, void *opaque) return ret; } +ret = vfio_set_dirty_page_tracking(vbasedev, true); +if (ret) { +return ret; +} + qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); ret = qemu_file_get_error(f); -- 2.7.0
[PATCH v28 16/17] vfio: Make vfio-pci device migration capable
If the device is not a failover primary device, call vfio_migration_probe() and vfio_migration_finalize() to enable migration support for those devices that support it respectively to tear it down again. Removed migration blocker from VFIO PCI device specific structure and use migration blocker from generic structure of VFIO device. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Reviewed-by: Dr. David Alan Gilbert Reviewed-by: Cornelia Huck --- hw/vfio/pci.c | 28 hw/vfio/pci.h | 1 - 2 files changed, 8 insertions(+), 21 deletions(-) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 92cc25a5489f..d2a2b5756774 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -2788,17 +2788,6 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) return; } -if (!pdev->failover_pair_id) { -error_setg(&vdev->migration_blocker, -"VFIO device doesn't support migration"); -ret = migrate_add_blocker(vdev->migration_blocker, errp); -if (ret) { -error_free(vdev->migration_blocker); -vdev->migration_blocker = NULL; -return; -} -} - vdev->vbasedev.name = g_path_get_basename(vdev->vbasedev.sysfsdev); vdev->vbasedev.ops = &vfio_pci_ops; vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI; @@ -3066,6 +3055,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) } } +if (!pdev->failover_pair_id) { +ret = vfio_migration_probe(&vdev->vbasedev, errp); +if (ret) { +error_report("%s: Migration disabled", vdev->vbasedev.name); +} +} + vfio_register_err_notifier(vdev); vfio_register_req_notifier(vdev); vfio_setup_resetfn_quirk(vdev); @@ -3080,11 +3076,6 @@ out_teardown: vfio_bars_exit(vdev); error: error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name); -if (vdev->migration_blocker) { -migrate_del_blocker(vdev->migration_blocker); -error_free(vdev->migration_blocker); -vdev->migration_blocker = NULL; -} } static void vfio_instance_finalize(Object *obj) @@ -3096,10 +3087,6 @@ static void vfio_instance_finalize(Object *obj) vfio_bars_finalize(vdev); g_free(vdev->emulated_config_bits); g_free(vdev->rom); -if (vdev->migration_blocker) { -migrate_del_blocker(vdev->migration_blocker); -error_free(vdev->migration_blocker); -} /* * XXX Leaking igd_opregion is not an oversight, we can't remove the * fw_cfg entry therefore leaking this allocation seems like the safest @@ -3127,6 +3114,7 @@ static void vfio_exitfn(PCIDevice *pdev) } vfio_teardown_msi(vdev); vfio_bars_exit(vdev); +vfio_migration_finalize(&vdev->vbasedev); } static void vfio_pci_reset(DeviceState *dev) diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h index bce71a9ac93f..1574ef983f8f 100644 --- a/hw/vfio/pci.h +++ b/hw/vfio/pci.h @@ -172,7 +172,6 @@ struct VFIOPCIDevice { bool no_vfio_ioeventfd; bool enable_ramfb; VFIODisplay *dpy; -Error *migration_blocker; Notifier irqchip_change_notifier; }; -- 2.7.0
[PATCH v28 15/17] vfio: Add ioctl to get dirty pages bitmap during dma unmap
With vIOMMU, IO virtual address range can get unmapped while in pre-copy phase of migration. In that case, unmap ioctl should return pages pinned in that range and QEMU should find its correcponding guest physical addresses and report those dirty. Suggested-by: Alex Williamson Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia --- hw/vfio/common.c | 96 +--- 1 file changed, 92 insertions(+), 4 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index c0b5b6245a47..49c68a5253ae 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -321,11 +321,94 @@ static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container) return true; } +static bool vfio_devices_all_running_and_saving(VFIOContainer *container) +{ +VFIOGroup *group; +VFIODevice *vbasedev; +MigrationState *ms = migrate_get_current(); + +if (!migration_is_setup_or_active(ms->state)) { +return false; +} + +QLIST_FOREACH(group, &container->group_list, container_next) { +QLIST_FOREACH(vbasedev, &group->device_list, next) { +VFIOMigration *migration = vbasedev->migration; + +if (!migration) { +return false; +} + +if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) && +(migration->device_state & VFIO_DEVICE_STATE_RUNNING)) { +continue; +} else { +return false; +} +} +} +return true; +} + +static int vfio_dma_unmap_bitmap(VFIOContainer *container, + hwaddr iova, ram_addr_t size, + IOMMUTLBEntry *iotlb) +{ +struct vfio_iommu_type1_dma_unmap *unmap; +struct vfio_bitmap *bitmap; +uint64_t pages = TARGET_PAGE_ALIGN(size) >> TARGET_PAGE_BITS; +int ret; + +unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap)); + +unmap->argsz = sizeof(*unmap) + sizeof(*bitmap); +unmap->iova = iova; +unmap->size = size; +unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP; +bitmap = (struct vfio_bitmap *)&unmap->data; + +/* + * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of + * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap_pgsize to + * TARGET_PAGE_SIZE. + */ + +bitmap->pgsize = TARGET_PAGE_SIZE; +bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / + BITS_PER_BYTE; + +if (bitmap->size > container->max_dirty_bitmap_size) { +error_report("UNMAP: Size of bitmap too big 0x%llx", bitmap->size); +ret = -E2BIG; +goto unmap_exit; +} + +bitmap->data = g_try_malloc0(bitmap->size); +if (!bitmap->data) { +ret = -ENOMEM; +goto unmap_exit; +} + +ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap); +if (!ret) { +cpu_physical_memory_set_dirty_lebitmap((uint64_t *)bitmap->data, +iotlb->translated_addr, pages); +} else { +error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m"); +} + +g_free(bitmap->data); +unmap_exit: +g_free(unmap); +return ret; +} + /* * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86 */ static int vfio_dma_unmap(VFIOContainer *container, - hwaddr iova, ram_addr_t size) + hwaddr iova, ram_addr_t size, + IOMMUTLBEntry *iotlb) { struct vfio_iommu_type1_dma_unmap unmap = { .argsz = sizeof(unmap), @@ -334,6 +417,11 @@ static int vfio_dma_unmap(VFIOContainer *container, .size = size, }; +if (iotlb && container->dirty_pages_supported && +vfio_devices_all_running_and_saving(container)) { +return vfio_dma_unmap_bitmap(container, iova, size, iotlb); +} + while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) { /* * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c @@ -381,7 +469,7 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova, * the VGA ROM space. */ if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 || -(errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 && +(errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 && ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) { return 0; } @@ -531,7 +619,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) iotlb->addr_mask + 1, vaddr, ret); } } else { -ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1); +ret = vfio_dma_unmap(contain
[PATCH v28 11/17] vfio: Get migration capability flags for container
Added helper functions to get IOMMU info capability chain. Added function to get migration capability information from that capability chain for IOMMU container. Similar change was proposed earlier: https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg03759.html Disable migration for devices if IOMMU module doesn't support migration capability. Signed-off-by: Kirti Wankhede Cc: Shameer Kolothum Cc: Eric Auger --- hw/vfio/common.c | 90 +++ hw/vfio/migration.c | 7 +++- include/hw/vfio/vfio-common.h | 3 ++ 3 files changed, 91 insertions(+), 9 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index c6e98b8d61be..d4959c036dd1 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -1228,6 +1228,75 @@ static int vfio_init_container(VFIOContainer *container, int group_fd, return 0; } +static int vfio_get_iommu_info(VFIOContainer *container, + struct vfio_iommu_type1_info **info) +{ + +size_t argsz = sizeof(struct vfio_iommu_type1_info); + +*info = g_new0(struct vfio_iommu_type1_info, 1); +again: +(*info)->argsz = argsz; + +if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) { +g_free(*info); +*info = NULL; +return -errno; +} + +if (((*info)->argsz > argsz)) { +argsz = (*info)->argsz; +*info = g_realloc(*info, argsz); +goto again; +} + +return 0; +} + +static struct vfio_info_cap_header * +vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id) +{ +struct vfio_info_cap_header *hdr; +void *ptr = info; + +if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) { +return NULL; +} + +for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) { +if (hdr->id == id) { +return hdr; +} +} + +return NULL; +} + +static void vfio_get_iommu_info_migration(VFIOContainer *container, + struct vfio_iommu_type1_info *info) +{ +struct vfio_info_cap_header *hdr; +struct vfio_iommu_type1_info_cap_migration *cap_mig; + +hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION); +if (!hdr) { +return; +} + +cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration, +header); + +/* + * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of + * TARGET_PAGE_SIZE to mark those dirty. + */ +if (cap_mig->pgsize_bitmap & TARGET_PAGE_SIZE) { +container->dirty_pages_supported = true; +container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size; +container->dirty_pgsizes = cap_mig->pgsize_bitmap; +} +} + static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, Error **errp) { @@ -1297,6 +1366,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, container->space = space; container->fd = fd; container->error = NULL; +container->dirty_pages_supported = false; QLIST_INIT(&container->giommu_list); QLIST_INIT(&container->hostwin_list); @@ -1309,7 +1379,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, case VFIO_TYPE1v2_IOMMU: case VFIO_TYPE1_IOMMU: { -struct vfio_iommu_type1_info info; +struct vfio_iommu_type1_info *info; /* * FIXME: This assumes that a Type1 IOMMU can map any 64-bit @@ -1318,15 +1388,19 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, * existing Type1 IOMMUs generally support any IOVA we're * going to actually try in practice. */ -info.argsz = sizeof(info); -ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info); -/* Ignore errors */ -if (ret || !(info.flags & VFIO_IOMMU_INFO_PGSIZES)) { +ret = vfio_get_iommu_info(container, &info); + +if (ret || !(info->flags & VFIO_IOMMU_INFO_PGSIZES)) { /* Assume 4k IOVA page size */ -info.iova_pgsizes = 4096; +info->iova_pgsizes = 4096; } -vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes); -container->pgsizes = info.iova_pgsizes; +vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes); +container->pgsizes = info->iova_pgsizes; + +if (!ret) { +vfio_get_iommu_info_migration(container, info); +} +g_free(info); break; } case VFIO_SPAPR_TCE_v2_IOMMU: diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 240646592b39..0dc40e34a4de 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -837,9 +837,14 @@ err: int vfio_migration_probe(VFIODevice *vbasedev, Error *
[PATCH v28 14/17] vfio: Dirty page tracking when vIOMMU is enabled
When vIOMMU is enabled, add MAP notifier from log_sync when all devices in container are in stop and copy phase of migration. Call replay and then from notifier callback, get dirty pages. Suggested-by: Alex Williamson Signed-off-by: Kirti Wankhede --- hw/vfio/common.c | 88 hw/vfio/trace-events | 1 + 2 files changed, 83 insertions(+), 6 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 2634387df948..c0b5b6245a47 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -442,8 +442,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section) } /* Called with rcu_read_lock held. */ -static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr, - bool *read_only) +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr, + ram_addr_t *ram_addr, bool *read_only) { MemoryRegion *mr; hwaddr xlat; @@ -474,8 +474,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr, return false; } -*vaddr = memory_region_get_ram_ptr(mr) + xlat; -*read_only = !writable || mr->readonly; +if (vaddr) { +*vaddr = memory_region_get_ram_ptr(mr) + xlat; +} + +if (ram_addr) { +*ram_addr = memory_region_get_ram_addr(mr) + xlat; +} + +if (read_only) { +*read_only = !writable || mr->readonly; +} return true; } @@ -485,7 +494,6 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); VFIOContainer *container = giommu->container; hwaddr iova = iotlb->iova + giommu->iommu_offset; -bool read_only; void *vaddr; int ret; @@ -501,7 +509,9 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) rcu_read_lock(); if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) { -if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) { +bool read_only; + +if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only)) { goto out; } /* @@ -899,11 +909,77 @@ err_out: return ret; } +typedef struct { +IOMMUNotifier n; +VFIOGuestIOMMU *giommu; +} vfio_giommu_dirty_notifier; + +static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) +{ +vfio_giommu_dirty_notifier *gdn = container_of(n, +vfio_giommu_dirty_notifier, n); +VFIOGuestIOMMU *giommu = gdn->giommu; +VFIOContainer *container = giommu->container; +hwaddr iova = iotlb->iova + giommu->iommu_offset; +ram_addr_t translated_addr; + +trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask); + +if (iotlb->target_as != &address_space_memory) { +error_report("Wrong target AS \"%s\", only system memory is allowed", + iotlb->target_as->name ? iotlb->target_as->name : "none"); +return; +} + +rcu_read_lock(); +if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) { +int ret; + +ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1, +translated_addr); +if (ret) { +error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx") = %d (%m)", + container, iova, + iotlb->addr_mask + 1, ret); +} +} +rcu_read_unlock(); +} + static int vfio_sync_dirty_bitmap(VFIOContainer *container, MemoryRegionSection *section) { ram_addr_t ram_addr; +if (memory_region_is_iommu(section->mr)) { +VFIOGuestIOMMU *giommu; + +QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) { +if (MEMORY_REGION(giommu->iommu) == section->mr && +giommu->n.start == section->offset_within_region) { +Int128 llend; +vfio_giommu_dirty_notifier gdn = { .giommu = giommu }; +int idx = memory_region_iommu_attrs_to_index(giommu->iommu, + MEMTXATTRS_UNSPECIFIED); + +llend = int128_add(int128_make64(section->offset_within_region), + section->size); +llend = int128_sub(llend, int128_one()); + +iommu_notifier_init(&gdn.n, +vfio_iommu_map_dirty_notify, +IOMMU_NOTIFIER_MAP, +section->offset_within_region, +int128_get64(llend), +
[PATCH v28 08/17] vfio: Add save state functions to SaveVMHandlers
Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy functions. These functions handles pre-copy and stop-and-copy phase. In _SAVING|_RUNNING device state or pre-copy phase: - read pending_bytes. If pending_bytes > 0, go through below steps. - read data_offset - indicates kernel driver to write data to staging buffer. - read data_size - amount of data in bytes written by vendor driver in migration region. - read data_size bytes of data from data_offset in the migration region. - Write data packet to file stream as below: {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data, VFIO_MIG_FLAG_END_OF_STATE } In _SAVING device state or stop-and-copy phase a. read config space of device and save to migration file stream. This doesn't need to be from vendor driver. Any other special config state from driver can be saved as data in following iteration. b. read pending_bytes. If pending_bytes > 0, go through below steps. c. read data_offset - indicates kernel driver to write data to staging buffer. d. read data_size - amount of data in bytes written by vendor driver in migration region. e. read data_size bytes of data from data_offset in the migration region. f. Write data packet as below: {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data} g. iterate through steps b to f while (pending_bytes > 0) h. Write {VFIO_MIG_FLAG_END_OF_STATE} When data region is mapped, its user's responsibility to read data from data_offset of data_size before moving to next steps. Added fix suggested by Artem Polyakov to reset pending_bytes in vfio_save_iterate(). Added fix suggested by Zhi Wang to add 0 as data size in migration stream and add END_OF_STATE delimiter to indicate phase complete. Suggested-by: Artem Polyakov Suggested-by: Zhi Wang Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia --- hw/vfio/migration.c | 276 ++ hw/vfio/trace-events | 6 + include/hw/vfio/vfio-common.h | 1 + 3 files changed, 283 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 94d2bdae5c54..be9e4aba541d 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -148,6 +148,151 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask, return 0; } +static void *get_data_section_size(VFIORegion *region, uint64_t data_offset, + uint64_t data_size, uint64_t *size) +{ +void *ptr = NULL; +uint64_t limit = 0; +int i; + +if (!region->mmaps) { +if (size) { +*size = MIN(data_size, region->size - data_offset); +} +return ptr; +} + +for (i = 0; i < region->nr_mmaps; i++) { +VFIOMmap *map = region->mmaps + i; + +if ((data_offset >= map->offset) && +(data_offset < map->offset + map->size)) { + +/* check if data_offset is within sparse mmap areas */ +ptr = map->mmap + data_offset - map->offset; +if (size) { +*size = MIN(data_size, map->offset + map->size - data_offset); +} +break; +} else if ((data_offset < map->offset) && + (!limit || limit > map->offset)) { +/* + * data_offset is not within sparse mmap areas, find size of + * non-mapped area. Check through all list since region->mmaps list + * is not sorted. + */ +limit = map->offset; +} +} + +if (!ptr && size) { +*size = limit ? MIN(data_size, limit - data_offset) : data_size; +} +return ptr; +} + +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size) +{ +VFIOMigration *migration = vbasedev->migration; +VFIORegion *region = &migration->region; +uint64_t data_offset = 0, data_size = 0, sz; +int ret; + +ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset), + region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_offset)); +if (ret < 0) { +return ret; +} + +ret = vfio_mig_read(vbasedev, &data_size, sizeof(data_size), +region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_size)); +if (ret < 0) { +return ret; +} + +trace_vfio_save_buffer(vbasedev->name, data_offset, data_size, + migration->pending_bytes); + +qemu_put_be64(f, data_size); +sz = data_size; + +while (sz) { +void *buf; +uint64_t sec_size; +bool buf_allocated = false; + +buf = get_data_section_size(region, data_offset, sz, &sec_size); + +if (!buf) { +buf = g_try_malloc(sec_size); +if (!buf) { +error_report("%s: Error allocating buffer ",
[PATCH v28 06/17] vfio: Add migration state change notifier
Added migration state change notifier to get notification on migration state change. These states are translated to VFIO device state and conveyed to vendor driver. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Reviewed-by: Dr. David Alan Gilbert --- hw/vfio/migration.c | 28 hw/vfio/trace-events | 1 + include/hw/vfio/vfio-common.h | 2 ++ 3 files changed, 31 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 9b6949439f8e..a0f0e79b9b73 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -175,6 +175,30 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state) (migration->device_state & mask) | value); } +static void vfio_migration_state_notifier(Notifier *notifier, void *data) +{ +MigrationState *s = data; +VFIOMigration *migration = container_of(notifier, VFIOMigration, +migration_state); +VFIODevice *vbasedev = migration->vbasedev; +int ret; + +trace_vfio_migration_state_notifier(vbasedev->name, +MigrationStatus_str(s->state)); + +switch (s->state) { +case MIGRATION_STATUS_CANCELLING: +case MIGRATION_STATUS_CANCELLED: +case MIGRATION_STATUS_FAILED: +ret = vfio_migration_set_state(vbasedev, + ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING), + VFIO_DEVICE_STATE_RUNNING); +if (ret) { +error_report("%s: Failed to set state RUNNING", vbasedev->name); +} +} +} + static void vfio_migration_region_exit(VFIODevice *vbasedev) { VFIOMigration *migration = vbasedev->migration; @@ -222,8 +246,11 @@ static int vfio_migration_init(VFIODevice *vbasedev, goto err; } +migration->vbasedev = vbasedev; migration->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change, vbasedev); +migration->migration_state.notify = vfio_migration_state_notifier; +add_migration_state_change_notifier(&migration->migration_state); return 0; err: @@ -275,6 +302,7 @@ void vfio_migration_finalize(VFIODevice *vbasedev) VFIOMigration *migration = vbasedev->migration; if (migration) { +remove_migration_state_change_notifier(&migration->migration_state); qemu_del_vm_change_state_handler(migration->vm_state); vfio_migration_region_exit(vbasedev); g_free(vbasedev->migration); diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index 41de81f12f60..78d7d83b5ef8 100644 --- a/hw/vfio/trace-events +++ b/hw/vfio/trace-events @@ -150,3 +150,4 @@ vfio_display_edid_write_error(void) "" vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d" vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d" vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d" +vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s" diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 9a571f1fb552..2bd593ba38bb 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -59,10 +59,12 @@ typedef struct VFIORegion { } VFIORegion; typedef struct VFIOMigration { +struct VFIODevice *vbasedev; VMChangeStateEntry *vm_state; VFIORegion region; uint32_t device_state; int vm_running; +Notifier migration_state; } VFIOMigration; typedef struct VFIOAddressSpace { -- 2.7.0
[PATCH v28 07/17] vfio: Register SaveVMHandlers for VFIO device
Define flags to be used as delimiter in migration stream for VFIO devices. Added .save_setup and .save_cleanup functions. Map & unmap migration region from these functions at source during saving or pre-copy phase. Set VFIO device state depending on VM's state. During live migration, VM is running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO device. During save-restore, VM is paused, _SAVING state is set for VFIO device. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia --- hw/vfio/migration.c | 102 +++ hw/vfio/trace-events | 2 + 2 files changed, 104 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index a0f0e79b9b73..94d2bdae5c54 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -8,12 +8,15 @@ */ #include "qemu/osdep.h" +#include "qemu/main-loop.h" +#include "qemu/cutils.h" #include #include "sysemu/runstate.h" #include "hw/vfio/vfio-common.h" #include "cpu.h" #include "migration/migration.h" +#include "migration/vmstate.h" #include "migration/qemu-file.h" #include "migration/register.h" #include "migration/blocker.h" @@ -25,6 +28,22 @@ #include "trace.h" #include "hw/hw.h" +/* + * Flags to be used as unique delimiters for VFIO devices in the migration + * stream. These flags are composed as: + * 0x => MSB 32-bit all 1s + * 0xef10 => Magic ID, represents emulated (virtual) function IO + * 0x => 16-bits reserved for flags + * + * The beginning of state information is marked by _DEV_CONFIG_STATE, + * _DEV_SETUP_STATE, or _DEV_DATA_STATE, respectively. The end of a + * certain state information is marked by _END_OF_STATE. + */ +#define VFIO_MIG_FLAG_END_OF_STATE (0xef11ULL) +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE (0xef12ULL) +#define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xef13ULL) +#define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL) + static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count, off_t off, bool iswrite) { @@ -129,6 +148,75 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask, return 0; } +static void vfio_migration_cleanup(VFIODevice *vbasedev) +{ +VFIOMigration *migration = vbasedev->migration; + +if (migration->region.mmaps) { +vfio_region_unmap(&migration->region); +} +} + +/* -- */ + +static int vfio_save_setup(QEMUFile *f, void *opaque) +{ +VFIODevice *vbasedev = opaque; +VFIOMigration *migration = vbasedev->migration; +int ret; + +trace_vfio_save_setup(vbasedev->name); + +qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE); + +if (migration->region.mmaps) { +/* + * Calling vfio_region_mmap() from migration thread. Memory API called + * from this function require locking the iothread when called from + * outside the main loop thread. + */ +qemu_mutex_lock_iothread(); +ret = vfio_region_mmap(&migration->region); +qemu_mutex_unlock_iothread(); +if (ret) { +error_report("%s: Failed to mmap VFIO migration region: %s", + vbasedev->name, strerror(-ret)); +error_report("%s: Falling back to slow path", vbasedev->name); +} +} + +ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK, + VFIO_DEVICE_STATE_SAVING); +if (ret) { +error_report("%s: Failed to set state SAVING", vbasedev->name); +return ret; +} + +qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); + +ret = qemu_file_get_error(f); +if (ret) { +return ret; +} + +return 0; +} + +static void vfio_save_cleanup(void *opaque) +{ +VFIODevice *vbasedev = opaque; + +vfio_migration_cleanup(vbasedev); +trace_vfio_save_cleanup(vbasedev->name); +} + +static SaveVMHandlers savevm_vfio_handlers = { +.save_setup = vfio_save_setup, +.save_cleanup = vfio_save_cleanup, +}; + +/* -- */ + static void vfio_vmstate_change(void *opaque, int running, RunState state) { VFIODevice *vbasedev = opaque; @@ -217,6 +305,8 @@ static int vfio_migration_init(VFIODevice *vbasedev, int ret; Object *obj; VFIOMigration *migration; +char id[256] = ""; +g_autofree char *path = NULL, *oid = NULL; if (!vbasedev->ops->vfio_get_object) { return -EINVAL; @@ -247,6 +337,18 @@ static int vfio_migration_init(VFIODevice *vbasedev, } migration->vbasedev = vbasedev; + +oid =
[PATCH v28 09/17] vfio: Add load state functions to SaveVMHandlers
Sequence during _RESUMING device state: While data for this device is available, repeat below steps: a. read data_offset from where user application should write data. b. write data of data_size to migration region from data_offset. c. write data_size which indicates vendor driver that data is written in staging buffer. For user, data is opaque. User should write data in the same order as received. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Reviewed-by: Dr. David Alan Gilbert --- hw/vfio/migration.c | 195 +++ hw/vfio/trace-events | 4 ++ 2 files changed, 199 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index be9e4aba541d..240646592b39 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -257,6 +257,77 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size) return ret; } +static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev, +uint64_t data_size) +{ +VFIORegion *region = &vbasedev->migration->region; +uint64_t data_offset = 0, size, report_size; +int ret; + +do { +ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset), + region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_offset)); +if (ret < 0) { +return ret; +} + +if (data_offset + data_size > region->size) { +/* + * If data_size is greater than the data section of migration region + * then iterate the write buffer operation. This case can occur if + * size of migration region at destination is smaller than size of + * migration region at source. + */ +report_size = size = region->size - data_offset; +data_size -= size; +} else { +report_size = size = data_size; +data_size = 0; +} + +trace_vfio_load_state_device_data(vbasedev->name, data_offset, size); + +while (size) { +void *buf; +uint64_t sec_size; +bool buf_alloc = false; + +buf = get_data_section_size(region, data_offset, size, &sec_size); + +if (!buf) { +buf = g_try_malloc(sec_size); +if (!buf) { +error_report("%s: Error allocating buffer ", __func__); +return -ENOMEM; +} +buf_alloc = true; +} + +qemu_get_buffer(f, buf, sec_size); + +if (buf_alloc) { +ret = vfio_mig_write(vbasedev, buf, sec_size, +region->fd_offset + data_offset); +g_free(buf); + +if (ret < 0) { +return ret; +} +} +size -= sec_size; +data_offset += sec_size; +} + +ret = vfio_mig_write(vbasedev, &report_size, sizeof(report_size), +region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_size)); +if (ret < 0) { +return ret; +} +} while (data_size); + +return 0; +} + static int vfio_update_pending(VFIODevice *vbasedev) { VFIOMigration *migration = vbasedev->migration; @@ -293,6 +364,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque) return qemu_file_get_error(f); } +static int vfio_load_device_config_state(QEMUFile *f, void *opaque) +{ +VFIODevice *vbasedev = opaque; +uint64_t data; + +if (vbasedev->ops && vbasedev->ops->vfio_load_config) { +int ret; + +ret = vbasedev->ops->vfio_load_config(vbasedev, f); +if (ret) { +error_report("%s: Failed to load device config space", + vbasedev->name); +return ret; +} +} + +data = qemu_get_be64(f); +if (data != VFIO_MIG_FLAG_END_OF_STATE) { +error_report("%s: Failed loading device config space, " + "end flag incorrect 0x%"PRIx64, vbasedev->name, data); +return -EINVAL; +} + +trace_vfio_load_device_config_state(vbasedev->name); +return qemu_file_get_error(f); +} + static void vfio_migration_cleanup(VFIODevice *vbasedev) { VFIOMigration *migration = vbasedev->migration; @@ -483,12 +581,109 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) return ret; } +static int vfio_load_setup(QEMUFile *f, void *opaque) +{ +VFIODevice *vbasedev = opaque; +VFIOMigration *migration = vbasedev->migration; +int ret = 0; + +if (migration->region.mmaps) { +ret = vfio_region_mmap(&migration->region); +if (ret) { +error_report("%s: Failed to mmap VFIO migration region %d: %s", +
[PATCH v28 04/17] vfio: Add migration region initialization and finalize function
Whether the VFIO device supports migration or not is decided based of migration region query. If migration region query is successful and migration region initialization is successful then migration is supported else migration is blocked. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Acked-by: Dr. David Alan Gilbert --- hw/vfio/meson.build | 1 + hw/vfio/migration.c | 133 ++ hw/vfio/trace-events | 3 + include/hw/vfio/vfio-common.h | 9 +++ 4 files changed, 146 insertions(+) create mode 100644 hw/vfio/migration.c diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build index 37efa74018bc..da9af297a0c5 100644 --- a/hw/vfio/meson.build +++ b/hw/vfio/meson.build @@ -2,6 +2,7 @@ vfio_ss = ss.source_set() vfio_ss.add(files( 'common.c', 'spapr.c', + 'migration.c', )) vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files( 'display.c', diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c new file mode 100644 index ..bbe6e0b7a6cc --- /dev/null +++ b/hw/vfio/migration.c @@ -0,0 +1,133 @@ +/* + * Migration support for VFIO devices + * + * Copyright NVIDIA, Inc. 2020 + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include "qemu/osdep.h" +#include + +#include "hw/vfio/vfio-common.h" +#include "cpu.h" +#include "migration/migration.h" +#include "migration/qemu-file.h" +#include "migration/register.h" +#include "migration/blocker.h" +#include "migration/misc.h" +#include "qapi/error.h" +#include "exec/ramlist.h" +#include "exec/ram_addr.h" +#include "pci.h" +#include "trace.h" + +static void vfio_migration_region_exit(VFIODevice *vbasedev) +{ +VFIOMigration *migration = vbasedev->migration; + +if (!migration) { +return; +} + +vfio_region_exit(&migration->region); +vfio_region_finalize(&migration->region); +} + +static int vfio_migration_init(VFIODevice *vbasedev, + struct vfio_region_info *info) +{ +int ret; +Object *obj; +VFIOMigration *migration; + +if (!vbasedev->ops->vfio_get_object) { +return -EINVAL; +} + +obj = vbasedev->ops->vfio_get_object(vbasedev); +if (!obj) { +return -EINVAL; +} + +migration = g_new0(VFIOMigration, 1); + +ret = vfio_region_setup(obj, vbasedev, &migration->region, +info->index, "migration"); +if (ret) { +error_report("%s: Failed to setup VFIO migration region %d: %s", + vbasedev->name, info->index, strerror(-ret)); +goto err; +} + +vbasedev->migration = migration; + +if (!migration->region.size) { +error_report("%s: Invalid zero-sized of VFIO migration region %d", + vbasedev->name, info->index); +ret = -EINVAL; +goto err; +} +return 0; + +err: +vfio_migration_region_exit(vbasedev); +g_free(migration); +vbasedev->migration = NULL; +return ret; +} + +/* -- */ + +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) +{ +struct vfio_region_info *info = NULL; +Error *local_err = NULL; +int ret; + +ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION, + VFIO_REGION_SUBTYPE_MIGRATION, &info); +if (ret) { +goto add_blocker; +} + +ret = vfio_migration_init(vbasedev, info); +if (ret) { +goto add_blocker; +} + +g_free(info); +trace_vfio_migration_probe(vbasedev->name, info->index); +return 0; + +add_blocker: +error_setg(&vbasedev->migration_blocker, + "VFIO device doesn't support migration"); +g_free(info); + +ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err); +if (local_err) { +error_propagate(errp, local_err); +error_free(vbasedev->migration_blocker); +vbasedev->migration_blocker = NULL; +} +return ret; +} + +void vfio_migration_finalize(VFIODevice *vbasedev) +{ +VFIOMigration *migration = vbasedev->migration; + +if (migration) { +vfio_migration_region_exit(vbasedev); +g_free(vbasedev->migration); +vbasedev->migration = NULL; +} + +if (vbasedev->migration_blocker) { +migrate_del_blocker(vbasedev->migration_blocker); +error_free(vbasedev->migration_blocker); +vbasedev->migration_blocker = NULL; +} +} diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index a0c7b49a2ebc..9ced5ec6277c 10
[PATCH v28 13/17] vfio: Add vfio_listener_log_sync to mark dirty pages
vfio_listener_log_sync gets list of dirty pages from container using VFIO_IOMMU_GET_DIRTY_BITMAP ioctl and mark those pages dirty when all devices are stopped and saving state. Return early for the RAM block section of mapped MMIO region. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia --- hw/vfio/common.c | 116 +++ hw/vfio/trace-events | 1 + 2 files changed, 117 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index d4959c036dd1..2634387df948 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -29,6 +29,7 @@ #include "hw/vfio/vfio.h" #include "exec/address-spaces.h" #include "exec/memory.h" +#include "exec/ram_addr.h" #include "hw/hw.h" #include "qemu/error-report.h" #include "qemu/main-loop.h" @@ -37,6 +38,7 @@ #include "sysemu/reset.h" #include "trace.h" #include "qapi/error.h" +#include "migration/migration.h" VFIOGroupList vfio_group_list = QLIST_HEAD_INITIALIZER(vfio_group_list); @@ -287,6 +289,39 @@ const MemoryRegionOps vfio_region_ops = { }; /* + * Device state interfaces + */ + +static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container) +{ +VFIOGroup *group; +VFIODevice *vbasedev; +MigrationState *ms = migrate_get_current(); + +if (!migration_is_setup_or_active(ms->state)) { +return false; +} + +QLIST_FOREACH(group, &container->group_list, container_next) { +QLIST_FOREACH(vbasedev, &group->device_list, next) { +VFIOMigration *migration = vbasedev->migration; + +if (!migration) { +return false; +} + +if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) && +!(migration->device_state & VFIO_DEVICE_STATE_RUNNING)) { +continue; +} else { +return false; +} +} +} +return true; +} + +/* * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86 */ static int vfio_dma_unmap(VFIOContainer *container, @@ -812,9 +847,90 @@ static void vfio_listener_region_del(MemoryListener *listener, } } +static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, + uint64_t size, ram_addr_t ram_addr) +{ +struct vfio_iommu_type1_dirty_bitmap *dbitmap; +struct vfio_iommu_type1_dirty_bitmap_get *range; +uint64_t pages; +int ret; + +dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range)); + +dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range); +dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP; +range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data; +range->iova = iova; +range->size = size; + +/* + * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of + * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap's pgsize to + * TARGET_PAGE_SIZE. + */ +range->bitmap.pgsize = TARGET_PAGE_SIZE; + +pages = TARGET_PAGE_ALIGN(range->size) >> TARGET_PAGE_BITS; +range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / + BITS_PER_BYTE; +range->bitmap.data = g_try_malloc0(range->bitmap.size); +if (!range->bitmap.data) { +ret = -ENOMEM; +goto err_out; +} + +ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap); +if (ret) { +error_report("Failed to get dirty bitmap for iova: 0x%llx " +"size: 0x%llx err: %d", +range->iova, range->size, errno); +goto err_out; +} + +cpu_physical_memory_set_dirty_lebitmap((uint64_t *)range->bitmap.data, +ram_addr, pages); + +trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size, +range->bitmap.size, ram_addr); +err_out: +g_free(range->bitmap.data); +g_free(dbitmap); + +return ret; +} + +static int vfio_sync_dirty_bitmap(VFIOContainer *container, + MemoryRegionSection *section) +{ +ram_addr_t ram_addr; + +ram_addr = memory_region_get_ram_addr(section->mr) + + section->offset_within_region; + +return vfio_get_dirty_bitmap(container, + TARGET_PAGE_ALIGN(section->offset_within_address_space), + int128_get64(section->size), ram_addr); +} + +static void vfio_listerner_log_sync(MemoryListener *listener, +MemoryRegionSection *section) +{ +VFIOContainer *container = container_of(listener, VFIOContainer, listener); + +if (vfio_listener_skipped_section(section) || +!container->
[PATCH v28 03/17] vfio: Add save and load functions for VFIO PCI devices
Added functions to save and restore PCI device specific data, specifically config space of PCI device. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia --- hw/vfio/pci.c | 48 +++ include/hw/vfio/vfio-common.h | 2 ++ 2 files changed, 50 insertions(+) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index bffd5bfe3b78..92cc25a5489f 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -41,6 +41,7 @@ #include "trace.h" #include "qapi/error.h" #include "migration/blocker.h" +#include "migration/qemu-file.h" #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug" @@ -2401,11 +2402,58 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev) return OBJECT(vdev); } +static bool vfio_msix_present(void *opaque, int version_id) +{ +PCIDevice *pdev = opaque; + +return msix_present(pdev); +} + +const VMStateDescription vmstate_vfio_pci_config = { +.name = "VFIOPCIDevice", +.version_id = 1, +.minimum_version_id = 1, +.fields = (VMStateField[]) { +VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice), +VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present), +VMSTATE_END_OF_LIST() +} +}; + +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f) +{ +VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); + +vmstate_save_state(f, &vmstate_vfio_pci_config, vdev, NULL); +} + +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f) +{ +VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); +PCIDevice *pdev = &vdev->pdev; +int ret; + +ret = vmstate_load_state(f, &vmstate_vfio_pci_config, vdev, 1); +if (ret) { +return ret; +} + +if (msi_enabled(pdev)) { +vfio_msi_enable(vdev); +} else if (msix_enabled(pdev)) { +vfio_msix_enable(vdev); +} + +return ret; +} + static VFIODeviceOps vfio_pci_ops = { .vfio_compute_needs_reset = vfio_pci_compute_needs_reset, .vfio_hot_reset_multi = vfio_pci_hot_reset_multi, .vfio_eoi = vfio_intx_eoi, .vfio_get_object = vfio_pci_get_object, +.vfio_save_config = vfio_pci_save_config, +.vfio_load_config = vfio_pci_load_config, }; int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp) diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index fe99c36a693a..ba6169cd926e 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -120,6 +120,8 @@ struct VFIODeviceOps { int (*vfio_hot_reset_multi)(VFIODevice *vdev); void (*vfio_eoi)(VFIODevice *vdev); Object *(*vfio_get_object)(VFIODevice *vdev); +void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f); +int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f); }; typedef struct VFIOGroup { -- 2.7.0
[PATCH v28 01/17] vfio: Add function to unmap VFIO region
This function will be used for migration region. Migration region is mmaped when migration starts and will be unmapped when migration is complete. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Reviewed-by: Cornelia Huck --- hw/vfio/common.c | 32 hw/vfio/trace-events | 1 + include/hw/vfio/vfio-common.h | 1 + 3 files changed, 30 insertions(+), 4 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 13471ae29436..c6e98b8d61be 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -924,6 +924,18 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region, return 0; } +static void vfio_subregion_unmap(VFIORegion *region, int index) +{ +trace_vfio_region_unmap(memory_region_name(®ion->mmaps[index].mem), +region->mmaps[index].offset, +region->mmaps[index].offset + +region->mmaps[index].size - 1); +memory_region_del_subregion(region->mem, ®ion->mmaps[index].mem); +munmap(region->mmaps[index].mmap, region->mmaps[index].size); +object_unparent(OBJECT(®ion->mmaps[index].mem)); +region->mmaps[index].mmap = NULL; +} + int vfio_region_mmap(VFIORegion *region) { int i, prot = 0; @@ -954,10 +966,7 @@ int vfio_region_mmap(VFIORegion *region) region->mmaps[i].mmap = NULL; for (i--; i >= 0; i--) { -memory_region_del_subregion(region->mem, ®ion->mmaps[i].mem); -munmap(region->mmaps[i].mmap, region->mmaps[i].size); -object_unparent(OBJECT(®ion->mmaps[i].mem)); -region->mmaps[i].mmap = NULL; +vfio_subregion_unmap(region, i); } return ret; @@ -982,6 +991,21 @@ int vfio_region_mmap(VFIORegion *region) return 0; } +void vfio_region_unmap(VFIORegion *region) +{ +int i; + +if (!region->mem) { +return; +} + +for (i = 0; i < region->nr_mmaps; i++) { +if (region->mmaps[i].mmap) { +vfio_subregion_unmap(region, i); +} +} +} + void vfio_region_exit(VFIORegion *region) { int i; diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index 93a0bc2522f8..a0c7b49a2ebc 100644 --- a/hw/vfio/trace-events +++ b/hw/vfio/trace-events @@ -113,6 +113,7 @@ vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Reg vfio_region_exit(const char *name, int index) "Device %s, region %d" vfio_region_finalize(const char *name, int index) "Device %s, region %d" vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d" +vfio_region_unmap(const char *name, unsigned long offset, unsigned long end) "Region %s unmap [0x%lx - 0x%lx]" vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries" vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]" vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8" diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index c78f3ff5593c..dc95f527b583 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -171,6 +171,7 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region, int index, const char *name); int vfio_region_mmap(VFIORegion *region); void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled); +void vfio_region_unmap(VFIORegion *region); void vfio_region_exit(VFIORegion *region); void vfio_region_finalize(VFIORegion *region); void vfio_reset_handler(void *opaque); -- 2.7.0
[PATCH v28 10/17] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled
mr->ram_block is NULL when mr->is_iommu is true, then fr.dirty_log_mask wasn't set correctly due to which memory listener's log_sync doesn't get called. This patch returns log_mask with DIRTY_MEMORY_MIGRATION set when IOMMU is enabled. Signed-off-by: Kirti Wankhede --- softmmu/memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/softmmu/memory.c b/softmmu/memory.c index 403ff3abc99b..94f606e9d9d9 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -1792,7 +1792,7 @@ bool memory_region_is_ram_device(MemoryRegion *mr) uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr) { uint8_t mask = mr->dirty_log_mask; -if (global_dirty_log && mr->ram_block) { +if (global_dirty_log && (mr->ram_block || memory_region_is_iommu(mr))) { mask |= (1 << DIRTY_MEMORY_MIGRATION); } return mask; -- 2.7.0
[PATCH v28 05/17] vfio: Add VM state change handler to know state of VM
VM state change handler is called on change in VM's state. Based on VM state, VFIO device state should be changed. Added read/write helper functions for migration region. Added function to set device_state. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Reviewed-by: Dr. David Alan Gilbert --- hw/vfio/migration.c | 156 ++ hw/vfio/trace-events | 2 + include/hw/vfio/vfio-common.h | 4 ++ 3 files changed, 162 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index bbe6e0b7a6cc..9b6949439f8e 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -10,6 +10,7 @@ #include "qemu/osdep.h" #include +#include "sysemu/runstate.h" #include "hw/vfio/vfio-common.h" #include "cpu.h" #include "migration/migration.h" @@ -22,6 +23,157 @@ #include "exec/ram_addr.h" #include "pci.h" #include "trace.h" +#include "hw/hw.h" + +static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count, + off_t off, bool iswrite) +{ +int ret; + +ret = iswrite ? pwrite(vbasedev->fd, val, count, off) : +pread(vbasedev->fd, val, count, off); +if (ret < count) { +error_report("vfio_mig_%s %d byte %s: failed at offset 0x%lx, err: %s", + iswrite ? "write" : "read", count, + vbasedev->name, off, strerror(errno)); +return (ret < 0) ? ret : -EINVAL; +} +return 0; +} + +static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, size_t count, + off_t off, bool iswrite) +{ +int ret, done = 0; +__u8 *tbuf = buf; + +while (count) { +int bytes = 0; + +if (count >= 8 && !(off % 8)) { +bytes = 8; +} else if (count >= 4 && !(off % 4)) { +bytes = 4; +} else if (count >= 2 && !(off % 2)) { +bytes = 2; +} else { +bytes = 1; +} + +ret = vfio_mig_access(vbasedev, tbuf, bytes, off, iswrite); +if (ret) { +return ret; +} + +count -= bytes; +done += bytes; +off += bytes; +tbuf += bytes; +} +return done; +} + +#define vfio_mig_read(f, v, c, o) vfio_mig_rw(f, (__u8 *)v, c, o, false) +#define vfio_mig_write(f, v, c, o) vfio_mig_rw(f, (__u8 *)v, c, o, true) + +#define VFIO_MIG_STRUCT_OFFSET(f) \ + offsetof(struct vfio_device_migration_info, f) +/* + * Change the device_state register for device @vbasedev. Bits set in @mask + * are preserved, bits set in @value are set, and bits not set in either @mask + * or @value are cleared in device_state. If the register cannot be accessed, + * the resulting state would be invalid, or the device enters an error state, + * an error is returned. + */ + +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask, +uint32_t value) +{ +VFIOMigration *migration = vbasedev->migration; +VFIORegion *region = &migration->region; +off_t dev_state_off = region->fd_offset + + VFIO_MIG_STRUCT_OFFSET(device_state); +uint32_t device_state; +int ret; + +ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state), +dev_state_off); +if (ret < 0) { +return ret; +} + +device_state = (device_state & mask) | value; + +if (!VFIO_DEVICE_STATE_VALID(device_state)) { +return -EINVAL; +} + +ret = vfio_mig_write(vbasedev, &device_state, sizeof(device_state), + dev_state_off); +if (ret < 0) { +int rret; + +rret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state), + dev_state_off); + +if ((rret < 0) || (VFIO_DEVICE_STATE_IS_ERROR(device_state))) { +hw_error("%s: Device in error state 0x%x", vbasedev->name, + device_state); +return rret ? rret : -EIO; +} +return ret; +} + +migration->device_state = device_state; +trace_vfio_migration_set_state(vbasedev->name, device_state); +return 0; +} + +static void vfio_vmstate_change(void *opaque, int running, RunState state) +{ +VFIODevice *vbasedev = opaque; +VFIOMigration *migration = vbasedev->migration; +uint32_t value, mask; +int ret; + +if ((vbasedev->migration->vm_running == running)) { +return; +} + +if (running) { +/* + * Here device state can have one of _SAVING, _RESUMING or _STOP bit. + * Transition from _SAVING to _RUNNING can happen if there is migration + * failure,
[PATCH v28 00/17] Add migration support for VFIO devices
ation post copy is not supported. v27 -> 28 - Nit picks and minor changes suggested by Alex. v26 -> 27 - Major change in Patch 3 -PCI config space save and long using VMSTATE_* - Major change in Patch 14 - Dirty page tracking when vIOMMU is enabled using IOMMU notifier and its replay functionality - as suggested by Alex. - Some Structure changes to keep all migration related members at one place. - Pulled fix suggested by Zhi Wang https://www.mail-archive.com/qemu-devel@nongnu.org/msg743722.html - Add comments where even suggested and required. v25 -> 26 - Removed emulated_config_bits cache and vdev->pdev.wmask from config space save load functions. - Used VMStateDescription for config space save and load functionality. - Major fixes from previous version review. https://www.mail-archive.com/qemu-devel@nongnu.org/msg714625.html v23 -> 25 - Updated config space save and load to save config cache, emulated bits cache and wmask cache. - Created idr string as suggested by Dr Dave that includes bus path. - Updated save and load function to read/write data to mixed regions, mapped or trapped. - When vIOMMU is enabled, created mapped iova range list which also keeps translated address. This list is used to mark dirty pages. This reduces downtime significantly with vIOMMU enabled than migration patches from previous version. - Removed get_address_limit() function from v23 patch as this not required now. v22 -> v23 -- Fixed issue reported by Yan https://lore.kernel.org/kvm/97977ede-3c5b-c5a5-7858-7eecd7dd5...@nvidia.com/ - Sending this version to test v23 kernel version patches: https://lore.kernel.org/kvm/1589998088-3250-1-git-send-email-kwankh...@nvidia.com/ v18 -> v22 - Few fixes from v18 review. But not yet fixed all concerns. I'll address those concerns in subsequent iterations. - Sending this version to test v22 kernel version patches: https://lore.kernel.org/kvm/1589781397-28368-1-git-send-email-kwankh...@nvidia.com/ v16 -> v18 - Nit fixes - Get migration capability flags from container - Added VFIO stats to MigrationInfo - Fixed bug reported by Yan https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg4.html v9 -> v16 - KABI almost finalised on kernel patches. - Added support for migration with vIOMMU enabled. v8 -> v9: - Split patch set in 2 sets, Kernel and QEMU sets. - Dirty pages bitmap is queried from IOMMU container rather than from vendor driver for per device. Added 2 ioctls to achieve this. v7 -> v8: - Updated comments for KABI - Added BAR address validation check during PCI device's config space load as suggested by Dr. David Alan Gilbert. - Changed vfio_migration_set_state() to set or clear device state flags. - Some nit fixes. v6 -> v7: - Fix build failures. v5 -> v6: - Fix build failure. v4 -> v5: - Added decriptive comment about the sequence of access of members of structure vfio_device_migration_info to be followed based on Alex's suggestion - Updated get dirty pages sequence. - As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to get_object, save_config and load_config. - Fixed multiple nit picks. - Tested live migration with multiple vfio device assigned to a VM. v3 -> v4: - Added one more bit for _RESUMING flag to be set explicitly. - data_offset field is read-only for user space application. - data_size is read for every iteration before reading data from migration, that is removed assumption that data will be till end of migration region. - If vendor driver supports mappable sparsed region, map those region during setup state of save/load, similarly unmap those from cleanup routines. - Handles race condition that causes data corruption in migration region during save device state by adding mutex and serialiaing save_buffer and get_dirty_pages routines. - Skip called get_dirty_pages routine for mapped MMIO region of device. - Added trace events. - Splitted into multiple functional patches. v2 -> v3: - Removed enum of VFIO device states. Defined VFIO device state with 2 bits. - Re-structured vfio_device_migration_info to keep it minimal and defined action on read and write access on its members. v1 -> v2: - Defined MIGRATION region type and sub-type which should be used with region type capability. - Re-structured vfio_device_migration_info. This structure will be placed at 0th offset of migration region. - Replaced ioctl with read/write for trapped part of migration region. - Added both type of access support, trapped or mmapped, for data section of the region. - Moved PCI device functions to pci file. - Added iteration to get dirty page bitmap until bitmap for all requested pages are copied. Thanks, Kirti Kirti Wankhede (17): vfio: Add function to unmap VFIO region vfio: Add vfio_get_object callback to VFIODeviceOps vfio: Add save and load functions for VFIO PCI devices vfio: Add migration region initialization
[PATCH v28 02/17] vfio: Add vfio_get_object callback to VFIODeviceOps
Hook vfio_get_object callback for PCI devices. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Suggested-by: Cornelia Huck Reviewed-by: Cornelia Huck --- hw/vfio/pci.c | 8 include/hw/vfio/vfio-common.h | 1 + 2 files changed, 9 insertions(+) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 0d83eb0e47bb..bffd5bfe3b78 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -2394,10 +2394,18 @@ static void vfio_pci_compute_needs_reset(VFIODevice *vbasedev) } } +static Object *vfio_pci_get_object(VFIODevice *vbasedev) +{ +VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); + +return OBJECT(vdev); +} + static VFIODeviceOps vfio_pci_ops = { .vfio_compute_needs_reset = vfio_pci_compute_needs_reset, .vfio_hot_reset_multi = vfio_pci_hot_reset_multi, .vfio_eoi = vfio_intx_eoi, +.vfio_get_object = vfio_pci_get_object, }; int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp) diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index dc95f527b583..fe99c36a693a 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -119,6 +119,7 @@ struct VFIODeviceOps { void (*vfio_compute_needs_reset)(VFIODevice *vdev); int (*vfio_hot_reset_multi)(VFIODevice *vdev); void (*vfio_eoi)(VFIODevice *vdev); +Object *(*vfio_get_object)(VFIODevice *vdev); }; typedef struct VFIOGroup { -- 2.7.0
Re: [PATCH v27 17/17] qapi: Add VFIO devices migration stats in Migration stats
On 10/23/2020 3:48 AM, Alex Williamson wrote: On Thu, 22 Oct 2020 16:42:07 +0530 Kirti Wankhede wrote: Added amount of bytes transferred to the VM at destination by all VFIO devices Signed-off-by: Kirti Wankhede Reviewed-by: Dr. David Alan Gilbert --- hw/vfio/common.c| 20 hw/vfio/migration.c | 10 ++ include/qemu/vfio-helpers.h | 3 +++ migration/migration.c | 14 ++ monitor/hmp-cmds.c | 6 ++ qapi/migration.json | 17 + 6 files changed, 70 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 9c879e5c0f62..8d0758eda9fa 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -39,6 +39,7 @@ #include "trace.h" #include "qapi/error.h" #include "migration/migration.h" +#include "qemu/vfio-helpers.h" VFIOGroupList vfio_group_list = QLIST_HEAD_INITIALIZER(vfio_group_list); @@ -292,6 +293,25 @@ const MemoryRegionOps vfio_region_ops = { * Device state interfaces */ +bool vfio_mig_active(void) +{ +VFIOGroup *group; +VFIODevice *vbasedev; + +if (QLIST_EMPTY(&vfio_group_list)) { +return false; +} + +QLIST_FOREACH(group, &vfio_group_list, next) { +QLIST_FOREACH(vbasedev, &group->device_list, next) { +if (vbasedev->migration_blocker) { +return false; +} +} +} +return true; +} + static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container) { VFIOGroup *group; diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 77ee60a43ea5..b23e21c6de2b 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -28,6 +28,7 @@ #include "pci.h" #include "trace.h" #include "hw/hw.h" +#include "qemu/vfio-helpers.h" /* * Flags to be used as unique delimiters for VFIO devices in the migration @@ -45,6 +46,8 @@ #define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xef13ULL) #define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL) +static int64_t bytes_transferred; + static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count, off_t off, bool iswrite) { @@ -255,6 +258,7 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size) *size = data_size; } +bytes_transferred += data_size; return ret; } @@ -776,6 +780,7 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data) case MIGRATION_STATUS_CANCELLING: case MIGRATION_STATUS_CANCELLED: case MIGRATION_STATUS_FAILED: +bytes_transferred = 0; ret = vfio_migration_set_state(vbasedev, ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING), VFIO_DEVICE_STATE_RUNNING); @@ -862,6 +867,11 @@ err: /* -- */ +int64_t vfio_mig_bytes_transferred(void) +{ +return bytes_transferred; +} + int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) { VFIOContainer *container = vbasedev->group->container; diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h index 4491c8e1a6e9..7f7a46e6ef2d 100644 --- a/include/qemu/vfio-helpers.h +++ b/include/qemu/vfio-helpers.h @@ -29,4 +29,7 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar, int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e, int irq_type, Error **errp); +bool vfio_mig_active(void); +int64_t vfio_mig_bytes_transferred(void); + #endif I don't think vfio-helpers is the right place for this, this header is specifically for using util/vfio-helpers.c. Would include/hw/vfio/vfio-common.h work? Yes, works with CONFIG_VFIO check. Changing it. diff --git a/migration/migration.c b/migration/migration.c index 0575ecb37953..8b2865d25ef4 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -56,6 +56,7 @@ #include "net/announce.h" #include "qemu/queue.h" #include "multifd.h" +#include "qemu/vfio-helpers.h" #define MAX_THROTTLE (128 << 20) /* Migration transfer speed throttling */ @@ -1002,6 +1003,17 @@ static void populate_disk_info(MigrationInfo *info) } } +static void populate_vfio_info(MigrationInfo *info) +{ +#ifdef CONFIG_LINUX Use CONFIG_VFIO? I get a build failure on qemu-system-avr /usr/bin/ld: /tmp/tmp.3QbqxgbENl/build/../migration/migration.c:1012: undefined reference to `vfio_mig_bytes_transferred'. Thanks, Ok Changing it. Alex +if (vfio_mig_active()) { +info->has_vfio = true; +info->vfio = g_malloc0(sizeof(*info->vfio)); +info->vfio->transferred = vfio_mig_bytes_transfer
Re: [PATCH v27 09/17] vfio: Add load state functions to SaveVMHandlers
On 10/23/2020 1:20 AM, Alex Williamson wrote: On Thu, 22 Oct 2020 16:41:59 +0530 Kirti Wankhede wrote: Sequence during _RESUMING device state: While data for this device is available, repeat below steps: a. read data_offset from where user application should write data. b. write data of data_size to migration region from data_offset. c. write data_size which indicates vendor driver that data is written in staging buffer. For user, data is opaque. User should write data in the same order as received. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Reviewed-by: Dr. David Alan Gilbert --- hw/vfio/migration.c | 192 +++ hw/vfio/trace-events | 3 + 2 files changed, 195 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 5506cef15d88..46d05d230e2a 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -257,6 +257,77 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size) return ret; } +static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev, +uint64_t data_size) +{ +VFIORegion *region = &vbasedev->migration->region; +uint64_t data_offset = 0, size, report_size; +int ret; + +do { +ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset), + region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_offset)); +if (ret < 0) { +return ret; +} + +if (data_offset + data_size > region->size) { +/* + * If data_size is greater than the data section of migration region + * then iterate the write buffer operation. This case can occur if + * size of migration region at destination is smaller than size of + * migration region at source. + */ +report_size = size = region->size - data_offset; +data_size -= size; +} else { +report_size = size = data_size; +data_size = 0; +} + +trace_vfio_load_state_device_data(vbasedev->name, data_offset, size); + +while (size) { +void *buf; +uint64_t sec_size; +bool buf_alloc = false; + +buf = get_data_section_size(region, data_offset, size, &sec_size); + +if (!buf) { +buf = g_try_malloc(sec_size); +if (!buf) { +error_report("%s: Error allocating buffer ", __func__); +return -ENOMEM; +} +buf_alloc = true; +} + +qemu_get_buffer(f, buf, sec_size); + +if (buf_alloc) { +ret = vfio_mig_write(vbasedev, buf, sec_size, +region->fd_offset + data_offset); +g_free(buf); + +if (ret < 0) { +return ret; +} +} +size -= sec_size; +data_offset += sec_size; +} + +ret = vfio_mig_write(vbasedev, &report_size, sizeof(report_size), +region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_size)); +if (ret < 0) { +return ret; +} +} while (data_size); + +return 0; +} + static int vfio_update_pending(VFIODevice *vbasedev) { VFIOMigration *migration = vbasedev->migration; @@ -293,6 +364,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque) return qemu_file_get_error(f); } +static int vfio_load_device_config_state(QEMUFile *f, void *opaque) +{ +VFIODevice *vbasedev = opaque; +uint64_t data; + +if (vbasedev->ops && vbasedev->ops->vfio_load_config) { +int ret; + +ret = vbasedev->ops->vfio_load_config(vbasedev, f); +if (ret) { +error_report("%s: Failed to load device config space", + vbasedev->name); +return ret; +} +} + +data = qemu_get_be64(f); +if (data != VFIO_MIG_FLAG_END_OF_STATE) { +error_report("%s: Failed loading device config space, " + "end flag incorrect 0x%"PRIx64, vbasedev->name, data); +return -EINVAL; +} + +trace_vfio_load_device_config_state(vbasedev->name); +return qemu_file_get_error(f); +} + /* -- */ static int vfio_save_setup(QEMUFile *f, void *opaque) @@ -477,12 +575,106 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) return ret; } +static int vfio_load_setup(QEMUFile *f, void *opaque) +{ +VFIODevice *vbasedev = opaque; +VFIOMigration *migration = vbasedev->migration; +int ret = 0; + +if (migration->region.mmaps) { +ret = vfio_region_mmap(
Re: [PATCH v27 14/17] vfio: Dirty page tracking when vIOMMU is enabled
On 10/23/2020 2:07 AM, Alex Williamson wrote: On Thu, 22 Oct 2020 16:42:04 +0530 Kirti Wankhede wrote: When vIOMMU is enabled, register MAP notifier from log_sync when all devices in container are in stop and copy phase of migration. Call replay and get dirty pages from notifier callback. Suggested-by: Alex Williamson Signed-off-by: Kirti Wankhede --- hw/vfio/common.c | 95 --- hw/vfio/trace-events | 1 + include/hw/vfio/vfio-common.h | 1 + 3 files changed, 91 insertions(+), 6 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 2634387df948..98c2b1f9b190 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -442,8 +442,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section) } /* Called with rcu_read_lock held. */ -static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr, - bool *read_only) +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr, + ram_addr_t *ram_addr, bool *read_only) { MemoryRegion *mr; hwaddr xlat; @@ -474,8 +474,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr, return false; } -*vaddr = memory_region_get_ram_ptr(mr) + xlat; -*read_only = !writable || mr->readonly; +if (vaddr) { +*vaddr = memory_region_get_ram_ptr(mr) + xlat; +} + +if (ram_addr) { +*ram_addr = memory_region_get_ram_addr(mr) + xlat; +} + +if (read_only) { +*read_only = !writable || mr->readonly; +} return true; } @@ -485,7 +494,6 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); VFIOContainer *container = giommu->container; hwaddr iova = iotlb->iova + giommu->iommu_offset; -bool read_only; void *vaddr; int ret; @@ -501,7 +509,9 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) rcu_read_lock(); if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) { -if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) { +bool read_only; + +if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only)) { goto out; } /* @@ -899,11 +909,84 @@ err_out: return ret; } +static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) +{ +VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, dirty_notify); +VFIOContainer *container = giommu->container; +hwaddr iova = iotlb->iova + giommu->iommu_offset; +ram_addr_t translated_addr; + +trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask); + +if (iotlb->target_as != &address_space_memory) { +error_report("Wrong target AS \"%s\", only system memory is allowed", + iotlb->target_as->name ? iotlb->target_as->name : "none"); +return; +} + +rcu_read_lock(); + +if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) { +int ret; + +ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1, +translated_addr); +if (ret) { +error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx") = %d (%m)", + container, iova, + iotlb->addr_mask + 1, ret); +} +} + +rcu_read_unlock(); +} + static int vfio_sync_dirty_bitmap(VFIOContainer *container, MemoryRegionSection *section) { ram_addr_t ram_addr; +if (memory_region_is_iommu(section->mr)) { +VFIOGuestIOMMU *giommu; +int ret = 0; + +QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) { +if (MEMORY_REGION(giommu->iommu) == section->mr && +giommu->n.start == section->offset_within_region) { +Int128 llend; +Error *err = NULL; +int idx = memory_region_iommu_attrs_to_index(giommu->iommu, + MEMTXATTRS_UNSPECIFIED); + +llend = int128_add(int128_make64(section->offset_within_region), + section->size); +llend = int128_sub(llend, int128_one()); + +iommu_notifier_init(&giommu->dirty_notify, +vfio_iommu_map_dirty_notify, +IOMMU_NOTIFIER_MAP, +section->offset_within_region, +int128_get64(llend), +
Re: [PATCH v27 07/17] vfio: Register SaveVMHandlers for VFIO device
On 10/23/2020 12:21 AM, Alex Williamson wrote: On Thu, 22 Oct 2020 16:41:57 +0530 Kirti Wankhede wrote: Define flags to be used as delimiter in migration stream for VFIO devices. Added .save_setup and .save_cleanup functions. Map & unmap migration region from these functions at source during saving or pre-copy phase. Set VFIO device state depending on VM's state. During live migration, VM is running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO device. During save-restore, VM is paused, _SAVING state is set for VFIO device. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia --- hw/vfio/migration.c | 96 hw/vfio/trace-events | 2 ++ 2 files changed, 98 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 7c4fa0d08ea6..2e1054bf7f43 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -8,12 +8,15 @@ */ #include "qemu/osdep.h" +#include "qemu/main-loop.h" +#include "qemu/cutils.h" #include #include "sysemu/runstate.h" #include "hw/vfio/vfio-common.h" #include "cpu.h" #include "migration/migration.h" +#include "migration/vmstate.h" #include "migration/qemu-file.h" #include "migration/register.h" #include "migration/blocker.h" @@ -25,6 +28,22 @@ #include "trace.h" #include "hw/hw.h" +/* + * Flags to be used as unique delimiters for VFIO devices in the migration + * stream. These flags are composed as: + * 0x => MSB 32-bit all 1s + * 0xef10 => Magic ID, represents emulated (virtual) function IO + * 0x => 16-bits reserved for flags + * + * The beginning of state information is marked by _DEV_CONFIG_STATE, + * _DEV_SETUP_STATE, or _DEV_DATA_STATE, respectively. The end of a + * certain state information is marked by _END_OF_STATE. + */ +#define VFIO_MIG_FLAG_END_OF_STATE (0xef11ULL) +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE (0xef12ULL) +#define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xef13ULL) +#define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL) + static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count, off_t off, bool iswrite) { @@ -129,6 +148,69 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask, return 0; } +/* -- */ + +static int vfio_save_setup(QEMUFile *f, void *opaque) +{ +VFIODevice *vbasedev = opaque; +VFIOMigration *migration = vbasedev->migration; +int ret; + +trace_vfio_save_setup(vbasedev->name); + +qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE); + +if (migration->region.mmaps) { +/* + * vfio_region_mmap() called from migration thread. Memory API called + * from vfio_regio_mmap() need it when called from outdide the main loop + * thread. + */ Thanks for adding this detail, maybe refine slightly as: Calling vfio_region_mmap() from migration thread. Memory APIs called from this function require locking the iothread when called from outside the main loop thread. Does that capture the intent? Ok. +qemu_mutex_lock_iothread(); +ret = vfio_region_mmap(&migration->region); +qemu_mutex_unlock_iothread(); +if (ret) { +error_report("%s: Failed to mmap VFIO migration region: %s", + vbasedev->name, strerror(-ret)); +error_report("%s: Falling back to slow path", vbasedev->name); +} +} + +ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK, + VFIO_DEVICE_STATE_SAVING); +if (ret) { +error_report("%s: Failed to set state SAVING", vbasedev->name); +return ret; +} + +qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); + +ret = qemu_file_get_error(f); +if (ret) { +return ret; +} + +return 0; +} + +static void vfio_save_cleanup(void *opaque) +{ +VFIODevice *vbasedev = opaque; +VFIOMigration *migration = vbasedev->migration; + +if (migration->region.mmaps) { +vfio_region_unmap(&migration->region); +} Are we in a different thread context here that we don't need that same iothread locking? qemu_savevm_state_setup() is called without holding iothread lock and qemu_savevm_state_cleanup() is called holding iothread lock, so we don't need lock here. +trace_vfio_save_cleanup(vbasedev->name); +} + +static SaveVMHandlers savevm_vfio_handlers = { +.save_setup = vfio_save_setup, +.save_cleanup = vfio_save_cleanup, +}; + +/* --
Re: [PATCH v27 05/17] vfio: Add VM state change handler to know state of VM
On 10/22/2020 10:05 PM, Alex Williamson wrote: On Thu, 22 Oct 2020 16:41:55 +0530 Kirti Wankhede wrote: VM state change handler is called on change in VM's state. Based on VM state, VFIO device state should be changed. Added read/write helper functions for migration region. Added function to set device_state. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Reviewed-by: Dr. David Alan Gilbert --- hw/vfio/migration.c | 158 ++ hw/vfio/trace-events | 2 + include/hw/vfio/vfio-common.h | 4 ++ 3 files changed, 164 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 5f74a3ad1d72..34f39c7e2e28 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -10,6 +10,7 @@ #include "qemu/osdep.h" #include +#include "sysemu/runstate.h" #include "hw/vfio/vfio-common.h" #include "cpu.h" #include "migration/migration.h" @@ -22,6 +23,157 @@ #include "exec/ram_addr.h" #include "pci.h" #include "trace.h" +#include "hw/hw.h" + +static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count, + off_t off, bool iswrite) +{ +int ret; + +ret = iswrite ? pwrite(vbasedev->fd, val, count, off) : +pread(vbasedev->fd, val, count, off); +if (ret < count) { +error_report("vfio_mig_%s %d byte %s: failed at offset 0x%lx, err: %s", + iswrite ? "write" : "read", count, + vbasedev->name, off, strerror(errno)); +return (ret < 0) ? ret : -EINVAL; +} +return 0; +} + +static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, size_t count, + off_t off, bool iswrite) +{ +int ret, done = 0; +__u8 *tbuf = buf; + +while (count) { +int bytes = 0; + +if (count >= 8 && !(off % 8)) { +bytes = 8; +} else if (count >= 4 && !(off % 4)) { +bytes = 4; +} else if (count >= 2 && !(off % 2)) { +bytes = 2; +} else { +bytes = 1; +} + +ret = vfio_mig_access(vbasedev, tbuf, bytes, off, iswrite); +if (ret) { +return ret; +} + +count -= bytes; +done += bytes; +off += bytes; +tbuf += bytes; +} +return done; +} + +#define vfio_mig_read(f, v, c, o) vfio_mig_rw(f, (__u8 *)v, c, o, false) +#define vfio_mig_write(f, v, c, o) vfio_mig_rw(f, (__u8 *)v, c, o, true) + +#define VFIO_MIG_STRUCT_OFFSET(f) \ + offsetof(struct vfio_device_migration_info, f) +/* + * Change the device_state register for device @vbasedev. Bits set in @mask + * are preserved, bits set in @value are set, and bits not set in either @mask + * or @value are cleared in device_state. If the register cannot be accessed, + * the resulting state would be invalid, or the device enters an error state, + * an error is returned. + */ + +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask, +uint32_t value) +{ +VFIOMigration *migration = vbasedev->migration; +VFIORegion *region = &migration->region; +off_t dev_state_off = region->fd_offset + + VFIO_MIG_STRUCT_OFFSET(device_state); +uint32_t device_state; +int ret; + +ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state), +dev_state_off); +if (ret < 0) { +return ret; +} + +device_state = (device_state & mask) | value; + +if (!VFIO_DEVICE_STATE_VALID(device_state)) { +return -EINVAL; +} + +ret = vfio_mig_write(vbasedev, &device_state, sizeof(device_state), + dev_state_off); +if (ret < 0) { +int rret; + +rret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state), + dev_state_off); + +if ((rret < 0) || (VFIO_DEVICE_STATE_IS_ERROR(device_state))) { +hw_error("%s: Device in error state 0x%x", vbasedev->name, + device_state); +return rret ? rret : -EIO; +} +return ret; +} + +migration->device_state = device_state; +trace_vfio_migration_set_state(vbasedev->name, device_state); +return 0; +} + +static void vfio_vmstate_change(void *opaque, int running, RunState state) +{ +VFIODevice *vbasedev = opaque; +VFIOMigration *migration = vbasedev->migration; +uint32_t value, mask; +int ret; + +if ((vbasedev->migration->vm_running == running)) { +return; +} + +if (running) { +/* + * Here device state can have one of _SAVING, _RESUMING or _STOP
Re: [PATCH v27 04/17] vfio: Add migration region initialization and finalize function
On 10/22/2020 7:52 PM, Alex Williamson wrote: On Thu, 22 Oct 2020 16:41:54 +0530 Kirti Wankhede wrote: Whether the VFIO device supports migration or not is decided based of migration region query. If migration region query is successful and migration region initialization is successful then migration is supported else migration is blocked. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Acked-by: Dr. David Alan Gilbert --- hw/vfio/meson.build | 1 + hw/vfio/migration.c | 129 ++ hw/vfio/trace-events | 3 + include/hw/vfio/vfio-common.h | 9 +++ 4 files changed, 142 insertions(+) create mode 100644 hw/vfio/migration.c diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build index 37efa74018bc..da9af297a0c5 100644 --- a/hw/vfio/meson.build +++ b/hw/vfio/meson.build @@ -2,6 +2,7 @@ vfio_ss = ss.source_set() vfio_ss.add(files( 'common.c', 'spapr.c', + 'migration.c', )) vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files( 'display.c', diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c new file mode 100644 index ..5f74a3ad1d72 --- /dev/null +++ b/hw/vfio/migration.c @@ -0,0 +1,129 @@ +/* + * Migration support for VFIO devices + * + * Copyright NVIDIA, Inc. 2020 + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include "qemu/osdep.h" +#include + +#include "hw/vfio/vfio-common.h" +#include "cpu.h" +#include "migration/migration.h" +#include "migration/qemu-file.h" +#include "migration/register.h" +#include "migration/blocker.h" +#include "migration/misc.h" +#include "qapi/error.h" +#include "exec/ramlist.h" +#include "exec/ram_addr.h" +#include "pci.h" +#include "trace.h" + +static void vfio_migration_region_exit(VFIODevice *vbasedev) +{ +VFIOMigration *migration = vbasedev->migration; + +if (!migration) { +return; +} + +if (migration->region.size) { +vfio_region_exit(&migration->region); +vfio_region_finalize(&migration->region); +} +} + +static int vfio_migration_init(VFIODevice *vbasedev, + struct vfio_region_info *info) +{ +int ret; +Object *obj; +VFIOMigration *migration; + +if (!vbasedev->ops->vfio_get_object) { +return -EINVAL; +} + +obj = vbasedev->ops->vfio_get_object(vbasedev); +if (!obj) { +return -EINVAL; +} + +migration = g_new0(VFIOMigration, 1); + +ret = vfio_region_setup(obj, vbasedev, &migration->region, +info->index, "migration"); +if (ret) { +error_report("%s: Failed to setup VFIO migration region %d: %s", + vbasedev->name, info->index, strerror(-ret)); +goto err; +} + +if (!migration->region.size) { +error_report("%s: Invalid zero-sized of VFIO migration region %d", + vbasedev->name, info->index); +ret = -EINVAL; +goto err; +} + +vbasedev->migration = migration; +return 0; + +err: +vfio_migration_region_exit(vbasedev); We can't get here with vbasedev->migration set, did you intend to set vbasedev->migration before testing region.size? Thanks, Oh yes, I missed to address this when I moved migration variable to VFIODevice. Moving vbasedev->migration before region.size check. Also removing region.size check vfio_migration_region_exit() for vfio_region_exit() and vfio_region_finalize(). Thanks, Kirti Alex +g_free(migration); +return ret; +} + +/* -- */ + +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) +{ +struct vfio_region_info *info = NULL; +Error *local_err = NULL; +int ret; + +ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION, + VFIO_REGION_SUBTYPE_MIGRATION, &info); +if (ret) { +goto add_blocker; +} + +ret = vfio_migration_init(vbasedev, info); +if (ret) { +goto add_blocker; +} + +g_free(info); +trace_vfio_migration_probe(vbasedev->name, info->index); +return 0; + +add_blocker: +error_setg(&vbasedev->migration_blocker, + "VFIO device doesn't support migration"); +g_free(info); + +ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err); +if (local_err) { +error_propagate(errp, local_err); +error_free(vbasedev->migration_blocker); +vbasedev->migration_blocker = NULL; +} +return ret; +} + +void vfio_migratio
Re: [PATCH v27 03/17] vfio: Add save and load functions for VFIO PCI devices
On 10/22/2020 7:36 PM, Alex Williamson wrote: On Thu, 22 Oct 2020 16:41:53 +0530 Kirti Wankhede wrote: Added functions to save and restore PCI device specific data, specifically config space of PCI device. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia --- hw/vfio/pci.c | 48 +++ include/hw/vfio/vfio-common.h | 2 ++ 2 files changed, 50 insertions(+) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index bffd5bfe3b78..1036a5332772 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -41,6 +41,7 @@ #include "trace.h" #include "qapi/error.h" #include "migration/blocker.h" +#include "migration/qemu-file.h" #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug" @@ -2401,11 +2402,58 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev) return OBJECT(vdev); } +static bool vfio_msix_enabled(void *opaque, int version_id) +{ +PCIDevice *pdev = opaque; + +return msix_enabled(pdev); Why msix_enabled() rather than msix_present()? It seems that even if MSI-X is not enabled at the point in time where this is called, there's still emulated state in the vector table. For example if the guest has written the vectors but has not yet enabled the capability at the point where we start a migration, this test might cause the guest on the target to enable MSI-X with uninitialized data in the vector table. You're correct. Changing it to check if present. +} + +const VMStateDescription vmstate_vfio_pci_config = { +.name = "VFIOPCIDevice", +.version_id = 1, +.minimum_version_id = 1, +.fields = (VMStateField[]) { +VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice), +VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_enabled), MSI (not-X) state is entirely in config space, so doesn't need a separate field, correct? Yes. Otherwise this looks quite a bit cleaner than previous version, I hope VMState experts can confirm this is sufficiently extensible within the migration framework. Thanks, Thanks, Kirti Alex +VMSTATE_END_OF_LIST() +} +}; + +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f) +{ +VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); + +vmstate_save_state(f, &vmstate_vfio_pci_config, vdev, NULL); +} + +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f) +{ +VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); +PCIDevice *pdev = &vdev->pdev; +int ret; + +ret = vmstate_load_state(f, &vmstate_vfio_pci_config, vdev, 1); +if (ret) { +return ret; +} + +if (msi_enabled(pdev)) { +vfio_msi_enable(vdev); +} else if (msix_enabled(pdev)) { +vfio_msix_enable(vdev); +} + +return ret; +} + static VFIODeviceOps vfio_pci_ops = { .vfio_compute_needs_reset = vfio_pci_compute_needs_reset, .vfio_hot_reset_multi = vfio_pci_hot_reset_multi, .vfio_eoi = vfio_intx_eoi, .vfio_get_object = vfio_pci_get_object, +.vfio_save_config = vfio_pci_save_config, +.vfio_load_config = vfio_pci_load_config, }; int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp) diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index fe99c36a693a..ba6169cd926e 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -120,6 +120,8 @@ struct VFIODeviceOps { int (*vfio_hot_reset_multi)(VFIODevice *vdev); void (*vfio_eoi)(VFIODevice *vdev); Object *(*vfio_get_object)(VFIODevice *vdev); +void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f); +int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f); }; typedef struct VFIOGroup {
Re: [PATCH v26 05/17] vfio: Add VM state change handler to know state of VM
On 10/22/2020 1:21 PM, Cornelia Huck wrote: On Wed, 21 Oct 2020 11:03:23 +0530 Kirti Wankhede wrote: On 10/20/2020 4:21 PM, Cornelia Huck wrote: On Sun, 18 Oct 2020 01:54:56 +0530 Kirti Wankhede wrote: On 9/29/2020 4:33 PM, Dr. David Alan Gilbert wrote: * Cornelia Huck (coh...@redhat.com) wrote: On Wed, 23 Sep 2020 04:54:07 +0530 Kirti Wankhede wrote: +static void vfio_vmstate_change(void *opaque, int running, RunState state) +{ +VFIODevice *vbasedev = opaque; + +if ((vbasedev->vm_running != running)) { +int ret; +uint32_t value = 0, mask = 0; + +if (running) { +value = VFIO_DEVICE_STATE_RUNNING; +if (vbasedev->device_state & VFIO_DEVICE_STATE_RESUMING) { +mask = ~VFIO_DEVICE_STATE_RESUMING; I've been staring at this for some time and I think that the desired result is - set _RUNNING - if _RESUMING was set, clear it, but leave the other bits intact Upto here, you're correct. - if _RESUMING was not set, clear everything previously set This would really benefit from a comment (or am I the only one struggling here?) Here mask should be ~0. Correcting it. Hm, now I'm confused. With value == _RUNNING, ~_RUNNING and ~0 as mask should be equivalent, shouldn't they? I too got confused after reading your comment. Lets walk through the device states and transitions can happen here: if running - device state could be either _SAVING or _RESUMING or _STOP. Both _SAVING and _RESUMING can't be set at a time, that is the error state. _STOP means 0. - Transition from _SAVING to _RUNNING can happen if there is migration failure, in that case we have to clear _SAVING - Transition from _RESUMING to _RUNNING can happen on resuming and we have to clear _RESUMING. - In both the above cases, we have to set _RUNNING and clear rest 2 bits. Then: mask = ~VFIO_DEVICE_STATE_MASK; value = VFIO_DEVICE_STATE_RUNNING; ok if !running - device state could be either _RUNNING or _SAVING|_RUNNING. Here we have to reset running bit. Then: mask = ~VFIO_DEVICE_STATE_RUNNING; value = 0; ok I'll add comment in the code above. That will help. I'm a bit worried though that all that reasoning which flags are set or cleared when is quite complex, and it's easy to make mistakes. Can we model this as a FSM, where an event (running state changes) transitions the device state from one state to another? I (personally) find FSMs easier to comprehend, but I'm not sure whether that change would be too invasive. If others can parse the state changes with that mask/value interface, I won't object to it. I agree FSM will be easy and for long term may be easy to maintain. But at this moment it will be intrusive change. For now we can go ahead with this code and later we can change to FSM model, if all agrees on it. Thanks, Kirti +} +} else { +mask = ~VFIO_DEVICE_STATE_RUNNING; +}
[PATCH v27 17/17] qapi: Add VFIO devices migration stats in Migration stats
Added amount of bytes transferred to the VM at destination by all VFIO devices Signed-off-by: Kirti Wankhede Reviewed-by: Dr. David Alan Gilbert --- hw/vfio/common.c| 20 hw/vfio/migration.c | 10 ++ include/qemu/vfio-helpers.h | 3 +++ migration/migration.c | 14 ++ monitor/hmp-cmds.c | 6 ++ qapi/migration.json | 17 + 6 files changed, 70 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 9c879e5c0f62..8d0758eda9fa 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -39,6 +39,7 @@ #include "trace.h" #include "qapi/error.h" #include "migration/migration.h" +#include "qemu/vfio-helpers.h" VFIOGroupList vfio_group_list = QLIST_HEAD_INITIALIZER(vfio_group_list); @@ -292,6 +293,25 @@ const MemoryRegionOps vfio_region_ops = { * Device state interfaces */ +bool vfio_mig_active(void) +{ +VFIOGroup *group; +VFIODevice *vbasedev; + +if (QLIST_EMPTY(&vfio_group_list)) { +return false; +} + +QLIST_FOREACH(group, &vfio_group_list, next) { +QLIST_FOREACH(vbasedev, &group->device_list, next) { +if (vbasedev->migration_blocker) { +return false; +} +} +} +return true; +} + static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container) { VFIOGroup *group; diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 77ee60a43ea5..b23e21c6de2b 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -28,6 +28,7 @@ #include "pci.h" #include "trace.h" #include "hw/hw.h" +#include "qemu/vfio-helpers.h" /* * Flags to be used as unique delimiters for VFIO devices in the migration @@ -45,6 +46,8 @@ #define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xef13ULL) #define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL) +static int64_t bytes_transferred; + static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count, off_t off, bool iswrite) { @@ -255,6 +258,7 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size) *size = data_size; } +bytes_transferred += data_size; return ret; } @@ -776,6 +780,7 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data) case MIGRATION_STATUS_CANCELLING: case MIGRATION_STATUS_CANCELLED: case MIGRATION_STATUS_FAILED: +bytes_transferred = 0; ret = vfio_migration_set_state(vbasedev, ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING), VFIO_DEVICE_STATE_RUNNING); @@ -862,6 +867,11 @@ err: /* -- */ +int64_t vfio_mig_bytes_transferred(void) +{ +return bytes_transferred; +} + int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) { VFIOContainer *container = vbasedev->group->container; diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h index 4491c8e1a6e9..7f7a46e6ef2d 100644 --- a/include/qemu/vfio-helpers.h +++ b/include/qemu/vfio-helpers.h @@ -29,4 +29,7 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar, int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e, int irq_type, Error **errp); +bool vfio_mig_active(void); +int64_t vfio_mig_bytes_transferred(void); + #endif diff --git a/migration/migration.c b/migration/migration.c index 0575ecb37953..8b2865d25ef4 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -56,6 +56,7 @@ #include "net/announce.h" #include "qemu/queue.h" #include "multifd.h" +#include "qemu/vfio-helpers.h" #define MAX_THROTTLE (128 << 20) /* Migration transfer speed throttling */ @@ -1002,6 +1003,17 @@ static void populate_disk_info(MigrationInfo *info) } } +static void populate_vfio_info(MigrationInfo *info) +{ +#ifdef CONFIG_LINUX +if (vfio_mig_active()) { +info->has_vfio = true; +info->vfio = g_malloc0(sizeof(*info->vfio)); +info->vfio->transferred = vfio_mig_bytes_transferred(); +} +#endif +} + static void fill_source_migration_info(MigrationInfo *info) { MigrationState *s = migrate_get_current(); @@ -1026,6 +1038,7 @@ static void fill_source_migration_info(MigrationInfo *info) populate_time_info(info, s); populate_ram_info(info, s); populate_disk_info(info); +populate_vfio_info(info); break; case MIGRATION_STATUS_COLO: info->has_status = true; @@ -1034,6 +1047,7 @@ static void fill_source_migration_info(MigrationInfo *info) case MIGRATION_STATUS_COMPLETED: populate_time_info(info, s); populate_ram_info(inf
[PATCH v27 14/17] vfio: Dirty page tracking when vIOMMU is enabled
When vIOMMU is enabled, register MAP notifier from log_sync when all devices in container are in stop and copy phase of migration. Call replay and get dirty pages from notifier callback. Suggested-by: Alex Williamson Signed-off-by: Kirti Wankhede --- hw/vfio/common.c | 95 --- hw/vfio/trace-events | 1 + include/hw/vfio/vfio-common.h | 1 + 3 files changed, 91 insertions(+), 6 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 2634387df948..98c2b1f9b190 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -442,8 +442,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section) } /* Called with rcu_read_lock held. */ -static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr, - bool *read_only) +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr, + ram_addr_t *ram_addr, bool *read_only) { MemoryRegion *mr; hwaddr xlat; @@ -474,8 +474,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr, return false; } -*vaddr = memory_region_get_ram_ptr(mr) + xlat; -*read_only = !writable || mr->readonly; +if (vaddr) { +*vaddr = memory_region_get_ram_ptr(mr) + xlat; +} + +if (ram_addr) { +*ram_addr = memory_region_get_ram_addr(mr) + xlat; +} + +if (read_only) { +*read_only = !writable || mr->readonly; +} return true; } @@ -485,7 +494,6 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); VFIOContainer *container = giommu->container; hwaddr iova = iotlb->iova + giommu->iommu_offset; -bool read_only; void *vaddr; int ret; @@ -501,7 +509,9 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) rcu_read_lock(); if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) { -if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) { +bool read_only; + +if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only)) { goto out; } /* @@ -899,11 +909,84 @@ err_out: return ret; } +static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) +{ +VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, dirty_notify); +VFIOContainer *container = giommu->container; +hwaddr iova = iotlb->iova + giommu->iommu_offset; +ram_addr_t translated_addr; + +trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask); + +if (iotlb->target_as != &address_space_memory) { +error_report("Wrong target AS \"%s\", only system memory is allowed", + iotlb->target_as->name ? iotlb->target_as->name : "none"); +return; +} + +rcu_read_lock(); + +if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) { +int ret; + +ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1, +translated_addr); +if (ret) { +error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx") = %d (%m)", + container, iova, + iotlb->addr_mask + 1, ret); +} +} + +rcu_read_unlock(); +} + static int vfio_sync_dirty_bitmap(VFIOContainer *container, MemoryRegionSection *section) { ram_addr_t ram_addr; +if (memory_region_is_iommu(section->mr)) { +VFIOGuestIOMMU *giommu; +int ret = 0; + +QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) { +if (MEMORY_REGION(giommu->iommu) == section->mr && +giommu->n.start == section->offset_within_region) { +Int128 llend; +Error *err = NULL; +int idx = memory_region_iommu_attrs_to_index(giommu->iommu, + MEMTXATTRS_UNSPECIFIED); + +llend = int128_add(int128_make64(section->offset_within_region), + section->size); +llend = int128_sub(llend, int128_one()); + +iommu_notifier_init(&giommu->dirty_notify, +vfio_iommu_map_dirty_notify, +IOMMU_NOTIFIER_MAP, +section->offset_within_region, +int128_get64(llend), +idx); +ret = memory_region_register_iommu_notifier(section->mr, + &gi
[PATCH v27 05/17] vfio: Add VM state change handler to know state of VM
VM state change handler is called on change in VM's state. Based on VM state, VFIO device state should be changed. Added read/write helper functions for migration region. Added function to set device_state. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Reviewed-by: Dr. David Alan Gilbert --- hw/vfio/migration.c | 158 ++ hw/vfio/trace-events | 2 + include/hw/vfio/vfio-common.h | 4 ++ 3 files changed, 164 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 5f74a3ad1d72..34f39c7e2e28 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -10,6 +10,7 @@ #include "qemu/osdep.h" #include +#include "sysemu/runstate.h" #include "hw/vfio/vfio-common.h" #include "cpu.h" #include "migration/migration.h" @@ -22,6 +23,157 @@ #include "exec/ram_addr.h" #include "pci.h" #include "trace.h" +#include "hw/hw.h" + +static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count, + off_t off, bool iswrite) +{ +int ret; + +ret = iswrite ? pwrite(vbasedev->fd, val, count, off) : +pread(vbasedev->fd, val, count, off); +if (ret < count) { +error_report("vfio_mig_%s %d byte %s: failed at offset 0x%lx, err: %s", + iswrite ? "write" : "read", count, + vbasedev->name, off, strerror(errno)); +return (ret < 0) ? ret : -EINVAL; +} +return 0; +} + +static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, size_t count, + off_t off, bool iswrite) +{ +int ret, done = 0; +__u8 *tbuf = buf; + +while (count) { +int bytes = 0; + +if (count >= 8 && !(off % 8)) { +bytes = 8; +} else if (count >= 4 && !(off % 4)) { +bytes = 4; +} else if (count >= 2 && !(off % 2)) { +bytes = 2; +} else { +bytes = 1; +} + +ret = vfio_mig_access(vbasedev, tbuf, bytes, off, iswrite); +if (ret) { +return ret; +} + +count -= bytes; +done += bytes; +off += bytes; +tbuf += bytes; +} +return done; +} + +#define vfio_mig_read(f, v, c, o) vfio_mig_rw(f, (__u8 *)v, c, o, false) +#define vfio_mig_write(f, v, c, o) vfio_mig_rw(f, (__u8 *)v, c, o, true) + +#define VFIO_MIG_STRUCT_OFFSET(f) \ + offsetof(struct vfio_device_migration_info, f) +/* + * Change the device_state register for device @vbasedev. Bits set in @mask + * are preserved, bits set in @value are set, and bits not set in either @mask + * or @value are cleared in device_state. If the register cannot be accessed, + * the resulting state would be invalid, or the device enters an error state, + * an error is returned. + */ + +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask, +uint32_t value) +{ +VFIOMigration *migration = vbasedev->migration; +VFIORegion *region = &migration->region; +off_t dev_state_off = region->fd_offset + + VFIO_MIG_STRUCT_OFFSET(device_state); +uint32_t device_state; +int ret; + +ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state), +dev_state_off); +if (ret < 0) { +return ret; +} + +device_state = (device_state & mask) | value; + +if (!VFIO_DEVICE_STATE_VALID(device_state)) { +return -EINVAL; +} + +ret = vfio_mig_write(vbasedev, &device_state, sizeof(device_state), + dev_state_off); +if (ret < 0) { +int rret; + +rret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state), + dev_state_off); + +if ((rret < 0) || (VFIO_DEVICE_STATE_IS_ERROR(device_state))) { +hw_error("%s: Device in error state 0x%x", vbasedev->name, + device_state); +return rret ? rret : -EIO; +} +return ret; +} + +migration->device_state = device_state; +trace_vfio_migration_set_state(vbasedev->name, device_state); +return 0; +} + +static void vfio_vmstate_change(void *opaque, int running, RunState state) +{ +VFIODevice *vbasedev = opaque; +VFIOMigration *migration = vbasedev->migration; +uint32_t value, mask; +int ret; + +if ((vbasedev->migration->vm_running == running)) { +return; +} + +if (running) { +/* + * Here device state can have one of _SAVING, _RESUMING or _STOP bit. + * Transition from _SAVING to _RUNNING can happen if there is migration + * failure,
[PATCH v27 15/17] vfio: Add ioctl to get dirty pages bitmap during dma unmap
With vIOMMU, IO virtual address range can get unmapped while in pre-copy phase of migration. In that case, unmap ioctl should return pages pinned in that range and QEMU should find its correcponding guest physical addresses and report those dirty. Suggested-by: Alex Williamson Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia --- hw/vfio/common.c | 96 +--- 1 file changed, 92 insertions(+), 4 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 98c2b1f9b190..9c879e5c0f62 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -321,11 +321,94 @@ static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container) return true; } +static bool vfio_devices_all_running_and_saving(VFIOContainer *container) +{ +VFIOGroup *group; +VFIODevice *vbasedev; +MigrationState *ms = migrate_get_current(); + +if (!migration_is_setup_or_active(ms->state)) { +return false; +} + +QLIST_FOREACH(group, &container->group_list, container_next) { +QLIST_FOREACH(vbasedev, &group->device_list, next) { +VFIOMigration *migration = vbasedev->migration; + +if (!migration) { +return false; +} + +if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) && +(migration->device_state & VFIO_DEVICE_STATE_RUNNING)) { +continue; +} else { +return false; +} +} +} +return true; +} + +static int vfio_dma_unmap_bitmap(VFIOContainer *container, + hwaddr iova, ram_addr_t size, + IOMMUTLBEntry *iotlb) +{ +struct vfio_iommu_type1_dma_unmap *unmap; +struct vfio_bitmap *bitmap; +uint64_t pages = TARGET_PAGE_ALIGN(size) >> TARGET_PAGE_BITS; +int ret; + +unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap)); + +unmap->argsz = sizeof(*unmap) + sizeof(*bitmap); +unmap->iova = iova; +unmap->size = size; +unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP; +bitmap = (struct vfio_bitmap *)&unmap->data; + +/* + * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of + * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap_pgsize to + * TARGET_PAGE_SIZE. + */ + +bitmap->pgsize = TARGET_PAGE_SIZE; +bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / + BITS_PER_BYTE; + +if (bitmap->size > container->max_dirty_bitmap_size) { +error_report("UNMAP: Size of bitmap too big 0x%llx", bitmap->size); +ret = -E2BIG; +goto unmap_exit; +} + +bitmap->data = g_try_malloc0(bitmap->size); +if (!bitmap->data) { +ret = -ENOMEM; +goto unmap_exit; +} + +ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap); +if (!ret) { +cpu_physical_memory_set_dirty_lebitmap((uint64_t *)bitmap->data, +iotlb->translated_addr, pages); +} else { +error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m"); +} + +g_free(bitmap->data); +unmap_exit: +g_free(unmap); +return ret; +} + /* * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86 */ static int vfio_dma_unmap(VFIOContainer *container, - hwaddr iova, ram_addr_t size) + hwaddr iova, ram_addr_t size, + IOMMUTLBEntry *iotlb) { struct vfio_iommu_type1_dma_unmap unmap = { .argsz = sizeof(unmap), @@ -334,6 +417,11 @@ static int vfio_dma_unmap(VFIOContainer *container, .size = size, }; +if (iotlb && container->dirty_pages_supported && +vfio_devices_all_running_and_saving(container)) { +return vfio_dma_unmap_bitmap(container, iova, size, iotlb); +} + while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) { /* * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c @@ -381,7 +469,7 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova, * the VGA ROM space. */ if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 || -(errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 && +(errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 && ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) { return 0; } @@ -531,7 +619,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) iotlb->addr_mask + 1, vaddr, ret); } } else { -ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1); +ret = vfio_dma_unmap(contain
[PATCH v27 13/17] vfio: Add vfio_listener_log_sync to mark dirty pages
vfio_listener_log_sync gets list of dirty pages from container using VFIO_IOMMU_GET_DIRTY_BITMAP ioctl and mark those pages dirty when all devices are stopped and saving state. Return early for the RAM block section of mapped MMIO region. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia --- hw/vfio/common.c | 116 +++ hw/vfio/trace-events | 1 + 2 files changed, 117 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index d4959c036dd1..2634387df948 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -29,6 +29,7 @@ #include "hw/vfio/vfio.h" #include "exec/address-spaces.h" #include "exec/memory.h" +#include "exec/ram_addr.h" #include "hw/hw.h" #include "qemu/error-report.h" #include "qemu/main-loop.h" @@ -37,6 +38,7 @@ #include "sysemu/reset.h" #include "trace.h" #include "qapi/error.h" +#include "migration/migration.h" VFIOGroupList vfio_group_list = QLIST_HEAD_INITIALIZER(vfio_group_list); @@ -287,6 +289,39 @@ const MemoryRegionOps vfio_region_ops = { }; /* + * Device state interfaces + */ + +static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container) +{ +VFIOGroup *group; +VFIODevice *vbasedev; +MigrationState *ms = migrate_get_current(); + +if (!migration_is_setup_or_active(ms->state)) { +return false; +} + +QLIST_FOREACH(group, &container->group_list, container_next) { +QLIST_FOREACH(vbasedev, &group->device_list, next) { +VFIOMigration *migration = vbasedev->migration; + +if (!migration) { +return false; +} + +if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) && +!(migration->device_state & VFIO_DEVICE_STATE_RUNNING)) { +continue; +} else { +return false; +} +} +} +return true; +} + +/* * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86 */ static int vfio_dma_unmap(VFIOContainer *container, @@ -812,9 +847,90 @@ static void vfio_listener_region_del(MemoryListener *listener, } } +static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, + uint64_t size, ram_addr_t ram_addr) +{ +struct vfio_iommu_type1_dirty_bitmap *dbitmap; +struct vfio_iommu_type1_dirty_bitmap_get *range; +uint64_t pages; +int ret; + +dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range)); + +dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range); +dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP; +range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data; +range->iova = iova; +range->size = size; + +/* + * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of + * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap's pgsize to + * TARGET_PAGE_SIZE. + */ +range->bitmap.pgsize = TARGET_PAGE_SIZE; + +pages = TARGET_PAGE_ALIGN(range->size) >> TARGET_PAGE_BITS; +range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / + BITS_PER_BYTE; +range->bitmap.data = g_try_malloc0(range->bitmap.size); +if (!range->bitmap.data) { +ret = -ENOMEM; +goto err_out; +} + +ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap); +if (ret) { +error_report("Failed to get dirty bitmap for iova: 0x%llx " +"size: 0x%llx err: %d", +range->iova, range->size, errno); +goto err_out; +} + +cpu_physical_memory_set_dirty_lebitmap((uint64_t *)range->bitmap.data, +ram_addr, pages); + +trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size, +range->bitmap.size, ram_addr); +err_out: +g_free(range->bitmap.data); +g_free(dbitmap); + +return ret; +} + +static int vfio_sync_dirty_bitmap(VFIOContainer *container, + MemoryRegionSection *section) +{ +ram_addr_t ram_addr; + +ram_addr = memory_region_get_ram_addr(section->mr) + + section->offset_within_region; + +return vfio_get_dirty_bitmap(container, + TARGET_PAGE_ALIGN(section->offset_within_address_space), + int128_get64(section->size), ram_addr); +} + +static void vfio_listerner_log_sync(MemoryListener *listener, +MemoryRegionSection *section) +{ +VFIOContainer *container = container_of(listener, VFIOContainer, listener); + +if (vfio_listener_skipped_section(section) || +!container->
[PATCH v27 10/17] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled
mr->ram_block is NULL when mr->is_iommu is true, then fr.dirty_log_mask wasn't set correctly due to which memory listener's log_sync doesn't get called. This patch returns log_mask with DIRTY_MEMORY_MIGRATION set when IOMMU is enabled. Signed-off-by: Kirti Wankhede --- softmmu/memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/softmmu/memory.c b/softmmu/memory.c index 403ff3abc99b..94f606e9d9d9 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -1792,7 +1792,7 @@ bool memory_region_is_ram_device(MemoryRegion *mr) uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr) { uint8_t mask = mr->dirty_log_mask; -if (global_dirty_log && mr->ram_block) { +if (global_dirty_log && (mr->ram_block || memory_region_is_iommu(mr))) { mask |= (1 << DIRTY_MEMORY_MIGRATION); } return mask; -- 2.7.0
[PATCH v27 07/17] vfio: Register SaveVMHandlers for VFIO device
Define flags to be used as delimiter in migration stream for VFIO devices. Added .save_setup and .save_cleanup functions. Map & unmap migration region from these functions at source during saving or pre-copy phase. Set VFIO device state depending on VM's state. During live migration, VM is running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO device. During save-restore, VM is paused, _SAVING state is set for VFIO device. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia --- hw/vfio/migration.c | 96 hw/vfio/trace-events | 2 ++ 2 files changed, 98 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 7c4fa0d08ea6..2e1054bf7f43 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -8,12 +8,15 @@ */ #include "qemu/osdep.h" +#include "qemu/main-loop.h" +#include "qemu/cutils.h" #include #include "sysemu/runstate.h" #include "hw/vfio/vfio-common.h" #include "cpu.h" #include "migration/migration.h" +#include "migration/vmstate.h" #include "migration/qemu-file.h" #include "migration/register.h" #include "migration/blocker.h" @@ -25,6 +28,22 @@ #include "trace.h" #include "hw/hw.h" +/* + * Flags to be used as unique delimiters for VFIO devices in the migration + * stream. These flags are composed as: + * 0x => MSB 32-bit all 1s + * 0xef10 => Magic ID, represents emulated (virtual) function IO + * 0x => 16-bits reserved for flags + * + * The beginning of state information is marked by _DEV_CONFIG_STATE, + * _DEV_SETUP_STATE, or _DEV_DATA_STATE, respectively. The end of a + * certain state information is marked by _END_OF_STATE. + */ +#define VFIO_MIG_FLAG_END_OF_STATE (0xef11ULL) +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE (0xef12ULL) +#define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xef13ULL) +#define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL) + static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count, off_t off, bool iswrite) { @@ -129,6 +148,69 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask, return 0; } +/* -- */ + +static int vfio_save_setup(QEMUFile *f, void *opaque) +{ +VFIODevice *vbasedev = opaque; +VFIOMigration *migration = vbasedev->migration; +int ret; + +trace_vfio_save_setup(vbasedev->name); + +qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE); + +if (migration->region.mmaps) { +/* + * vfio_region_mmap() called from migration thread. Memory API called + * from vfio_regio_mmap() need it when called from outdide the main loop + * thread. + */ +qemu_mutex_lock_iothread(); +ret = vfio_region_mmap(&migration->region); +qemu_mutex_unlock_iothread(); +if (ret) { +error_report("%s: Failed to mmap VFIO migration region: %s", + vbasedev->name, strerror(-ret)); +error_report("%s: Falling back to slow path", vbasedev->name); +} +} + +ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK, + VFIO_DEVICE_STATE_SAVING); +if (ret) { +error_report("%s: Failed to set state SAVING", vbasedev->name); +return ret; +} + +qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); + +ret = qemu_file_get_error(f); +if (ret) { +return ret; +} + +return 0; +} + +static void vfio_save_cleanup(void *opaque) +{ +VFIODevice *vbasedev = opaque; +VFIOMigration *migration = vbasedev->migration; + +if (migration->region.mmaps) { +vfio_region_unmap(&migration->region); +} +trace_vfio_save_cleanup(vbasedev->name); +} + +static SaveVMHandlers savevm_vfio_handlers = { +.save_setup = vfio_save_setup, +.save_cleanup = vfio_save_cleanup, +}; + +/* -- */ + static void vfio_vmstate_change(void *opaque, int running, RunState state) { VFIODevice *vbasedev = opaque; @@ -219,6 +301,8 @@ static int vfio_migration_init(VFIODevice *vbasedev, int ret; Object *obj; VFIOMigration *migration; +char id[256] = ""; +g_autofree char *path = NULL, *oid; if (!vbasedev->ops->vfio_get_object) { return -EINVAL; @@ -248,6 +332,18 @@ static int vfio_migration_init(VFIODevice *vbasedev, vbasedev->migration = migration; migration->vbasedev = vbasedev; + +oid = vmstate_if_get_id(VMSTATE_IF(DEVICE(obj))); +if (oid) { +path = g_strdup_printf("%s/vfio",
[PATCH v27 16/17] vfio: Make vfio-pci device migration capable
If the device is not a failover primary device, call vfio_migration_probe() and vfio_migration_finalize() to enable migration support for those devices that support it respectively to tear it down again. Removed migration blocker from VFIO PCI device specific structure and use migration blocker from generic structure of VFIO device. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Reviewed-by: Dr. David Alan Gilbert Reviewed-by: Cornelia Huck --- hw/vfio/pci.c | 28 hw/vfio/pci.h | 1 - 2 files changed, 8 insertions(+), 21 deletions(-) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 1036a5332772..c67fb4cced8e 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -2788,17 +2788,6 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) return; } -if (!pdev->failover_pair_id) { -error_setg(&vdev->migration_blocker, -"VFIO device doesn't support migration"); -ret = migrate_add_blocker(vdev->migration_blocker, errp); -if (ret) { -error_free(vdev->migration_blocker); -vdev->migration_blocker = NULL; -return; -} -} - vdev->vbasedev.name = g_path_get_basename(vdev->vbasedev.sysfsdev); vdev->vbasedev.ops = &vfio_pci_ops; vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI; @@ -3066,6 +3055,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) } } +if (!pdev->failover_pair_id) { +ret = vfio_migration_probe(&vdev->vbasedev, errp); +if (ret) { +error_report("%s: Migration disabled", vdev->vbasedev.name); +} +} + vfio_register_err_notifier(vdev); vfio_register_req_notifier(vdev); vfio_setup_resetfn_quirk(vdev); @@ -3080,11 +3076,6 @@ out_teardown: vfio_bars_exit(vdev); error: error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name); -if (vdev->migration_blocker) { -migrate_del_blocker(vdev->migration_blocker); -error_free(vdev->migration_blocker); -vdev->migration_blocker = NULL; -} } static void vfio_instance_finalize(Object *obj) @@ -3096,10 +3087,6 @@ static void vfio_instance_finalize(Object *obj) vfio_bars_finalize(vdev); g_free(vdev->emulated_config_bits); g_free(vdev->rom); -if (vdev->migration_blocker) { -migrate_del_blocker(vdev->migration_blocker); -error_free(vdev->migration_blocker); -} /* * XXX Leaking igd_opregion is not an oversight, we can't remove the * fw_cfg entry therefore leaking this allocation seems like the safest @@ -3127,6 +3114,7 @@ static void vfio_exitfn(PCIDevice *pdev) } vfio_teardown_msi(vdev); vfio_bars_exit(vdev); +vfio_migration_finalize(&vdev->vbasedev); } static void vfio_pci_reset(DeviceState *dev) diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h index bce71a9ac93f..1574ef983f8f 100644 --- a/hw/vfio/pci.h +++ b/hw/vfio/pci.h @@ -172,7 +172,6 @@ struct VFIOPCIDevice { bool no_vfio_ioeventfd; bool enable_ramfb; VFIODisplay *dpy; -Error *migration_blocker; Notifier irqchip_change_notifier; }; -- 2.7.0
[PATCH v27 12/17] vfio: Add function to start and stop dirty pages tracking
Call VFIO_IOMMU_DIRTY_PAGES ioctl to start and stop dirty pages tracking for VFIO devices. Signed-off-by: Kirti Wankhede Reviewed-by: Dr. David Alan Gilbert --- hw/vfio/migration.c | 36 1 file changed, 36 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index ea5e0f1b8489..77ee60a43ea5 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -11,6 +11,7 @@ #include "qemu/main-loop.h" #include "qemu/cutils.h" #include +#include #include "sysemu/runstate.h" #include "hw/vfio/vfio-common.h" @@ -391,6 +392,34 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque) return qemu_file_get_error(f); } +static int vfio_set_dirty_page_tracking(VFIODevice *vbasedev, bool start) +{ +int ret; +VFIOMigration *migration = vbasedev->migration; +VFIOContainer *container = vbasedev->group->container; +struct vfio_iommu_type1_dirty_bitmap dirty = { +.argsz = sizeof(dirty), +}; + +if (start) { +if (migration->device_state & VFIO_DEVICE_STATE_SAVING) { +dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START; +} else { +return -EINVAL; +} +} else { +dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP; +} + +ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty); +if (ret) { +error_report("Failed to set dirty tracking flag 0x%x errno: %d", + dirty.flags, errno); +return -errno; +} +return ret; +} + /* -- */ static int vfio_save_setup(QEMUFile *f, void *opaque) @@ -426,6 +455,11 @@ static int vfio_save_setup(QEMUFile *f, void *opaque) return ret; } +ret = vfio_set_dirty_page_tracking(vbasedev, true); +if (ret) { +return ret; +} + qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); ret = qemu_file_get_error(f); @@ -441,6 +475,8 @@ static void vfio_save_cleanup(void *opaque) VFIODevice *vbasedev = opaque; VFIOMigration *migration = vbasedev->migration; +vfio_set_dirty_page_tracking(vbasedev, false); + if (migration->region.mmaps) { vfio_region_unmap(&migration->region); } -- 2.7.0
[PATCH v27 11/17] vfio: Get migration capability flags for container
Added helper functions to get IOMMU info capability chain. Added function to get migration capability information from that capability chain for IOMMU container. Similar change was proposed earlier: https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg03759.html Disable migration for devices if IOMMU module doesn't support migration capability. Signed-off-by: Kirti Wankhede Cc: Shameer Kolothum Cc: Eric Auger --- hw/vfio/common.c | 90 +++ hw/vfio/migration.c | 7 +++- include/hw/vfio/vfio-common.h | 3 ++ 3 files changed, 91 insertions(+), 9 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index c6e98b8d61be..d4959c036dd1 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -1228,6 +1228,75 @@ static int vfio_init_container(VFIOContainer *container, int group_fd, return 0; } +static int vfio_get_iommu_info(VFIOContainer *container, + struct vfio_iommu_type1_info **info) +{ + +size_t argsz = sizeof(struct vfio_iommu_type1_info); + +*info = g_new0(struct vfio_iommu_type1_info, 1); +again: +(*info)->argsz = argsz; + +if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) { +g_free(*info); +*info = NULL; +return -errno; +} + +if (((*info)->argsz > argsz)) { +argsz = (*info)->argsz; +*info = g_realloc(*info, argsz); +goto again; +} + +return 0; +} + +static struct vfio_info_cap_header * +vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id) +{ +struct vfio_info_cap_header *hdr; +void *ptr = info; + +if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) { +return NULL; +} + +for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) { +if (hdr->id == id) { +return hdr; +} +} + +return NULL; +} + +static void vfio_get_iommu_info_migration(VFIOContainer *container, + struct vfio_iommu_type1_info *info) +{ +struct vfio_info_cap_header *hdr; +struct vfio_iommu_type1_info_cap_migration *cap_mig; + +hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION); +if (!hdr) { +return; +} + +cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration, +header); + +/* + * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of + * TARGET_PAGE_SIZE to mark those dirty. + */ +if (cap_mig->pgsize_bitmap & TARGET_PAGE_SIZE) { +container->dirty_pages_supported = true; +container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size; +container->dirty_pgsizes = cap_mig->pgsize_bitmap; +} +} + static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, Error **errp) { @@ -1297,6 +1366,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, container->space = space; container->fd = fd; container->error = NULL; +container->dirty_pages_supported = false; QLIST_INIT(&container->giommu_list); QLIST_INIT(&container->hostwin_list); @@ -1309,7 +1379,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, case VFIO_TYPE1v2_IOMMU: case VFIO_TYPE1_IOMMU: { -struct vfio_iommu_type1_info info; +struct vfio_iommu_type1_info *info; /* * FIXME: This assumes that a Type1 IOMMU can map any 64-bit @@ -1318,15 +1388,19 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, * existing Type1 IOMMUs generally support any IOVA we're * going to actually try in practice. */ -info.argsz = sizeof(info); -ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info); -/* Ignore errors */ -if (ret || !(info.flags & VFIO_IOMMU_INFO_PGSIZES)) { +ret = vfio_get_iommu_info(container, &info); + +if (ret || !(info->flags & VFIO_IOMMU_INFO_PGSIZES)) { /* Assume 4k IOVA page size */ -info.iova_pgsizes = 4096; +info->iova_pgsizes = 4096; } -vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes); -container->pgsizes = info.iova_pgsizes; +vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes); +container->pgsizes = info->iova_pgsizes; + +if (!ret) { +vfio_get_iommu_info_migration(container, info); +} +g_free(info); break; } case VFIO_SPAPR_TCE_v2_IOMMU: diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 46d05d230e2a..ea5e0f1b8489 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -828,9 +828,14 @@ err: int vfio_migration_probe(VFIODevice *vbasedev, Error *
[PATCH v27 03/17] vfio: Add save and load functions for VFIO PCI devices
Added functions to save and restore PCI device specific data, specifically config space of PCI device. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia --- hw/vfio/pci.c | 48 +++ include/hw/vfio/vfio-common.h | 2 ++ 2 files changed, 50 insertions(+) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index bffd5bfe3b78..1036a5332772 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -41,6 +41,7 @@ #include "trace.h" #include "qapi/error.h" #include "migration/blocker.h" +#include "migration/qemu-file.h" #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug" @@ -2401,11 +2402,58 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev) return OBJECT(vdev); } +static bool vfio_msix_enabled(void *opaque, int version_id) +{ +PCIDevice *pdev = opaque; + +return msix_enabled(pdev); +} + +const VMStateDescription vmstate_vfio_pci_config = { +.name = "VFIOPCIDevice", +.version_id = 1, +.minimum_version_id = 1, +.fields = (VMStateField[]) { +VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice), +VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_enabled), +VMSTATE_END_OF_LIST() +} +}; + +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f) +{ +VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); + +vmstate_save_state(f, &vmstate_vfio_pci_config, vdev, NULL); +} + +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f) +{ +VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); +PCIDevice *pdev = &vdev->pdev; +int ret; + +ret = vmstate_load_state(f, &vmstate_vfio_pci_config, vdev, 1); +if (ret) { +return ret; +} + +if (msi_enabled(pdev)) { +vfio_msi_enable(vdev); +} else if (msix_enabled(pdev)) { +vfio_msix_enable(vdev); +} + +return ret; +} + static VFIODeviceOps vfio_pci_ops = { .vfio_compute_needs_reset = vfio_pci_compute_needs_reset, .vfio_hot_reset_multi = vfio_pci_hot_reset_multi, .vfio_eoi = vfio_intx_eoi, .vfio_get_object = vfio_pci_get_object, +.vfio_save_config = vfio_pci_save_config, +.vfio_load_config = vfio_pci_load_config, }; int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp) diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index fe99c36a693a..ba6169cd926e 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -120,6 +120,8 @@ struct VFIODeviceOps { int (*vfio_hot_reset_multi)(VFIODevice *vdev); void (*vfio_eoi)(VFIODevice *vdev); Object *(*vfio_get_object)(VFIODevice *vdev); +void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f); +int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f); }; typedef struct VFIOGroup { -- 2.7.0
[PATCH v27 09/17] vfio: Add load state functions to SaveVMHandlers
Sequence during _RESUMING device state: While data for this device is available, repeat below steps: a. read data_offset from where user application should write data. b. write data of data_size to migration region from data_offset. c. write data_size which indicates vendor driver that data is written in staging buffer. For user, data is opaque. User should write data in the same order as received. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Reviewed-by: Dr. David Alan Gilbert --- hw/vfio/migration.c | 192 +++ hw/vfio/trace-events | 3 + 2 files changed, 195 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 5506cef15d88..46d05d230e2a 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -257,6 +257,77 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size) return ret; } +static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev, +uint64_t data_size) +{ +VFIORegion *region = &vbasedev->migration->region; +uint64_t data_offset = 0, size, report_size; +int ret; + +do { +ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset), + region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_offset)); +if (ret < 0) { +return ret; +} + +if (data_offset + data_size > region->size) { +/* + * If data_size is greater than the data section of migration region + * then iterate the write buffer operation. This case can occur if + * size of migration region at destination is smaller than size of + * migration region at source. + */ +report_size = size = region->size - data_offset; +data_size -= size; +} else { +report_size = size = data_size; +data_size = 0; +} + +trace_vfio_load_state_device_data(vbasedev->name, data_offset, size); + +while (size) { +void *buf; +uint64_t sec_size; +bool buf_alloc = false; + +buf = get_data_section_size(region, data_offset, size, &sec_size); + +if (!buf) { +buf = g_try_malloc(sec_size); +if (!buf) { +error_report("%s: Error allocating buffer ", __func__); +return -ENOMEM; +} +buf_alloc = true; +} + +qemu_get_buffer(f, buf, sec_size); + +if (buf_alloc) { +ret = vfio_mig_write(vbasedev, buf, sec_size, +region->fd_offset + data_offset); +g_free(buf); + +if (ret < 0) { +return ret; +} +} +size -= sec_size; +data_offset += sec_size; +} + +ret = vfio_mig_write(vbasedev, &report_size, sizeof(report_size), +region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_size)); +if (ret < 0) { +return ret; +} +} while (data_size); + +return 0; +} + static int vfio_update_pending(VFIODevice *vbasedev) { VFIOMigration *migration = vbasedev->migration; @@ -293,6 +364,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque) return qemu_file_get_error(f); } +static int vfio_load_device_config_state(QEMUFile *f, void *opaque) +{ +VFIODevice *vbasedev = opaque; +uint64_t data; + +if (vbasedev->ops && vbasedev->ops->vfio_load_config) { +int ret; + +ret = vbasedev->ops->vfio_load_config(vbasedev, f); +if (ret) { +error_report("%s: Failed to load device config space", + vbasedev->name); +return ret; +} +} + +data = qemu_get_be64(f); +if (data != VFIO_MIG_FLAG_END_OF_STATE) { +error_report("%s: Failed loading device config space, " + "end flag incorrect 0x%"PRIx64, vbasedev->name, data); +return -EINVAL; +} + +trace_vfio_load_device_config_state(vbasedev->name); +return qemu_file_get_error(f); +} + /* -- */ static int vfio_save_setup(QEMUFile *f, void *opaque) @@ -477,12 +575,106 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) return ret; } +static int vfio_load_setup(QEMUFile *f, void *opaque) +{ +VFIODevice *vbasedev = opaque; +VFIOMigration *migration = vbasedev->migration; +int ret = 0; + +if (migration->region.mmaps) { +ret = vfio_region_mmap(&migration->region); +if (ret) { +error_report("%s: Failed to mmap VFIO migration region %d: %s", +
[PATCH v27 04/17] vfio: Add migration region initialization and finalize function
Whether the VFIO device supports migration or not is decided based of migration region query. If migration region query is successful and migration region initialization is successful then migration is supported else migration is blocked. Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia Acked-by: Dr. David Alan Gilbert --- hw/vfio/meson.build | 1 + hw/vfio/migration.c | 129 ++ hw/vfio/trace-events | 3 + include/hw/vfio/vfio-common.h | 9 +++ 4 files changed, 142 insertions(+) create mode 100644 hw/vfio/migration.c diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build index 37efa74018bc..da9af297a0c5 100644 --- a/hw/vfio/meson.build +++ b/hw/vfio/meson.build @@ -2,6 +2,7 @@ vfio_ss = ss.source_set() vfio_ss.add(files( 'common.c', 'spapr.c', + 'migration.c', )) vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files( 'display.c', diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c new file mode 100644 index ..5f74a3ad1d72 --- /dev/null +++ b/hw/vfio/migration.c @@ -0,0 +1,129 @@ +/* + * Migration support for VFIO devices + * + * Copyright NVIDIA, Inc. 2020 + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include "qemu/osdep.h" +#include + +#include "hw/vfio/vfio-common.h" +#include "cpu.h" +#include "migration/migration.h" +#include "migration/qemu-file.h" +#include "migration/register.h" +#include "migration/blocker.h" +#include "migration/misc.h" +#include "qapi/error.h" +#include "exec/ramlist.h" +#include "exec/ram_addr.h" +#include "pci.h" +#include "trace.h" + +static void vfio_migration_region_exit(VFIODevice *vbasedev) +{ +VFIOMigration *migration = vbasedev->migration; + +if (!migration) { +return; +} + +if (migration->region.size) { +vfio_region_exit(&migration->region); +vfio_region_finalize(&migration->region); +} +} + +static int vfio_migration_init(VFIODevice *vbasedev, + struct vfio_region_info *info) +{ +int ret; +Object *obj; +VFIOMigration *migration; + +if (!vbasedev->ops->vfio_get_object) { +return -EINVAL; +} + +obj = vbasedev->ops->vfio_get_object(vbasedev); +if (!obj) { +return -EINVAL; +} + +migration = g_new0(VFIOMigration, 1); + +ret = vfio_region_setup(obj, vbasedev, &migration->region, +info->index, "migration"); +if (ret) { +error_report("%s: Failed to setup VFIO migration region %d: %s", + vbasedev->name, info->index, strerror(-ret)); +goto err; +} + +if (!migration->region.size) { +error_report("%s: Invalid zero-sized of VFIO migration region %d", + vbasedev->name, info->index); +ret = -EINVAL; +goto err; +} + +vbasedev->migration = migration; +return 0; + +err: +vfio_migration_region_exit(vbasedev); +g_free(migration); +return ret; +} + +/* -- */ + +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) +{ +struct vfio_region_info *info = NULL; +Error *local_err = NULL; +int ret; + +ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION, + VFIO_REGION_SUBTYPE_MIGRATION, &info); +if (ret) { +goto add_blocker; +} + +ret = vfio_migration_init(vbasedev, info); +if (ret) { +goto add_blocker; +} + +g_free(info); +trace_vfio_migration_probe(vbasedev->name, info->index); +return 0; + +add_blocker: +error_setg(&vbasedev->migration_blocker, + "VFIO device doesn't support migration"); +g_free(info); + +ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err); +if (local_err) { +error_propagate(errp, local_err); +error_free(vbasedev->migration_blocker); +vbasedev->migration_blocker = NULL; +} +return ret; +} + +void vfio_migration_finalize(VFIODevice *vbasedev) +{ +if (vbasedev->migration_blocker) { +migrate_del_blocker(vbasedev->migration_blocker); +error_free(vbasedev->migration_blocker); +vbasedev->migration_blocker = NULL; +} + +vfio_migration_region_exit(vbasedev); +g_free(vbasedev->migration); +} diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index a0c7b49a2ebc..9ced5ec6277c 100644 --- a/hw/vfio/trace-events +++ b/hw/vfio/trace-events @@ -145,3 +145,6 @@ vfio_display_edid_link_up(void) &