Re: [Question] Can I start qemu-system-aarch64 with a vmlinux(ELF format)?
Hi Peter, On 2024/2/29 17:44, Peter Maydell wrote: On Thu, 29 Feb 2024 at 03:01, Kunkun Jiang wrote: Hi Peter, On 2024/2/27 23:28, Peter Maydell wrote: On Tue, 27 Feb 2024 at 14:42, Kunkun Jiang via wrote: Hi everybody, I want to start qemu-system-aarch64 with a vmlinux, which is an ELF format file. The arm_load_elf() is implemented in arm_setup_direct_kernel_boot(). So I thought it was supporting the ELF format file. No, you can't do this. The hw/arm/boot.c code assumes that ELF files are "bare metal" binaries, whereas uImage format, AArch64 Image format, and raw binary files are Linux kernels. Only the last three kinds of files will be started with the boot protocol the Linux kernel expects. For AArch64, pass the -kernel option the path to the Image file, not the vmlinux file. Yes, it works fine using Image files. I would also like to ask again, is it because AArch64 does not support vmlinux, or is it because qemu does not implement this capability? As I said, it is because QEMU assumes that ELF files are bare metal images, not Linux kernel images. Sorry to bother you again. I still have a few questions. 1. What does "bare metal images" mean? Are they used in tcg mode? 2. How QEMU assumes an ELF file is a bare metal image? Can you post the corresponding code? 3. How can I make the hw/arm/boot.c code assumes the ELF files are Linux kernels? Looking forward to your reply. Thanks, Kunkun Jiang
Re: [Question] Can I start qemu-system-aarch64 with a vmlinux(ELF format)?
Hi Peter, On 2024/2/27 23:28, Peter Maydell wrote: On Tue, 27 Feb 2024 at 14:42, Kunkun Jiang via wrote: Hi everybody, I want to start qemu-system-aarch64 with a vmlinux, which is an ELF format file. The arm_load_elf() is implemented in arm_setup_direct_kernel_boot(). So I thought it was supporting the ELF format file. No, you can't do this. The hw/arm/boot.c code assumes that ELF files are "bare metal" binaries, whereas uImage format, AArch64 Image format, and raw binary files are Linux kernels. Only the last three kinds of files will be started with the boot protocol the Linux kernel expects. For AArch64, pass the -kernel option the path to the Image file, not the vmlinux file. Yes, it works fine using Image files. I would also like to ask again, is it because AArch64 does not support vmlinux, or is it because qemu does not implement this capability? Thanks, Kunkun Jiang
[Question] Can I start qemu-system-aarch64 with a vmlinux(ELF format)?
Hi everybody, I want to start qemu-system-aarch64 with a vmlinux, which is an ELF format file. The arm_load_elf() is implemented in arm_setup_direct_kernel_boot(). So I thought it was supporting the ELF format file. But there's no output. Here is my command line: qemu-system-aarch64 -machine virt,gic-version=3 -enable-kvm -smp 4 -m 1G -cpu host -kernel vmlinux -initrd fs -append "xxx" Am I using it the wrong way? Looking forward to your reply. Thanks, Kunkun Jiang
Re: [PATCH V9 32/46] vfio-pci: cpr part 2 (msi)
Hi Steve, On 2023/7/10 23:43, Steven Sistare wrote: On 7/5/2023 4:56 AM, Kunkun Jiang wrote: Hi Steve, I have a few questions about the msi part of the vfio device. In the reboot mode, you mentioned "The guest drivers' suspend methods flush outstanding requests and re-initialize the devices". This means, during the downtime,the vfio device dose not generate interrupts. So no special processing is required for the msi interrupt of the vfio device. Am I right? Correct. In the cpr-exec mode, will the vfio device be "stopped"? If the vfio device is running all the time,it is possible to generate interrrupts during the downtime. How to deal with these interrupts? The vfio device is not stopped, but its connection to the kvm instance is severed. Interrupts are pended in the vfio kernel state, and that state is preserved across exec, by preserving the vfio descriptors. After exec, qemu creates a new kvm instance, attaches vfio to it, and the interrupts are delivered. In addition, ARM GICv4 provides support for the direct injection of vLPIs. Interrupts are more difficult to handle. In this case, what should be done? I have not studied this feature or tried it. Have you analyzed the VT-D post-interrupt feature? It is similar to the direct injection of vLPI. They all implement the interrupt injecting channel by executing irq_bypass_register_consumer. According to the current kernel code, it need to first cancel the connection between vfio producer and kvm consumer, then establishes the connection between vfio producer and the new kvm consumer. During the unbinding process, both ARM and x86 will execute the callback of *_set_vcpu_affinity to modify the interrupt reporting channel. For x86, in the process of stop posting interrupts, back to remapping mode, cmpxchg_double is used in the code. Does this guarantee that interrupts will not be lost? For ARM, it will first send a DISCARD command to ITS and then establish the interrupt reporting channel for GICv3. The DISCARD will remove the pending interrupt. Interrupts that come before channel re-establishment are silently discarded. Do you guys have any good ideas? Look forward to your reply. Kunkun Jiang - Steve Look forward to your reply. Kunkun Jiang On 2022/7/27 0:10, Steve Sistare wrote: Finish cpr for vfio-pci MSI/MSI-X devices by preserving eventfd's and vector state. Signed-off-by: Steve Sistare --- hw/vfio/pci.c | 119 +- 1 file changed, 118 insertions(+), 1 deletion(-) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index b5fd2ec..1d0e8db 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -54,17 +54,47 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev); static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled); static void vfio_msi_disable_common(VFIOPCIDevice *vdev); +#define EVENT_FD_NAME(vdev, name) \ + g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name)) + +static void save_event_fd(VFIOPCIDevice *vdev, const char *name, int nr, + EventNotifier *ev) +{ + int fd = event_notifier_get_fd(ev); + + if (fd >= 0) { + g_autofree char *fdname = EVENT_FD_NAME(vdev, name); + + cpr_resave_fd(fdname, nr, fd); + } +} + +static int load_event_fd(VFIOPCIDevice *vdev, const char *name, int nr) +{ + g_autofree char *fdname = EVENT_FD_NAME(vdev, name); + return cpr_find_fd(fdname, nr); +} + +static void delete_event_fd(VFIOPCIDevice *vdev, const char *name, int nr) +{ + g_autofree char *fdname = EVENT_FD_NAME(vdev, name); + cpr_delete_fd(fdname, nr); +} + /* Create new or reuse existing eventfd */ static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e, const char *name, int nr) { - int fd = -1; /* placeholder until a subsequent patch */ int ret = 0; + int fd = load_event_fd(vdev, name, nr); if (fd >= 0) { event_notifier_init_fd(e, fd); } else { ret = event_notifier_init(e, 0); + if (!ret) { + save_event_fd(vdev, name, nr, e); + } } return ret; } @@ -72,6 +102,7 @@ static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e, static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e, const char *name, int nr) { + delete_event_fd(vdev, name, nr); event_notifier_cleanup(e); } @@ -512,6 +543,15 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr, VFIOMSIVector *vector; int ret; + /* + * Ignore the callback from msix_set_vector_notifiers during resume. + * The necessary subset of these actions is called from vfio_claim_vectors + * during post load. + */ + if (vdev->vbasedev.reused) { + return 0; + } + trace_vfio_msix_vector_do_use(vdev->vbasedev
Re: [PATCH V9 32/46] vfio-pci: cpr part 2 (msi)
Hi Steve, I have a few questions about the msi part of the vfio device. In the reboot mode, you mentioned "The guest drivers' suspend methods flush outstanding requests and re-initialize the devices". This means, during the downtime,the vfio device dose not generate interrupts. So no special processing is required for the msi interrupt of the vfio device. Am I right? In the cpr-exec mode, will the vfio device be "stopped"? If the vfio device is running all the time,it is possible to generate interrrupts during the downtime. How to deal with these interrupts? In addition, ARM GICv4 provides support for the direct injection of vLPIs. Interrupts are more difficult to handle. In this case, what should be done? Look forward to your reply. Kunkun Jiang On 2022/7/27 0:10, Steve Sistare wrote: Finish cpr for vfio-pci MSI/MSI-X devices by preserving eventfd's and vector state. Signed-off-by: Steve Sistare --- hw/vfio/pci.c | 119 +- 1 file changed, 118 insertions(+), 1 deletion(-) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index b5fd2ec..1d0e8db 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -54,17 +54,47 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev); static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled); static void vfio_msi_disable_common(VFIOPCIDevice *vdev); +#define EVENT_FD_NAME(vdev, name) \ +g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name)) + +static void save_event_fd(VFIOPCIDevice *vdev, const char *name, int nr, + EventNotifier *ev) +{ +int fd = event_notifier_get_fd(ev); + +if (fd >= 0) { +g_autofree char *fdname = EVENT_FD_NAME(vdev, name); + +cpr_resave_fd(fdname, nr, fd); +} +} + +static int load_event_fd(VFIOPCIDevice *vdev, const char *name, int nr) +{ +g_autofree char *fdname = EVENT_FD_NAME(vdev, name); +return cpr_find_fd(fdname, nr); +} + +static void delete_event_fd(VFIOPCIDevice *vdev, const char *name, int nr) +{ +g_autofree char *fdname = EVENT_FD_NAME(vdev, name); +cpr_delete_fd(fdname, nr); +} + /* Create new or reuse existing eventfd */ static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e, const char *name, int nr) { -int fd = -1; /* placeholder until a subsequent patch */ int ret = 0; +int fd = load_event_fd(vdev, name, nr); if (fd >= 0) { event_notifier_init_fd(e, fd); } else { ret = event_notifier_init(e, 0); +if (!ret) { +save_event_fd(vdev, name, nr, e); +} } return ret; } @@ -72,6 +102,7 @@ static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e, static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e, const char *name, int nr) { +delete_event_fd(vdev, name, nr); event_notifier_cleanup(e); } @@ -512,6 +543,15 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr, VFIOMSIVector *vector; int ret; +/* + * Ignore the callback from msix_set_vector_notifiers during resume. + * The necessary subset of these actions is called from vfio_claim_vectors + * during post load. + */ +if (vdev->vbasedev.reused) { +return 0; +} + trace_vfio_msix_vector_do_use(vdev->vbasedev.name, nr); vector = >msi_vectors[nr]; @@ -2784,6 +2824,11 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev) fd = event_notifier_get_fd(>err_notifier); qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev); +/* Do not alter irq_signaling during vfio_realize for cpr */ +if (vdev->vbasedev.reused) { +return; +} + if (vfio_set_irq_signaling(>vbasedev, VFIO_PCI_ERR_IRQ_INDEX, 0, VFIO_IRQ_SET_ACTION_TRIGGER, fd, )) { error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name); @@ -2849,6 +2894,12 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev) fd = event_notifier_get_fd(>req_notifier); qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev); +/* Do not alter irq_signaling during vfio_realize for cpr */ +if (vdev->vbasedev.reused) { +vdev->req_enabled = true; +return; +} + if (vfio_set_irq_signaling(>vbasedev, VFIO_PCI_REQ_IRQ_INDEX, 0, VFIO_IRQ_SET_ACTION_TRIGGER, fd, )) { error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name); @@ -3357,6 +3408,46 @@ static Property vfio_pci_dev_properties[] = { DEFINE_PROP_END_OF_LIST(), }; +static void vfio_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors, bool msix) +{ +int i, fd; +bool pending = false; +PCIDevice *pdev = >pdev; + +vdev->nr_vectors = nr_vectors; +v
[PATCH] qmp: Add error proofing for query-acpi-ospm-status
The ACPI device may not implement the ospm_status callback. Executing qmp "query-acpi-ospm-status" will cause segmentation fault. Add error proofing add log to avoid such serious consequences. Signed-off-by: Kunkun Jiang --- monitor/qmp-cmds.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/monitor/qmp-cmds.c b/monitor/qmp-cmds.c index bf22a8c5a6..7652613635 100644 --- a/monitor/qmp-cmds.c +++ b/monitor/qmp-cmds.c @@ -229,7 +229,12 @@ ACPIOSTInfoList *qmp_query_acpi_ospm_status(Error **errp) AcpiDeviceIfClass *adevc = ACPI_DEVICE_IF_GET_CLASS(obj); AcpiDeviceIf *adev = ACPI_DEVICE_IF(obj); -adevc->ospm_status(adev, ); +if (adevc->ospm_status) { +adevc->ospm_status(adev, ); +} else { +error_setg(errp, "command is not supported, " + "ACPI device missing ospm_status callback"); +} } else { error_setg(errp, "command is not supported, missing ACPI device"); } -- 2.33.0
[PATCH] vfio/migration: Fix incorrect initialization value for parameters in VFIOMigration
The structure VFIOMigration of a VFIODevice is allocated and initialized in vfio_migration_init(). "device_state" and "vm_running" are initialized to 0, indicating that VFIO device is_STOP and VM is not-running. The initialization value is incorrect. According to the agreement, default state of VFIO device is _RUNNING. And if a VFIO device is hot-plugged while the VM is running, "vm_running" should be 1. This patch fixes it. Fixes: 02a7e71b1e5 (vfio: Add VM state change handler to know state of VM) Signed-off-by: Kunkun Jiang --- hw/vfio/migration.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index a6ad1f8945..3de4252111 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -806,6 +806,8 @@ static int vfio_migration_init(VFIODevice *vbasedev, } vbasedev->migration = g_new0(VFIOMigration, 1); +vbasedev->migration->device_state = VFIO_DEVICE_STATE_RUNNING; +vbasedev->migration->vm_running = runstate_is_running(); ret = vfio_region_setup(obj, vbasedev, >migration->region, info->index, "migration"); -- 2.27.0
[PATCH v3 0/2] vfio: Some fixes about vfio-pci MMIO BARs mapping
This series include patches as below: Patch 1: - Added support for mmapping sub-page MMIO BAR after live migration to improve IO performance for some devices Patch 2: - Added a trace point to informe users when a MMIO RAM section cannot be mapped History: v2 -> v3: - Modify commit message [Eric Auger] v1 -> v2: - Add iterate sub-page BARs in vfio_pci_load_config and try to update them [Alex Williamson] Kunkun Jiang (2): vfio/pci: Add support for mmapping sub-page MMIO BARs after live migration vfio/common: Add a trace point when a MMIO RAM section cannot be mapped hw/vfio/common.c | 7 +++ hw/vfio/pci.c| 19 ++- 2 files changed, 25 insertions(+), 1 deletion(-) -- 2.23.0
[PATCH v3 1/2] vfio/pci: Add support for mmapping sub-page MMIO BARs after live migration
We can expand MemoryRegions of sub-page MMIO BARs in vfio_pci_write_config() to improve IO performance for some devices. However, the MemoryRegions of destination VM are not expanded any more after live migration. Because their addresses have been updated in vmstate_load_state() (vfio_pci_load_config) and vfio_sub_page_bar_update_mapping() will not be called. This may result in poor performance after live migration. So iterate BARs in vfio_pci_load_config() and try to update sub-page BARs. Reported-by: Nianyao Tang Reported-by: Qixin Gan Signed-off-by: Kunkun Jiang --- hw/vfio/pci.c | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 5cdf1d4298..7b45353ce2 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -2453,7 +2453,12 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f) { VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); PCIDevice *pdev = >pdev; -int ret; +pcibus_t old_addr[PCI_NUM_REGIONS - 1]; +int bar, ret; + +for (bar = 0; bar < PCI_ROM_SLOT; bar++) { +old_addr[bar] = pdev->io_regions[bar].addr; +} ret = vmstate_load_state(f, _vfio_pci_config, vdev, 1); if (ret) { @@ -2463,6 +2468,18 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f) vfio_pci_write_config(pdev, PCI_COMMAND, pci_get_word(pdev->config + PCI_COMMAND), 2); +for (bar = 0; bar < PCI_ROM_SLOT; bar++) { +/* + * The address may not be changed in some scenarios + * (e.g. the VF driver isn't loaded in VM). + */ +if (old_addr[bar] != pdev->io_regions[bar].addr && +vdev->bars[bar].region.size > 0 && +vdev->bars[bar].region.size < qemu_real_host_page_size) { +vfio_sub_page_bar_update_mapping(pdev, bar); +} +} + if (msi_enabled(pdev)) { vfio_msi_enable(vdev); } else if (msix_enabled(pdev)) { -- 2.23.0
[PATCH v3 2/2] vfio/common: Add a trace point when a MMIO RAM section cannot be mapped
The MSI-X structures of some devices and other non-MSI-X structures may be in the same BAR. They may share one host page, especially in the case of large page granularity, such as 64K. For example, MSIX-Table size of 82599 NIC is 0x30 and the offset in Bar 3(size 64KB) is 0x0. vfio_listener_region_add() will be called to map the remaining range (0x30-0x). If host page size is 64KB, it will return early at 'int128_ge((int128_make64(iova), llend))' without any message. Let's add a trace point to inform users like commit 5c08600547c0 ("vfio: Use a trace point when a RAM section cannot be DMA mapped") did. Signed-off-by: Kunkun Jiang --- hw/vfio/common.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index a784b219e6..dd387b0d39 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -893,6 +893,13 @@ static void vfio_listener_region_add(MemoryListener *listener, llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask)); if (int128_ge(int128_make64(iova), llend)) { +if (memory_region_is_ram_device(section->mr)) { +trace_vfio_listener_region_add_no_dma_map( +memory_region_name(section->mr), +section->offset_within_address_space, +int128_getlo(section->size), +qemu_real_host_page_size); +} return; } end = int128_get64(int128_sub(llend, int128_one())); -- 2.23.0
Re: [PATCH v2 1/2] vfio/pci: Fix vfio-pci sub-page MMIO BAR mmaping in live migration
Hi Eric, On 2021/10/23 22:26, Eric Auger wrote: Hi Kunkun, On 10/22/21 12:01 PM, Kunkun Jiang wrote: Hi Eric, On 2021/10/22 0:15, Eric Auger wrote: Hi Kunkun, On 9/14/21 3:53 AM, Kunkun Jiang wrote: We expand MemoryRegions of vfio-pci sub-page MMIO BARs to vfio_pci_write_config to improve IO performance. s/to vfio_pci_write_config/ in vfio_pci_write_config() Thank you for your review. I will correct it in v3. The MemoryRegions of destination VM will not be expanded successful in live migration, because their addresses have s/will not be expanded successful in live migration/are not expanded anymore after live migration (?) Is that the correct symptom? You are right. They are not expanded anymore after live migration, not expanded unsuccessfully. I used the wrong words. been updated in vmstate_load_state (vfio_pci_load_config). So iterate BARs in vfio_pci_write_config and try to update sub-page BARs. Fixes: c5e2fb3ce4d (vfio: Add save and load functions for VFIO PCI devices) is it an actual fix or an optimization? I recently realized that this is actually an optimization. The VF driver in VM use the assembly language instructions, which can operate two registers simultaneously, like stp, ldp. These instructions would result in non-ISV data abort, which cannot be handled well at the moment. If the driver doesn't use such instructions, not expanding only affects performance. I will add these to the commit message in the next version. Reported-by: Nianyao Tang Reported-by: Qixin Gan Signed-off-by: Kunkun Jiang --- hw/vfio/pci.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index e1ea1d8a23..43c7e93153 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -2453,7 +2453,12 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f) { VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); PCIDevice *pdev = >pdev; - int ret; + pcibus_t old_addr[PCI_NUM_REGIONS - 1]; + int bar, ret; + + for (bar = 0; bar < PCI_ROM_SLOT; bar++) { + old_addr[bar] = pdev->io_regions[bar].addr; + } what are those values before the vmstate_load_state ie. can't you do the Are you referring to pdev->io_regions[bar].addr ? All of the bars addr is PCI_BAR_UNMAPPED (~(pcibus_t)0) before the vmstate_load_state. OK that was my assumption. What I meant is in that case you always have old_addr[bar] != pdev->io_regions[bar].addr, right? In the positive this check is not needed and you don't need old_addr at all. In the original code this was needed since one wanted to call vfio_sub_page_bar_update_mapping() only for the bar base address that were changed during the vfio_pci_write_config. As far as I know, there is at least one case. If the VF driver is not loaded (insmod xxx.ko) in the VM, we will have old_addr[bar] == pdev->io_regions[bar].addr. The vfio_sub_page_bar_update_mapping() will be called when 0 < bar size < host page size. But vfio_sub_page_bar_update_mapping() will not change anything in this case. Thanks, Kunkun Jiang Thanks Eric vfio_sub_page_bar_update_mapping() unconditionnaly on old_addr[bar] != pdev->io_regions[bar].addr The size of Bar is a power of 2. The Bar, whose size is greater than host page size, doesn't need to be expanded. Can you explain more? May be I misunderstood you. Thanks, Kunkun Jiang ret = vmstate_load_state(f, _vfio_pci_config, vdev, 1); if (ret) { @@ -2463,6 +2468,14 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f) vfio_pci_write_config(pdev, PCI_COMMAND, pci_get_word(pdev->config + PCI_COMMAND), 2); + for (bar = 0; bar < PCI_ROM_SLOT; bar++) { + if (old_addr[bar] != pdev->io_regions[bar].addr && + vdev->bars[bar].region.size > 0 && + vdev->bars[bar].region.size < qemu_real_host_page_size) { + vfio_sub_page_bar_update_mapping(pdev, bar); + } + } + if (msi_enabled(pdev)) { vfio_msi_enable(vdev); } else if (msix_enabled(pdev)) { Thanks Eric . .
Re: [PATCH v2 2/2] vfio/common: Add trace point when a MMIO RAM section less than PAGE_SIZE
Hi Eric, On 2021/10/22 1:02, Eric Auger wrote: Hi Kunkun, On 9/14/21 3:53 AM, Kunkun Jiang wrote: The MSI-X structures of some devices and other non-MSI-X structures are in the same BAR. They may share one host page, especially in the may be in the same bar? You are right. So embarrassing. case of large page granularity, such as 64K. For example, MSIX-Table size of 82599 NIC is 0x30 and the offset in Bar 3(size 64KB) is 0x0. If host page size is 64KB. remove the '.' s/./, ? In that case wouldn't the symptom be the same with a 4kB page? Host page size is different, iova is different. iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space); If host page size is 4KB, the iova will smaller than llend. If host page size is 64KB, the iova will be equal to llend. vfio_listener_region_add() will be called to map the remaining range (0x30-0x). And it will return early at 'int128_ge((int128_make64(iova), llend))' and hasn't any message. s/and hasn't any message /without any message Ok, I will correct this in next version. Let's add a trace point to informed users like commit 5c08600547c0 s/informed/inform Same for this. ("vfio: Use a trace point when a RAM section cannot be DMA mapped") did. Signed-off-by: Kunkun Jiang --- hw/vfio/common.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 8728d4d5c2..2fc6213c0f 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -892,6 +892,13 @@ static void vfio_listener_region_add(MemoryListener *listener, llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask)); if (int128_ge(int128_make64(iova), llend)) { +if (memory_region_is_ram_device(section->mr)) { +trace_vfio_listener_region_add_no_dma_map( +memory_region_name(section->mr), +section->offset_within_address_space, +int128_getlo(section->size), +qemu_real_host_page_size); +} return; } end = int128_get64(int128_sub(llend, int128_one())); By the way, if this patch is accepted. The following code in the vfio_listener_region_add may need to be deleted: if (memory_region_is_ram_device(section->mr)) { hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1; if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) { trace_vfio_listener_region_add_no_dma_map( memory_region_name(section->mr), section->offset_within_address_space, int128_getlo(section->size), pgmask + 1); return; } } What do you think? Thanks, Kunkun Jiang Thanks Eric .
Re: [PATCH v2 1/2] vfio/pci: Fix vfio-pci sub-page MMIO BAR mmaping in live migration
Hi Eric, On 2021/10/22 0:15, Eric Auger wrote: Hi Kunkun, On 9/14/21 3:53 AM, Kunkun Jiang wrote: We expand MemoryRegions of vfio-pci sub-page MMIO BARs to vfio_pci_write_config to improve IO performance. s/to vfio_pci_write_config/ in vfio_pci_write_config() Thank you for your review. I will correct it in v3. The MemoryRegions of destination VM will not be expanded successful in live migration, because their addresses have s/will not be expanded successful in live migration/are not expanded anymore after live migration (?) Is that the correct symptom? You are right. They are not expanded anymore after live migration, not expanded unsuccessfully. I used the wrong words. been updated in vmstate_load_state (vfio_pci_load_config). So iterate BARs in vfio_pci_write_config and try to update sub-page BARs. Fixes: c5e2fb3ce4d (vfio: Add save and load functions for VFIO PCI devices) is it an actual fix or an optimization? I recently realized that this is actually an optimization. The VF driver in VM use the assembly language instructions, which can operate two registers simultaneously, like stp, ldp. These instructions would result in non-ISV data abort, which cannot be handled well at the moment. If the driver doesn't use such instructions, not expanding only affects performance. I will add these to the commit message in the next version. Reported-by: Nianyao Tang Reported-by: Qixin Gan Signed-off-by: Kunkun Jiang --- hw/vfio/pci.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index e1ea1d8a23..43c7e93153 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -2453,7 +2453,12 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f) { VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); PCIDevice *pdev = >pdev; -int ret; +pcibus_t old_addr[PCI_NUM_REGIONS - 1]; +int bar, ret; + +for (bar = 0; bar < PCI_ROM_SLOT; bar++) { +old_addr[bar] = pdev->io_regions[bar].addr; +} what are those values before the vmstate_load_state ie. can't you do the Are you referring to pdev->io_regions[bar].addr ? All of the bars addr is PCI_BAR_UNMAPPED (~(pcibus_t)0) before the vmstate_load_state. vfio_sub_page_bar_update_mapping() unconditionnaly on old_addr[bar] != pdev->io_regions[bar].addr The size of Bar is a power of 2. The Bar, whose size is greater than host page size, doesn't need to be expanded. Can you explain more? May be I misunderstood you. Thanks, Kunkun Jiang ret = vmstate_load_state(f, _vfio_pci_config, vdev, 1); if (ret) { @@ -2463,6 +2468,14 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f) vfio_pci_write_config(pdev, PCI_COMMAND, pci_get_word(pdev->config + PCI_COMMAND), 2); +for (bar = 0; bar < PCI_ROM_SLOT; bar++) { +if (old_addr[bar] != pdev->io_regions[bar].addr && +vdev->bars[bar].region.size > 0 && +vdev->bars[bar].region.size < qemu_real_host_page_size) { +vfio_sub_page_bar_update_mapping(pdev, bar); +} +} + if (msi_enabled(pdev)) { vfio_msi_enable(vdev); } else if (msix_enabled(pdev)) { Thanks Eric .
Re: [RFC v9 15/29] vfio: Set up nested stage mappings
Hi Eric, On 2021/10/8 0:58, Eric Auger wrote: Hi Kunkun, On 4/14/21 3:45 AM, Kunkun Jiang wrote: On 2021/4/13 20:57, Auger Eric wrote: Hi Kunkun, On 4/13/21 2:10 PM, Kunkun Jiang wrote: Hi Eric, On 2021/4/11 20:08, Eric Auger wrote: In nested mode, legacy vfio_iommu_map_notify cannot be used as there is no "caching" mode and we do not trap on map. On Intel, vfio_iommu_map_notify was used to DMA map the RAM through the host single stage. With nested mode, we need to setup the stage 2 and the stage 1 separately. This patch introduces a prereg_listener to setup the stage 2 mapping. The stage 1 mapping, owned by the guest, is passed to the host when the guest invalidates the stage 1 configuration, through a dedicated PCIPASIDOps callback. Guest IOTLB invalidations are cascaded downto the host through another IOMMU MR UNMAP notifier. Signed-off-by: Eric Auger --- v7 -> v8: - properly handle new IOMMUTLBEntry fields and especially propagate DOMAIN and PASID based invalidations v6 -> v7: - remove PASID based invalidation v5 -> v6: - add error_report_err() - remove the abort in case of nested stage case v4 -> v5: - use VFIO_IOMMU_SET_PASID_TABLE - use PCIPASIDOps for config notification v3 -> v4: - use iommu_inv_pasid_info for ASID invalidation v2 -> v3: - use VFIO_IOMMU_ATTACH_PASID_TABLE - new user API - handle leaf v1 -> v2: - adapt to uapi changes - pass the asid - pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier --- hw/vfio/common.c | 139 +-- hw/vfio/pci.c | 21 +++ hw/vfio/trace-events | 2 + 3 files changed, 157 insertions(+), 5 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 0cd7ef2139..e369d451e7 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr, return true; } +/* Propagate a guest IOTLB invalidation to the host (nested mode) */ +static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) +{ + VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); + struct vfio_iommu_type1_cache_invalidate ustruct = {}; + VFIOContainer *container = giommu->container; + int ret; + + assert(iotlb->perm == IOMMU_NONE); + + ustruct.argsz = sizeof(ustruct); + ustruct.flags = 0; + ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info); + ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1; + ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB; + + switch (iotlb->granularity) { + case IOMMU_INV_GRAN_DOMAIN: + ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN; + break; + case IOMMU_INV_GRAN_PASID: + { + struct iommu_inv_pasid_info *pasid_info; + int archid = -1; + + pasid_info = _info; + ustruct.info.granularity = IOMMU_INV_GRANU_PASID; + if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) { + pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID; + archid = iotlb->arch_id; + } + pasid_info->archid = archid; + trace_vfio_iommu_asid_inv_iotlb(archid); + break; + } + case IOMMU_INV_GRAN_ADDR: + { + hwaddr start = iotlb->iova + giommu->iommu_offset; + struct iommu_inv_addr_info *addr_info; + size_t size = iotlb->addr_mask + 1; + int archid = -1; + + addr_info = _info; + ustruct.info.granularity = IOMMU_INV_GRANU_ADDR; + if (iotlb->leaf) { + addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF; + } + if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) { + addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID; + archid = iotlb->arch_id; + } + addr_info->archid = archid; + addr_info->addr = start; + addr_info->granule_size = size; + addr_info->nb_granules = 1; + trace_vfio_iommu_addr_inv_iotlb(archid, start, size, + 1, iotlb->leaf); + break; + } Should we pass a size to host kernel here, even if vSMMU doesn't support RIL or guest kernel doesn't use RIL? It will cause TLBI issue in this scenario: Guest kernel issues a TLBI cmd without "range" (tg = 0) to invalidate a 2M huge page. Then qemu passed the iova and size (4K) to host kernel. Finally, host kernel issues a TLBI cmd with "range" (4K) which can not invalidate the TLB entry of 2M huge page. (pSMMU supports RIL) In that case the guest will loop over all 4K images belonging to the 2M huge page and invalidate each of them. This should turn into qemu notifications for each 4kB page, no? This is totally inefficient, hence The guest will not loop over all 4K images belonging to the 2M huge page. The iommu_iotlb_gather->pgsize will be 2M, if a page is 2M huge page. The gather->pgsize wil
Re: [PATCH v2 0/2] vfio: Some fixes about vfio-pci MMIO RAM mapping
Kindly ping, Hi all, Will this patch be picked up soon, or is there any other advice? Thanks, Kunkun Jiang On 2021/9/14 9:53, Kunkun Jiang wrote: This series include patches as below: Patch 1: - vfio/pci: Fix vfio-pci sub-page MMIO BAR mmaping in live migration Patch 2: - Added a trace point to informe users when a MMIO RAM ection less than PAGE_SIZE History: v1 -> v2: - Add iterate sub-page BARs in vfio_pci_load_config and try to update them [Alex Williamson] Kunkun Jiang (2): vfio/pci: Fix vfio-pci sub-page MMIO BAR mmaping in live migration vfio/common: Add trace point when a MMIO RAM section less than PAGE_SIZE hw/vfio/common.c | 7 +++ hw/vfio/pci.c| 15 ++- 2 files changed, 21 insertions(+), 1 deletion(-)
Re: [question] VFIO Device Migration: The vCPU may be paused during vfio device DMA in iommu nested stage mode && vSVA
Hi Kevin: On 2021/9/24 14:47, Tian, Kevin wrote: From: Kunkun Jiang Sent: Friday, September 24, 2021 2:19 PM Hi all, I encountered a problem in vfio device migration test. The vCPU may be paused during vfio-pci DMA in iommu nested stage mode && vSVA. This may lead to migration fail and other problems related to device hardware and driver implementation. It may be a bit early to discuss this issue, after all, the iommu nested stage mode and vSVA are not yet mature. But judging from the current implementation, we will definitely encounter this problem in the future. Yes, this is a known limitation to support migration with vSVA. This is the current process of vSVA processing translation fault in iommu nested stage mode (take SMMU as an example): guest os 4.handle translation fault 5.send CMD_RESUME to vSMMU qemu 3.inject fault into guest os 6.deliver response to host os (vfio/vsmmu) host os 2.notify the qemu 7.send CMD_RESUME to SMMU (vfio/smmu) SMMU 1.address translation fault 8.retry or terminate The order is 1--->8. Currently, qemu may pause vCPU at any step. It is possible to pause vCPU at step 1-5, that is, in a DMA. This may lead to migration fail and other problems related to device hardware and driver implementation. For example, the device status cannot be changed from RUNNING && SAVING to SAVING, because the device DMA is not over. As far as i can see, vCPU should not be paused during a device IO process, such as DMA. However, currently live migration does not pay attention to the state of vfio device when pausing the vCPU. And if the vCPU is not paused, the vfio device is always running. This looks like a *deadlock*. Basically this requires: 1) stopping vCPU after stopping device (could selectively enable this sequence for vSVA); How to tell if vSVA is open? In fact, as long as it is in iommu nested stage mode, there will be such a problem, whether it is vSVA or no-vSVA. In no-vSVA mode, a fault can also be generated by modifying the guest device driver. 2) when stopping device, the driver should block new requests from vCPU (queued to a pending list) and then drain all in-fly requests including faults; * to block this further requires switching from fast-path to slow trap-emulation path for the cmd portal before stopping the device; 3) save the pending requests in the vm image and replay them after the vm is resumed; * finally disable blocking by switching back to the fast-path for the cmd portal; Is there any related patch sent out and discussed? I might have overlooked that. We may be able to discuss and finalize a specification for this problem. Thanks, Kunkun Jiang Do you have any ideas to solve this problem? Looking forward to your replay. We verified above flow can work in our internal POC. Thanks Kevin
[question] VFIO Device Migration: The vCPU may be paused during vfio device DMA in iommu nested stage mode && vSVA
Hi all, I encountered a problem in vfio device migration test. The vCPU may be paused during vfio-pci DMA in iommu nested stage mode && vSVA. This may lead to migration fail and other problems related to device hardware and driver implementation. It may be a bit early to discuss this issue, after all, the iommu nested stage mode and vSVA are not yet mature. But judging from the current implementation, we will definitely encounter this problem in the future. This is the current process of vSVA processing translation fault in iommu nested stage mode (take SMMU as an example): guest os 4.handle translation fault 5.send CMD_RESUME to vSMMU qemu 3.inject fault into guest os 6.deliver response to host os (vfio/vsmmu) host os 2.notify the qemu 7.send CMD_RESUME to SMMU (vfio/smmu) SMMU 1.address translation fault 8.retry or terminate The order is 1--->8. Currently, qemu may pause vCPU at any step. It is possible to pause vCPU at step 1-5, that is, in a DMA. This may lead to migration fail and other problems related to device hardware and driver implementation. For example, the device status cannot be changed from RUNNING && SAVING to SAVING, because the device DMA is not over. As far as i can see, vCPU should not be paused during a device IO process, such as DMA. However, currently live migration does not pay attention to the state of vfio device when pausing the vCPU. And if the vCPU is not paused, the vfio device is always running. This looks like a *deadlock*. Do you have any ideas to solve this problem? Looking forward to your replay. Thanks, Kunkun Jiang
[PATCH v2 2/2] vfio/common: Add trace point when a MMIO RAM section less than PAGE_SIZE
The MSI-X structures of some devices and other non-MSI-X structures are in the same BAR. They may share one host page, especially in the case of large page granularity, such as 64K. For example, MSIX-Table size of 82599 NIC is 0x30 and the offset in Bar 3(size 64KB) is 0x0. If host page size is 64KB. vfio_listener_region_add() will be called to map the remaining range (0x30-0x). And it will return early at 'int128_ge((int128_make64(iova), llend))' and hasn't any message. Let's add a trace point to informed users like commit 5c08600547c0 ("vfio: Use a trace point when a RAM section cannot be DMA mapped") did. Signed-off-by: Kunkun Jiang --- hw/vfio/common.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 8728d4d5c2..2fc6213c0f 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -892,6 +892,13 @@ static void vfio_listener_region_add(MemoryListener *listener, llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask)); if (int128_ge(int128_make64(iova), llend)) { +if (memory_region_is_ram_device(section->mr)) { +trace_vfio_listener_region_add_no_dma_map( +memory_region_name(section->mr), +section->offset_within_address_space, +int128_getlo(section->size), +qemu_real_host_page_size); +} return; } end = int128_get64(int128_sub(llend, int128_one())); -- 2.23.0
[PATCH v2 1/2] vfio/pci: Fix vfio-pci sub-page MMIO BAR mmaping in live migration
We expand MemoryRegions of vfio-pci sub-page MMIO BARs to vfio_pci_write_config to improve IO performance. The MemoryRegions of destination VM will not be expanded successful in live migration, because their addresses have been updated in vmstate_load_state (vfio_pci_load_config). So iterate BARs in vfio_pci_write_config and try to update sub-page BARs. Fixes: c5e2fb3ce4d (vfio: Add save and load functions for VFIO PCI devices) Reported-by: Nianyao Tang Reported-by: Qixin Gan Signed-off-by: Kunkun Jiang --- hw/vfio/pci.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index e1ea1d8a23..43c7e93153 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -2453,7 +2453,12 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f) { VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); PCIDevice *pdev = >pdev; -int ret; +pcibus_t old_addr[PCI_NUM_REGIONS - 1]; +int bar, ret; + +for (bar = 0; bar < PCI_ROM_SLOT; bar++) { +old_addr[bar] = pdev->io_regions[bar].addr; +} ret = vmstate_load_state(f, _vfio_pci_config, vdev, 1); if (ret) { @@ -2463,6 +2468,14 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f) vfio_pci_write_config(pdev, PCI_COMMAND, pci_get_word(pdev->config + PCI_COMMAND), 2); +for (bar = 0; bar < PCI_ROM_SLOT; bar++) { +if (old_addr[bar] != pdev->io_regions[bar].addr && +vdev->bars[bar].region.size > 0 && +vdev->bars[bar].region.size < qemu_real_host_page_size) { +vfio_sub_page_bar_update_mapping(pdev, bar); +} +} + if (msi_enabled(pdev)) { vfio_msi_enable(vdev); } else if (msix_enabled(pdev)) { -- 2.23.0
[PATCH v2 0/2] vfio: Some fixes about vfio-pci MMIO RAM mapping
This series include patches as below: Patch 1: - vfio/pci: Fix vfio-pci sub-page MMIO BAR mmaping in live migration Patch 2: - Added a trace point to informe users when a MMIO RAM ection less than PAGE_SIZE History: v1 -> v2: - Add iterate sub-page BARs in vfio_pci_load_config and try to update them [Alex Williamson] Kunkun Jiang (2): vfio/pci: Fix vfio-pci sub-page MMIO BAR mmaping in live migration vfio/common: Add trace point when a MMIO RAM section less than PAGE_SIZE hw/vfio/common.c | 7 +++ hw/vfio/pci.c| 15 ++- 2 files changed, 21 insertions(+), 1 deletion(-) -- 2.23.0
Re: [PATCH 1/2] vfio/pci: Fix vfio-pci sub-page MMIO BAR mmaping in live migration
On 2021/9/11 6:24, Alex Williamson wrote: On Fri, 10 Sep 2021 16:33:12 +0800 Kunkun Jiang wrote: Hi Alex, On 2021/9/9 4:45, Alex Williamson wrote: On Fri, 3 Sep 2021 17:36:10 +0800 Kunkun Jiang wrote: We expand MemoryRegions of vfio-pci sub-page MMIO BARs to vfio_pci_write_config to improve IO performance. The MemoryRegions of destination VM will not be expanded successful in live migration, because their addresses have been updated in vmstate_load_state (vfio_pci_load_config). What's the call path through vfio_pci_write_config() that you're relying on to get triggered to enable this and why wouldn't we just walk all sub-page BARs in vfio_pci_load_config() to resolve the issue then? It's my understanding that we do this update in write-config because it's required that the VM sizes the BAR before using it, which is not the case when we resume from migration. Thanks, Let's take an example: AArch64 host page granularity: 64KB PCI device: *Bar2 size 32KB* [mem 0x800020-0x800020 64bit pref] When enable Command register bit 1(Memory Space), the code flow is as follows: vfio_pci_write_config (addr: 4 val: 2 len: 2) // record the old address of each bar, 0x old_addr[bar] = pdev->io_regions[bar].addr; pci_default_write_config pci_update_mappings new_addr = pci_bar_address // 0x800020 r->addr = new_addr; memory_region_addr_subregion_overlap ... *vfio_listener_region_add* alignment check of the ram section address and size fail, return *kvm_region_add* kvm_set_phys_mem alignment check of the ram section address and size fail, return // old_addr[bar] != pdev->io_regions[bar].addr && // 0 < vdev->bars[bar].region.size < qemu_real_host_page_size vfio_sub_page_bar_update_mapping *bar size = qemu_real_host_page_size* ... vfio_listener_region_add map success kvm_region_add kvm_set_phys_mem map success In live migration, only pci config data is sent to the destination VM. Therefore, we need to update the bar's size before destination VM using it. In vfio_pci_load_config, the code flow is as follows: vfio_pci_load_config vmstate_load_state *get_pci_config_device* pci_update_mappings ... // bar's addr is updated(0x800020), but bar's size is still 32KB, so map failed vfio_pci_write_config // bar's addr will not be changed, so vfio_sub_page_bar_update_mapping won't be called My idea is that removing the check 'old_addr[bar] != pdev->io_regions[bar].addr' doesn't affect the previous process. There's also a bar size check. In vfio_sub_page_bar_update_mapping, it will check if bar is mapped and page aligned. 1) If bar's addr is 0x, it will not pass the vfio_sub_page_bar_update_mapping check. 2) If bar's size has been updated, it will not pass the bar size check in vfio_pci_write_config. The bar size check in vfio_pci_write_config() only tests if the vfio region is >0 and bars[bar].region.size == qemu_real_host_page_size) once we setup the sub-page support, I'm not convinced that's true. So yes, sub-page-update can reject invalid addresses and we already rely on it to do so, but the code being removed avoids that redundant writes to the BAR won't trigger redundant MemoryRegion manipulation. Maybe those are harmless, but that's not your argument for allowing it. OTOH, why wouldn't vfio_pci_load_config() iterate sub-page BARs and try to update them at that time? Thanks, Like this? I've tested it, and it's okay. vfio_pci_load_config { for (bar = 0; bar < PCI_ROM_SLOT; bar++) { if (old_addr[bar] != pdev->io_regions[bar].addr && vdev->bars[bar].region.size > 0 && vdev->bars[bar].region.size < qemu_real_host_page_size) { vfio_sub_page_bar_update_mapping(pdev, bar); } } } Thanks, Kunkun Jiang Alex Remove the restriction on base address change in vfio_pci_write_config for correct mmapping sub-page MMIO BARs. Accroding to my analysis, the remaining parameter verification is enough. Fixes: c5e2fb3ce4d (vfio: Add save and load functions for VFIO PCI devices) Reported-by: Nianyao Tang Reported-by: Qixin Gan Signed-off-by: Kunkun Jiang --- hw/vfio/pci.c | 8 +--- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index e1ea1d8a23..891b211ddf 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -1189,18 +1189,12 @@ void vfio_pci_write_config(PCIDevice *pdev, } } else if (ranges_overlap(addr, len, PCI_BASE_ADDRESS_0, 24) || range_covers_byte(addr, len, PCI_COMMAND)) { -pcibus_t old_ad
Re: [PATCH 2/2] vfio/common: Add trace point when a MMIO RAM section less than PAGE_SIZE
Hi Alex, On 2021/9/9 4:45, Alex Williamson wrote: On Fri, 3 Sep 2021 17:36:11 +0800 Kunkun Jiang wrote: The MSI-X structures of some devices and other non-MSI-X structures are in the same BAR. They may share one host page, especially in the case of large page granularity, suck as 64K. s/suck/such/ ok For example, MSIX-Table size of 82599 NIC is 0x30 and the offset in Bar 3(size 64KB) is 0x0. If host page size is 64KB. vfio_listenerregion_add() will be called to map the remaining range s/vfio_listenerregion_add/vfio_listener_region_add/ ok (0x30-0x). And it will return early at 'int128_ge((int128_make64(iova), llend))' and hasn't any message. Let's add a trace point to informed users like 5c08600547c did. Please use the following syntax for referencing previous commits (12-char sha1 is standard): commit 5c08600547c0 ("vfio: Use a trace point when a RAM section cannot be DMA mapped") Thank you for your reply. I will correct it in V2. Thanks, Kunkun Jiang Thanks, Alex Signed-off-by: Kunkun Jiang --- hw/vfio/common.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 8728d4d5c2..2fc6213c0f 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -892,6 +892,13 @@ static void vfio_listener_region_add(MemoryListener *listener, llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask)); if (int128_ge(int128_make64(iova), llend)) { +if (memory_region_is_ram_device(section->mr)) { +trace_vfio_listener_region_add_no_dma_map( +memory_region_name(section->mr), +section->offset_within_address_space, +int128_getlo(section->size), +qemu_real_host_page_size); +} return; } end = int128_get64(int128_sub(llend, int128_one())); .
Re: [PATCH 1/2] vfio/pci: Fix vfio-pci sub-page MMIO BAR mmaping in live migration
Hi Alex, On 2021/9/9 4:45, Alex Williamson wrote: On Fri, 3 Sep 2021 17:36:10 +0800 Kunkun Jiang wrote: We expand MemoryRegions of vfio-pci sub-page MMIO BARs to vfio_pci_write_config to improve IO performance. The MemoryRegions of destination VM will not be expanded successful in live migration, because their addresses have been updated in vmstate_load_state (vfio_pci_load_config). What's the call path through vfio_pci_write_config() that you're relying on to get triggered to enable this and why wouldn't we just walk all sub-page BARs in vfio_pci_load_config() to resolve the issue then? It's my understanding that we do this update in write-config because it's required that the VM sizes the BAR before using it, which is not the case when we resume from migration. Thanks, Let's take an example: AArch64 host page granularity: 64KB PCI device: *Bar2 size 32KB* [mem 0x800020-0x800020 64bit pref] When enable Command register bit 1(Memory Space), the code flow is as follows: vfio_pci_write_config (addr: 4 val: 2 len: 2) // record the old address of each bar, 0x old_addr[bar] = pdev->io_regions[bar].addr; pci_default_write_config pci_update_mappings new_addr = pci_bar_address // 0x800020 r->addr = new_addr; memory_region_addr_subregion_overlap ... *vfio_listener_region_add* alignment check of the ram section address and size fail, return *kvm_region_add* kvm_set_phys_mem alignment check of the ram section address and size fail, return // old_addr[bar] != pdev->io_regions[bar].addr && // 0 < vdev->bars[bar].region.size < qemu_real_host_page_size vfio_sub_page_bar_update_mapping *bar size = qemu_real_host_page_size* ... vfio_listener_region_add map success kvm_region_add kvm_set_phys_mem map success In live migration, only pci config data is sent to the destination VM. Therefore, we need to update the bar's size before destination VM using it. In vfio_pci_load_config, the code flow is as follows: vfio_pci_load_config vmstate_load_state *get_pci_config_device* pci_update_mappings ... // bar's addr is updated(0x800020), but bar's size is still 32KB, so map failed vfio_pci_write_config // bar's addr will not be changed, so vfio_sub_page_bar_update_mapping won't be called My idea is that removing the check 'old_addr[bar] != pdev->io_regions[bar].addr' doesn't affect the previous process. There's also a bar size check. In vfio_sub_page_bar_update_mapping, it will check if bar is mapped and page aligned. 1) If bar's addr is 0x, it will not pass the vfio_sub_page_bar_update_mapping check. 2) If bar's size has been updated, it will not pass the bar size check in vfio_pci_write_config. Thanks, Kunkun Jiang Alex Remove the restriction on base address change in vfio_pci_write_config for correct mmapping sub-page MMIO BARs. Accroding to my analysis, the remaining parameter verification is enough. Fixes: c5e2fb3ce4d (vfio: Add save and load functions for VFIO PCI devices) Reported-by: Nianyao Tang Reported-by: Qixin Gan Signed-off-by: Kunkun Jiang --- hw/vfio/pci.c | 8 +--- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index e1ea1d8a23..891b211ddf 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -1189,18 +1189,12 @@ void vfio_pci_write_config(PCIDevice *pdev, } } else if (ranges_overlap(addr, len, PCI_BASE_ADDRESS_0, 24) || range_covers_byte(addr, len, PCI_COMMAND)) { -pcibus_t old_addr[PCI_NUM_REGIONS - 1]; int bar; -for (bar = 0; bar < PCI_ROM_SLOT; bar++) { -old_addr[bar] = pdev->io_regions[bar].addr; -} - pci_default_write_config(pdev, addr, val, len); for (bar = 0; bar < PCI_ROM_SLOT; bar++) { -if (old_addr[bar] != pdev->io_regions[bar].addr && -vdev->bars[bar].region.size > 0 && +if (vdev->bars[bar].region.size > 0 && vdev->bars[bar].region.size < qemu_real_host_page_size) { vfio_sub_page_bar_update_mapping(pdev, bar); } .
[PATCH 2/2] vfio/common: Add trace point when a MMIO RAM section less than PAGE_SIZE
The MSI-X structures of some devices and other non-MSI-X structures are in the same BAR. They may share one host page, especially in the case of large page granularity, suck as 64K. For example, MSIX-Table size of 82599 NIC is 0x30 and the offset in Bar 3(size 64KB) is 0x0. If host page size is 64KB. vfio_listenerregion_add() will be called to map the remaining range (0x30-0x). And it will return early at 'int128_ge((int128_make64(iova), llend))' and hasn't any message. Let's add a trace point to informed users like 5c08600547c did. Signed-off-by: Kunkun Jiang --- hw/vfio/common.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 8728d4d5c2..2fc6213c0f 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -892,6 +892,13 @@ static void vfio_listener_region_add(MemoryListener *listener, llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask)); if (int128_ge(int128_make64(iova), llend)) { +if (memory_region_is_ram_device(section->mr)) { +trace_vfio_listener_region_add_no_dma_map( +memory_region_name(section->mr), +section->offset_within_address_space, +int128_getlo(section->size), +qemu_real_host_page_size); +} return; } end = int128_get64(int128_sub(llend, int128_one())); -- 2.23.0
[PATCH 0/2] vfio: Some fixes about vfio-pci MMIO RAM mapping
This series include patches as below: Patch 1: - Deleted a check to fix vfio-pci sub-page MMIO BAR mmaping in live migration Patch 2: - Added a trace point to informe users when a MMIO RAM ection less than PAGE_SIZE Kunkun Jiang (2): vfio/pci: Fix vfio-pci sub-page MMIO BAR mmaping in live migration vfio/common: Add trace point when a MMIO RAM section less than PAGE_SIZE hw/vfio/common.c | 7 +++ hw/vfio/pci.c| 8 +--- 2 files changed, 8 insertions(+), 7 deletions(-) -- 2.23.0
[PATCH 1/2] vfio/pci: Fix vfio-pci sub-page MMIO BAR mmaping in live migration
We expand MemoryRegions of vfio-pci sub-page MMIO BARs to vfio_pci_write_config to improve IO performance. The MemoryRegions of destination VM will not be expanded successful in live migration, because their addresses have been updated in vmstate_load_state (vfio_pci_load_config). Remove the restriction on base address change in vfio_pci_write_config for correct mmapping sub-page MMIO BARs. Accroding to my analysis, the remaining parameter verification is enough. Fixes: c5e2fb3ce4d (vfio: Add save and load functions for VFIO PCI devices) Reported-by: Nianyao Tang Reported-by: Qixin Gan Signed-off-by: Kunkun Jiang --- hw/vfio/pci.c | 8 +--- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index e1ea1d8a23..891b211ddf 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -1189,18 +1189,12 @@ void vfio_pci_write_config(PCIDevice *pdev, } } else if (ranges_overlap(addr, len, PCI_BASE_ADDRESS_0, 24) || range_covers_byte(addr, len, PCI_COMMAND)) { -pcibus_t old_addr[PCI_NUM_REGIONS - 1]; int bar; -for (bar = 0; bar < PCI_ROM_SLOT; bar++) { -old_addr[bar] = pdev->io_regions[bar].addr; -} - pci_default_write_config(pdev, addr, val, len); for (bar = 0; bar < PCI_ROM_SLOT; bar++) { -if (old_addr[bar] != pdev->io_regions[bar].addr && -vdev->bars[bar].region.size > 0 && +if (vdev->bars[bar].region.size > 0 && vdev->bars[bar].region.size < qemu_real_host_page_size) { vfio_sub_page_bar_update_mapping(pdev, bar); } -- 2.23.0
Re: [RFC v9 14/29] vfio: Introduce helpers to DMA map/unmap a RAM section
Hi Eric, On 2021/4/11 20:08, Eric Auger wrote: Let's introduce two helpers that allow to DMA map/unmap a RAM section. Those helpers will be called for nested stage setup in another call site. Also the vfio_listener_region_add/del() structure may be clearer. Signed-off-by: Eric Auger --- v8 -> v9 - rebase on top of 1eb7f642750c ("vfio: Support host translation granule size") v5 -> v6: - add Error ** --- hw/vfio/common.c | 199 +-- hw/vfio/trace-events | 4 +- 2 files changed, 119 insertions(+), 84 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index a8f835328e..0cd7ef2139 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -662,13 +662,126 @@ hostwin_from_range(VFIOContainer *container, hwaddr iova, hwaddr end) return NULL; } +static int vfio_dma_map_ram_section(VFIOContainer *container, +MemoryRegionSection *section, Error **err) +{ +VFIOHostDMAWindow *hostwin; +Int128 llend, llsize; +hwaddr iova, end; +void *vaddr; +int ret; + +assert(memory_region_is_ram(section->mr)); + +iova = TARGET_PAGE_ALIGN(section->offset_within_address_space); +llend = int128_make64(section->offset_within_address_space); +llend = int128_add(llend, section->size); +llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK)); Is there any special meaning for using TRAGET_PAGE_ALIGN here? REAL_HOST_PAGE_ALIGN is used in vfio_listener_region_add. And I think a check should be added here if using REAL_HOST_PAGE_ALIGN. if (int128_ge(int128_make64(iova), llend)) { return; } It will cause an error log or 'assert(r ==a )' of int128_get64 by calling vfio_prereg_listener_region_add in some scenarios. Some devices' BAR may map MSI-X structures and others in one host page. By the way, is this set of patch to be updated after "/dev/iommu" is sent out? Thanks, Kunkun Jiang +end = int128_get64(int128_sub(llend, int128_one())); + +vaddr = memory_region_get_ram_ptr(section->mr) + +section->offset_within_region + +(iova - section->offset_within_address_space); + +hostwin = hostwin_from_range(container, iova, end); +if (!hostwin) { +error_setg(err, "Container %p can't map guest IOVA region" + " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx, container, iova, end); +return -EFAULT; +} + +trace_vfio_dma_map_ram(iova, end, vaddr); + +llsize = int128_sub(llend, int128_make64(iova)); + +if (memory_region_is_ram_device(section->mr)) { +hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1; + +if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) { +trace_vfio_listener_region_add_no_dma_map( +memory_region_name(section->mr), +section->offset_within_address_space, +int128_getlo(section->size), +pgmask + 1); +return 0; +} +} + +ret = vfio_dma_map(container, iova, int128_get64(llsize), + vaddr, section->readonly); +if (ret) { +error_setg(err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx", %p) = %d (%m)", + container, iova, int128_get64(llsize), vaddr, ret); +if (memory_region_is_ram_device(section->mr)) { +/* Allow unexpected mappings not to be fatal for RAM devices */ +error_report_err(*err); +return 0; +} +return ret; +} +return 0; +} + +static void vfio_dma_unmap_ram_section(VFIOContainer *container, + MemoryRegionSection *section) +{ +Int128 llend, llsize; +hwaddr iova, end; +bool try_unmap = true; +int ret; + +iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space); +llend = int128_make64(section->offset_within_address_space); +llend = int128_add(llend, section->size); +llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask)); + +if (int128_ge(int128_make64(iova), llend)) { +return; +} +end = int128_get64(int128_sub(llend, int128_one())); + +llsize = int128_sub(llend, int128_make64(iova)); + +trace_vfio_dma_unmap_ram(iova, end); + +if (memory_region_is_ram_device(section->mr)) { +hwaddr pgmask; +VFIOHostDMAWindow *hostwin = hostwin_from_range(container, iova, end); + +assert(hostwin); /* or region_add() would have failed */ + +pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1; +try_unmap = !((iova & pgmask) || (int128_get64(llsize) & pgmask)); +} + +if (try_unmap) { +if (int128_eq(llsize, int128_2_64())) { +/* The unmap ioctl doesn
[PATCH 1/2] vfio/common: Add trace point when a MMIO RAM section less than minimum size
I recently did some tests about the 82599 NIC, and found a strange scenario. The MSIX-Table size of this NIC is 0x30 and the offset in Bar 3(64KB) is 0x0. And CPU page size is 64KB. The region_add() will return early at 'int128_ge((int128_make64(iova), llend))' and hasn't any message. Let's add a trace point to informed users. Signed-off-by: Kunkun Jiang --- hw/vfio/common.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 7d80f43e39..bbb8d1ea0c 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -892,6 +892,13 @@ static void vfio_listener_region_add(MemoryListener *listener, llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask)); if (int128_ge(int128_make64(iova), llend)) { +if (memory_region_is_ram_device(section->mr)) { +trace_vfio_listener_region_add_no_dma_map( +memory_region_name(section->mr), +section->offset_within_address_space, +int128_getlo(section->size), +qemu_real_host_page_size); +} return; } end = int128_get64(int128_sub(llend, int128_one())); -- 2.23.0
[PATCH 0/2] vfio/common: Some fixes about region_add/region_del
This series include patches as below: Patch 1: - Add a trace point to informe users when a MMIO RAM section less than minimum size Patch 2: - Fix address alignment in region_add/regiondel with vfio iommu smallest page size Kunkun Jiang (2): vfio/common: Add trace point when a MMIO RAM section less than minimum size vfio/common: Fix address alignment in region_add/region_del hw/vfio/common.c | 35 +++ 1 file changed, 27 insertions(+), 8 deletions(-) -- 2.23.0
[PATCH 2/2] vfio/common: Fix address alignment in region_add/region_del
The page sizes supported by IOMMU may not match the CPU page size. For example, the CPU page size is 16KB, but ARM SMMU may not support 16KB. So it is inappropriate to use qemu_real_host_page_mask in region_add/region_del. The vfio iommu page sizes exposed via VFIO_IOMMU_GET_INFO. So use the smallest page size to align the address. Fixes: 1eb7f642750 (vfio: Support host translation granule size) Signed-off-by: Kunkun Jiang --- hw/vfio/common.c | 30 +- 1 file changed, 21 insertions(+), 9 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index bbb8d1ea0c..62f338cd78 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -859,10 +859,13 @@ static void vfio_unregister_ram_discard_listener(VFIOContainer *container, g_free(vrdl); } +#define VFIO_IOMMU_MIN_PAGE_ALIGN(addr, pgsize) ROUND_UP((addr), (pgsize)) + static void vfio_listener_region_add(MemoryListener *listener, MemoryRegionSection *section) { VFIOContainer *container = container_of(listener, VFIOContainer, listener); +uint64_t vfio_iommu_min_page_size, vfio_iommu_min_page_mask; hwaddr iova, end; Int128 llend, llsize; void *vaddr; @@ -879,17 +882,21 @@ static void vfio_listener_region_add(MemoryListener *listener, return; } +vfio_iommu_min_page_size = 1 << ctz64(container->pgsizes); +vfio_iommu_min_page_mask = ~(vfio_iommu_min_page_size - 1); + if (unlikely((section->offset_within_address_space & - ~qemu_real_host_page_mask) != - (section->offset_within_region & ~qemu_real_host_page_mask))) { + ~vfio_iommu_min_page_mask) != + (section->offset_within_region & ~vfio_iommu_min_page_mask))) { error_report("%s received unaligned region", __func__); return; } -iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space); +iova = VFIO_IOMMU_MIN_PAGE_ALIGN(section->offset_within_address_space, + vfio_iommu_min_page_size); llend = int128_make64(section->offset_within_address_space); llend = int128_add(llend, section->size); -llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask)); +llend = int128_and(llend, int128_exts64(vfio_iommu_min_page_mask)); if (int128_ge(int128_make64(iova), llend)) { if (memory_region_is_ram_device(section->mr)) { @@ -897,7 +904,7 @@ static void vfio_listener_region_add(MemoryListener *listener, memory_region_name(section->mr), section->offset_within_address_space, int128_getlo(section->size), -qemu_real_host_page_size); +vfio_iommu_min_page_size); } return; } @@ -1102,6 +1109,7 @@ static void vfio_listener_region_del(MemoryListener *listener, MemoryRegionSection *section) { VFIOContainer *container = container_of(listener, VFIOContainer, listener); +uint64_t vfio_iommu_min_page_size, vfio_iommu_min_page_mask; hwaddr iova, end; Int128 llend, llsize; int ret; @@ -1115,9 +1123,12 @@ static void vfio_listener_region_del(MemoryListener *listener, return; } +vfio_iommu_min_page_size = 1 << ctz64(container->pgsizes); +vfio_iommu_min_page_mask = ~(vfio_iommu_min_page_size - 1); + if (unlikely((section->offset_within_address_space & - ~qemu_real_host_page_mask) != - (section->offset_within_region & ~qemu_real_host_page_mask))) { + ~vfio_iommu_min_page_mask) != + (section->offset_within_region & ~vfio_iommu_min_page_mask))) { error_report("%s received unaligned region", __func__); return; } @@ -1145,10 +1156,11 @@ static void vfio_listener_region_del(MemoryListener *listener, */ } -iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space); +iova = VFIO_IOMMU_MIN_PAGE_ALIGN(section->offset_within_address_space, + vfio_iommu_min_page_size); llend = int128_make64(section->offset_within_address_space); llend = int128_add(llend, section->size); -llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask)); +llend = int128_and(llend, int128_exts64(vfio_iommu_min_page_mask)); if (int128_ge(int128_make64(iova), llend)) { return; -- 2.23.0
Re: [question] Shall we flush ITS tables into guest RAM when shutdown the VM?
On 2021/7/6 22:27, Eric Auger wrote: Hi Dave, On 7/6/21 4:19 PM, Dr. David Alan Gilbert wrote: * Eric Auger (eric.au...@redhat.com) wrote: Hi, On 7/6/21 10:18 AM, Kunkun Jiang wrote: Hi Eric, On 2021/6/30 17:16, Eric Auger wrote: On 6/30/21 3:38 AM, Kunkun Jiang wrote: On 2021/6/30 4:14, Eric Auger wrote: Hi Kunkun, On 6/29/21 11:33 AM, Kunkun Jiang wrote: Hi all, Accroding to the patch cddafd8f353d2d251b1a5c6c948a577a85838582, our original intention is to flush the ITS tables into guest RAM at the point RUN_STATE_FINISH_MIGRATE, but sometimes the VM gets stopped before migration launch so let's simply flush the tables each time the VM gets stopped. But I encountered an error when I shut down the virtual machine. qemu-system-aarch64: KVM_SET_DEVICE_ATTR failed: Group 4 attr 0x0001: Permission denied Shall we need to flush ITS tables into guest RAM when 'shutdown' the VM? Or do you think this error is normal? yes we determined in the past this was the right moment do save the tables "with a live migration the guest is still running after the RAM has been first saved, and so the tables may still change during the iterative RAM save. You would actually need to do this at just the point we stop the VM before the final RAM save; that *might* be possible by using a notifier hook a vm run state change to RUN_STATE_FINISH_MIGRATE - if you can do the changes just as the migration flips into that mode it *should* work. " said David. But sometimes as the commit msg says, the VM is stopped before the migration launch - I do not remember the exact scenario tbh -. Well, I initially wanted to know more about this scenario to determine whether a normal shutdown would fall into it. I think it was for save/restore use case. In that case you need to flush the KVM cache in memory on VM shutdown. Sorry for late reply. Can we distinguish from the 'RunState'? When we stop the VM, the RunState will be set. There are many types of RunState, such as RUN_STATE_FINISH_MIGRATE/RUN_STATE_SAVE_VM/ RUN_STATE_SHUTDOWN/RUN_STATE_GUEST_PANICKED, etc. Maybe RUN_STATE_SHUTDOWN doesn't belong to save/restore use case, right? Adding Dave, Juan and Peter in the loop for migration expertise. At the moment we save the ARM ITS MSI controller tables whenever the VM gets stopped. Saving the tables from KVM caches into the guest RAM is needed for migration and save/restore use cases. However with GICv4 this fails at KVM level because some MSIs are forwarded and saving their state is not supported with GICv4. While GICv4 migration is not supported we would like the VM to work properly, ie. being stoppable without taking care of table saving. So could we be more precise and identifiy the save/restore and migration use cases instead of saving the tables on each VM shutdown. During the precopy migration (not sure about others), we do: static void migration_completion(MigrationState *s) { ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE); ... ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false, inactivate); so I think we do have that state there to hook off. That's consistent with what you suggested in the past ans what is logged in the commit message of cddafd8f353d2d251b1a5c6c948a577a85838582 ("hw/intc/arm_gicv3_its: Implement state save/restore"). However does the save/restore enters that state. If I remember correctly that's why I decided to do the save on each VM stop instead. My idea is to distinguish between save/restore, migration use cases and other scenarios based on the 'RunState'. RUN_STATE_FINISH_MIGRATION and RUN_STATE_COLO should be the right cases, I think. I'm not sure RUN_STATE_SAVE_VM is a right case, which is used in save_snapshot. Do you have a better way to identify it? By the way, it's just get an -EACCESS error and doesn't cause any other problems. If we don't have a good and easy way to solve it, maybe we can ignore it. Thanks, Kunkun Jiang The tables are saved into guest RAM so when need the CPUs and devices to be stopped but we need the guest RAM to be saved after the ITS save operation. Yeh so what should happen is that you: a) Iterate RAM a lot b) You stop everything -> Flushes remaining changes into RAM c) Transmit device state and last bits of RAM changes. so that flush should happen at (b). That's correct. Thanks Eric Dave Kunkun, by the way you currently just get an error from qemu, ie. qemu does not exit? Couldn't we just ignore -EACCESS error? Thanks Eric In my opinion, when the virtual machine is normally shutdown, flushing the ITS tables is not necessary. If we can't tell the difference between 'normal shutdown' and the above scenario, then this 'error' is inevitable. So each time the VM is stopped we flush the caches into guest RAM. This error occurs in the following scenario: Kunpeng 920
Re: [question] Shall we flush ITS tables into guest RAM when shutdown the VM?
Hi Eric, On 2021/7/6 21:52, Eric Auger wrote: Hi, On 7/6/21 10:18 AM, Kunkun Jiang wrote: Hi Eric, On 2021/6/30 17:16, Eric Auger wrote: On 6/30/21 3:38 AM, Kunkun Jiang wrote: On 2021/6/30 4:14, Eric Auger wrote: Hi Kunkun, On 6/29/21 11:33 AM, Kunkun Jiang wrote: Hi all, Accroding to the patch cddafd8f353d2d251b1a5c6c948a577a85838582, our original intention is to flush the ITS tables into guest RAM at the point RUN_STATE_FINISH_MIGRATE, but sometimes the VM gets stopped before migration launch so let's simply flush the tables each time the VM gets stopped. But I encountered an error when I shut down the virtual machine. qemu-system-aarch64: KVM_SET_DEVICE_ATTR failed: Group 4 attr 0x0001: Permission denied Shall we need to flush ITS tables into guest RAM when 'shutdown' the VM? Or do you think this error is normal? yes we determined in the past this was the right moment do save the tables "with a live migration the guest is still running after the RAM has been first saved, and so the tables may still change during the iterative RAM save. You would actually need to do this at just the point we stop the VM before the final RAM save; that *might* be possible by using a notifier hook a vm run state change to RUN_STATE_FINISH_MIGRATE - if you can do the changes just as the migration flips into that mode it *should* work. " said David. But sometimes as the commit msg says, the VM is stopped before the migration launch - I do not remember the exact scenario tbh -. Well, I initially wanted to know more about this scenario to determine whether a normal shutdown would fall into it. I think it was for save/restore use case. In that case you need to flush the KVM cache in memory on VM shutdown. Sorry for late reply. Can we distinguish from the 'RunState'? When we stop the VM, the RunState will be set. There are many types of RunState, such as RUN_STATE_FINISH_MIGRATE/RUN_STATE_SAVE_VM/ RUN_STATE_SHUTDOWN/RUN_STATE_GUEST_PANICKED, etc. Maybe RUN_STATE_SHUTDOWN doesn't belong to save/restore use case, right? Adding Dave, Juan and Peter in the loop for migration expertise. At the moment we save the ARM ITS MSI controller tables whenever the VM gets stopped. Saving the tables from KVM caches into the guest RAM is needed for migration and save/restore use cases. However with GICv4 this fails at KVM level because some MSIs are forwarded and saving their state is not supported with GICv4. While GICv4 migration is not supported we would like the VM to work properly, ie. being stoppable without taking care of table saving. So could we be more precise and identifiy the save/restore and migration use cases instead of saving the tables on each VM shutdown. The tables are saved into guest RAM so when need the CPUs and devices to be stopped but we need the guest RAM to be saved after the ITS save operation. Kunkun, by the way you currently just get an error from qemu, ie. qemu does not exit? Couldn't we just ignore -EACCESS error? Sorry for late reply. There is something wrong with my mailbox, so I didn't receive your email... Yeah, just get an error. I just wanted to know if you have a good way to solve it. It's not causing any other problems, so ignoring it should be OK. Thanks, Kunkun Jiang Thanks Eric In my opinion, when the virtual machine is normally shutdown, flushing the ITS tables is not necessary. If we can't tell the difference between 'normal shutdown' and the above scenario, then this 'error' is inevitable. So each time the VM is stopped we flush the caches into guest RAM. This error occurs in the following scenario: Kunpeng 920 、enable GICv4、passthrough a accelerator Hisilicon SEC to the VM. The flow is as follows: QEMU: vm_shutdown do_vm_stop(RUN_STATE_SHUTDOWN) vm_state_notify ... vm_change_state_handler (hw/intc/arm_gicv3_its_kvm.c) kvm_device_access Kernel: vgic_its_save_tables_v0 vgic_its_save_device_tables vgic_its_save_itt There is such a code in vgic_its_save_itt(): /* * If an LPI carries the HW bit, this means that this * interrupt is controlled by GICv4, and we do not * have direct access to that state without GICv4.1. * Let's simply fail the save operation... */ if (ite->irq->hw && !kvm_vgic_global_state.has_gicv4_1) return -EACCES; Maybe we miss a piece of code for 4.0 that unsets the forwarding. The only way to handle this is to make sure ite->irq->hw is not set on shutdown, no? It's not going to return -EACCES here, if we unset the forwarding first. But this may cause problems in save/restore scenarios. The GICv4 architecture doesn't give any guarantee that the pending state is written into the pending table. Thanks, Kunkun Jiang Thanks Eric As far as I understand you need a v4.1 to migrate, following Shenming's series [PATCH v5 0/6] KVM: arm64: Add VLPI migration
Re: [PATCH 1/2] qapi/run-state: Add a new shutdown cause 'migration-completed'
On 2021/7/6 18:27, Dr. David Alan Gilbert wrote: * Kunkun Jiang (jiangkun...@huawei.com) wrote: Hi Daniel, On 2021/7/5 20:48, Daniel P. Berrangé wrote: On Mon, Jul 05, 2021 at 08:36:52PM +0800, Kunkun Jiang wrote: In the current version, the source QEMU process does not automatic exit after a successful migration. Additional action is required, such as sending { "execute": "quit" } or ctrl+c. For simplify, add a new shutdown cause 'migration-completed' to exit the source QEMU process after a successful migration. IIUC, 'STATUS_COMPLETED' state is entered on the source host once it has finished sending all VM state, and thus does not guarantee that the target host has successfully received and loaded all VM state. Thanks for your reply. If the target host doesn't successfully receive and load all VM state, we can send { "execute": "cont" } to resume the soruce in time to ensure that VM will not lost? Yes, that's pretty common at the moment; the failed migration can happen at lots of different points: a) The last part of the actual migration stream/loading the devices - that's pretty easy, since the destination hasn't actually got the full migration stream. b) If the migration itself completes, but then the management system then tries to reconfigure the networking/storage on the destination, and something goes wrong in that, then it can roll that back and cont on the source. So, it's a pretty common type of failure/recovery - the management application has to be a bit careful not to do anything destructive until as late as possible, so it knows it can switch back. Okay, I see. Typically a mgmt app will need to directly confirm that the target host QEMU has succesfully started running, before it will tell the source QEMU to quit. 'a mgmt app', such as libvirt? Yes, it's currently libvirt that does that; but any of the control things could (it's just libvirt has been going long enough so it knows about lots and lots of nasty cases of migration failure, and recovering properly). Can you explain why did you want to get the source to automatically quit? In a real setup where does it help? Sorry, my thoughts on live migration scenarios are not comprehensive enough. Thanks, Kunkun Jiang Dave Thanks, Kunkun Jiang So, AFAICT, this automatic exit after STATUS_COMPLETED is not safe and could lead to total loss of the running VM in error scenarios. Signed-off-by: Kunkun Jiang --- migration/migration.c | 1 + qapi/run-state.json | 4 +++- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/migration/migration.c b/migration/migration.c index 4228635d18..16782c93c2 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -3539,6 +3539,7 @@ static void migration_iteration_finish(MigrationState *s) case MIGRATION_STATUS_COMPLETED: migration_calculate_complete(s); runstate_set(RUN_STATE_POSTMIGRATE); +qemu_system_shutdown_request(SHUTDOWN_CAUSE_MIGRATION_COMPLETED); break; case MIGRATION_STATUS_ACTIVE: diff --git a/qapi/run-state.json b/qapi/run-state.json index 43d66d700f..66aaef4e2b 100644 --- a/qapi/run-state.json +++ b/qapi/run-state.json @@ -86,12 +86,14 @@ # ignores --no-reboot. This is useful for sanitizing # hypercalls on s390 that are used during kexec/kdump/boot # +# @migration-completed: Reaction to the successful migration +# ## { 'enum': 'ShutdownCause', # Beware, shutdown_caused_by_guest() depends on enumeration order 'data': [ 'none', 'host-error', 'host-qmp-quit', 'host-qmp-system-reset', 'host-signal', 'host-ui', 'guest-shutdown', 'guest-reset', -'guest-panic', 'subsystem-reset'] } +'guest-panic', 'subsystem-reset', 'migration-completed'] } ## # @StatusInfo: -- 2.23.0 Regards, Daniel
Re: [question] Shall we flush ITS tables into guest RAM when shutdown the VM?
Hi Eric, On 2021/6/30 17:16, Eric Auger wrote: On 6/30/21 3:38 AM, Kunkun Jiang wrote: On 2021/6/30 4:14, Eric Auger wrote: Hi Kunkun, On 6/29/21 11:33 AM, Kunkun Jiang wrote: Hi all, Accroding to the patch cddafd8f353d2d251b1a5c6c948a577a85838582, our original intention is to flush the ITS tables into guest RAM at the point RUN_STATE_FINISH_MIGRATE, but sometimes the VM gets stopped before migration launch so let's simply flush the tables each time the VM gets stopped. But I encountered an error when I shut down the virtual machine. qemu-system-aarch64: KVM_SET_DEVICE_ATTR failed: Group 4 attr 0x0001: Permission denied Shall we need to flush ITS tables into guest RAM when 'shutdown' the VM? Or do you think this error is normal? yes we determined in the past this was the right moment do save the tables "with a live migration the guest is still running after the RAM has been first saved, and so the tables may still change during the iterative RAM save. You would actually need to do this at just the point we stop the VM before the final RAM save; that *might* be possible by using a notifier hook a vm run state change to RUN_STATE_FINISH_MIGRATE - if you can do the changes just as the migration flips into that mode it *should* work. " said David. But sometimes as the commit msg says, the VM is stopped before the migration launch - I do not remember the exact scenario tbh -. Well, I initially wanted to know more about this scenario to determine whether a normal shutdown would fall into it. I think it was for save/restore use case. In that case you need to flush the KVM cache in memory on VM shutdown. Sorry for late reply. Can we distinguish from the 'RunState'? When we stop the VM, the RunState will be set. There are many types of RunState, such as RUN_STATE_FINISH_MIGRATE/RUN_STATE_SAVE_VM/ RUN_STATE_SHUTDOWN/RUN_STATE_GUEST_PANICKED, etc. Maybe RUN_STATE_SHUTDOWN doesn't belong to save/restore use case, right? In my opinion, when the virtual machine is normally shutdown, flushing the ITS tables is not necessary. If we can't tell the difference between 'normal shutdown' and the above scenario, then this 'error' is inevitable. So each time the VM is stopped we flush the caches into guest RAM. This error occurs in the following scenario: Kunpeng 920 、enable GICv4、passthrough a accelerator Hisilicon SEC to the VM. The flow is as follows: QEMU: vm_shutdown do_vm_stop(RUN_STATE_SHUTDOWN) vm_state_notify ... vm_change_state_handler (hw/intc/arm_gicv3_its_kvm.c) kvm_device_access Kernel: vgic_its_save_tables_v0 vgic_its_save_device_tables vgic_its_save_itt There is such a code in vgic_its_save_itt(): /* * If an LPI carries the HW bit, this means that this * interrupt is controlled by GICv4, and we do not * have direct access to that state without GICv4.1. * Let's simply fail the save operation... */ if (ite->irq->hw && !kvm_vgic_global_state.has_gicv4_1) return -EACCES; Maybe we miss a piece of code for 4.0 that unsets the forwarding. The only way to handle this is to make sure ite->irq->hw is not set on shutdown, no? It's not going to return -EACCES here, if we unset the forwarding first. But this may cause problems in save/restore scenarios. The GICv4 architecture doesn't give any guarantee that the pending state is written into the pending table. Thanks, Kunkun Jiang Thanks Eric As far as I understand you need a v4.1 to migrate, following Shenming's series [PATCH v5 0/6] KVM: arm64: Add VLPI migration support on GICv4.1 Maybe sync with him? Yes, GICv4 does not support live migrate. Thanks, Kunkun Jiang Thanks Eric Looking forward to your reply. Thanks, Kunkun Jiang . .
Re: [PATCH 1/2] qapi/run-state: Add a new shutdown cause 'migration-completed'
Hi Daniel, On 2021/7/5 20:48, Daniel P. Berrangé wrote: On Mon, Jul 05, 2021 at 08:36:52PM +0800, Kunkun Jiang wrote: In the current version, the source QEMU process does not automatic exit after a successful migration. Additional action is required, such as sending { "execute": "quit" } or ctrl+c. For simplify, add a new shutdown cause 'migration-completed' to exit the source QEMU process after a successful migration. IIUC, 'STATUS_COMPLETED' state is entered on the source host once it has finished sending all VM state, and thus does not guarantee that the target host has successfully received and loaded all VM state. Thanks for your reply. If the target host doesn't successfully receive and load all VM state, we can send { "execute": "cont" } to resume the soruce in time to ensure that VM will not lost? Typically a mgmt app will need to directly confirm that the target host QEMU has succesfully started running, before it will tell the source QEMU to quit. 'a mgmt app', such as libvirt? Thanks, Kunkun Jiang So, AFAICT, this automatic exit after STATUS_COMPLETED is not safe and could lead to total loss of the running VM in error scenarios. Signed-off-by: Kunkun Jiang --- migration/migration.c | 1 + qapi/run-state.json | 4 +++- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/migration/migration.c b/migration/migration.c index 4228635d18..16782c93c2 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -3539,6 +3539,7 @@ static void migration_iteration_finish(MigrationState *s) case MIGRATION_STATUS_COMPLETED: migration_calculate_complete(s); runstate_set(RUN_STATE_POSTMIGRATE); +qemu_system_shutdown_request(SHUTDOWN_CAUSE_MIGRATION_COMPLETED); break; case MIGRATION_STATUS_ACTIVE: diff --git a/qapi/run-state.json b/qapi/run-state.json index 43d66d700f..66aaef4e2b 100644 --- a/qapi/run-state.json +++ b/qapi/run-state.json @@ -86,12 +86,14 @@ # ignores --no-reboot. This is useful for sanitizing # hypercalls on s390 that are used during kexec/kdump/boot # +# @migration-completed: Reaction to the successful migration +# ## { 'enum': 'ShutdownCause', # Beware, shutdown_caused_by_guest() depends on enumeration order 'data': [ 'none', 'host-error', 'host-qmp-quit', 'host-qmp-system-reset', 'host-signal', 'host-ui', 'guest-shutdown', 'guest-reset', -'guest-panic', 'subsystem-reset'] } +'guest-panic', 'subsystem-reset', 'migration-completed'] } ## # @StatusInfo: -- 2.23.0 Regards, Daniel
[PATCH 0/2] Auto exit source QEMU process after a successful migration
Hi all, This serial include patches as below: Patch 1: - add a new shutdown cause 'migration-completed', which used for automatically exit of source QEMU process after a successful migration Patch 2: - add a new migration capability 'auto-quit' to control whether to automatically exit source QEMU process after a successful migration Kunkun Jiang (2): qapi/run-state: Add a new shutdown cause 'migration-completed' qapi/migration: Add a new migration capability 'auto-quit' migration/migration.c | 13 + migration/migration.h | 1 + qapi/migration.json | 6 +- qapi/run-state.json | 4 +++- 4 files changed, 22 insertions(+), 2 deletions(-) -- 2.23.0
[PATCH 2/2] qapi/migration: Add a new migration capability 'auto-quit'
For compatibility, a new migration capability 'auto-quit' is added to control the exit of source QEMU after a successful migration. Signed-off-by: Kunkun Jiang --- migration/migration.c | 14 +- migration/migration.h | 1 + qapi/migration.json | 6 +- 3 files changed, 19 insertions(+), 2 deletions(-) diff --git a/migration/migration.c b/migration/migration.c index 16782c93c2..82ad6d35b2 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -2567,6 +2567,15 @@ bool migrate_background_snapshot(void) return s->enabled_capabilities[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]; } +bool migrate_auto_quit(void) +{ +MigrationState *s; + +s = migrate_get_current(); + +return s->enabled_capabilities[MIGRATION_CAPABILITY_AUTO_QUIT]; +} + /* migration thread support */ /* * Something bad happened to the RP stream, mark an error @@ -3539,7 +3548,10 @@ static void migration_iteration_finish(MigrationState *s) case MIGRATION_STATUS_COMPLETED: migration_calculate_complete(s); runstate_set(RUN_STATE_POSTMIGRATE); -qemu_system_shutdown_request(SHUTDOWN_CAUSE_MIGRATION_COMPLETED); + +if (migrate_auto_quit()) { +qemu_system_shutdown_request(SHUTDOWN_CAUSE_MIGRATION_COMPLETED); +} break; case MIGRATION_STATUS_ACTIVE: diff --git a/migration/migration.h b/migration/migration.h index 2ebb740dfa..a72b178d35 100644 --- a/migration/migration.h +++ b/migration/migration.h @@ -349,6 +349,7 @@ int migrate_decompress_threads(void); bool migrate_use_events(void); bool migrate_postcopy_blocktime(void); bool migrate_background_snapshot(void); +bool migrate_auto_quit(void); /* Sending on the return path - generic and then for each message type */ void migrate_send_rp_shut(MigrationIncomingState *mis, diff --git a/qapi/migration.json b/qapi/migration.json index 1124a2dda8..26c1a52c56 100644 --- a/qapi/migration.json +++ b/qapi/migration.json @@ -452,6 +452,9 @@ # procedure starts. The VM RAM is saved with running VM. # (since 6.0) # +# @auto-quit: If enabled, QEMU process will exit after a successful migration. +# (since 6.1) +# # Since: 1.2 ## { 'enum': 'MigrationCapability', @@ -459,7 +462,8 @@ 'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram', 'block', 'return-path', 'pause-before-switchover', 'multifd', 'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate', - 'x-ignore-shared', 'validate-uuid', 'background-snapshot'] } + 'x-ignore-shared', 'validate-uuid', 'background-snapshot', + 'auto-quit'] } ## # @MigrationCapabilityStatus: -- 2.23.0
[PATCH 1/2] qapi/run-state: Add a new shutdown cause 'migration-completed'
In the current version, the source QEMU process does not automatic exit after a successful migration. Additional action is required, such as sending { "execute": "quit" } or ctrl+c. For simplify, add a new shutdown cause 'migration-completed' to exit the source QEMU process after a successful migration. Signed-off-by: Kunkun Jiang --- migration/migration.c | 1 + qapi/run-state.json | 4 +++- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/migration/migration.c b/migration/migration.c index 4228635d18..16782c93c2 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -3539,6 +3539,7 @@ static void migration_iteration_finish(MigrationState *s) case MIGRATION_STATUS_COMPLETED: migration_calculate_complete(s); runstate_set(RUN_STATE_POSTMIGRATE); +qemu_system_shutdown_request(SHUTDOWN_CAUSE_MIGRATION_COMPLETED); break; case MIGRATION_STATUS_ACTIVE: diff --git a/qapi/run-state.json b/qapi/run-state.json index 43d66d700f..66aaef4e2b 100644 --- a/qapi/run-state.json +++ b/qapi/run-state.json @@ -86,12 +86,14 @@ # ignores --no-reboot. This is useful for sanitizing # hypercalls on s390 that are used during kexec/kdump/boot # +# @migration-completed: Reaction to the successful migration +# ## { 'enum': 'ShutdownCause', # Beware, shutdown_caused_by_guest() depends on enumeration order 'data': [ 'none', 'host-error', 'host-qmp-quit', 'host-qmp-system-reset', 'host-signal', 'host-ui', 'guest-shutdown', 'guest-reset', -'guest-panic', 'subsystem-reset'] } +'guest-panic', 'subsystem-reset', 'migration-completed'] } ## # @StatusInfo: -- 2.23.0
Re: [question] Shall we flush ITS tables into guest RAM when shutdown the VM?
On 2021/6/30 4:14, Eric Auger wrote: Hi Kunkun, On 6/29/21 11:33 AM, Kunkun Jiang wrote: Hi all, Accroding to the patch cddafd8f353d2d251b1a5c6c948a577a85838582, our original intention is to flush the ITS tables into guest RAM at the point RUN_STATE_FINISH_MIGRATE, but sometimes the VM gets stopped before migration launch so let's simply flush the tables each time the VM gets stopped. But I encountered an error when I shut down the virtual machine. qemu-system-aarch64: KVM_SET_DEVICE_ATTR failed: Group 4 attr 0x0001: Permission denied Shall we need to flush ITS tables into guest RAM when 'shutdown' the VM? Or do you think this error is normal? yes we determined in the past this was the right moment do save the tables "with a live migration the guest is still running after the RAM has been first saved, and so the tables may still change during the iterative RAM save. You would actually need to do this at just the point we stop the VM before the final RAM save; that *might* be possible by using a notifier hook a vm run state change to RUN_STATE_FINISH_MIGRATE - if you can do the changes just as the migration flips into that mode it *should* work. " said David. But sometimes as the commit msg says, the VM is stopped before the migration launch - I do not remember the exact scenario tbh -. Well, I initially wanted to know more about this scenario to determine whether a normal shutdown would fall into it. In my opinion, when the virtual machine is normally shutdown, flushing the ITS tables is not necessary. If we can't tell the difference between 'normal shutdown' and the above scenario, then this 'error' is inevitable. So each time the VM is stopped we flush the caches into guest RAM. This error occurs in the following scenario: Kunpeng 920 、enable GICv4、passthrough a accelerator Hisilicon SEC to the VM. The flow is as follows: QEMU: vm_shutdown do_vm_stop(RUN_STATE_SHUTDOWN) vm_state_notify ... vm_change_state_handler (hw/intc/arm_gicv3_its_kvm.c) kvm_device_access Kernel: vgic_its_save_tables_v0 vgic_its_save_device_tables vgic_its_save_itt There is such a code in vgic_its_save_itt(): /* * If an LPI carries the HW bit, this means that this * interrupt is controlled by GICv4, and we do not * have direct access to that state without GICv4.1. * Let's simply fail the save operation... */ if (ite->irq->hw && !kvm_vgic_global_state.has_gicv4_1) return -EACCES; As far as I understand you need a v4.1 to migrate, following Shenming's series [PATCH v5 0/6] KVM: arm64: Add VLPI migration support on GICv4.1 Maybe sync with him? Yes, GICv4 does not support live migrate. Thanks, Kunkun Jiang Thanks Eric Looking forward to your reply. Thanks, Kunkun Jiang .
[question] Shall we flush ITS tables into guest RAM when shutdown the VM?
Hi all, Accroding to the patch cddafd8f353d2d251b1a5c6c948a577a85838582, our original intention is to flush the ITS tables into guest RAM at the point RUN_STATE_FINISH_MIGRATE, but sometimes the VM gets stopped before migration launch so let's simply flush the tables each time the VM gets stopped. But I encountered an error when I shut down the virtual machine. qemu-system-aarch64: KVM_SET_DEVICE_ATTR failed: Group 4 attr 0x0001: Permission denied Shall we need to flush ITS tables into guest RAM when 'shutdown' the VM? Or do you think this error is normal? This error occurs in the following scenario: Kunpeng 920 、enable GICv4、passthrough a accelerator Hisilicon SEC to the VM. The flow is as follows: QEMU: vm_shutdown do_vm_stop(RUN_STATE_SHUTDOWN) vm_state_notify ... vm_change_state_handler (hw/intc/arm_gicv3_its_kvm.c) kvm_device_access Kernel: vgic_its_save_tables_v0 vgic_its_save_device_tables vgic_its_save_itt There is such a code in vgic_its_save_itt(): /* * If an LPI carries the HW bit, this means that this * interrupt is controlled by GICv4, and we do not * have direct access to that state without GICv4.1. * Let's simply fail the save operation... */ if (ite->irq->hw && !kvm_vgic_global_state.has_gicv4_1) return -EACCES; Looking forward to your reply. Thanks, Kunkun Jiang
Re: [PATCH] vfio: Fix unregister SaveVMHandler in vfio_migration_finalize
Kindly ping, Hi everyone, Will this patch be picked up soon, or is there any other work for me to do? Best Regards, Kunkun Jiang On 2021/5/27 20:31, Kunkun Jiang wrote: In the vfio_migration_init(), the SaveVMHandler is registered for VFIO device. But it lacks the operation of 'unregister'. It will lead to 'Segmentation fault (core dumped)' in qemu_savevm_state_setup(), if performing live migration after a VFIO device is hot deleted. Fixes: 7c2f5f75f94 (vfio: Register SaveVMHandlers for VFIO device) Reported-by: Qixin Gan Signed-off-by: Kunkun Jiang --- hw/vfio/migration.c | 1 + 1 file changed, 1 insertion(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 201642d75e..ef397ebe6c 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -892,6 +892,7 @@ void vfio_migration_finalize(VFIODevice *vbasedev) remove_migration_state_change_notifier(>migration_state); qemu_del_vm_change_state_handler(migration->vm_state); +unregister_savevm(VMSTATE_IF(vbasedev->dev), "vfio", vbasedev); vfio_migration_exit(vbasedev); }
Re: [PATCH] vfio: Fix unregister SaveVMHandler in vfio_migration_finalize
Hi Philippe, On 2021/5/27 21:44, Philippe Mathieu-Daudé wrote: On 5/27/21 2:31 PM, Kunkun Jiang wrote: In the vfio_migration_init(), the SaveVMHandler is registered for VFIO device. But it lacks the operation of 'unregister'. It will lead to 'Segmentation fault (core dumped)' in qemu_savevm_state_setup(), if performing live migration after a VFIO device is hot deleted. Fixes: 7c2f5f75f94 (vfio: Register SaveVMHandlers for VFIO device) Reported-by: Qixin Gan Signed-off-by: Kunkun Jiang Cc: qemu-sta...@nongnu.org --- hw/vfio/migration.c | 1 + 1 file changed, 1 insertion(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 201642d75e..ef397ebe6c 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -892,6 +892,7 @@ void vfio_migration_finalize(VFIODevice *vbasedev) remove_migration_state_change_notifier(>migration_state); qemu_del_vm_change_state_handler(migration->vm_state); +unregister_savevm(VMSTATE_IF(vbasedev->dev), "vfio", vbasedev); Hmm what about devices using "%s/vfio" id? The unregister_savevm() needs 'VMSTATEIf *obj'. If we pass a non-null 'obj' to unregister_svevm(), it will handle the devices using "%s/vfio" id with the following code: if (obj) { char *oid = vmstate_if_get_id(obj); if (oid) { pstrcpy(id, sizeof(id), oid); pstrcat(id, sizeof(id), "/"); g_free(oid); } } pstrcat(id, sizeof(id), idstr); By the way, I'm puzzled that register_savevm_live() and unregister_savevm() handle devices using "%s/vfio" id differently. So I learned the commit history of register_savevm_live() and unregister_savevm(). In the beginning, both them need 'DeviceState *dev', which are replaced with VMStateIf in 3cad405babb. Later in ce62df5378b, the 'dev' was removed, because no caller of register_savevm_live() need to pass a non-null 'dev' at that time. So now the vfio devices need to handle the 'id' first and then call register_savevm_live(). I am wondering whether we need to add 'VMSTATEIf *obj' in register_savevm_live(). What do you think of this? Thanks, Kunkun Jiang vfio_migration_exit(vbasedev); } .
[PATCH] vfio: Fix unregister SaveVMHandler in vfio_migration_finalize
In the vfio_migration_init(), the SaveVMHandler is registered for VFIO device. But it lacks the operation of 'unregister'. It will lead to 'Segmentation fault (core dumped)' in qemu_savevm_state_setup(), if performing live migration after a VFIO device is hot deleted. Fixes: 7c2f5f75f94 (vfio: Register SaveVMHandlers for VFIO device) Reported-by: Qixin Gan Signed-off-by: Kunkun Jiang --- hw/vfio/migration.c | 1 + 1 file changed, 1 insertion(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 201642d75e..ef397ebe6c 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -892,6 +892,7 @@ void vfio_migration_finalize(VFIODevice *vbasedev) remove_migration_state_change_notifier(>migration_state); qemu_del_vm_change_state_handler(migration->vm_state); +unregister_savevm(VMSTATE_IF(vbasedev->dev), "vfio", vbasedev); vfio_migration_exit(vbasedev); } -- 2.23.0
Re: [RFC v9 24/29] hw/arm/smmuv3: Fill the IOTLBEntry leaf field on NH_VA invalidation
Hi Eric, On 2021/4/11 20:09, Eric Auger wrote: Let's propagate the leaf attribute throughout the invalidation path. This hint is used to reduce the scope of the invalidations to the last level of translation. Not enforcing it induces large performance penalties in nested mode. Signed-off-by: Eric Auger --- hw/arm/smmuv3.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c index 7beb55cd89..74a6408146 100644 --- a/hw/arm/smmuv3.c +++ b/hw/arm/smmuv3.c @@ -799,7 +799,7 @@ epilogue: static void smmuv3_notify_iova(IOMMUMemoryRegion *mr, IOMMUNotifier *n, int asid, dma_addr_t iova, - uint8_t tg, uint64_t num_pages) + uint8_t tg, uint64_t num_pages, bool leaf) { SMMUDevice *sdev = container_of(mr, SMMUDevice, iommu); IOMMUTLBEvent event = {}; @@ -834,6 +834,7 @@ static void smmuv3_notify_iova(IOMMUMemoryRegion *mr, event.entry.perm = IOMMU_NONE; event.entry.flags = IOMMU_INV_FLAGS_ARCHID; event.entry.arch_id = asid; +event.entry.leaf = leaf; memory_region_notify_iommu_one(n, ); } @@ -865,7 +866,7 @@ static void smmuv3_notify_asid(IOMMUMemoryRegion *mr, /* invalidate an asid/iova range tuple in all mr's */ static void smmuv3_inv_notifiers_iova(SMMUState *s, int asid, dma_addr_t iova, - uint8_t tg, uint64_t num_pages) + uint8_t tg, uint64_t num_pages, bool leaf) { SMMUDevice *sdev; Does the parameter 'leaf' need to be added to the trace? trace_smmuv3_inv_notifiers_iova(mr->parent_obj.name, asid, iova, tg, num_pages); Thanks, Kunkun Jiang @@ -877,7 +878,7 @@ static void smmuv3_inv_notifiers_iova(SMMUState *s, int asid, dma_addr_t iova, tg, num_pages); IOMMU_NOTIFIER_FOREACH(n, mr) { -smmuv3_notify_iova(mr, n, asid, iova, tg, num_pages); +smmuv3_notify_iova(mr, n, asid, iova, tg, num_pages, leaf); } } } @@ -915,7 +916,7 @@ static void smmuv3_s1_range_inval(SMMUState *s, Cmd *cmd) count = mask + 1; trace_smmuv3_s1_range_inval(vmid, asid, addr, tg, count, ttl, leaf); -smmuv3_inv_notifiers_iova(s, asid, addr, tg, count); +smmuv3_inv_notifiers_iova(s, asid, addr, tg, count, leaf); smmu_iotlb_inv_iova(s, asid, addr, tg, count, ttl); num_pages -= count;
Re: [RFC PATCH v2 0/4] Add migration support for VFIO PCI devices in SMMUv3 nested stage mode
Hi all, This series has been updated to v3.[1] Any comments and reviews are welcome. Thanks, Kunkun Jiang [1] [RFC PATCH v3 0/4] Add migration support for VFIO PCI devices in SMMUv3 nested mode https://lore.kernel.org/qemu-devel/20210511020816.2905-1-jiangkun...@huawei.com/ On 2021/3/31 18:12, Kunkun Jiang wrote: Hi all, Since the SMMUv3's nested translation stages[1] has been introduced by Eric, we need to pay attention to the migration of VFIO PCI devices in SMMUv3 nested stage mode. At present, it is not yet supported in QEMU. There are two problems in the existing framework. First, the current way to get dirty pages is not applicable to nested stage mode. Because of the "Caching Mode", VTD can map the RAM through the host single stage (giova->hpa). "vfio_listener_log_sync" gets dirty pages by transferring "giova" to the kernel for the RAM block section of mapped MMIO region. In nested stage mode, we setup the stage 2 (gpa->hpa) and the stage 1(giova->gpa) separately. So it is inapplicable to get dirty pages by the current way in nested stage mode. Second, it also need to pass stage 1 configurations to the destination host after the migration. In Eric's patch, it passes the stage 1 configuration to the host on each STE update for the devices set the PASID PciOps. The configuration will be applied at physical level. But the data of physical level will not be sent to the destination host. So we have to pass stage 1 configurations to the destination host after the migration. Best Regards, Kunkun Jiang [1] [RFC,v8,00/28] vSMMUv3/pSMMUv3 2 stage VFIO integration http://patchwork.ozlabs.org/project/qemu-devel/cover/20210225105233.650545-1-eric.au...@redhat.com/ This Patch set includes patches as below: Patch 1-2: - Refactor the vfio_listener_log_sync and added a new function to get dirty pages in nested stage mode. Patch 3: - Added global_log_start/stop interface in vfio_memory_prereg_listener Patch 4: - Added the post_load function to vmstate_smmuv3 for passing stage 1 configuration to the destination host after the migration. History: v1 -> v2: - Add global_log_start/stop interface in vfio_memory_prereg_listener - Add support for repass stage 1 configs with multiple CDs to the host Kunkun Jiang (4): vfio: Introduce helpers to mark dirty pages of a RAM section vfio: Add vfio_prereg_listener_log_sync in nested stage vfio: Add vfio_prereg_listener_global_log_start/stop in nested stage hw/arm/smmuv3: Post-load stage 1 configurations to the host hw/arm/smmuv3.c | 62 +++ hw/arm/trace-events | 1 + hw/vfio/common.c| 71 - 3 files changed, 126 insertions(+), 8 deletions(-)
[RFC PATCH v3 1/4] vfio: Introduce helpers to mark dirty pages of a RAM section
Extract part of the code from vfio_sync_dirty_bitmap to form a new helper, which allows to mark dirty pages of a RAM section. This helper will be called for nested stage. Signed-off-by: Kunkun Jiang --- hw/vfio/common.c | 22 ++ 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index dc8372c772..9fb8d44a6d 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -1271,6 +1271,19 @@ err_out: return ret; } +static int vfio_dma_sync_ram_section_dirty_bitmap(VFIOContainer *container, + MemoryRegionSection *section) +{ +ram_addr_t ram_addr; + +ram_addr = memory_region_get_ram_addr(section->mr) + + section->offset_within_region; + +return vfio_get_dirty_bitmap(container, +REAL_HOST_PAGE_ALIGN(section->offset_within_address_space), +int128_get64(section->size), ram_addr); +} + typedef struct { IOMMUNotifier n; VFIOGuestIOMMU *giommu; @@ -1312,8 +1325,6 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) static int vfio_sync_dirty_bitmap(VFIOContainer *container, MemoryRegionSection *section) { -ram_addr_t ram_addr; - if (memory_region_is_iommu(section->mr)) { VFIOGuestIOMMU *giommu; @@ -1342,12 +1353,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container, return 0; } -ram_addr = memory_region_get_ram_addr(section->mr) + - section->offset_within_region; - -return vfio_get_dirty_bitmap(container, - REAL_HOST_PAGE_ALIGN(section->offset_within_address_space), - int128_get64(section->size), ram_addr); +return vfio_dma_sync_ram_section_dirty_bitmap(container, section); } static void vfio_listener_log_sync(MemoryListener *listener, -- 2.23.0
[RFC PATCH v3 4/4] hw/arm/smmuv3: Post-load stage 1 configurations to the host
In nested mode, we call the set_pasid_table() callback on each STE update to pass the guest stage 1 configuration to the host and apply it at physical level. In the case of live migration, we need to manually call the set_pasid_table() to load the guest stage 1 configurations to the host. If this operation fails, the migration fails. Signed-off-by: Kunkun Jiang --- hw/arm/smmuv3.c | 33 - 1 file changed, 28 insertions(+), 5 deletions(-) diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c index ca690513e6..ac1de572f3 100644 --- a/hw/arm/smmuv3.c +++ b/hw/arm/smmuv3.c @@ -929,7 +929,7 @@ static void smmuv3_s1_range_inval(SMMUState *s, Cmd *cmd) } } -static void smmuv3_notify_config_change(SMMUState *bs, uint32_t sid) +static int smmuv3_notify_config_change(SMMUState *bs, uint32_t sid) { #ifdef __linux__ IOMMUMemoryRegion *mr = smmu_iommu_mr(bs, sid); @@ -938,9 +938,10 @@ static void smmuv3_notify_config_change(SMMUState *bs, uint32_t sid) IOMMUConfig iommu_config = {}; SMMUTransCfg *cfg; SMMUDevice *sdev; +int ret; if (!mr) { -return; +return 0; } sdev = container_of(mr, SMMUDevice, iommu); @@ -949,13 +950,13 @@ static void smmuv3_notify_config_change(SMMUState *bs, uint32_t sid) smmuv3_flush_config(sdev); if (!pci_device_is_pasid_ops_set(sdev->bus, sdev->devfn)) { -return; +return 0; } cfg = smmuv3_get_config(sdev, ); if (!cfg) { -return; +return 0; } iommu_config.pasid_cfg.argsz = sizeof(struct iommu_pasid_table_config); @@ -977,10 +978,13 @@ static void smmuv3_notify_config_change(SMMUState *bs, uint32_t sid) iommu_config.pasid_cfg.config, iommu_config.pasid_cfg.base_ptr); -if (pci_device_set_pasid_table(sdev->bus, sdev->devfn, _config)) { +ret = pci_device_set_pasid_table(sdev->bus, sdev->devfn, _config); +if (ret) { error_report("Failed to pass PASID table to host for iommu mr %s (%m)", mr->parent_obj.name); } + +return ret; #endif } @@ -1545,6 +1549,24 @@ static void smmu_realize(DeviceState *d, Error **errp) smmu_init_irq(s, dev); } +static int smmuv3_post_load(void *opaque, int version_id) +{ +SMMUv3State *s3 = opaque; +SMMUState *s = &(s3->smmu_state); +SMMUDevice *sdev; +int ret = 0; + +QLIST_FOREACH(sdev, >devices_with_notifiers, next) { +uint32_t sid = smmu_get_sid(sdev); +ret = smmuv3_notify_config_change(s, sid); +if (ret) { +break; +} +} + +return ret; +} + static const VMStateDescription vmstate_smmuv3_queue = { .name = "smmuv3_queue", .version_id = 1, @@ -1563,6 +1585,7 @@ static const VMStateDescription vmstate_smmuv3 = { .version_id = 1, .minimum_version_id = 1, .priority = MIG_PRI_IOMMU, +.post_load = smmuv3_post_load, .fields = (VMStateField[]) { VMSTATE_UINT32(features, SMMUv3State), VMSTATE_UINT8(sid_size, SMMUv3State), -- 2.23.0
[RFC PATCH v3 2/4] vfio: Add vfio_prereg_listener_log_sync in nested stage
In nested mode, we set up the stage 2 (gpa->hpa)and stage 1 (giova->gpa) separately by vfio_prereg_listener_region_add() and vfio_listener_region_add(). So when marking dirty pages we just need to pay attention to stage 2 mappings. Legacy vfio_listener_log_sync cannot be used in nested stage. This patch adds vfio_prereg_listener_log_sync to mark dirty pages in nested mode. Signed-off-by: Kunkun Jiang --- hw/vfio/common.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 9fb8d44a6d..149e535a75 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -1284,6 +1284,22 @@ static int vfio_dma_sync_ram_section_dirty_bitmap(VFIOContainer *container, int128_get64(section->size), ram_addr); } +static void vfio_prereg_listener_log_sync(MemoryListener *listener, + MemoryRegionSection *section) +{ +VFIOContainer *container = +container_of(listener, VFIOContainer, prereg_listener); + +if (!memory_region_is_ram(section->mr) || +!container->dirty_pages_supported) { +return; +} + +if (vfio_devices_all_dirty_tracking(container)) { +vfio_dma_sync_ram_section_dirty_bitmap(container, section); +} +} + typedef struct { IOMMUNotifier n; VFIOGuestIOMMU *giommu; @@ -1328,6 +1344,16 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container, if (memory_region_is_iommu(section->mr)) { VFIOGuestIOMMU *giommu; +/* + * In nested mode, stage 2 (gpa->hpa) and stage 1 (giova->gpa) are + * set up separately. It is inappropriate to pass 'giova' to kernel + * to get dirty pages. We only need to focus on stage 2 mapping when + * marking dirty pages. + */ +if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) { +return 0; +} + QLIST_FOREACH(giommu, >giommu_list, giommu_next) { if (MEMORY_REGION(giommu->iommu) == section->mr && giommu->n.start == section->offset_within_region) { @@ -1382,6 +1408,7 @@ static const MemoryListener vfio_memory_listener = { static MemoryListener vfio_memory_prereg_listener = { .region_add = vfio_prereg_listener_region_add, .region_del = vfio_prereg_listener_region_del, +.log_sync = vfio_prereg_listener_log_sync, }; static void vfio_listener_release(VFIOContainer *container) -- 2.23.0
[RFC PATCH v3 3/4] vfio: Add vfio_prereg_listener_global_log_start/stop in nested stage
In nested mode, we set up the stage 2 and stage 1 separately. In my opinion, vfio_memory_prereg_listener is used for stage 2 and vfio_memory_listener is used for stage 1. So it feels weird to call the global_log_start/stop interface in vfio_memory_listener to switch dirty tracking, although this won't cause any errors. Add global_log_start/stop interface in vfio_memory_prereg_listener can separate stage 2 from stage 1. Signed-off-by: Kunkun Jiang --- hw/vfio/common.c | 24 1 file changed, 24 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 149e535a75..6e004fc00f 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -1209,6 +1209,17 @@ static void vfio_listener_log_global_start(MemoryListener *listener) { VFIOContainer *container = container_of(listener, VFIOContainer, listener); +/* For nested mode, vfio_prereg_listener is used to start dirty tracking */ +if (container->iommu_type != VFIO_TYPE1_NESTING_IOMMU) { +vfio_set_dirty_page_tracking(container, true); +} +} + +static void vfio_prereg_listener_log_global_start(MemoryListener *listener) +{ +VFIOContainer *container = +container_of(listener, VFIOContainer, prereg_listener); + vfio_set_dirty_page_tracking(container, true); } @@ -1216,6 +1227,17 @@ static void vfio_listener_log_global_stop(MemoryListener *listener) { VFIOContainer *container = container_of(listener, VFIOContainer, listener); +/* For nested mode, vfio_prereg_listener is used to stop dirty tracking */ +if (container->iommu_type != VFIO_TYPE1_NESTING_IOMMU) { +vfio_set_dirty_page_tracking(container, false); +} +} + +static void vfio_prereg_listener_log_global_stop(MemoryListener *listener) +{ +VFIOContainer *container = +container_of(listener, VFIOContainer, prereg_listener); + vfio_set_dirty_page_tracking(container, false); } @@ -1408,6 +1430,8 @@ static const MemoryListener vfio_memory_listener = { static MemoryListener vfio_memory_prereg_listener = { .region_add = vfio_prereg_listener_region_add, .region_del = vfio_prereg_listener_region_del, +.log_global_start = vfio_prereg_listener_log_global_start, +.log_global_stop = vfio_prereg_listener_log_global_stop, .log_sync = vfio_prereg_listener_log_sync, }; -- 2.23.0
[RFC PATCH v3 0/4] Add migration support for VFIO PCI devices in SMMUv3 nested mode
Hi all, Since the SMMUv3's nested translation stages has been introduced by Eric, we need to pay attention to the migration of VFIO PCI devices in SMMUv3 nested stage mode. At present, it is not yet supported in QEMU. There are two problems in the existing framework. First, the current way to get dirty pages is not applicable to nested stage mode. Because of the "Caching Mode", VTD can map the RAM through the host single stage (giova->hpa). "vfio_listener_log_sync" gets dirty pages by transferring "giova" to the kernel for the RAM block section of mapped MMIO region. In nested stage mode, we setup the stage 2 (gpa->hpa) and the stage 1 (giova->gpa) separately. So it is inapplicable to get dirty pages by the current way in nested stage mode. Second, it also need to pass stage 1 configurations to the destination host after the migration. In Eric's patch, it passes the stage 1 configuration to the host on each STE update for the devices set the PASID PciOps. The configuration will be applied at physical level. But the data of physical level will not be sent to the destination host. So we have to pass stage 1 configurations to the destination host after the migration. This Patch set includes patches as below: Patch 1-2: - Refactor the vfio_listener_log_sync and added a new function to get dirty pages for nested mode Patch 3: - Added global_log_start/stop interface in vfio_memory_prereg_listener for nested mode Patch 4: - Added the post_load function to vmstate_smmuv3 for passing stage 1 configuration to the destination host after the migration Best regards, Kunkun Jiang History: v2 -> v3: - Rebase to v9 of Eric's series 'vSMMUv3/pSMMUv3 2 stage VFIO integration'[1] - Delete smmuv3_manual_set_pci_device_pasid_table() and reuse smmuv3_notify_config_change() [Eric] v1 -> v2: - Add global_log_start/stop interface in vfio_memory_prereg_listener - Add support for repass stage 1 configs with multiple CDs to the host [1] [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration https://lore.kernel.org/qemu-devel/20210411120912.15770-1-eric.au...@redhat.com/ Kunkun Jiang (4): vfio: Introduce helpers to mark dirty pages of a RAM section vfio: Add vfio_prereg_listener_log_sync in nested stage vfio: Add vfio_prereg_listener_global_log_start/stop in nested stage hw/arm/smmuv3: Post-load stage 1 configurations to the host hw/arm/smmuv3.c | 33 ++ hw/vfio/common.c | 73 ++-- 2 files changed, 93 insertions(+), 13 deletions(-) -- 2.23.0
Re: [RFC PATCH 0/3] vfio/migration: Support manual clear vfio dirty log
Hi all, Sorry for my carelessness. This is the v2 of this series. Thanks, Kunkun Jiang On 2021/5/8 17:31, Kunkun Jiang wrote: In the past, we clear dirty log immediately after sync dirty log to userspace. This may cause redundant dirty handling if userspace handles dirty log iteratively: After vfio clears dirty log, new dirty log starts to generate. These new dirty log will be reported to userspace even if they are generated before userspace handles the same dirty page. Since a new dirty log tracking method for vfio based on iommu hwdbm[1] has been introduced in the kernel and added a new capability named VFIO_DIRTY_LOG_MANUAL_CLEAR, we can eliminate some redundant dirty handling by supporting it. This series include patches as below: Patch 1: - updated the linux-headers/linux/vfio.h from kernel side Patch 2: - introduced 'struct VFIODMARange' to describe a range of the given DMA mapping and with respect to a VFIO_IOMMU_MAP_DMA operation Patch 3: - implemented the operation to manual clear vfio dirty log, which can eliminate some redundant dirty handling History: v1 -> v2: - Add a new ioctl VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP_NOCLEAR to get vfio dirty log when support manual clear. Thanks, Kunkun Jiang [1] IOMMU part: https://lore.kernel.org/linux-iommu/20210507102211.8836-1-zhukeqi...@huawei.com/ VFIO part: https://lore.kernel.org/kvm/20210507103608.39440-1-zhukeqi...@huawei.com/ Zenghui Yu (3): linux-headers: update against 5.12 and "manual clear vfio dirty log" series vfio: Maintain DMA mapping range for the container vfio/migration: Add support for manual clear vfio dirty log hw/vfio/common.c | 211 -- include/hw/vfio/vfio-common.h | 10 ++ linux-headers/linux/vfio.h| 61 +- 3 files changed, 273 insertions(+), 9 deletions(-)
[RFC PATCH 2/3] vfio: Maintain DMA mapping range for the container
From: Zenghui Yu When synchronizing dirty bitmap from kernel VFIO we do it in a per-iova-range fashion and we allocate the userspace bitmap for each of the ioctl. This patch introduces `struct VFIODMARange` to describe a range of the given DMA mapping with respect to a VFIO_IOMMU_MAP_DMA operation, and make the bitmap cache of this range be persistent so that we don't need to g_try_malloc0() every time. Note that the new structure is almost a copy of `struct vfio_iommu_type1_dma_map` but only internally used by QEMU. More importantly, the cached per-iova-range dirty bitmap will be further used when we want to add support for the CLEAR_BITMAP and this cached bitmap will be used to guarantee we don't clear any unknown dirty bits otherwise that can be a severe data loss issue for migration code. It's pretty intuitive to maintain a bitmap per container since we perform log_sync at this granule. But I don't know how to deal with things like memory hot-{un}plug, sparse DMA mappings, etc. Suggestions welcome. * yet something to-do: - can't work with guest viommu - no locks - etc [ The idea and even the commit message are largely inherited from kvm side. See commit 9f4bf4baa8b820c7930e23c9566c9493db7e1d25. ] Signed-off-by: Zenghui Yu Signed-off-by: Kunkun Jiang --- hw/vfio/common.c | 62 +++ include/hw/vfio/vfio-common.h | 9 + 2 files changed, 65 insertions(+), 6 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index ae5654fcdb..b8b6418e69 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -421,6 +421,29 @@ unmap_exit: return ret; } +static VFIODMARange *vfio_lookup_match_range(VFIOContainer *container, +hwaddr start_addr, hwaddr size) +{ +VFIODMARange *qrange; + +QLIST_FOREACH(qrange, >dma_list, next) { +if (qrange->iova == start_addr && qrange->size == size) { +return qrange; +} +} +return NULL; +} + +static void vfio_dma_range_init_dirty_bitmap(VFIODMARange *qrange) +{ +uint64_t pages, size; + +pages = REAL_HOST_PAGE_ALIGN(qrange->size) / qemu_real_host_page_size; +size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / BITS_PER_BYTE; + +qrange->bitmap = g_malloc0(size); +} + /* * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86 */ @@ -434,12 +457,29 @@ static int vfio_dma_unmap(VFIOContainer *container, .iova = iova, .size = size, }; +VFIODMARange *qrange; if (iotlb && container->dirty_pages_supported && vfio_devices_all_running_and_saving(container)) { return vfio_dma_unmap_bitmap(container, iova, size, iotlb); } +/* + * unregister the DMA range + * + * It seems that the memory layer will give us the same section as the one + * used in region_add(). Otherwise it'll be complicated to manipulate the + * bitmap across region_{add,del}. Is there any guarantee? + * + * But there is really not such a restriction on the kernel interface + * (VFIO_IOMMU_DIRTY_PAGES_FLAG_{UN}MAP_DMA, etc). + */ +qrange = vfio_lookup_match_range(container, iova, size); +assert(qrange); +g_free(qrange->bitmap); +QLIST_REMOVE(qrange, next); +g_free(qrange); + while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, )) { /* * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c @@ -476,6 +516,14 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova, .iova = iova, .size = size, }; +VFIODMARange *qrange; + +qrange = g_malloc0(sizeof(*qrange)); +qrange->iova = iova; +qrange->size = size; +QLIST_INSERT_HEAD(>dma_list, qrange, next); +/* XXX allocate the dirty bitmap on demand */ +vfio_dma_range_init_dirty_bitmap(qrange); if (!readonly) { map.flags |= VFIO_DMA_MAP_FLAG_WRITE; @@ -1023,9 +1071,14 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, { struct vfio_iommu_type1_dirty_bitmap *dbitmap; struct vfio_iommu_type1_dirty_bitmap_get *range; +VFIODMARange *qrange; uint64_t pages; int ret; +qrange = vfio_lookup_match_range(container, iova, size); +/* the same as vfio_dma_unmap() */ +assert(qrange); + dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range)); dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range); @@ -1044,11 +1097,8 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, pages = REAL_HOST_PAGE_ALIGN(range->size) / qemu_real_host_page_size; range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / BITS_PER_BYTE; -range->bitmap.data = g_try_malloc0(range->bitmap.size); -if (!range->bitmap.data) { -ret = -ENOMEM; -goto err_out; -}
[RFC PATCH 1/3] linux-headers: update against 5.12 and "manual clear vfio dirty log" series
From: Zenghui Yu The new capability VFIO_DIRTY_LOG_MANUAL_CLEAR and the new ioctl VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP_NOCLEAR and VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP have been introduced in the kernel, update the header to add them. Signed-off-by: Zenghui Yu Signed-off-by: Kunkun Jiang --- linux-headers/linux/vfio.h | 61 +- 1 file changed, 60 insertions(+), 1 deletion(-) diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h index 609099e455..7e918c697f 100644 --- a/linux-headers/linux/vfio.h +++ b/linux-headers/linux/vfio.h @@ -46,6 +46,20 @@ */ #define VFIO_NOIOMMU_IOMMU 8 +/* Supports VFIO_DMA_UNMAP_FLAG_ALL */ +#define VFIO_UNMAP_ALL 9 + +/* Supports the vaddr flag for DMA map and unmap */ +#define VFIO_UPDATE_VADDR 10 + +/* + * The vfio_iommu driver may support user clears dirty log manually, which means + * dirty log is not cleared automatically after dirty log is copied to userspace, + * it's user's duty to clear dirty log. Note: when user queries this extension + * and vfio_iommu driver supports it, then it is enabled. + */ +#define VFIO_DIRTY_LOG_MANUAL_CLEAR11 + /* * The IOCTL interface is designed for extensibility by embedding the * structure length (argsz) and flags into structures passed between @@ -1074,12 +1088,22 @@ struct vfio_iommu_type1_info_dma_avail { * * Map process virtual addresses to IO virtual addresses using the * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required. + * + * If flags & VFIO_DMA_MAP_FLAG_VADDR, update the base vaddr for iova, and + * unblock translation of host virtual addresses in the iova range. The vaddr + * must have previously been invalidated with VFIO_DMA_UNMAP_FLAG_VADDR. To + * maintain memory consistency within the user application, the updated vaddr + * must address the same memory object as originally mapped. Failure to do so + * will result in user memory corruption and/or device misbehavior. iova and + * size must match those in the original MAP_DMA call. Protection is not + * changed, and the READ & WRITE flags must be 0. */ struct vfio_iommu_type1_dma_map { __u32 argsz; __u32 flags; #define VFIO_DMA_MAP_FLAG_READ (1 << 0)/* readable from device */ #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1) /* writable from device */ +#define VFIO_DMA_MAP_FLAG_VADDR (1 << 2) __u64 vaddr; /* Process virtual address */ __u64 iova; /* IO virtual address */ __u64 size; /* Size of mapping (bytes) */ @@ -1102,6 +1126,7 @@ struct vfio_bitmap { * field. No guarantee is made to the user that arbitrary unmaps of iova * or size different from those used in the original mapping call will * succeed. + * * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get the dirty bitmap * before unmapping IO virtual addresses. When this flag is set, the user must * provide a struct vfio_bitmap in data[]. User must provide zero-allocated @@ -,11 +1136,21 @@ struct vfio_bitmap { * indicates that the page at that offset from iova is dirty. A Bitmap of the * pages in the range of unmapped size is returned in the user-provided * vfio_bitmap.data. + * + * If flags & VFIO_DMA_UNMAP_FLAG_ALL, unmap all addresses. iova and size + * must be 0. This cannot be combined with the get-dirty-bitmap flag. + * + * If flags & VFIO_DMA_UNMAP_FLAG_VADDR, do not unmap, but invalidate host + * virtual addresses in the iova range. Tasks that attempt to translate an + * iova's vaddr will block. DMA to already-mapped pages continues. This + * cannot be combined with the get-dirty-bitmap flag. */ struct vfio_iommu_type1_dma_unmap { __u32 argsz; __u32 flags; #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0) +#define VFIO_DMA_UNMAP_FLAG_ALL (1 << 1) +#define VFIO_DMA_UNMAP_FLAG_VADDR (1 << 2) __u64 iova; /* IO virtual address */ __u64 size; /* Size of mapping (bytes) */ __u8data[]; @@ -1161,7 +1196,29 @@ struct vfio_iommu_type1_dma_unmap { * actual bitmap. If dirty pages logging is not enabled, an error will be * returned. * - * Only one of the flags _START, _STOP and _GET may be specified at a time. + * The VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP_NOCLEAR flag is almost same as + * VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP, except that it requires underlying + * dirty bitmap is not cleared automatically. The user can clear it manually by + * calling the IOCTL with VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP flag set. + * + * Calling the IOCTL with VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP flag set, + * instructs the IOMMU driver to clear the dirty status of pages in a bitmap + * for IOMMU
[RFC PATCH 3/3] vfio/migration: Add support for manual clear vfio dirty log
From: Zenghui Yu The new capability VFIO_DIRTY_LOG_MANUAL_CLEAR and the new ioctl VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP_NOCLEAR and VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP have been introduced in the kernel, tweak the userspace side to use them. Check if the kernel supports VFIO_DIRTY_LOG_MANUAL_CLEAR and provide the log_clear() hook for vfio_memory_listener. If the kernel supports it, deliever the clear message to kernel. Signed-off-by: Zenghui Yu Signed-off-by: Kunkun Jiang --- hw/vfio/common.c | 149 +- include/hw/vfio/vfio-common.h | 1 + 2 files changed, 148 insertions(+), 2 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index b8b6418e69..9c41a36a61 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -1082,7 +1082,9 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range)); dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range); -dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP; +dbitmap->flags = container->dirty_log_manual_clear ? + VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP_NOCLEAR : + VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP; range = (struct vfio_iommu_type1_dirty_bitmap_get *)>data; range->iova = iova; range->size = size; @@ -1213,12 +1215,148 @@ static void vfio_listener_log_sync(MemoryListener *listener, } } +/* + * I'm not sure if there's any alignment requirement for the CLEAR_BITMAP + * ioctl. But copy from kvm side and align {start, size} with 64 pages. + * + * I think the code can be simplified a lot if no alignment requirement. + */ +#define VFIO_CLEAR_LOG_SHIFT 6 +#define VFIO_CLEAR_LOG_ALIGN (qemu_real_host_page_size << VFIO_CLEAR_LOG_SHIFT) +#define VFIO_CLEAR_LOG_MASK (-VFIO_CLEAR_LOG_ALIGN) + +static int vfio_log_clear_one_range(VFIOContainer *container, +VFIODMARange *qrange, uint64_t start, uint64_t size) +{ +struct vfio_iommu_type1_dirty_bitmap *dbitmap; +struct vfio_iommu_type1_dirty_bitmap_get *range; + +dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range)); + +dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range); +dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP; +range = (struct vfio_iommu_type1_dirty_bitmap_get *)>data; + +/* + * Now let's deal with the actual bitmap, which is almost the same + * as the kvm side. + */ +uint64_t end, bmap_start, start_delta, bmap_npages; +unsigned long *bmap_clear = NULL, psize = qemu_real_host_page_size; +int ret; + +bmap_start = start & VFIO_CLEAR_LOG_MASK; +start_delta = start - bmap_start; +bmap_start /= psize; + +bmap_npages = DIV_ROUND_UP(size + start_delta, VFIO_CLEAR_LOG_ALIGN) +<< VFIO_CLEAR_LOG_SHIFT; +end = qrange->size / psize; +if (bmap_npages > end - bmap_start) { +bmap_npages = end - bmap_start; +} +start_delta /= psize; + +if (start_delta) { +bmap_clear = bitmap_new(bmap_npages); +bitmap_copy_with_src_offset(bmap_clear, qrange->bitmap, +bmap_start, start_delta + size / psize); +bitmap_clear(bmap_clear, 0, start_delta); +range->bitmap.data = (__u64 *)bmap_clear; +} else { +range->bitmap.data = (__u64 *)(qrange->bitmap + BIT_WORD(bmap_start)); +} + +range->iova = qrange->iova + bmap_start * psize; +range->size = bmap_npages * psize; +range->bitmap.size = ROUND_UP(bmap_npages, sizeof(__u64) * BITS_PER_BYTE) / + BITS_PER_BYTE; +range->bitmap.pgsize = qemu_real_host_page_size; + +ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap); +if (ret) { +error_report("Failed to clear dirty log for iova: 0x%"PRIx64 +" size: 0x%"PRIx64" err: %d", (uint64_t)range->iova, +(uint64_t)range->size, errno); +goto err_out; +} + +bitmap_clear(qrange->bitmap, bmap_start + start_delta, size / psize); +err_out: +g_free(bmap_clear); +g_free(dbitmap); +return 0; +} + +static int vfio_physical_log_clear(VFIOContainer *container, + MemoryRegionSection *section) +{ +uint64_t start, size, offset, count; +VFIODMARange *qrange; +int ret = 0; + +if (!container->dirty_log_manual_clear) { +/* No need to do explicit clear */ +return ret; +} + +start = section->offset_within_address_space; +size = int128_get64(section->size); + +if (!size) { +return ret; +} + +QLIST_FOREACH(qrange, >dma_list, next) { +/* + * Discard ranges that do not overlap the section (e.g., the + * Memory BAR regions of the device) + */ +if (qrange->i
[RFC PATCH 0/3] vfio/migration: Support manual clear vfio dirty log
In the past, we clear dirty log immediately after sync dirty log to userspace. This may cause redundant dirty handling if userspace handles dirty log iteratively: After vfio clears dirty log, new dirty log starts to generate. These new dirty log will be reported to userspace even if they are generated before userspace handles the same dirty page. Since a new dirty log tracking method for vfio based on iommu hwdbm[1] has been introduced in the kernel and added a new capability named VFIO_DIRTY_LOG_MANUAL_CLEAR, we can eliminate some redundant dirty handling by supporting it. This series include patches as below: Patch 1: - updated the linux-headers/linux/vfio.h from kernel side Patch 2: - introduced 'struct VFIODMARange' to describe a range of the given DMA mapping and with respect to a VFIO_IOMMU_MAP_DMA operation Patch 3: - implemented the operation to manual clear vfio dirty log, which can eliminate some redundant dirty handling History: v1 -> v2: - Add a new ioctl VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP_NOCLEAR to get vfio dirty log when support manual clear. Thanks, Kunkun Jiang [1] IOMMU part: https://lore.kernel.org/linux-iommu/20210507102211.8836-1-zhukeqi...@huawei.com/ VFIO part: https://lore.kernel.org/kvm/20210507103608.39440-1-zhukeqi...@huawei.com/ Zenghui Yu (3): linux-headers: update against 5.12 and "manual clear vfio dirty log" series vfio: Maintain DMA mapping range for the container vfio/migration: Add support for manual clear vfio dirty log hw/vfio/common.c | 211 -- include/hw/vfio/vfio-common.h | 10 ++ linux-headers/linux/vfio.h| 61 +- 3 files changed, 273 insertions(+), 9 deletions(-) -- 2.23.0
Re: [question] The source cannot recover, if the destination fails in the last round of live migration
Hi Dave, On 2021/5/6 21:05, Dr. David Alan Gilbert wrote: * Kunkun Jiang (jiangkun...@huawei.com) wrote: Hi all, Hi, Recently I am learning about the part of live migration. I have a question about the last round. When the pending_size is less than the threshold, it will enter the last round and call migration_completion(). It will stop the source and sent the remaining dirty pages and devices' status information to the destination. The destination will load these information and start the VM. If there is an error at the destination at this time, it will exit directly, and the source will not be able to detect the error and recover. Because the source will not call migration_detect_error(). Is my understanding correct? Should the source wait the result of the last round of destination ? Try setting the 'return-path' migration capability on both the source and destination; I think it's that option will cause the destination to send an OK/error at the end and the source to wait for it. Thank you for your reply! The 'return-path' migration capability solved my question. But why not set it as the default? In my opinion, it is a basic ability of live migration. We need it to judge whether the last round of the destination is successful in the way of 'precopy'. Looking forward to your reply. Thanks, Kunkun Jiang Dave Thanks, Kunkun Jiang
[question] The source cannot recover, if the destination fails in the last round of live migration
Hi all, Recently I am learning about the part of live migration. I have a question about the last round. When the pending_size is less than the threshold, it will enter the last round and call migration_completion(). It will stop the source and sent the remaining dirty pages and devices' status information to the destination. The destination will load these information and start the VM. If there is an error at the destination at this time, it will exit directly, and the source will not be able to detect the error and recover. Because the source will not call migration_detect_error(). Is my understanding correct? Should the source wait the result of the last round of destination ? Thanks, Kunkun Jiang
Re: [RFC v9 15/29] vfio: Set up nested stage mappings
Hi Eric, On 2021/4/27 3:16, Auger Eric wrote: Hi Kunkun, On 4/15/21 4:03 AM, Kunkun Jiang wrote: Hi Eric, On 2021/4/14 16:05, Auger Eric wrote: Hi Kunkun, On 4/14/21 3:45 AM, Kunkun Jiang wrote: On 2021/4/13 20:57, Auger Eric wrote: Hi Kunkun, On 4/13/21 2:10 PM, Kunkun Jiang wrote: Hi Eric, On 2021/4/11 20:08, Eric Auger wrote: In nested mode, legacy vfio_iommu_map_notify cannot be used as there is no "caching" mode and we do not trap on map. On Intel, vfio_iommu_map_notify was used to DMA map the RAM through the host single stage. With nested mode, we need to setup the stage 2 and the stage 1 separately. This patch introduces a prereg_listener to setup the stage 2 mapping. The stage 1 mapping, owned by the guest, is passed to the host when the guest invalidates the stage 1 configuration, through a dedicated PCIPASIDOps callback. Guest IOTLB invalidations are cascaded downto the host through another IOMMU MR UNMAP notifier. Signed-off-by: Eric Auger --- v7 -> v8: - properly handle new IOMMUTLBEntry fields and especially propagate DOMAIN and PASID based invalidations v6 -> v7: - remove PASID based invalidation v5 -> v6: - add error_report_err() - remove the abort in case of nested stage case v4 -> v5: - use VFIO_IOMMU_SET_PASID_TABLE - use PCIPASIDOps for config notification v3 -> v4: - use iommu_inv_pasid_info for ASID invalidation v2 -> v3: - use VFIO_IOMMU_ATTACH_PASID_TABLE - new user API - handle leaf v1 -> v2: - adapt to uapi changes - pass the asid - pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier --- hw/vfio/common.c | 139 +-- hw/vfio/pci.c | 21 +++ hw/vfio/trace-events | 2 + 3 files changed, 157 insertions(+), 5 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 0cd7ef2139..e369d451e7 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr, return true; } +/* Propagate a guest IOTLB invalidation to the host (nested mode) */ +static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) +{ + VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); + struct vfio_iommu_type1_cache_invalidate ustruct = {}; + VFIOContainer *container = giommu->container; + int ret; + + assert(iotlb->perm == IOMMU_NONE); + + ustruct.argsz = sizeof(ustruct); + ustruct.flags = 0; + ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info); + ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1; + ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB; + + switch (iotlb->granularity) { + case IOMMU_INV_GRAN_DOMAIN: + ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN; + break; + case IOMMU_INV_GRAN_PASID: + { + struct iommu_inv_pasid_info *pasid_info; + int archid = -1; + + pasid_info = _info; + ustruct.info.granularity = IOMMU_INV_GRANU_PASID; + if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) { + pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID; + archid = iotlb->arch_id; + } + pasid_info->archid = archid; + trace_vfio_iommu_asid_inv_iotlb(archid); + break; + } + case IOMMU_INV_GRAN_ADDR: + { + hwaddr start = iotlb->iova + giommu->iommu_offset; + struct iommu_inv_addr_info *addr_info; + size_t size = iotlb->addr_mask + 1; + int archid = -1; + + addr_info = _info; + ustruct.info.granularity = IOMMU_INV_GRANU_ADDR; + if (iotlb->leaf) { + addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF; + } + if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) { + addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID; + archid = iotlb->arch_id; + } + addr_info->archid = archid; + addr_info->addr = start; + addr_info->granule_size = size; + addr_info->nb_granules = 1; + trace_vfio_iommu_addr_inv_iotlb(archid, start, size, + 1, iotlb->leaf); + break; + } Should we pass a size to host kernel here, even if vSMMU doesn't support RIL or guest kernel doesn't use RIL? It will cause TLBI issue in this scenario: Guest kernel issues a TLBI cmd without "range" (tg = 0) to invalidate a 2M huge page. Then qemu passed the iova and size (4K) to host kernel. Finally, host kernel issues a TLBI cmd with "range" (4K) which can not invalidate the TLB entry of 2M huge page. (pSMMU supports RIL) In that case the guest will loop over all 4K images belonging to the 2M huge page and invalidate each of them. This should turn into qemu notifications for each 4kB page, no? This is totally inefficient, hence The guest will not loop over all 4K images belonging to th
Re: [RFC v9 14/29] vfio: Introduce helpers to DMA map/unmap a RAM section
+if (ret) { +error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx") = %d (%m)", + container, iova, int128_get64(llsize), ret); +} +} +} + static void vfio_listener_region_add(MemoryListener *listener, MemoryRegionSection *section) { VFIOContainer *container = container_of(listener, VFIOContainer, listener); hwaddr iova, end; -Int128 llend, llsize; -void *vaddr; +Int128 llend; int ret; VFIOHostDMAWindow *hostwin; Error *err = NULL; @@ -814,39 +927,7 @@ static void vfio_listener_region_add(MemoryListener *listener, } /* Here we assume that memory_region_is_ram(section->mr)==true */ - -vaddr = memory_region_get_ram_ptr(section->mr) + -section->offset_within_region + -(iova - section->offset_within_address_space); - -trace_vfio_listener_region_add_ram(iova, end, vaddr); - -llsize = int128_sub(llend, int128_make64(iova)); - -if (memory_region_is_ram_device(section->mr)) { -hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1; - -if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) { -trace_vfio_listener_region_add_no_dma_map( -memory_region_name(section->mr), -section->offset_within_address_space, -int128_getlo(section->size), -pgmask + 1); -return; -} -} - -ret = vfio_dma_map(container, iova, int128_get64(llsize), - vaddr, section->readonly); -if (ret) { -error_setg(, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", " - "0x%"HWADDR_PRIx", %p) = %d (%m)", - container, iova, int128_get64(llsize), vaddr, ret); -if (memory_region_is_ram_device(section->mr)) { -/* Allow unexpected mappings not to be fatal for RAM devices */ -error_report_err(err); -return; -} +if (vfio_dma_map_ram_section(container, section, )) { goto fail; } @@ -880,10 +961,6 @@ static void vfio_listener_region_del(MemoryListener *listener, MemoryRegionSection *section) { VFIOContainer *container = container_of(listener, VFIOContainer, listener); -hwaddr iova, end; -Int128 llend, llsize; -int ret; -bool try_unmap = true; if (vfio_listener_skipped_section(section)) { trace_vfio_listener_region_del_skip( @@ -923,49 +1000,7 @@ static void vfio_listener_region_del(MemoryListener *listener, */ } There are some codes of vfio_listener_region_del() that just doesn't show up. I post it below. if (memory_region_is_iommu(section->mr)) { VFIOGuestIOMMU *giommu; QLIST_FOREACH(giommu, >giommu_list, giommu_next) { if (MEMORY_REGION(giommu->iommu) == section->mr && giommu->n.start == section->offset_within_region) { memory_region_unregister_iommu_notifier(section->mr, >n); QLIST_REMOVE(giommu, giommu_next); g_free(giommu); break; } } /* * FIXME: We assume the one big unmap below is adequate to * remove any individual page mappings in the IOMMU which * might have been copied into VFIO. This works for a page table * based IOMMU where a big unmap flattens a large range of IO-PTEs. * That may not be true for all IOMMU types. */ } I think we need a check here. If it is nested mode, just return after g_free(giommu). Because in nested mode, stage 2 (gpa->hpa) and the stage 1 (giova->gpa) are set separately. When hot delete a pci device, we are going to call vfio_listener_region_del() and vfio_prereg_listener_region_del(). So it's not appropriate to unmap RAM section in vfio_listener_region_del(). The RAM section will be unmap in vfio_prereg_listener_region_del(). Thanks, Kunkun Jiang -iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space); -llend = int128_make64(section->offset_within_address_space); -llend = int128_add(llend, section->size); -llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask)); - -if (int128_ge(int128_make64(iova), llend)) { -return; -} -end = int128_get64(int128_sub(llend, int128_one())); - -llsize = int128_sub(llend, int128_make64(iova)); - -trace_vfio_listener_region_del(iova, end); - -if (memory_region_is_ram_device(section->mr)) { -hwaddr pgmask; -VFIOHostDMAWindow *hostwin = hostwin_from_range(container, iova, end); - -assert(hostwin); /* or region_add() would have failed */ - -pgmask = (1ULL &
Re: [RFC v9 15/29] vfio: Set up nested stage mappings
Hi Eric, On 2021/4/26 20:30, Auger Eric wrote: Hi Kunkun, On 4/14/21 3:45 AM, Kunkun Jiang wrote: On 2021/4/13 20:57, Auger Eric wrote: Hi Kunkun, On 4/13/21 2:10 PM, Kunkun Jiang wrote: Hi Eric, On 2021/4/11 20:08, Eric Auger wrote: In nested mode, legacy vfio_iommu_map_notify cannot be used as there is no "caching" mode and we do not trap on map. On Intel, vfio_iommu_map_notify was used to DMA map the RAM through the host single stage. With nested mode, we need to setup the stage 2 and the stage 1 separately. This patch introduces a prereg_listener to setup the stage 2 mapping. The stage 1 mapping, owned by the guest, is passed to the host when the guest invalidates the stage 1 configuration, through a dedicated PCIPASIDOps callback. Guest IOTLB invalidations are cascaded downto the host through another IOMMU MR UNMAP notifier. Signed-off-by: Eric Auger --- v7 -> v8: - properly handle new IOMMUTLBEntry fields and especially propagate DOMAIN and PASID based invalidations v6 -> v7: - remove PASID based invalidation v5 -> v6: - add error_report_err() - remove the abort in case of nested stage case v4 -> v5: - use VFIO_IOMMU_SET_PASID_TABLE - use PCIPASIDOps for config notification v3 -> v4: - use iommu_inv_pasid_info for ASID invalidation v2 -> v3: - use VFIO_IOMMU_ATTACH_PASID_TABLE - new user API - handle leaf v1 -> v2: - adapt to uapi changes - pass the asid - pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier --- hw/vfio/common.c | 139 +-- hw/vfio/pci.c | 21 +++ hw/vfio/trace-events | 2 + 3 files changed, 157 insertions(+), 5 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 0cd7ef2139..e369d451e7 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr, return true; } +/* Propagate a guest IOTLB invalidation to the host (nested mode) */ +static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) +{ + VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); + struct vfio_iommu_type1_cache_invalidate ustruct = {}; + VFIOContainer *container = giommu->container; + int ret; + + assert(iotlb->perm == IOMMU_NONE); + + ustruct.argsz = sizeof(ustruct); + ustruct.flags = 0; + ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info); + ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1; + ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB; + + switch (iotlb->granularity) { + case IOMMU_INV_GRAN_DOMAIN: + ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN; + break; + case IOMMU_INV_GRAN_PASID: + { + struct iommu_inv_pasid_info *pasid_info; + int archid = -1; + + pasid_info = _info; + ustruct.info.granularity = IOMMU_INV_GRANU_PASID; + if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) { + pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID; + archid = iotlb->arch_id; + } + pasid_info->archid = archid; + trace_vfio_iommu_asid_inv_iotlb(archid); + break; + } + case IOMMU_INV_GRAN_ADDR: + { + hwaddr start = iotlb->iova + giommu->iommu_offset; + struct iommu_inv_addr_info *addr_info; + size_t size = iotlb->addr_mask + 1; + int archid = -1; + + addr_info = _info; + ustruct.info.granularity = IOMMU_INV_GRANU_ADDR; + if (iotlb->leaf) { + addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF; + } + if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) { + addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID; + archid = iotlb->arch_id; + } + addr_info->archid = archid; + addr_info->addr = start; + addr_info->granule_size = size; + addr_info->nb_granules = 1; + trace_vfio_iommu_addr_inv_iotlb(archid, start, size, + 1, iotlb->leaf); + break; + } Should we pass a size to host kernel here, even if vSMMU doesn't support RIL or guest kernel doesn't use RIL? It will cause TLBI issue in this scenario: Guest kernel issues a TLBI cmd without "range" (tg = 0) to invalidate a 2M huge page. Then qemu passed the iova and size (4K) to host kernel. Finally, host kernel issues a TLBI cmd with "range" (4K) which can not invalidate the TLB entry of 2M huge page. (pSMMU supports RIL) In that case the guest will loop over all 4K images belonging to the 2M huge page and invalidate each of them. This should turn into qemu notifications for each 4kB page, no? This is totally inefficient, hence The guest will not loop over all 4K images belonging to the 2M huge page. The iommu_iotlb_gather->pgsize will be 2M, if a page is 2M huge page. The gather->pgsize wil
Re: [RFC v9 15/29] vfio: Set up nested stage mappings
Hi Eric, On 2021/4/14 16:05, Auger Eric wrote: Hi Kunkun, On 4/14/21 3:45 AM, Kunkun Jiang wrote: On 2021/4/13 20:57, Auger Eric wrote: Hi Kunkun, On 4/13/21 2:10 PM, Kunkun Jiang wrote: Hi Eric, On 2021/4/11 20:08, Eric Auger wrote: In nested mode, legacy vfio_iommu_map_notify cannot be used as there is no "caching" mode and we do not trap on map. On Intel, vfio_iommu_map_notify was used to DMA map the RAM through the host single stage. With nested mode, we need to setup the stage 2 and the stage 1 separately. This patch introduces a prereg_listener to setup the stage 2 mapping. The stage 1 mapping, owned by the guest, is passed to the host when the guest invalidates the stage 1 configuration, through a dedicated PCIPASIDOps callback. Guest IOTLB invalidations are cascaded downto the host through another IOMMU MR UNMAP notifier. Signed-off-by: Eric Auger --- v7 -> v8: - properly handle new IOMMUTLBEntry fields and especially propagate DOMAIN and PASID based invalidations v6 -> v7: - remove PASID based invalidation v5 -> v6: - add error_report_err() - remove the abort in case of nested stage case v4 -> v5: - use VFIO_IOMMU_SET_PASID_TABLE - use PCIPASIDOps for config notification v3 -> v4: - use iommu_inv_pasid_info for ASID invalidation v2 -> v3: - use VFIO_IOMMU_ATTACH_PASID_TABLE - new user API - handle leaf v1 -> v2: - adapt to uapi changes - pass the asid - pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier --- hw/vfio/common.c | 139 +-- hw/vfio/pci.c | 21 +++ hw/vfio/trace-events | 2 + 3 files changed, 157 insertions(+), 5 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 0cd7ef2139..e369d451e7 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr, return true; } +/* Propagate a guest IOTLB invalidation to the host (nested mode) */ +static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) +{ + VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); + struct vfio_iommu_type1_cache_invalidate ustruct = {}; + VFIOContainer *container = giommu->container; + int ret; + + assert(iotlb->perm == IOMMU_NONE); + + ustruct.argsz = sizeof(ustruct); + ustruct.flags = 0; + ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info); + ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1; + ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB; + + switch (iotlb->granularity) { + case IOMMU_INV_GRAN_DOMAIN: + ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN; + break; + case IOMMU_INV_GRAN_PASID: + { + struct iommu_inv_pasid_info *pasid_info; + int archid = -1; + + pasid_info = _info; + ustruct.info.granularity = IOMMU_INV_GRANU_PASID; + if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) { + pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID; + archid = iotlb->arch_id; + } + pasid_info->archid = archid; + trace_vfio_iommu_asid_inv_iotlb(archid); + break; + } + case IOMMU_INV_GRAN_ADDR: + { + hwaddr start = iotlb->iova + giommu->iommu_offset; + struct iommu_inv_addr_info *addr_info; + size_t size = iotlb->addr_mask + 1; + int archid = -1; + + addr_info = _info; + ustruct.info.granularity = IOMMU_INV_GRANU_ADDR; + if (iotlb->leaf) { + addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF; + } + if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) { + addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID; + archid = iotlb->arch_id; + } + addr_info->archid = archid; + addr_info->addr = start; + addr_info->granule_size = size; + addr_info->nb_granules = 1; + trace_vfio_iommu_addr_inv_iotlb(archid, start, size, + 1, iotlb->leaf); + break; + } Should we pass a size to host kernel here, even if vSMMU doesn't support RIL or guest kernel doesn't use RIL? It will cause TLBI issue in this scenario: Guest kernel issues a TLBI cmd without "range" (tg = 0) to invalidate a 2M huge page. Then qemu passed the iova and size (4K) to host kernel. Finally, host kernel issues a TLBI cmd with "range" (4K) which can not invalidate the TLB entry of 2M huge page. (pSMMU supports RIL) In that case the guest will loop over all 4K images belonging to the 2M huge page and invalidate each of them. This should turn into qemu notifications for each 4kB page, no? This is totally inefficient, hence The guest will not loop over all 4K images belonging to the 2M huge page. The iommu_iotlb_gather->pgsize will be 2M, if a page is 2M huge page. The gather->pgsize wil
Re: [RFC v9 15/29] vfio: Set up nested stage mappings
On 2021/4/13 20:57, Auger Eric wrote: Hi Kunkun, On 4/13/21 2:10 PM, Kunkun Jiang wrote: Hi Eric, On 2021/4/11 20:08, Eric Auger wrote: In nested mode, legacy vfio_iommu_map_notify cannot be used as there is no "caching" mode and we do not trap on map. On Intel, vfio_iommu_map_notify was used to DMA map the RAM through the host single stage. With nested mode, we need to setup the stage 2 and the stage 1 separately. This patch introduces a prereg_listener to setup the stage 2 mapping. The stage 1 mapping, owned by the guest, is passed to the host when the guest invalidates the stage 1 configuration, through a dedicated PCIPASIDOps callback. Guest IOTLB invalidations are cascaded downto the host through another IOMMU MR UNMAP notifier. Signed-off-by: Eric Auger --- v7 -> v8: - properly handle new IOMMUTLBEntry fields and especially propagate DOMAIN and PASID based invalidations v6 -> v7: - remove PASID based invalidation v5 -> v6: - add error_report_err() - remove the abort in case of nested stage case v4 -> v5: - use VFIO_IOMMU_SET_PASID_TABLE - use PCIPASIDOps for config notification v3 -> v4: - use iommu_inv_pasid_info for ASID invalidation v2 -> v3: - use VFIO_IOMMU_ATTACH_PASID_TABLE - new user API - handle leaf v1 -> v2: - adapt to uapi changes - pass the asid - pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier --- hw/vfio/common.c | 139 +-- hw/vfio/pci.c | 21 +++ hw/vfio/trace-events | 2 + 3 files changed, 157 insertions(+), 5 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 0cd7ef2139..e369d451e7 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr, return true; } +/* Propagate a guest IOTLB invalidation to the host (nested mode) */ +static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) +{ + VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); + struct vfio_iommu_type1_cache_invalidate ustruct = {}; + VFIOContainer *container = giommu->container; + int ret; + + assert(iotlb->perm == IOMMU_NONE); + + ustruct.argsz = sizeof(ustruct); + ustruct.flags = 0; + ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info); + ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1; + ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB; + + switch (iotlb->granularity) { + case IOMMU_INV_GRAN_DOMAIN: + ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN; + break; + case IOMMU_INV_GRAN_PASID: + { + struct iommu_inv_pasid_info *pasid_info; + int archid = -1; + + pasid_info = _info; + ustruct.info.granularity = IOMMU_INV_GRANU_PASID; + if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) { + pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID; + archid = iotlb->arch_id; + } + pasid_info->archid = archid; + trace_vfio_iommu_asid_inv_iotlb(archid); + break; + } + case IOMMU_INV_GRAN_ADDR: + { + hwaddr start = iotlb->iova + giommu->iommu_offset; + struct iommu_inv_addr_info *addr_info; + size_t size = iotlb->addr_mask + 1; + int archid = -1; + + addr_info = _info; + ustruct.info.granularity = IOMMU_INV_GRANU_ADDR; + if (iotlb->leaf) { + addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF; + } + if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) { + addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID; + archid = iotlb->arch_id; + } + addr_info->archid = archid; + addr_info->addr = start; + addr_info->granule_size = size; + addr_info->nb_granules = 1; + trace_vfio_iommu_addr_inv_iotlb(archid, start, size, + 1, iotlb->leaf); + break; + } Should we pass a size to host kernel here, even if vSMMU doesn't support RIL or guest kernel doesn't use RIL? It will cause TLBI issue in this scenario: Guest kernel issues a TLBI cmd without "range" (tg = 0) to invalidate a 2M huge page. Then qemu passed the iova and size (4K) to host kernel. Finally, host kernel issues a TLBI cmd with "range" (4K) which can not invalidate the TLB entry of 2M huge page. (pSMMU supports RIL) In that case the guest will loop over all 4K images belonging to the 2M huge page and invalidate each of them. This should turn into qemu notifications for each 4kB page, no? This is totally inefficient, hence The guest will not loop over all 4K images belonging to the 2M huge page. The iommu_iotlb_gather->pgsize will be 2M, if a page is 2M huge page. The gather->pgsize will be passed to __arm_smmu_tlb_inv_range as "granule": iommu_iotlb_gather_add_
Re: [RFC v9 15/29] vfio: Set up nested stage mappings
Hi Eric, On 2021/4/11 20:08, Eric Auger wrote: In nested mode, legacy vfio_iommu_map_notify cannot be used as there is no "caching" mode and we do not trap on map. On Intel, vfio_iommu_map_notify was used to DMA map the RAM through the host single stage. With nested mode, we need to setup the stage 2 and the stage 1 separately. This patch introduces a prereg_listener to setup the stage 2 mapping. The stage 1 mapping, owned by the guest, is passed to the host when the guest invalidates the stage 1 configuration, through a dedicated PCIPASIDOps callback. Guest IOTLB invalidations are cascaded downto the host through another IOMMU MR UNMAP notifier. Signed-off-by: Eric Auger --- v7 -> v8: - properly handle new IOMMUTLBEntry fields and especially propagate DOMAIN and PASID based invalidations v6 -> v7: - remove PASID based invalidation v5 -> v6: - add error_report_err() - remove the abort in case of nested stage case v4 -> v5: - use VFIO_IOMMU_SET_PASID_TABLE - use PCIPASIDOps for config notification v3 -> v4: - use iommu_inv_pasid_info for ASID invalidation v2 -> v3: - use VFIO_IOMMU_ATTACH_PASID_TABLE - new user API - handle leaf v1 -> v2: - adapt to uapi changes - pass the asid - pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier --- hw/vfio/common.c | 139 +-- hw/vfio/pci.c| 21 +++ hw/vfio/trace-events | 2 + 3 files changed, 157 insertions(+), 5 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 0cd7ef2139..e369d451e7 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr, return true; } +/* Propagate a guest IOTLB invalidation to the host (nested mode) */ +static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) +{ +VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); +struct vfio_iommu_type1_cache_invalidate ustruct = {}; +VFIOContainer *container = giommu->container; +int ret; + +assert(iotlb->perm == IOMMU_NONE); + +ustruct.argsz = sizeof(ustruct); +ustruct.flags = 0; +ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info); +ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1; +ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB; + +switch (iotlb->granularity) { +case IOMMU_INV_GRAN_DOMAIN: +ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN; +break; +case IOMMU_INV_GRAN_PASID: +{ +struct iommu_inv_pasid_info *pasid_info; +int archid = -1; + +pasid_info = _info; +ustruct.info.granularity = IOMMU_INV_GRANU_PASID; +if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) { +pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID; +archid = iotlb->arch_id; +} +pasid_info->archid = archid; +trace_vfio_iommu_asid_inv_iotlb(archid); +break; +} +case IOMMU_INV_GRAN_ADDR: +{ +hwaddr start = iotlb->iova + giommu->iommu_offset; +struct iommu_inv_addr_info *addr_info; +size_t size = iotlb->addr_mask + 1; +int archid = -1; + +addr_info = _info; +ustruct.info.granularity = IOMMU_INV_GRANU_ADDR; +if (iotlb->leaf) { +addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF; +} +if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) { +addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID; +archid = iotlb->arch_id; +} +addr_info->archid = archid; +addr_info->addr = start; +addr_info->granule_size = size; +addr_info->nb_granules = 1; +trace_vfio_iommu_addr_inv_iotlb(archid, start, size, +1, iotlb->leaf); +break; +} Should we pass a size to host kernel here, even if vSMMU doesn't support RIL or guest kernel doesn't use RIL? It will cause TLBI issue in this scenario: Guest kernel issues a TLBI cmd without "range" (tg = 0) to invalidate a 2M huge page. Then qemu passed the iova and size (4K) to host kernel. Finally, host kernel issues a TLBI cmd with "range" (4K) which can not invalidate the TLB entry of 2M huge page. (pSMMU supports RIL) Thanks, Kunkun Jiang +} + +ret = ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE, ); +if (ret) { +error_report("%p: failed to invalidate CACHE (%d)", container, ret); +} +} + static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) { VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); @@ -776,6 +843,35 @@ static void vfio_dma_unmap_ram_section(VFIOContainer *container, } } +static void vfio_prereg_listener_region_add(MemoryListener *listener, +MemoryRe
Re: [RFC PATCH 3/3] hw/arm/smmuv3: Post-load stage 1 configurations to the host
Hi Eric, On 2021/4/12 16:34, Auger Eric wrote: Hi Kunkun, On 2/19/21 10:42 AM, Kunkun Jiang wrote: In nested mode, we call the set_pasid_table() callback on each STE update to pass the guest stage 1 configuration to the host and apply it at physical level. In the case of live migration, we need to manual call the s/manual/manually You are right. set_pasid_table() to load the guest stage 1 configurations to the host. If this operation is fail, the migration is fail. s/If this operation is fail, the migration is fail/If this operation fails, the migration fails. Thanks for your careful review. Signed-off-by: Kunkun Jiang --- hw/arm/smmuv3.c | 60 + hw/arm/trace-events | 1 + 2 files changed, 61 insertions(+) diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c index 6c6ed84e78..94ca15375c 100644 --- a/hw/arm/smmuv3.c +++ b/hw/arm/smmuv3.c @@ -1473,6 +1473,65 @@ static void smmu_realize(DeviceState *d, Error **errp) smmu_init_irq(s, dev); } +static int smmuv3_manual_set_pci_device_pasid_table(SMMUDevice *sdev) Can't you retrieve the associated sid and then call smmuv3_notify_config_change() Agreed. It can reuse the code of smmuv3_notify_config_change(). I will modify it in the next version. But I need a return value to judge whether it is successful, which may need to modify the type of smmuv3_notify_config_change(). Thanks, Kunkun Jiang +{ +#ifdef __linux__ +IOMMUMemoryRegion *mr = &(sdev->iommu); +int sid = smmu_get_sid(sdev); +SMMUEventInfo event = {.type = SMMU_EVT_NONE, .sid = sid, + .inval_ste_allowed = true}; +IOMMUConfig iommu_config = {}; +SMMUTransCfg *cfg; +int ret = -1; + +cfg = smmuv3_get_config(sdev, ); +if (!cfg) { +return ret; +} + +iommu_config.pasid_cfg.argsz = sizeof(struct iommu_pasid_table_config); +iommu_config.pasid_cfg.version = PASID_TABLE_CFG_VERSION_1; +iommu_config.pasid_cfg.format = IOMMU_PASID_FORMAT_SMMUV3; +iommu_config.pasid_cfg.base_ptr = cfg->s1ctxptr; +iommu_config.pasid_cfg.pasid_bits = 0; +iommu_config.pasid_cfg.vendor_data.smmuv3.version = PASID_TABLE_SMMUV3_CFG_VERSION_1; + +if (cfg->disabled || cfg->bypassed) { +iommu_config.pasid_cfg.config = IOMMU_PASID_CONFIG_BYPASS; +} else if (cfg->aborted) { +iommu_config.pasid_cfg.config = IOMMU_PASID_CONFIG_ABORT; +} else { +iommu_config.pasid_cfg.config = IOMMU_PASID_CONFIG_TRANSLATE; +} + +ret = pci_device_set_pasid_table(sdev->bus, sdev->devfn, _config); +if (ret) { +error_report("Failed to pass PASID table to host for iommu mr %s (%m)", + mr->parent_obj.name); +} + +return ret; +#endif +} + +static int smmuv3_post_load(void *opaque, int version_id) +{ +SMMUv3State *s3 = opaque; +SMMUState *s = &(s3->smmu_state); +SMMUDevice *sdev; +int ret = 0; + +QLIST_FOREACH(sdev, >devices_with_notifiers, next) { +trace_smmuv3_post_load_sdev(sdev->devfn, sdev->iommu.parent_obj.name); +ret = smmuv3_manual_set_pci_device_pasid_table(sdev); +if (ret) { +break; +} +} + +return ret; +} + static const VMStateDescription vmstate_smmuv3_queue = { .name = "smmuv3_queue", .version_id = 1, @@ -1491,6 +1550,7 @@ static const VMStateDescription vmstate_smmuv3 = { .version_id = 1, .minimum_version_id = 1, .priority = MIG_PRI_IOMMU, +.post_load = smmuv3_post_load, .fields = (VMStateField[]) { VMSTATE_UINT32(features, SMMUv3State), VMSTATE_UINT8(sid_size, SMMUv3State), diff --git a/hw/arm/trace-events b/hw/arm/trace-events index 35e562ab74..caa864dd72 100644 --- a/hw/arm/trace-events +++ b/hw/arm/trace-events @@ -53,4 +53,5 @@ smmuv3_notify_flag_add(const char *iommu) "ADD SMMUNotifier node for iommu mr=%s smmuv3_notify_flag_del(const char *iommu) "DEL SMMUNotifier node for iommu mr=%s" smmuv3_inv_notifiers_iova(const char *name, uint16_t asid, uint64_t iova, uint8_t tg, uint64_t num_pages) "iommu mr=%s asid=%d iova=0x%"PRIx64" tg=%d num_pages=0x%"PRIx64 smmuv3_notify_config_change(const char *name, uint8_t config, uint64_t s1ctxptr) "iommu mr=%s config=%d s1ctxptr=0x%"PRIx64 +smmuv3_post_load_sdev(int devfn, const char *name) "sdev devfn=%d iommu mr=%s"PRIx64 Thanks Eric .
Re: [RFC PATCH 2/3] vfio: Add vfio_prereg_listener_log_sync in nested stage
Hi Eric, On 2021/4/8 21:56, Auger Eric wrote: Hi Kunkun, On 2/19/21 10:42 AM, Kunkun Jiang wrote: On Intel, the DMA mapped through the host single stage. Instead we set up the stage 2 and stage 1 separately in nested mode as there is no "Caching Mode". You need to rewrite the above sentences, Missing ARM and also the 1st sentences misses a verb. Thanks for your review! I will fix it in the next version. Legacy vfio_listener_log_sync cannot be used in nested stage as we don't need to pay close attention to stage 1 mapping. This patch adds vfio_prereg_listener_log_sync to mark dirty pages in nested mode. Signed-off-by: Kunkun Jiang --- hw/vfio/common.c | 25 + 1 file changed, 25 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 7c50905856..af333e0dee 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -1216,6 +1216,22 @@ static int vfio_dma_sync_ram_section_dirty_bitmap(VFIOContainer *container, int128_get64(section->size), ram_addr); } +static void vfio_prereg_listener_log_sync(MemoryListener *listener, + MemoryRegionSection *section) +{ +VFIOContainer *container = +container_of(listener, VFIOContainer, prereg_listener); + +if (!memory_region_is_ram(section->mr) || +!container->dirty_pages_supported) { +return; +} + +if (vfio_devices_all_saving(container)) { I fail to see where is this defined? Keqian modified vfio_devices_all_saving to vfio_devices_all_dirty_tracking in 758b96b61d5. When I posted this series patches, it was vfio_devices_all_saving. In v2[1], I have updated it based on the lasted qemu. [1] https://lore.kernel.org/qemu-devel/20210331101259.2153-3-jiangkun...@huawei.com/ Thanks, Kunkun Jiang +vfio_dma_sync_ram_section_dirty_bitmap(container, section); +} +} + typedef struct { IOMMUNotifier n; VFIOGuestIOMMU *giommu; @@ -1260,6 +1276,14 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container, if (memory_region_is_iommu(section->mr)) { VFIOGuestIOMMU *giommu; +/* + * In nested mode, stage 2 and stage 1 are set up separately. We + * only need to focus on stage 2 mapping when marking dirty pages. + */ +if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) { +return 0; +} + QLIST_FOREACH(giommu, >giommu_list, giommu_next) { if (MEMORY_REGION(giommu->iommu) == section->mr && giommu->n.start == section->offset_within_region) { @@ -1312,6 +1336,7 @@ static const MemoryListener vfio_memory_listener = { static MemoryListener vfio_memory_prereg_listener = { .region_add = vfio_prereg_listener_region_add, .region_del = vfio_prereg_listener_region_del, +.log_sync = vfio_prereg_listener_log_sync, }; static void vfio_listener_release(VFIOContainer *container) Thanks Eric .
Re: [RFC PATCH 0/3] Add migration support for VFIO PCI devices in SMMUv3 nested stage mode
Hi Eric, On 2021/4/12 16:40, Auger Eric wrote: Hi Kunkun, On 2/19/21 10:42 AM, Kunkun Jiang wrote: Hi all, Since the SMMUv3's nested translation stages[1] has been introduced by Eric, we need to pay attention to the migration of VFIO PCI devices in SMMUv3 nested stage mode. At present, it is not yet supported in QEMU. There are two problems in the existing framework. First, the current way to get dirty pages is not applicable to nested stage mode. Because of the "Caching Mode", VTD can map the RAM through the host single stage (giova->hpa). "vfio_listener_log_sync" gets dirty pages by transferring "giova" to the kernel for the RAM block section of mapped MMIO region. In nested stage mode, we setup the stage 2 (gpa->hpa) and the stage 1(giova->gpa) separately. So it is inapplicable to get dirty pages by the current way in nested stage mode. Second, it also need to pass stage 1 configurations to the destination host after the migration. In Eric's patch, it passes the stage 1 configuration to the host on each STE update for the devices set the PASID PciOps. The configuration will be applied at physical level. But the data of physical level will not be sent to the destination host. So we have to pass stage 1 configurations to the destination host after the migration. This Patch set includes patches as below: Patch 1-2: - Refactor the vfio_listener_log_sync and added a new function to get dirty pages in nested stage mode. Patch 3: - Added the post_load function to vmstate_smmuv3 for passing stage 1 configuration to the destination host after the migration. @Eric, Could you please add this Patch set to your future version of "vSMMUv3/pSMMUv3 2 stage VFIO integration", if you think this Patch set makes sense? :) First of all, thank you for working on this. As you may have noticed I sent a new RFC version yesterday (without including this). When time allows, you may have a look at the comments I posted on your series. I don't think I can test it at the moment so I may prefer to keep it separate. Also be aware that the QEMU integration of nested has not received much comments yet and is likely to evolve. The priority is to get some R-b's on the kernel pieces, especially the SMMU part. With this dependency resolved, things can't move forward I am afraid. Thanks Eric Yes, I saw your latest version and comments. Thanks again for your review. I will try my best to test your patches and come up with some useful ideas. In the future, this series will be updated with your series of nested. If conditions permit, I hope you can test it and give me some comments. Best regards, Kunkun Jiang Best Regards Kunkun Jiang [1] [RFC,v7,00/26] vSMMUv3/pSMMUv3 2 stage VFIO integration http://patchwork.ozlabs.org/project/qemu-devel/cover/20201116181349.11908-1-eric.au...@redhat.com/ Kunkun Jiang (3): vfio: Introduce helpers to mark dirty pages of a RAM section vfio: Add vfio_prereg_listener_log_sync in nested stage hw/arm/smmuv3: Post-load stage 1 configurations to the host hw/arm/smmuv3.c | 60 + hw/arm/trace-events | 1 + hw/vfio/common.c| 47 +-- 3 files changed, 100 insertions(+), 8 deletions(-) .
Re: [RFC PATCH 1/3] vfio: Introduce helpers to mark dirty pages of a RAM section
Hi Eric, On 2021/4/8 21:46, Auger Eric wrote: Hi Kunkun, On 2/19/21 10:42 AM, Kunkun Jiang wrote: Extract part of the code from vfio_sync_dirty_bitmap to form a new helper, which allows to mark dirty pages of a RAM section. This helper will be called for nested stage. Signed-off-by: Kunkun Jiang --- hw/vfio/common.c | 22 ++ 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 9225f10722..7c50905856 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -1203,6 +1203,19 @@ err_out: return ret; } +static int vfio_dma_sync_ram_section_dirty_bitmap(VFIOContainer *container, + MemoryRegionSection *section) +{ +ram_addr_t ram_addr; + +ram_addr = memory_region_get_ram_addr(section->mr) + + section->offset_within_region; + +return vfio_get_dirty_bitmap(container, + TARGET_PAGE_ALIGN(section->offset_within_address_space), + int128_get64(section->size), ram_addr); +} + typedef struct { IOMMUNotifier n; VFIOGuestIOMMU *giommu; @@ -1244,8 +1257,6 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) static int vfio_sync_dirty_bitmap(VFIOContainer *container, MemoryRegionSection *section) { -ram_addr_t ram_addr; - if (memory_region_is_iommu(section->mr)) { VFIOGuestIOMMU *giommu; @@ -1274,12 +1285,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container, return 0; } -ram_addr = memory_region_get_ram_addr(section->mr) + - section->offset_within_region; - -return vfio_get_dirty_bitmap(container, - TARGET_PAGE_ALIGN(section->offset_within_address_space), this is now REAL_HOST_PAGE_ALIGN Thanks Eric Sorry for late replay. Yes, it is REAL_HOST_PAGE_ALIGN now which modified by a patch I sent earlier. I posted v2 a few days ago and I have modified TARGET_PAGE_ALIGN to REAL_HOST_PAGE_ALIGN.[1] [1] https://lore.kernel.org/qemu-devel/20210331101259.2153-2-jiangkun...@huawei.com/ Thanks, Kunkun Jiang - int128_get64(section->size), ram_addr); +return vfio_dma_sync_ram_section_dirty_bitmap(container, section); } static void vfio_listerner_log_sync(MemoryListener *listener, .
Re: A question about the translation granule size supported by the vSMMU
Hi Eric, On 2021/4/8 15:27, Auger Eric wrote: Hi Kunkun, On 4/7/21 11:26 AM, Kunkun Jiang wrote: Hi Eric, On 2021/4/7 3:50, Auger Eric wrote: Hi Kunkun, On 3/27/21 3:24 AM, Kunkun Jiang wrote: Hi all, Recently, I did some tests on SMMU nested mode. Here is a question about the translation granule size supported by vSMMU. There is such a code in SMMUv3_init_regs(): /* 4K and 64K granule support */ s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN4K, 1); s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN64K, 1); s->idr[5] = FIELD_DP32(s->idr[5], IDR5, OAS, SMMU_IDR5_OAS); /* 44 bits */ Why is the 16K granule not supported? I modified the code to support it and did not encounter any problems in the test. Although 4K and 64K minimal granules are "strongly recommended", I think vSMMU should still support 16K. Are there other reasons why 16K is not supported here? no there aren't any. The main reasons were 16KB support is optional and supporting it increases the test matrix. Also it seems quite a few machines I have access to do support 16KB granule. On the others I get "EFI stub: ERROR: This 16 KB granular kernel is not supported by your CPU". Nevertheless I am not opposed to support it as it seems to work without trouble. Just need to have an extra look at implied validity checks but there shouldn't be much. Thanks Eric Yes, you are right. In my opinion, it is necessary to check whether pSMMU supports 16K to avoid the situation I mentioned below. In SMMU nested mode, if vSMMU supports 16K and set pasid table to pSMMU, it may get errors when pSMMU does translation table walk if pSMMU doesn't support 16K (not tested). Do you think we need to add an interface to get some pSMMU info?> Maybe my consideration was superfluous. No it is not. At qemu level we have memory_region_iommu_set_page_size_mask() that is called from the VFIO device. It allows to pass such info to the IOMMU device (qemu b91774984249). iommu_set_page_size_mask() cb needs to be implemented at SMMU QEMU device level. Also [PATCH 0/2] Domain nesting info for arm-smmu may allow to return other constraints from the pSMMU. Thanks Eric Ok, it makes sense to me. I am glad to test them if these patches are ready. Thanks, Kunkun Jiang Thanks, Kunkun Jiang When in SMMU nested mode, it may get errors if pSMMU doesn't support 16K but vSMMU supports 16K. But we can get some settings of pSMMU to avoid this situation. I found some discussions between Eric and Linu about this [1], but this idea does not seem to be implemented. [1] https://lists.gnu.org/archive/html/qemu-arm/2017-09/msg00149.html Best regards, Kunkun Jiang . .
Re: [PATCH v4 0/2] Some modification about ram_save_host_page()
Hi Dave, On 2021/4/7 1:14, Dr. David Alan Gilbert wrote: * Kunkun Jiang (jiangkun...@huawei.com) wrote: Kindly ping, Hi David Alan Gilbert, Will this series be picked up soon, or is there any other work for me to do? You don't need to do anything, but it did miss the cutoff for soft freeze; since it's an optimisation not a fix; it's still on my list so it'll go in just as soon as 6.1 opens up. Dave Okay, I see. Thank you. Best Regards, Kunkun Jiang Best Regards, Kunkun Jiang On 2021/3/16 20:57, Kunkun Jiang wrote: Hi all, This series include patches as below: Patch 1: - reduce unnecessary rate limiting in ram_save_host_page() Patch 2: - optimized ram_save_host_page() by using migration_bitmap_find_dirty() to find dirty pages History: v3 -> v4: - Remove the modification to ram_save_host_page() comment [Peter Xu] - Remove the renaming of tmppages v2 -> v3: - Reduce unnecessary rate limiting if nothing is sent in the current iteration [David Edmondson] - Invert the body of the loop in ram_save_host_page() [David Edmondson] v1 -> v2: - Modify ram_save_host_page() comment [David Edmondson] - Remove 'goto' [David Edmondson] Kunkun Jiang (2): migration/ram: Reduce unnecessary rate limiting migration/ram: Optimize ram_save_host_page() migration/ram.c | 34 +++--- 1 file changed, 19 insertions(+), 15 deletions(-)
Re: A question about the translation granule size supported by the vSMMU
Hi Eric, On 2021/4/7 3:50, Auger Eric wrote: Hi Kunkun, On 3/27/21 3:24 AM, Kunkun Jiang wrote: Hi all, Recently, I did some tests on SMMU nested mode. Here is a question about the translation granule size supported by vSMMU. There is such a code in SMMUv3_init_regs(): /* 4K and 64K granule support */ s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN4K, 1); s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN64K, 1); s->idr[5] = FIELD_DP32(s->idr[5], IDR5, OAS, SMMU_IDR5_OAS); /* 44 bits */ Why is the 16K granule not supported? I modified the code to support it and did not encounter any problems in the test. Although 4K and 64K minimal granules are "strongly recommended", I think vSMMU should still support 16K. Are there other reasons why 16K is not supported here? no there aren't any. The main reasons were 16KB support is optional and supporting it increases the test matrix. Also it seems quite a few machines I have access to do support 16KB granule. On the others I get "EFI stub: ERROR: This 16 KB granular kernel is not supported by your CPU". Nevertheless I am not opposed to support it as it seems to work without trouble. Just need to have an extra look at implied validity checks but there shouldn't be much. Thanks Eric Yes, you are right. In my opinion, it is necessary to check whether pSMMU supports 16K to avoid the situation I mentioned below. In SMMU nested mode, if vSMMU supports 16K and set pasid table to pSMMU, it may get errors when pSMMU does translation table walk if pSMMU doesn't support 16K (not tested). Do you think we need to add an interface to get some pSMMU info? Maybe my consideration was superfluous. Thanks, Kunkun Jiang When in SMMU nested mode, it may get errors if pSMMU doesn't support 16K but vSMMU supports 16K. But we can get some settings of pSMMU to avoid this situation. I found some discussions between Eric and Linu about this [1], but this idea does not seem to be implemented. [1] https://lists.gnu.org/archive/html/qemu-arm/2017-09/msg00149.html Best regards, Kunkun Jiang .
Re: [PATCH v4 0/2] Some modification about ram_save_host_page()
Kindly ping, Hi David Alan Gilbert, Will this series be picked up soon, or is there any other work for me to do? Best Regards, Kunkun Jiang On 2021/3/16 20:57, Kunkun Jiang wrote: Hi all, This series include patches as below: Patch 1: - reduce unnecessary rate limiting in ram_save_host_page() Patch 2: - optimized ram_save_host_page() by using migration_bitmap_find_dirty() to find dirty pages History: v3 -> v4: - Remove the modification to ram_save_host_page() comment [Peter Xu] - Remove the renaming of tmppages v2 -> v3: - Reduce unnecessary rate limiting if nothing is sent in the current iteration [David Edmondson] - Invert the body of the loop in ram_save_host_page() [David Edmondson] v1 -> v2: - Modify ram_save_host_page() comment [David Edmondson] - Remove 'goto' [David Edmondson] Kunkun Jiang (2): migration/ram: Reduce unnecessary rate limiting migration/ram: Optimize ram_save_host_page() migration/ram.c | 34 +++--- 1 file changed, 19 insertions(+), 15 deletions(-)
[RFC PATCH v2 2/4] vfio: Add vfio_prereg_listener_log_sync in nested stage
On Intel, the DMA mapped through the host single stage. Instead we set up the stage 2 and stage 1 separately in nested mode as there is no "Caching Mode". Legacy vfio_listener_log_sync cannot be used in nested stage as we don't need to pay close attention to stage 1 mapping. This patch adds vfio_prereg_listener_log_sync to mark dirty pages in nested mode. Signed-off-by: Kunkun Jiang --- hw/vfio/common.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 3117979307..86722814d4 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -1284,6 +1284,22 @@ static int vfio_dma_sync_ram_section_dirty_bitmap(VFIOContainer *container, int128_get64(section->size), ram_addr); } +static void vfio_prereg_listener_log_sync(MemoryListener *listener, + MemoryRegionSection *section) +{ +VFIOContainer *container = +container_of(listener, VFIOContainer, prereg_listener); + +if (!memory_region_is_ram(section->mr) || +!container->dirty_pages_supported) { +return; +} + +if (vfio_devices_all_dirty_tracking(container)) { +vfio_dma_sync_ram_section_dirty_bitmap(container, section); +} +} + typedef struct { IOMMUNotifier n; VFIOGuestIOMMU *giommu; @@ -1328,6 +1344,16 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container, if (memory_region_is_iommu(section->mr)) { VFIOGuestIOMMU *giommu; +/* + * In nested mode, stage 2 (gpa->hpa) and stage 1 (giova->gpa) are + * set up separately. It is inappropriate to pass 'giova' to kernel + * to get dirty pages. We only need to focus on stage 2 mapping when + * marking dirty pages. + */ +if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) { +return 0; +} + QLIST_FOREACH(giommu, >giommu_list, giommu_next) { if (MEMORY_REGION(giommu->iommu) == section->mr && giommu->n.start == section->offset_within_region) { @@ -1382,6 +1408,7 @@ static const MemoryListener vfio_memory_listener = { static MemoryListener vfio_memory_prereg_listener = { .region_add = vfio_prereg_listener_region_add, .region_del = vfio_prereg_listener_region_del, +.log_sync = vfio_prereg_listener_log_sync, }; static void vfio_listener_release(VFIOContainer *container) -- 2.23.0
[RFC PATCH v2 3/4] vfio: Add vfio_prereg_listener_global_log_start/stop in nested stage
In nested mode, we set up the stage 2 and stage 1 separately. In my opinion, vfio_memory_prereg_listener is used for stage 2 and vfio_memory_listener is used for stage 1. So it feels weird to call the global_log_start/stop interface in vfio_memory_listener to switch dirty tracking, although this won't cause any errors. Add global_log_start/stop interface in vfio_memory_prereg_listener can separate stage 2 from stage 1. Signed-off-by: Kunkun Jiang --- hw/vfio/common.c | 22 ++ 1 file changed, 22 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 86722814d4..efea252e46 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -1209,6 +1209,16 @@ static void vfio_listener_log_global_start(MemoryListener *listener) { VFIOContainer *container = container_of(listener, VFIOContainer, listener); +if (container->iommu_type != VFIO_TYPE1_NESTING_IOMMU) { +vfio_set_dirty_page_tracking(container, true); +} +} + +static void vfio_prereg_listener_log_global_start(MemoryListener *listener) +{ +VFIOContainer *container = +container_of(listener, VFIOContainer, prereg_listener); + vfio_set_dirty_page_tracking(container, true); } @@ -1216,6 +1226,16 @@ static void vfio_listener_log_global_stop(MemoryListener *listener) { VFIOContainer *container = container_of(listener, VFIOContainer, listener); +if (container->iommu_type != VFIO_TYPE1_NESTING_IOMMU) { +vfio_set_dirty_page_tracking(container, false); +} +} + +static void vfio_prereg_listener_log_global_stop(MemoryListener *listener) +{ +VFIOContainer *container = +container_of(listener, VFIOContainer, prereg_listener); + vfio_set_dirty_page_tracking(container, false); } @@ -1408,6 +1428,8 @@ static const MemoryListener vfio_memory_listener = { static MemoryListener vfio_memory_prereg_listener = { .region_add = vfio_prereg_listener_region_add, .region_del = vfio_prereg_listener_region_del, +.log_global_start = vfio_prereg_listener_log_global_start, +.log_global_stop = vfio_prereg_listener_log_global_stop, .log_sync = vfio_prereg_listener_log_sync, }; -- 2.23.0
[RFC PATCH v2 0/4] Add migration support for VFIO PCI devices in SMMUv3 nested stage mode
Hi all, Since the SMMUv3's nested translation stages[1] has been introduced by Eric, we need to pay attention to the migration of VFIO PCI devices in SMMUv3 nested stage mode. At present, it is not yet supported in QEMU. There are two problems in the existing framework. First, the current way to get dirty pages is not applicable to nested stage mode. Because of the "Caching Mode", VTD can map the RAM through the host single stage (giova->hpa). "vfio_listener_log_sync" gets dirty pages by transferring "giova" to the kernel for the RAM block section of mapped MMIO region. In nested stage mode, we setup the stage 2 (gpa->hpa) and the stage 1(giova->gpa) separately. So it is inapplicable to get dirty pages by the current way in nested stage mode. Second, it also need to pass stage 1 configurations to the destination host after the migration. In Eric's patch, it passes the stage 1 configuration to the host on each STE update for the devices set the PASID PciOps. The configuration will be applied at physical level. But the data of physical level will not be sent to the destination host. So we have to pass stage 1 configurations to the destination host after the migration. Best Regards, Kunkun Jiang [1] [RFC,v8,00/28] vSMMUv3/pSMMUv3 2 stage VFIO integration http://patchwork.ozlabs.org/project/qemu-devel/cover/20210225105233.650545-1-eric.au...@redhat.com/ This Patch set includes patches as below: Patch 1-2: - Refactor the vfio_listener_log_sync and added a new function to get dirty pages in nested stage mode. Patch 3: - Added global_log_start/stop interface in vfio_memory_prereg_listener Patch 4: - Added the post_load function to vmstate_smmuv3 for passing stage 1 configuration to the destination host after the migration. History: v1 -> v2: - Add global_log_start/stop interface in vfio_memory_prereg_listener - Add support for repass stage 1 configs with multiple CDs to the host Kunkun Jiang (4): vfio: Introduce helpers to mark dirty pages of a RAM section vfio: Add vfio_prereg_listener_log_sync in nested stage vfio: Add vfio_prereg_listener_global_log_start/stop in nested stage hw/arm/smmuv3: Post-load stage 1 configurations to the host hw/arm/smmuv3.c | 62 +++ hw/arm/trace-events | 1 + hw/vfio/common.c| 71 - 3 files changed, 126 insertions(+), 8 deletions(-) -- 2.23.0
[RFC PATCH v2 4/4] hw/arm/smmuv3: Post-load stage 1 configurations to the host
In nested mode, we call the set_pasid_table() callback on each STE update to pass the guest stage 1 configuration to the host and apply it at physical level. In the case of live migration, we need to manual call the set_pasid_table() to load the guest stage 1 configurations to the host. If this operation is fail, the migration is fail. Signed-off-by: Kunkun Jiang --- hw/arm/smmuv3.c | 62 + hw/arm/trace-events | 1 + 2 files changed, 63 insertions(+) diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c index 55aa6ad874..4d28ca3777 100644 --- a/hw/arm/smmuv3.c +++ b/hw/arm/smmuv3.c @@ -1534,6 +1534,67 @@ static void smmu_realize(DeviceState *d, Error **errp) smmu_init_irq(s, dev); } +static int smmuv3_manual_set_pci_device_pasid_table(SMMUDevice *sdev) +{ +#ifdef __linux__ +IOMMUMemoryRegion *mr = &(sdev->iommu); +int sid = smmu_get_sid(sdev); +SMMUEventInfo event = {.type = SMMU_EVT_NONE, .sid = sid, + .inval_ste_allowed = true}; +IOMMUConfig iommu_config = {}; +SMMUTransCfg *cfg; +int ret = -1; + +cfg = smmuv3_get_config(sdev, ); +if (!cfg) { +return ret; +} + +iommu_config.pasid_cfg.argsz = sizeof(struct iommu_pasid_table_config); +iommu_config.pasid_cfg.version = PASID_TABLE_CFG_VERSION_1; +iommu_config.pasid_cfg.format = IOMMU_PASID_FORMAT_SMMUV3; +iommu_config.pasid_cfg.base_ptr = cfg->s1ctxptr; +iommu_config.pasid_cfg.pasid_bits = cfg->s1cdmax; +iommu_config.pasid_cfg.vendor_data.smmuv3.version = PASID_TABLE_SMMUV3_CFG_VERSION_1; +iommu_config.pasid_cfg.vendor_data.smmuv3.s1fmt = cfg->s1fmt; +iommu_config.pasid_cfg.vendor_data.smmuv3.s1dss = cfg->s1dss; + +if (cfg->disabled || cfg->bypassed) { +iommu_config.pasid_cfg.config = IOMMU_PASID_CONFIG_BYPASS; +} else if (cfg->aborted) { +iommu_config.pasid_cfg.config = IOMMU_PASID_CONFIG_ABORT; +} else { +iommu_config.pasid_cfg.config = IOMMU_PASID_CONFIG_TRANSLATE; +} + +ret = pci_device_set_pasid_table(sdev->bus, sdev->devfn, _config); +if (ret) { +error_report("Failed to pass PASID table to host for iommu mr %s (%m)", + mr->parent_obj.name); +} + +return ret; +#endif +} + +static int smmuv3_post_load(void *opaque, int version_id) +{ +SMMUv3State *s3 = opaque; +SMMUState *s = &(s3->smmu_state); +SMMUDevice *sdev; +int ret = 0; + +QLIST_FOREACH(sdev, >devices_with_notifiers, next) { +trace_smmuv3_post_load_sdev(sdev->devfn, sdev->iommu.parent_obj.name); +ret = smmuv3_manual_set_pci_device_pasid_table(sdev); +if (ret) { +break; +} +} + +return ret; +} + static const VMStateDescription vmstate_smmuv3_queue = { .name = "smmuv3_queue", .version_id = 1, @@ -1552,6 +1613,7 @@ static const VMStateDescription vmstate_smmuv3 = { .version_id = 1, .minimum_version_id = 1, .priority = MIG_PRI_IOMMU, +.post_load = smmuv3_post_load, .fields = (VMStateField[]) { VMSTATE_UINT32(features, SMMUv3State), VMSTATE_UINT8(sid_size, SMMUv3State), diff --git a/hw/arm/trace-events b/hw/arm/trace-events index b0b0030d24..2f093286ec 100644 --- a/hw/arm/trace-events +++ b/hw/arm/trace-events @@ -54,4 +54,5 @@ smmuv3_notify_flag_add(const char *iommu) "ADD SMMUNotifier node for iommu mr=%s smmuv3_notify_flag_del(const char *iommu) "DEL SMMUNotifier node for iommu mr=%s" smmuv3_inv_notifiers_iova(const char *name, uint16_t asid, uint64_t iova, uint8_t tg, uint64_t num_pages) "iommu mr=%s asid=%d iova=0x%"PRIx64" tg=%d num_pages=0x%"PRIx64 smmuv3_notify_config_change(const char *name, uint8_t config, uint64_t s1ctxptr) "iommu mr=%s config=%d s1ctxptr=0x%"PRIx64 +smmuv3_post_load_sdev(int devfn, const char *name) "sdev devfn=%d iommu mr=%s"PRIx64 -- 2.23.0
[RFC PATCH v2 1/4] vfio: Introduce helpers to mark dirty pages of a RAM section
Extract part of the code from vfio_sync_dirty_bitmap to form a new helper, which allows to mark dirty pages of a RAM section. This helper will be called for nested stage. Signed-off-by: Kunkun Jiang --- hw/vfio/common.c | 22 ++ 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 7da1e95b43..3117979307 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -1271,6 +1271,19 @@ err_out: return ret; } +static int vfio_dma_sync_ram_section_dirty_bitmap(VFIOContainer *container, + MemoryRegionSection *section) +{ +ram_addr_t ram_addr; + +ram_addr = memory_region_get_ram_addr(section->mr) + + section->offset_within_region; + +return vfio_get_dirty_bitmap(container, +REAL_HOST_PAGE_ALIGN(section->offset_within_address_space), +int128_get64(section->size), ram_addr); +} + typedef struct { IOMMUNotifier n; VFIOGuestIOMMU *giommu; @@ -1312,8 +1325,6 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) static int vfio_sync_dirty_bitmap(VFIOContainer *container, MemoryRegionSection *section) { -ram_addr_t ram_addr; - if (memory_region_is_iommu(section->mr)) { VFIOGuestIOMMU *giommu; @@ -1342,12 +1353,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container, return 0; } -ram_addr = memory_region_get_ram_addr(section->mr) + - section->offset_within_region; - -return vfio_get_dirty_bitmap(container, - REAL_HOST_PAGE_ALIGN(section->offset_within_address_space), - int128_get64(section->size), ram_addr); +return vfio_dma_sync_ram_section_dirty_bitmap(container, section); } static void vfio_listener_log_sync(MemoryListener *listener, -- 2.23.0
[PATCH] hw/arm/smmuv3: Support 16K translation granule
The driver can query some bits in SMMUv3 IDR5 to learn which translation granules are supported. Arm recommends that SMMUv3 implementations support at least 4K and 64K granules. But in the vSMMUv3, there seems to be no reason not to support 16K translation granule. In addition, if 16K is not supported, vSVA will failed to be enabled in the future for 16K guest kernel. So it'd better to support it. Signed-off-by: Kunkun Jiang --- hw/arm/smmuv3.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c index 3b87324ce2..0a483b0bab 100644 --- a/hw/arm/smmuv3.c +++ b/hw/arm/smmuv3.c @@ -259,8 +259,9 @@ static void smmuv3_init_regs(SMMUv3State *s) s->idr[3] = FIELD_DP32(s->idr[3], IDR3, RIL, 1); s->idr[3] = FIELD_DP32(s->idr[3], IDR3, HAD, 1); - /* 4K and 64K granule support */ +/* 4K, 16K and 64K granule support */ s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN4K, 1); +s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN16K, 1); s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN64K, 1); s->idr[5] = FIELD_DP32(s->idr[5], IDR5, OAS, SMMU_IDR5_OAS); /* 44 bits */ @@ -503,7 +504,8 @@ static int decode_cd(SMMUTransCfg *cfg, CD *cd, SMMUEventInfo *event) tg = CD_TG(cd, i); tt->granule_sz = tg2granule(tg, i); -if ((tt->granule_sz != 12 && tt->granule_sz != 16) || CD_ENDI(cd)) { +if ((tt->granule_sz != 12 && tt->granule_sz != 14 && + tt->granule_sz != 16) || CD_ENDI(cd)) { goto bad_cd; } -- 2.19.1
A question about the translation granule size supported by the vSMMU
Hi all, Recently, I did some tests on SMMU nested mode. Here is a question about the translation granule size supported by vSMMU. There is such a code in SMMUv3_init_regs(): /* 4K and 64K granule support */ s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN4K, 1); s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN64K, 1); s->idr[5] = FIELD_DP32(s->idr[5], IDR5, OAS, SMMU_IDR5_OAS); /* 44 bits */ Why is the 16K granule not supported? I modified the code to support it and did not encounter any problems in the test. Although 4K and 64K minimal granules are "strongly recommended", I think vSMMU should still support 16K. Are there other reasons why 16K is not supported here? When in SMMU nested mode, it may get errors if pSMMU doesn't support 16K but vSMMU supports 16K. But we can get some settings of pSMMU to avoid this situation. I found some discussions between Eric and Linu about this [1], but this idea does not seem to be implemented. [1] https://lists.gnu.org/archive/html/qemu-arm/2017-09/msg00149.html Best regards, Kunkun Jiang
Re: [RFC PATCH 0/3] vfio/migration: Support manual clear vfio dirty log
On 2021/3/18 20:36, Tian, Kevin wrote: From: Kunkun Jiang Sent: Thursday, March 18, 2021 8:29 PM Hi Kevin, On 2021/3/18 17:04, Tian, Kevin wrote: From: Kunkun Jiang Sent: Thursday, March 18, 2021 3:59 PM Hi Kevin, On 2021/3/18 14:28, Tian, Kevin wrote: From: Kunkun Jiang Sent: Wednesday, March 10, 2021 5:41 PM Hi all, In the past, we clear dirty log immediately after sync dirty log to userspace. This may cause redundant dirty handling if userspace handles dirty log iteratively: After vfio clears dirty log, new dirty log starts to generate. These new dirty log will be reported to userspace even if they are generated before userspace handles the same dirty page. Since a new dirty log tracking method for vfio based on iommu hwdbm[1] has been introduced in the kernel and added a new capability named VFIO_DIRTY_LOG_MANUAL_CLEAR, we can eliminate some redundant dirty handling by supporting it. Is there any performance data showing the benefit of this new method? Current dirty log tracking method for VFIO: [1] All pages marked dirty if not all iommu_groups have pinned_scope [2] pinned pages by various vendor drivers if all iommu_groups have pinned scope Both methods are coarse-grained and can not determine which pages are really dirty. Each round may mark the pages that are not really dirty as dirty and send them to the destination. ( It might be better if the range of the pinned_scope was smaller. ) This will result in a waste of resources. HWDBM is short for Hardware Dirty Bit Management. (e.g. smmuv3 HTTU, Hardware Translation Table Update) About SMMU HTTU: HTTU is a feature of ARM SMMUv3, it can update access flag or/and dirty state of the TTD (Translation Table Descriptor) by hardware. With HTTU, stage1 TTD is classified into 3 types: DBM bit AP[2](readonly bit) 1. writable_clean 1 1 2. writable_dirty 1 0 3. readonly 0 1 If HTTU_HD (manage dirty state) is enabled, smmu can change TTD from writable_clean to writable_dirty. Then software can scan TTD to sync dirty state into dirty bitmap. With this feature, we can track the dirty log of DMA continuously and precisely. The capability of VFIO_DIRTY_LOG_MANUAL_CLEAR is similar to that on the KVM side. We add this new log_clear() interface only to split the old log_sync() into two separated procedures: - use log_sync() to collect the collection only, and, - use log_clear() to clear the dirty bitmap. If you're interested in this new method, you can take a look at our set of patches. [1] https://lore.kernel.org/linux-iommu/20210310090614.26668-1- zhukeqi...@huawei.com/ I know what you are doing. Intel is also working on VT-d dirty bit support based on above link. What I'm curious is the actual performance gain with this optimization. KVM doing that is one good reference, but IOMMU has different characteristics (e.g. longer invalidation latency) compared to CPU MMU. It's always good to understand what a so-called optimization can actually optimize in a context different from where it's originally proved. Thanks Kevin My understanding is that this is a new method, which is quite different from the previous two. So can you explain in more detail what performance data you want? Thanks, Kunkun Jiang When you have HTTU enabled, compare the migration efficiency with and without this manual clear interface. Thanks Kevin Hi Kevin, Sorry for late reply. I tested it on our FPGA in two scenarios. [1] perform limited times of DMA on a fixed ram area [2] perform infinite DMA on a fixed ram area In scenario [1], we can clearly see that lesser data is being transmitted with this manual clear interface. For example, a total of 10 DMA are performed, the amount of transferred data is the sum of 6 times. This depends on whether the device performs a DMA on the dirty ram area between log_sync() and log_clear(). In scenario [2], with or without this manual clear interface, it doesn't make a big difference. Every time we call log_sync(), the fixed ram area is dirty. So, in general scenarios, it can reduce the amount of transmitted data. In addition, regarding the difference between MMU and IOMMU (e.g. longer invalidation latency) you mentioned last time, I think it has no effect on this manual clear interface. Currently, the way we invalidate IOMMU TLB is iommu_flush_iotlb_all(). It takes lesser time than multiple, range-based invalidation. Thanks, Kunkun Jiang
Re: [RFC PATCH 0/3] vfio/migration: Support manual clear vfio dirty log
On 2021/3/18 20:36, Tian, Kevin wrote: From: Kunkun Jiang Sent: Thursday, March 18, 2021 8:29 PM Hi Kevin, On 2021/3/18 17:04, Tian, Kevin wrote: From: Kunkun Jiang Sent: Thursday, March 18, 2021 3:59 PM Hi Kevin, On 2021/3/18 14:28, Tian, Kevin wrote: From: Kunkun Jiang Sent: Wednesday, March 10, 2021 5:41 PM Hi all, In the past, we clear dirty log immediately after sync dirty log to userspace. This may cause redundant dirty handling if userspace handles dirty log iteratively: After vfio clears dirty log, new dirty log starts to generate. These new dirty log will be reported to userspace even if they are generated before userspace handles the same dirty page. Since a new dirty log tracking method for vfio based on iommu hwdbm[1] has been introduced in the kernel and added a new capability named VFIO_DIRTY_LOG_MANUAL_CLEAR, we can eliminate some redundant dirty handling by supporting it. Is there any performance data showing the benefit of this new method? Current dirty log tracking method for VFIO: [1] All pages marked dirty if not all iommu_groups have pinned_scope [2] pinned pages by various vendor drivers if all iommu_groups have pinned scope Both methods are coarse-grained and can not determine which pages are really dirty. Each round may mark the pages that are not really dirty as dirty and send them to the destination. ( It might be better if the range of the pinned_scope was smaller. ) This will result in a waste of resources. HWDBM is short for Hardware Dirty Bit Management. (e.g. smmuv3 HTTU, Hardware Translation Table Update) About SMMU HTTU: HTTU is a feature of ARM SMMUv3, it can update access flag or/and dirty state of the TTD (Translation Table Descriptor) by hardware. With HTTU, stage1 TTD is classified into 3 types: DBM bit AP[2](readonly bit) 1. writable_clean 1 1 2. writable_dirty 1 0 3. readonly 0 1 If HTTU_HD (manage dirty state) is enabled, smmu can change TTD from writable_clean to writable_dirty. Then software can scan TTD to sync dirty state into dirty bitmap. With this feature, we can track the dirty log of DMA continuously and precisely. The capability of VFIO_DIRTY_LOG_MANUAL_CLEAR is similar to that on the KVM side. We add this new log_clear() interface only to split the old log_sync() into two separated procedures: - use log_sync() to collect the collection only, and, - use log_clear() to clear the dirty bitmap. If you're interested in this new method, you can take a look at our set of patches. [1] https://lore.kernel.org/linux-iommu/20210310090614.26668-1- zhukeqi...@huawei.com/ I know what you are doing. Intel is also working on VT-d dirty bit support based on above link. What I'm curious is the actual performance gain with this optimization. KVM doing that is one good reference, but IOMMU has different characteristics (e.g. longer invalidation latency) compared to CPU MMU. It's always good to understand what a so-called optimization can actually optimize in a context different from where it's originally proved. Thanks Kevin My understanding is that this is a new method, which is quite different from the previous two. So can you explain in more detail what performance data you want? Thanks, Kunkun Jiang When you have HTTU enabled, compare the migration efficiency with and without this manual clear interface. Thanks Kevin Oh, I see. From a theoretical analysis, this manual clear interface will be effective. I will perform some tests on it and send out the results. Thanks, Kunkun Jiang
Re: [RFC PATCH 0/3] vfio/migration: Support manual clear vfio dirty log
Hi Kevin, On 2021/3/18 17:04, Tian, Kevin wrote: From: Kunkun Jiang Sent: Thursday, March 18, 2021 3:59 PM Hi Kevin, On 2021/3/18 14:28, Tian, Kevin wrote: From: Kunkun Jiang Sent: Wednesday, March 10, 2021 5:41 PM Hi all, In the past, we clear dirty log immediately after sync dirty log to userspace. This may cause redundant dirty handling if userspace handles dirty log iteratively: After vfio clears dirty log, new dirty log starts to generate. These new dirty log will be reported to userspace even if they are generated before userspace handles the same dirty page. Since a new dirty log tracking method for vfio based on iommu hwdbm[1] has been introduced in the kernel and added a new capability named VFIO_DIRTY_LOG_MANUAL_CLEAR, we can eliminate some redundant dirty handling by supporting it. Is there any performance data showing the benefit of this new method? Current dirty log tracking method for VFIO: [1] All pages marked dirty if not all iommu_groups have pinned_scope [2] pinned pages by various vendor drivers if all iommu_groups have pinned scope Both methods are coarse-grained and can not determine which pages are really dirty. Each round may mark the pages that are not really dirty as dirty and send them to the destination. ( It might be better if the range of the pinned_scope was smaller. ) This will result in a waste of resources. HWDBM is short for Hardware Dirty Bit Management. (e.g. smmuv3 HTTU, Hardware Translation Table Update) About SMMU HTTU: HTTU is a feature of ARM SMMUv3, it can update access flag or/and dirty state of the TTD (Translation Table Descriptor) by hardware. With HTTU, stage1 TTD is classified into 3 types: DBM bit AP[2](readonly bit) 1. writable_clean 1 1 2. writable_dirty 1 0 3. readonly 0 1 If HTTU_HD (manage dirty state) is enabled, smmu can change TTD from writable_clean to writable_dirty. Then software can scan TTD to sync dirty state into dirty bitmap. With this feature, we can track the dirty log of DMA continuously and precisely. The capability of VFIO_DIRTY_LOG_MANUAL_CLEAR is similar to that on the KVM side. We add this new log_clear() interface only to split the old log_sync() into two separated procedures: - use log_sync() to collect the collection only, and, - use log_clear() to clear the dirty bitmap. If you're interested in this new method, you can take a look at our set of patches. [1] https://lore.kernel.org/linux-iommu/20210310090614.26668-1- zhukeqi...@huawei.com/ I know what you are doing. Intel is also working on VT-d dirty bit support based on above link. What I'm curious is the actual performance gain with this optimization. KVM doing that is one good reference, but IOMMU has different characteristics (e.g. longer invalidation latency) compared to CPU MMU. It's always good to understand what a so-called optimization can actually optimize in a context different from where it's originally proved. Thanks Kevin My understanding is that this is a new method, which is quite different from the previous two. So can you explain in more detail what performance data you want? Thanks, Kunkun Jiang
Re: [RFC PATCH 0/3] vfio/migration: Support manual clear vfio dirty log
Hi Kevin, On 2021/3/18 14:28, Tian, Kevin wrote: From: Kunkun Jiang Sent: Wednesday, March 10, 2021 5:41 PM Hi all, In the past, we clear dirty log immediately after sync dirty log to userspace. This may cause redundant dirty handling if userspace handles dirty log iteratively: After vfio clears dirty log, new dirty log starts to generate. These new dirty log will be reported to userspace even if they are generated before userspace handles the same dirty page. Since a new dirty log tracking method for vfio based on iommu hwdbm[1] has been introduced in the kernel and added a new capability named VFIO_DIRTY_LOG_MANUAL_CLEAR, we can eliminate some redundant dirty handling by supporting it. Is there any performance data showing the benefit of this new method? Current dirty log tracking method for VFIO: [1] All pages marked dirty if not all iommu_groups have pinned_scope [2] pinned pages by various vendor drivers if all iommu_groups have pinned scope Both methods are coarse-grained and can not determine which pages are really dirty. Each round may mark the pages that are not really dirty as dirty and send them to the destination. ( It might be better if the range of the pinned_scope was smaller. ) This will result in a waste of resources. HWDBM is short for Hardware Dirty Bit Management. (e.g. smmuv3 HTTU, Hardware Translation Table Update) About SMMU HTTU: HTTU is a feature of ARM SMMUv3, it can update access flag or/and dirty state of the TTD (Translation Table Descriptor) by hardware. With HTTU, stage1 TTD is classified into 3 types: DBM bit AP[2](readonly bit) 1. writable_clean 1 1 2. writable_dirty 1 0 3. readonly 0 1 If HTTU_HD (manage dirty state) is enabled, smmu can change TTD from writable_clean to writable_dirty. Then software can scan TTD to sync dirty state into dirty bitmap. With this feature, we can track the dirty log of DMA continuously and precisely. The capability of VFIO_DIRTY_LOG_MANUAL_CLEAR is similar to that on the KVM side. We add this new log_clear() interface only to split the old log_sync() into two separated procedures: - use log_sync() to collect the collection only, and, - use log_clear() to clear the dirty bitmap. If you're interested in this new method, you can take a look at our set of patches. [1] https://lore.kernel.org/linux-iommu/20210310090614.26668-1-zhukeqi...@huawei.com/ Best regards, Kunkun Jiang This series include patches as below: Patch 1: - updated the linux-headers/linux/vfio.h from kernel side Patch 2: - introduced 'struct VFIODMARange' to describe a range of the given DMA mapping and with respect to a VFIO_IOMMU_MAP_DMA operation Patch 3: - implemented the operation to manual clear vfio dirty log, which can eliminate some redundant dirty handling Thanks, Kunkun Jiang [1] https://lore.kernel.org/linux-iommu/20210310090614.26668-1- zhukeqi...@huawei.com/T/#mb168c9738ecd3d8794e2da14f970545d5820f 863 Zenghui Yu (3): linux-headers: update against 5.12-rc2 and "vfio log clear" series vfio: Maintain DMA mapping range for the container vfio/migration: Support VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP hw/vfio/common.c | 207 -- include/hw/vfio/vfio-common.h | 10 ++ linux-headers/linux/vfio.h| 55 - 3 files changed, 264 insertions(+), 8 deletions(-) -- 2.23.0 .
Re: [RFC PATCH 0/3] vfio/migration: Support manual clear vfio dirty log
kindly ping, Any comments and reviews are welcome. Thanks, Kunkun Jiang On 2021/3/10 17:41, Kunkun Jiang wrote: Hi all, In the past, we clear dirty log immediately after sync dirty log to userspace. This may cause redundant dirty handling if userspace handles dirty log iteratively: After vfio clears dirty log, new dirty log starts to generate. These new dirty log will be reported to userspace even if they are generated before userspace handles the same dirty page. Since a new dirty log tracking method for vfio based on iommu hwdbm[1] has been introduced in the kernel and added a new capability named VFIO_DIRTY_LOG_MANUAL_CLEAR, we can eliminate some redundant dirty handling by supporting it. This series include patches as below: Patch 1: - updated the linux-headers/linux/vfio.h from kernel side Patch 2: - introduced 'struct VFIODMARange' to describe a range of the given DMA mapping and with respect to a VFIO_IOMMU_MAP_DMA operation Patch 3: - implemented the operation to manual clear vfio dirty log, which can eliminate some redundant dirty handling Thanks, Kunkun Jiang [1] https://lore.kernel.org/linux-iommu/20210310090614.26668-1-zhukeqi...@huawei.com/T/#mb168c9738ecd3d8794e2da14f970545d5820f863 Zenghui Yu (3): linux-headers: update against 5.12-rc2 and "vfio log clear" series vfio: Maintain DMA mapping range for the container vfio/migration: Support VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP hw/vfio/common.c | 207 -- include/hw/vfio/vfio-common.h | 10 ++ linux-headers/linux/vfio.h| 55 - 3 files changed, 264 insertions(+), 8 deletions(-)
Re: [PATCH v4 1/2] migration/ram: Reduce unnecessary rate limiting
Hi Peter, On 2021/3/17 5:39, Peter Xu wrote: On Tue, Mar 16, 2021 at 08:57:15PM +0800, Kunkun Jiang wrote: When the host page is a huge page and something is sent in the current iteration, migration_rate_limit() should be executed. If not, it can be omitted. Signed-off-by: Keqian Zhu Signed-off-by: Kunkun Jiang Reviewed-by: David Edmondson --- migration/ram.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/migration/ram.c b/migration/ram.c index 72143da0ac..3eb5b0d7a7 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -2015,8 +2015,13 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss, pages += tmppages; pss->page++; -/* Allow rate limiting to happen in the middle of huge pages */ -migration_rate_limit(); +/* + * Allow rate limiting to happen in the middle of huge pages if + * something is sent in the current iteration. + */ +if (pagesize_bits > 1 && tmppages > 0) { +migration_rate_limit(); +} Sorry I'm still not a fan of this - I'd even prefer calling that once more just to make sure it won't be forgotten to be called.. Not to say it's merely a noop. I'll leave this to Dave.. Maybe I'm too harsh! :) You are very serious and meticulous. I like your character very much. This patch was used to reviewed by David. So, I want to know what his opinion is. @David Hi David, what is your opinion on this patch? Thanks, Kunkun Jiang
[PATCH v4 2/2] migration/ram: Optimize ram_save_host_page()
Starting from pss->page, ram_save_host_page() will check every page and send the dirty pages up to the end of the current host page or the boundary of used_length of the block. If the host page size is a huge page, the step "check" will take a lot of time. It will improve performance to use migration_bitmap_find_dirty(). Tested on Kunpeng 920; VM parameters: 1U 4G (page size 1G) The time of ram_save_host_page() in the last round of ram saving: before optimize: 9250us after optimize: 34us Signed-off-by: Keqian Zhu Signed-off-by: Kunkun Jiang --- migration/ram.c | 39 +++ 1 file changed, 19 insertions(+), 20 deletions(-) diff --git a/migration/ram.c b/migration/ram.c index 3eb5b0d7a7..1d9c24aa85 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -1993,6 +1993,8 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss, int tmppages, pages = 0; size_t pagesize_bits = qemu_ram_pagesize(pss->block) >> TARGET_PAGE_BITS; +unsigned long hostpage_boundary = +QEMU_ALIGN_UP(pss->page + 1, pagesize_bits); unsigned long start_page = pss->page; int res; @@ -2003,30 +2005,27 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss, do { /* Check the pages is dirty and if it is send it */ -if (!migration_bitmap_clear_dirty(rs, pss->block, pss->page)) { -pss->page++; -continue; -} - -tmppages = ram_save_target_page(rs, pss, last_stage); -if (tmppages < 0) { -return tmppages; -} +if (migration_bitmap_clear_dirty(rs, pss->block, pss->page)) { +tmppages = ram_save_target_page(rs, pss, last_stage); +if (tmppages < 0) { +return tmppages; +} -pages += tmppages; -pss->page++; -/* - * Allow rate limiting to happen in the middle of huge pages if - * something is sent in the current iteration. - */ -if (pagesize_bits > 1 && tmppages > 0) { -migration_rate_limit(); +pages += tmppages; +/* + * Allow rate limiting to happen in the middle of huge pages if + * something is sent in the current iteration. + */ +if (pagesize_bits > 1 && tmppages > 0) { +migration_rate_limit(); +} } -} while ((pss->page & (pagesize_bits - 1)) && +pss->page = migration_bitmap_find_dirty(rs, pss->block, pss->page); +} while ((pss->page < hostpage_boundary) && offset_in_ramblock(pss->block, ((ram_addr_t)pss->page) << TARGET_PAGE_BITS)); -/* The offset we leave with is the last one we looked at */ -pss->page--; +/* The offset we leave with is the min boundary of host page and block */ +pss->page = MIN(pss->page, hostpage_boundary) - 1; res = ram_save_release_protection(rs, pss, start_page); return (res < 0 ? res : pages); -- 2.23.0
[PATCH v4 1/2] migration/ram: Reduce unnecessary rate limiting
When the host page is a huge page and something is sent in the current iteration, migration_rate_limit() should be executed. If not, it can be omitted. Signed-off-by: Keqian Zhu Signed-off-by: Kunkun Jiang Reviewed-by: David Edmondson --- migration/ram.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/migration/ram.c b/migration/ram.c index 72143da0ac..3eb5b0d7a7 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -2015,8 +2015,13 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss, pages += tmppages; pss->page++; -/* Allow rate limiting to happen in the middle of huge pages */ -migration_rate_limit(); +/* + * Allow rate limiting to happen in the middle of huge pages if + * something is sent in the current iteration. + */ +if (pagesize_bits > 1 && tmppages > 0) { +migration_rate_limit(); +} } while ((pss->page & (pagesize_bits - 1)) && offset_in_ramblock(pss->block, ((ram_addr_t)pss->page) << TARGET_PAGE_BITS)); -- 2.23.0
[PATCH v4 0/2] Some modification about ram_save_host_page()
Hi all, This series include patches as below: Patch 1: - reduce unnecessary rate limiting in ram_save_host_page() Patch 2: - optimized ram_save_host_page() by using migration_bitmap_find_dirty() to find dirty pages History: v3 -> v4: - Remove the modification to ram_save_host_page() comment [Peter Xu] - Remove the renaming of tmppages v2 -> v3: - Reduce unnecessary rate limiting if nothing is sent in the current iteration [David Edmondson] - Invert the body of the loop in ram_save_host_page() [David Edmondson] v1 -> v2: - Modify ram_save_host_page() comment [David Edmondson] - Remove 'goto' [David Edmondson] Kunkun Jiang (2): migration/ram: Reduce unnecessary rate limiting migration/ram: Optimize ram_save_host_page() migration/ram.c | 34 +++--- 1 file changed, 19 insertions(+), 15 deletions(-) -- 2.23.0
[RFC PATCH 2/3] vfio: Maintain DMA mapping range for the container
From: Zenghui Yu When synchronizing dirty bitmap from kernel VFIO we do it in a per-iova-range fashion and we allocate the userspace bitmap for each of the ioctl. This patch introduces `struct VFIODMARange` to describe a range of the given DMA mapping with respect to a VFIO_IOMMU_MAP_DMA operation, and make the bitmap cache of this range be persistent so that we don't need to g_try_malloc0() every time. Note that the new structure is almost a copy of `struct vfio_iommu_type1_dma_map` but only internally used by QEMU. More importantly, the cached per-iova-range dirty bitmap will be further used when we want to add support for the CLEAR_BITMAP and this cached bitmap will be used to guarantee we don't clear any unknown dirty bits otherwise that can be a severe data loss issue for migration code. It's pretty intuitive to maintain a bitmap per container since we perform log_sync at this granule. But I don't know how to deal with things like memory hot-{un}plug, sparse DMA mappings, etc. Suggestions welcome. * yet something to-do: - can't work with guest viommu - no locks - etc [ The idea and even the commit message are largely inherited from kvm side. See commit 9f4bf4baa8b820c7930e23c9566c9493db7e1d25. ] Signed-off-by: Zenghui Yu Signed-off-by: Kunkun Jiang --- hw/vfio/common.c | 62 +++ include/hw/vfio/vfio-common.h | 9 + 2 files changed, 65 insertions(+), 6 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index c0ff20f0a2..e7cb0e2b23 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -426,6 +426,29 @@ unmap_exit: return ret; } +static VFIODMARange *vfio_lookup_match_range(VFIOContainer *container, +hwaddr start_addr, hwaddr size) +{ +VFIODMARange *qrange; + +QLIST_FOREACH(qrange, >dma_list, next) { +if (qrange->iova == start_addr && qrange->size == size) { +return qrange; +} +} +return NULL; +} + +static void vfio_dma_range_init_dirty_bitmap(VFIODMARange *qrange) +{ +uint64_t pages, size; + +pages = REAL_HOST_PAGE_ALIGN(qrange->size) / qemu_real_host_page_size; +size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / BITS_PER_BYTE; + +qrange->bitmap = g_malloc0(size); +} + /* * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86 */ @@ -439,12 +462,29 @@ static int vfio_dma_unmap(VFIOContainer *container, .iova = iova, .size = size, }; +VFIODMARange *qrange; if (iotlb && container->dirty_pages_supported && vfio_devices_all_running_and_saving(container)) { return vfio_dma_unmap_bitmap(container, iova, size, iotlb); } +/* + * unregister the DMA range + * + * It seems that the memory layer will give us the same section as the one + * used in region_add(). Otherwise it'll be complicated to manipulate the + * bitmap across region_{add,del}. Is there any guarantee? + * + * But there is really not such a restriction on the kernel interface + * (VFIO_IOMMU_DIRTY_PAGES_FLAG_{UN}MAP_DMA, etc). + */ +qrange = vfio_lookup_match_range(container, iova, size); +assert(qrange); +g_free(qrange->bitmap); +QLIST_REMOVE(qrange, next); +g_free(qrange); + while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, )) { /* * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c @@ -481,6 +521,14 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova, .iova = iova, .size = size, }; +VFIODMARange *qrange; + +qrange = g_malloc0(sizeof(*qrange)); +qrange->iova = iova; +qrange->size = size; +QLIST_INSERT_HEAD(>dma_list, qrange, next); +/* XXX allocate the dirty bitmap on demand */ +vfio_dma_range_init_dirty_bitmap(qrange); if (!readonly) { map.flags |= VFIO_DMA_MAP_FLAG_WRITE; @@ -992,9 +1040,14 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, { struct vfio_iommu_type1_dirty_bitmap *dbitmap; struct vfio_iommu_type1_dirty_bitmap_get *range; +VFIODMARange *qrange; uint64_t pages; int ret; +qrange = vfio_lookup_match_range(container, iova, size); +/* the same as vfio_dma_unmap() */ +assert(qrange); + dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range)); dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range); @@ -1013,11 +1066,8 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, pages = REAL_HOST_PAGE_ALIGN(range->size) / qemu_real_host_page_size; range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / BITS_PER_BYTE; -range->bitmap.data = g_try_malloc0(range->bitmap.size); -if (!range->bitmap.data) { -ret = -ENOMEM; -goto err_out; -}
[RFC PATCH 3/3] vfio/migration: Support VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP
From: Zenghui Yu The new capability VFIO_DIRTY_LOG_MANUAL_CLEAR and the new ioctl VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP have been introduced in the kernel, tweak the userspace side to use them. Check if the kernel supports VFIO_DIRTY_LOG_MANUAL_CLEAR and provide the log_clear() hook for vfio_memory_listener. If the kernel supports it, deliever the clear message to kernel. Signed-off-by: Zenghui Yu Signed-off-by: Kunkun Jiang --- hw/vfio/common.c | 145 +- include/hw/vfio/vfio-common.h | 1 + 2 files changed, 145 insertions(+), 1 deletion(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index e7cb0e2b23..713515b86a 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -1182,10 +1182,146 @@ static void vfio_listener_log_sync(MemoryListener *listener, } } +/* + * I'm not sure if there's any alignment requirement for the CLEAR_BITMAP + * ioctl. But copy from kvm side and align {start, size} with 64 pages. + * + * I think the code can be simplified a lot if no alignment requirement. + */ +#define VFIO_CLEAR_LOG_SHIFT 6 +#define VFIO_CLEAR_LOG_ALIGN (qemu_real_host_page_size << VFIO_CLEAR_LOG_SHIFT) +#define VFIO_CLEAR_LOG_MASK (-VFIO_CLEAR_LOG_ALIGN) + +static int vfio_log_clear_one_range(VFIOContainer *container, +VFIODMARange *qrange, uint64_t start, uint64_t size) +{ +struct vfio_iommu_type1_dirty_bitmap *dbitmap; +struct vfio_iommu_type1_dirty_bitmap_get *range; + +dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range)); + +dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range); +dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP; +range = (struct vfio_iommu_type1_dirty_bitmap_get *)>data; + +/* + * Now let's deal with the actual bitmap, which is almost the same + * as the kvm side. + */ +uint64_t end, bmap_start, start_delta, bmap_npages; +unsigned long *bmap_clear = NULL, psize = qemu_real_host_page_size; +int ret; + +bmap_start = start & VFIO_CLEAR_LOG_MASK; +start_delta = start - bmap_start; +bmap_start /= psize; + +bmap_npages = DIV_ROUND_UP(size + start_delta, VFIO_CLEAR_LOG_ALIGN) +<< VFIO_CLEAR_LOG_SHIFT; +end = qrange->size / psize; +if (bmap_npages > end - bmap_start) { +bmap_npages = end - bmap_start; +} +start_delta /= psize; + +if (start_delta) { +bmap_clear = bitmap_new(bmap_npages); +bitmap_copy_with_src_offset(bmap_clear, qrange->bitmap, +bmap_start, start_delta + size / psize); +bitmap_clear(bmap_clear, 0, start_delta); +range->bitmap.data = (__u64 *)bmap_clear; +} else { +range->bitmap.data = (__u64 *)(qrange->bitmap + BIT_WORD(bmap_start)); +} + +range->iova = qrange->iova + bmap_start * psize; +range->size = bmap_npages * psize; +range->bitmap.size = ROUND_UP(bmap_npages, sizeof(__u64) * BITS_PER_BYTE) / + BITS_PER_BYTE; +range->bitmap.pgsize = qemu_real_host_page_size; + +ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap); +if (ret) { +error_report("Failed to clear dirty log for iova: 0x%"PRIx64 +" size: 0x%"PRIx64" err: %d", (uint64_t)range->iova, +(uint64_t)range->size, errno); +goto err_out; +} + +bitmap_clear(qrange->bitmap, bmap_start + start_delta, size / psize); +err_out: +g_free(bmap_clear); +g_free(dbitmap); +return 0; +} + +static int vfio_physical_log_clear(VFIOContainer *container, + MemoryRegionSection *section) +{ +uint64_t start, size, offset, count; +VFIODMARange *qrange; +int ret = 0; + +if (!container->dirty_log_manual_clear) { +/* No need to do explicit clear */ +return ret; +} + +start = section->offset_within_address_space; +size = int128_get64(section->size); + +if (!size) { +return ret; +} + +QLIST_FOREACH(qrange, >dma_list, next) { +/* + * Discard ranges that do not overlap the section (e.g., the + * Memory BAR regions of the device) + */ +if (qrange->iova > start + size - 1 || +start > qrange->iova + qrange->size - 1) { +continue; +} + +if (start >= qrange->iova) { +/* The range starts before section or is aligned to it. */ +offset = start - qrange->iova; +count = MIN(qrange->size - offset, size); +} else { +/* The range starts after section. */ +offset = 0; +count = MIN(qrange->size, size - (qrange->iova - start)); +} +ret = vfio_log_clear_one_range(container, qrange, offset, count); +
[RFC PATCH 1/3] linux-headers: update against 5.12-rc2 and "vfio log clear" series
From: Zenghui Yu The new capability VFIO_DIRTY_LOG_MANUAL_CLEAR and the new ioctl VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP have been introduced in the kernel, update the header to add them. Signed-off-by: Zenghui Yu Signed-off-by: Kunkun Jiang --- linux-headers/linux/vfio.h | 55 +- 1 file changed, 54 insertions(+), 1 deletion(-) diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h index 609099e455..9836c5442f 100644 --- a/linux-headers/linux/vfio.h +++ b/linux-headers/linux/vfio.h @@ -46,6 +46,20 @@ */ #define VFIO_NOIOMMU_IOMMU 8 +/* Supports VFIO_DMA_UNMAP_FLAG_ALL */ +#define VFIO_UNMAP_ALL 9 + +/* Supports the vaddr flag for DMA map and unmap */ +#define VFIO_UPDATE_VADDR 10 + +/* + * The vfio_iommu driver may support user clears dirty log manually, which means + * dirty log is not cleared automatically after dirty log is copied to userspace, + * it's user's duty to clear dirty log. Note: when user queries this extension + * and vfio_iommu driver supports it, then it is enabled. + */ +#define VFIO_DIRTY_LOG_MANUAL_CLEAR11 + /* * The IOCTL interface is designed for extensibility by embedding the * structure length (argsz) and flags into structures passed between @@ -1074,12 +1088,22 @@ struct vfio_iommu_type1_info_dma_avail { * * Map process virtual addresses to IO virtual addresses using the * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required. + * + * If flags & VFIO_DMA_MAP_FLAG_VADDR, update the base vaddr for iova, and + * unblock translation of host virtual addresses in the iova range. The vaddr + * must have previously been invalidated with VFIO_DMA_UNMAP_FLAG_VADDR. To + * maintain memory consistency within the user application, the updated vaddr + * must address the same memory object as originally mapped. Failure to do so + * will result in user memory corruption and/or device misbehavior. iova and + * size must match those in the original MAP_DMA call. Protection is not + * changed, and the READ & WRITE flags must be 0. */ struct vfio_iommu_type1_dma_map { __u32 argsz; __u32 flags; #define VFIO_DMA_MAP_FLAG_READ (1 << 0)/* readable from device */ #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1) /* writable from device */ +#define VFIO_DMA_MAP_FLAG_VADDR (1 << 2) __u64 vaddr; /* Process virtual address */ __u64 iova; /* IO virtual address */ __u64 size; /* Size of mapping (bytes) */ @@ -1102,6 +1126,7 @@ struct vfio_bitmap { * field. No guarantee is made to the user that arbitrary unmaps of iova * or size different from those used in the original mapping call will * succeed. + * * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get the dirty bitmap * before unmapping IO virtual addresses. When this flag is set, the user must * provide a struct vfio_bitmap in data[]. User must provide zero-allocated @@ -,11 +1136,21 @@ struct vfio_bitmap { * indicates that the page at that offset from iova is dirty. A Bitmap of the * pages in the range of unmapped size is returned in the user-provided * vfio_bitmap.data. + * + * If flags & VFIO_DMA_UNMAP_FLAG_ALL, unmap all addresses. iova and size + * must be 0. This cannot be combined with the get-dirty-bitmap flag. + * + * If flags & VFIO_DMA_UNMAP_FLAG_VADDR, do not unmap, but invalidate host + * virtual addresses in the iova range. Tasks that attempt to translate an + * iova's vaddr will block. DMA to already-mapped pages continues. This + * cannot be combined with the get-dirty-bitmap flag. */ struct vfio_iommu_type1_dma_unmap { __u32 argsz; __u32 flags; #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0) +#define VFIO_DMA_UNMAP_FLAG_ALL (1 << 1) +#define VFIO_DMA_UNMAP_FLAG_VADDR (1 << 2) __u64 iova; /* IO virtual address */ __u64 size; /* Size of mapping (bytes) */ __u8data[]; @@ -1161,7 +1196,24 @@ struct vfio_iommu_type1_dma_unmap { * actual bitmap. If dirty pages logging is not enabled, an error will be * returned. * - * Only one of the flags _START, _STOP and _GET may be specified at a time. + * Calling the IOCTL with VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP flag set, + * instructs the IOMMU driver to clear the dirty status of pages in a bitmap + * for IOMMU container for a given IOVA range. The user must specify the IOVA + * range, the bitmap and the pgsize through the structure + * vfio_iommu_type1_dirty_bitmap_get in the data[] portion. This interface + * supports clearing a bitmap of the smallest supported pgsize only and can be + * modified in future to clear a bitmap of any specified supported pgsize. The + * user must
[RFC PATCH 0/3] vfio/migration: Support manual clear vfio dirty log
Hi all, In the past, we clear dirty log immediately after sync dirty log to userspace. This may cause redundant dirty handling if userspace handles dirty log iteratively: After vfio clears dirty log, new dirty log starts to generate. These new dirty log will be reported to userspace even if they are generated before userspace handles the same dirty page. Since a new dirty log tracking method for vfio based on iommu hwdbm[1] has been introduced in the kernel and added a new capability named VFIO_DIRTY_LOG_MANUAL_CLEAR, we can eliminate some redundant dirty handling by supporting it. This series include patches as below: Patch 1: - updated the linux-headers/linux/vfio.h from kernel side Patch 2: - introduced 'struct VFIODMARange' to describe a range of the given DMA mapping and with respect to a VFIO_IOMMU_MAP_DMA operation Patch 3: - implemented the operation to manual clear vfio dirty log, which can eliminate some redundant dirty handling Thanks, Kunkun Jiang [1] https://lore.kernel.org/linux-iommu/20210310090614.26668-1-zhukeqi...@huawei.com/T/#mb168c9738ecd3d8794e2da14f970545d5820f863 Zenghui Yu (3): linux-headers: update against 5.12-rc2 and "vfio log clear" series vfio: Maintain DMA mapping range for the container vfio/migration: Support VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP hw/vfio/common.c | 207 -- include/hw/vfio/vfio-common.h | 10 ++ linux-headers/linux/vfio.h| 55 - 3 files changed, 264 insertions(+), 8 deletions(-) -- 2.23.0
Re: [PATCH] vfio: Support host translation granule size
Hi Alex, On 2021/3/10 7:17, Alex Williamson wrote: On Thu, 4 Mar 2021 21:34:46 +0800 Kunkun Jiang wrote: The cpu_physical_memory_set_dirty_lebitmap() can quickly deal with the dirty pages of memory by bitmap-traveling, regardless of whether the bitmap is aligned correctly or not. cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of host page size. So it'd better to set bitmap_pgsize to host page size to support more translation granule sizes. Fixes: 87ea529c502 (vfio: Get migration capability flags for container) Signed-off-by: Kunkun Jiang --- hw/vfio/common.c | 44 ++-- 1 file changed, 22 insertions(+), 22 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 6ff1daa763..69fb5083a4 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -378,7 +378,7 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container, { struct vfio_iommu_type1_dma_unmap *unmap; struct vfio_bitmap *bitmap; -uint64_t pages = TARGET_PAGE_ALIGN(size) >> TARGET_PAGE_BITS; +uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size; int ret; unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap)); @@ -390,12 +390,12 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container, bitmap = (struct vfio_bitmap *)>data; /* - * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of - * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap_pgsize to - * TARGET_PAGE_SIZE. + * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of + * qemu_real_host_page_size to mark those dirty. Hence set bitmap_pgsize + * to qemu_real_host_page_size. I don't see that this change is well supported by the code, cpu_physical_memory_set_dirty_lebitmap() seems to operate on Yes, cpu_physical_memory_set_dirty_lebitmap() is finally to operate on TARGET_PAGE_SIZE. But actually it supports pages in bitmap of qemu_real_host_page_size to mark those dirty. It uses "hpratio = qemu_real_host_page_size / TARGET_PAGE_SIZE" to adapt to different translation granule size(e.g. 4K 2M 1G). TARGET_PAGE_SIZE, and the next three patch chunks take a detour through memory listener code that seem unrelated to the change described in the commit log. This claims to fix something, what is actually broken? Thanks, Alex This patch 87ea529c502 (vfio: Get migration capability flags for container) is the start of the bug. The code of [1](marked below) restricts the host page size must be TARGET_PAGE_SIZE(e.g. 4K) to set container->dirty_pages_supported = true. It is inappropriate to limit the page size to TARGET_PAGE_SIZE. Best Regards Kunkun Jiang */ -bitmap->pgsize = TARGET_PAGE_SIZE; +bitmap->pgsize = qemu_real_host_page_size; bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / BITS_PER_BYTE; @@ -674,16 +674,16 @@ static void vfio_listener_region_add(MemoryListener *listener, return; } -if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) != - (section->offset_within_region & ~TARGET_PAGE_MASK))) { +if (unlikely((section->offset_within_address_space & ~qemu_real_host_page_mask) != + (section->offset_within_region & ~qemu_real_host_page_mask))) { error_report("%s received unaligned region", __func__); return; } -iova = TARGET_PAGE_ALIGN(section->offset_within_address_space); +iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space); llend = int128_make64(section->offset_within_address_space); llend = int128_add(llend, section->size); -llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK)); +llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask)); if (int128_ge(int128_make64(iova), llend)) { return; @@ -892,8 +892,8 @@ static void vfio_listener_region_del(MemoryListener *listener, return; } -if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) != - (section->offset_within_region & ~TARGET_PAGE_MASK))) { +if (unlikely((section->offset_within_address_space & ~qemu_real_host_page_mask) != + (section->offset_within_region & ~qemu_real_host_page_mask))) { error_report("%s received unaligned region", __func__); return; } @@ -921,10 +921,10 @@ static void vfio_listener_region_del(MemoryListener *listener, */ } -iova = TARGET_PAGE_ALIGN(section->offset_within_address_space); +iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space); llend = int128_make64(section->offset_within_address_space); llend = int128_add(llend, section->size); -llend = int128_and(llend
Re: [PATCH v3 2/3] migration/ram: Reduce unnecessary rate limiting
Hi, On 2021/3/10 0:15, Peter Xu wrote: On Tue, Mar 09, 2021 at 10:33:04PM +0800, Kunkun Jiang wrote: Hi, On 2021/3/9 5:12, Peter Xu wrote: On Mon, Mar 08, 2021 at 06:34:58PM +0800, Kunkun Jiang wrote: Hi, On 2021/3/5 22:22, Peter Xu wrote: Kunkun, On Fri, Mar 05, 2021 at 03:50:34PM +0800, Kunkun Jiang wrote: When the host page is a huge page and something is sent in the current iteration, the migration_rate_limit() should be executed. If not, this function can be omitted to save time. Rename tmppages to pages_this_iteration to express its meaning more clearly. Signed-off-by: Keqian Zhu Signed-off-by: Kunkun Jiang --- migration/ram.c | 21 ++--- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/migration/ram.c b/migration/ram.c index a168da5cdd..9fc5b2997c 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -1988,7 +1988,7 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss, static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss, bool last_stage) { -int tmppages, pages = 0; +int pages = 0; size_t pagesize_bits = qemu_ram_pagesize(pss->block) >> TARGET_PAGE_BITS; unsigned long start_page = pss->page; @@ -2000,21 +2000,28 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss, } do { +int pages_this_iteration = 0; + /* Check if the page is dirty and send it if it is */ if (!migration_bitmap_clear_dirty(rs, pss->block, pss->page)) { pss->page++; continue; } -tmppages = ram_save_target_page(rs, pss, last_stage); -if (tmppages < 0) { -return tmppages; +pages_this_iteration = ram_save_target_page(rs, pss, last_stage); +if (pages_this_iteration < 0) { +return pages_this_iteration; } -pages += tmppages; +pages += pages_this_iteration; To me, both names are okay, it's just that the new name doesn't really provide a lot more new information, while it's even longer... Since you seem to prefer cleaning up tmppages, I'm actually thinking whether it should be called as "pages" at all since ram_save_target_page() majorly only returns either 1 if succeeded or <0 if error. There's only one very corner case of xbzrle where it can return 0 in save_xbzrle_page(): if (encoded_len == 0) { trace_save_xbzrle_page_skipping(); return 0; } I think it means the page didn't change at all, then I'm also wondering maybe it can also return 1 showing one page migrated (though actually skipped!) which should still be fine for the callers, e.g., ram_find_and_save_block() who will finally check this "pages" value. So I think _maybe_ that's a nicer cleanup to change that "return 0" to "return 1", then another patch to make the return value to be (1) return 0 if page saved, or (2) return <0 if error. Then here in ram_save_host_page() tmppages can be renamed to "ret" or "succeed". Thanks for your advice. change "return 0" to "return 1" would have a slight effect on 'rs->target_page_count += pages' in ram_save_iterate(). This may lead to consider more complex situations. What do you think of this? I don't think we should change the meaning of ram_save_host_page()'s return value, but only ram_save_target_page(); ram_save_host_page() could return >1 for huge pages. Thanks, I am not sure I have got your point. If I change "return 0" to "return 1" (actually skipped), the "pages" in the ram_save_host_page() will add 1(original add 0). Then it will end the loop in the ram_find_and_save_block. Frankly I even think it's a bug - when return 0 it could mean the xbzrle page is the same as before even if dirty bit set (the page just got written with the same data, that's why dirty bit set but xbzrle calculated with no diff). However it shouldn't mean "done==1" which is a sign of "migration complete" imho.. Thanks for your explanation, I get it. And then in the ram_save_iterate(), the modify may have a effect on "rs->target_page_count += page". I haven't done enough research on this part of code. Do you think this change will have a big impact? Maybe, but I don't expect it to change anything real. If to change it we'd definitely better smoke xbzrle a bit. It's a pure idea I got in mind to cleanup the code, but feel free to ignore it too. For your current series, I think the last patch is the most appealing. So maybe we can focus on that first. Thanks, You are right. The change here may be not worth it. Thanks, Kunkun Jiang
Re: [PATCH v3 2/3] migration/ram: Reduce unnecessary rate limiting
Hi, On 2021/3/9 5:12, Peter Xu wrote: On Mon, Mar 08, 2021 at 06:34:58PM +0800, Kunkun Jiang wrote: Hi, On 2021/3/5 22:22, Peter Xu wrote: Kunkun, On Fri, Mar 05, 2021 at 03:50:34PM +0800, Kunkun Jiang wrote: When the host page is a huge page and something is sent in the current iteration, the migration_rate_limit() should be executed. If not, this function can be omitted to save time. Rename tmppages to pages_this_iteration to express its meaning more clearly. Signed-off-by: Keqian Zhu Signed-off-by: Kunkun Jiang --- migration/ram.c | 21 ++--- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/migration/ram.c b/migration/ram.c index a168da5cdd..9fc5b2997c 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -1988,7 +1988,7 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss, static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss, bool last_stage) { -int tmppages, pages = 0; +int pages = 0; size_t pagesize_bits = qemu_ram_pagesize(pss->block) >> TARGET_PAGE_BITS; unsigned long start_page = pss->page; @@ -2000,21 +2000,28 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss, } do { +int pages_this_iteration = 0; + /* Check if the page is dirty and send it if it is */ if (!migration_bitmap_clear_dirty(rs, pss->block, pss->page)) { pss->page++; continue; } -tmppages = ram_save_target_page(rs, pss, last_stage); -if (tmppages < 0) { -return tmppages; +pages_this_iteration = ram_save_target_page(rs, pss, last_stage); +if (pages_this_iteration < 0) { +return pages_this_iteration; } -pages += tmppages; +pages += pages_this_iteration; To me, both names are okay, it's just that the new name doesn't really provide a lot more new information, while it's even longer... Since you seem to prefer cleaning up tmppages, I'm actually thinking whether it should be called as "pages" at all since ram_save_target_page() majorly only returns either 1 if succeeded or <0 if error. There's only one very corner case of xbzrle where it can return 0 in save_xbzrle_page(): if (encoded_len == 0) { trace_save_xbzrle_page_skipping(); return 0; } I think it means the page didn't change at all, then I'm also wondering maybe it can also return 1 showing one page migrated (though actually skipped!) which should still be fine for the callers, e.g., ram_find_and_save_block() who will finally check this "pages" value. So I think _maybe_ that's a nicer cleanup to change that "return 0" to "return 1", then another patch to make the return value to be (1) return 0 if page saved, or (2) return <0 if error. Then here in ram_save_host_page() tmppages can be renamed to "ret" or "succeed". Thanks for your advice. change "return 0" to "return 1" would have a slight effect on 'rs->target_page_count += pages' in ram_save_iterate(). This may lead to consider more complex situations. What do you think of this? I don't think we should change the meaning of ram_save_host_page()'s return value, but only ram_save_target_page(); ram_save_host_page() could return >1 for huge pages. Thanks, I am not sure I have got your point. If I change "return 0" to "return 1" (actually skipped), the "pages" in the ram_save_host_page() will add 1(original add 0). Then it will end the loop in the ram_find_and_save_block. And then in the ram_save_iterate(), the modify may have a effect on "rs->target_page_count += page". I haven't done enough research on this part of code. Do you think this change will have a big impact? Thanks, Kunkun Jiang
Re: [PATCH v3 1/3] migration/ram: Modify the code comment of ram_save_host_page()
Hi, On 2021/3/9 5:03, Peter Xu wrote: On Mon, Mar 08, 2021 at 06:33:56PM +0800, Kunkun Jiang wrote: Hi, Peter On 2021/3/5 21:59, Peter Xu wrote: On Fri, Mar 05, 2021 at 03:50:33PM +0800, Kunkun Jiang wrote: The ram_save_host_page() has been modified several times since its birth. But the comment hasn't been modified as it should be. It'd better to modify the comment to explain ram_save_host_page() more clearly. Signed-off-by: Kunkun Jiang --- migration/ram.c | 16 +++- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/migration/ram.c b/migration/ram.c index 72143da0ac..a168da5cdd 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -1970,15 +1970,13 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss, } /** - * ram_save_host_page: save a whole host page + * ram_save_host_page: save a whole host page or the rest of a RAMBlock * - * Starting at *offset send pages up to the end of the current host - * page. It's valid for the initial offset to point into the middle of - * a host page in which case the remainder of the hostpage is sent. - * Only dirty target pages are sent. Note that the host page size may - * be a huge page for this block. - * The saving stops at the boundary of the used_length of the block - * if the RAMBlock isn't a multiple of the host page size. [1] + * Send dirty pages between pss->page and either the end of that page + * or the used_length of the RAMBlock, whichever is smaller. + * + * Note that if the host page is a huge page, pss->page may be in the + * middle of that page. It reads more like a shorter version of original comment.. Could you point out what's the major difference? Since the original comment looks still good to me. Sorry for late reply. The reason I want to modify the comment is that I think some parts of the comment don't match the code. (1) The brief description of ram_save_host_page() missed a situation that ram_save_host_page() will send dirty pages between pass->page and the used_length of the block, if the RAMBlock isn't a multiple of the host page size. It seems it mentioned at [1] above. (2) '*offset' should be replaced with pss->page. This one looks right as you mentioned. So taking the chance of optimizing ram_save_host_page(), I modified the comment. This version comment is suggested by @David Edmondson. Compared with the original comment, I think it looks concise. I think changing incorrect comments is nice; it's just that we should still avoid rewritting comments with similar contents. * * Returns the number of pages written or negative on error * @@ -2002,7 +2000,7 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss, } do { -/* Check the pages is dirty and if it is send it */ +/* Check if the page is dirty and send it if it is */ This line fixes some English issues, it seems. Looks okay, but normally I won't change it unless the wording was too weird, so as to keep the git record or the original lines. Here the original looks still okay, no strong opinion. Thanks, Yes, the original reads weird to me. Same reason as above, I modified this line. If you think there is no need to modify the original commit, I do not insist on changing it. As I mentioned I don't really have a strong preference, so feel free to keep your own will. :) I would only to suggest avoid rewritting chunk of similar meanings. Thanks, Thank you for your advice, I will pay more attention to it in the future. Best regards, Kunkun Jiang
Re: [PATCH v3 3/3] migration/ram: Optimize ram_save_host_page()
Hi, On 2021/3/9 5:36, Peter Xu wrote: On Mon, Mar 08, 2021 at 09:58:02PM +0800, Kunkun Jiang wrote: Hi, On 2021/3/5 22:30, Peter Xu wrote: On Fri, Mar 05, 2021 at 03:50:35PM +0800, Kunkun Jiang wrote: Starting from pss->page, ram_save_host_page() will check every page and send the dirty pages up to the end of the current host page or the boundary of used_length of the block. If the host page size is a huge page, the step "check" will take a lot of time. This will improve performance to use migration_bitmap_find_dirty(). Is there any measurement done? I tested it on Kunpeng 920. VM params: 1U 4G( page size 1G). The time of ram_save_host_page() in the last round of ram saving: before optimize: 9250us after optimize: 34us Looks like an idle VM, but still this is a great improvement. Would you mind add this into the commit message too? Ok, I will add it in the next version. This looks like an optimization, but to me it seems to have changed a lot context that it doesn't need to... Do you think it'll also work to just look up dirty again and update pss->page properly if migration_bitmap_clear_dirty() returned zero? Thanks, This just inverted the body of the loop, suggested by @David Edmondson. Here is the v2[1]. Do you mean to change it like this? [1]: http://patchwork.ozlabs.org/project/qemu-devel/patch/20210301082132.1107-4-jiangkun...@huawei.com/ I see, then it's okay - But indeed I still prefer your previous version. :) Thanks, Both versions are fine to me. This version may make the final code slightly cleaner, I think. Thanks, Kunkun Jiang
Re: [PATCH v3 3/3] migration/ram: Optimize ram_save_host_page()
Hi, On 2021/3/5 22:30, Peter Xu wrote: On Fri, Mar 05, 2021 at 03:50:35PM +0800, Kunkun Jiang wrote: Starting from pss->page, ram_save_host_page() will check every page and send the dirty pages up to the end of the current host page or the boundary of used_length of the block. If the host page size is a huge page, the step "check" will take a lot of time. This will improve performance to use migration_bitmap_find_dirty(). Is there any measurement done? I tested it on Kunpeng 920. VM params: 1U 4G( page size 1G). The time of ram_save_host_page() in the last round of ram saving: before optimize: 9250us after optimize: 34us This looks like an optimization, but to me it seems to have changed a lot context that it doesn't need to... Do you think it'll also work to just look up dirty again and update pss->page properly if migration_bitmap_clear_dirty() returned zero? Thanks, This just inverted the body of the loop, suggested by @David Edmondson. Here is the v2[1]. Do you mean to change it like this? [1]: http://patchwork.ozlabs.org/project/qemu-devel/patch/20210301082132.1107-4-jiangkun...@huawei.com/ Thanks, Kunkun Jiang Signed-off-by: Keqian Zhu Signed-off-by: Kunkun Jiang --- migration/ram.c | 39 +++ 1 file changed, 19 insertions(+), 20 deletions(-) diff --git a/migration/ram.c b/migration/ram.c index 9fc5b2997c..28215aefe4 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -1991,6 +1991,8 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss, int pages = 0; size_t pagesize_bits = qemu_ram_pagesize(pss->block) >> TARGET_PAGE_BITS; +unsigned long hostpage_boundary = +QEMU_ALIGN_UP(pss->page + 1, pagesize_bits); unsigned long start_page = pss->page; int res; @@ -2003,30 +2005,27 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss, int pages_this_iteration = 0; /* Check if the page is dirty and send it if it is */ -if (!migration_bitmap_clear_dirty(rs, pss->block, pss->page)) { -pss->page++; -continue; -} - -pages_this_iteration = ram_save_target_page(rs, pss, last_stage); -if (pages_this_iteration < 0) { -return pages_this_iteration; -} +if (migration_bitmap_clear_dirty(rs, pss->block, pss->page)) { +pages_this_iteration = ram_save_target_page(rs, pss, last_stage); +if (pages_this_iteration < 0) { +return pages_this_iteration; +} -pages += pages_this_iteration; -pss->page++; -/* - * Allow rate limiting to happen in the middle of huge pages if - * something is sent in the current iteration. - */ -if (pagesize_bits > 1 && pages_this_iteration > 0) { -migration_rate_limit(); +pages += pages_this_iteration; +/* + * Allow rate limiting to happen in the middle of huge pages if + * something is sent in the current iteration. + */ +if (pagesize_bits > 1 && pages_this_iteration > 0) { +migration_rate_limit(); +} } -} while ((pss->page & (pagesize_bits - 1)) && +pss->page = migration_bitmap_find_dirty(rs, pss->block, pss->page); +} while ((pss->page < hostpage_boundary) && offset_in_ramblock(pss->block, ((ram_addr_t)pss->page) << TARGET_PAGE_BITS)); -/* The offset we leave with is the last one we looked at */ -pss->page--; +/* The offset we leave with is the min boundary of host page and block */ +pss->page = MIN(pss->page, hostpage_boundary) - 1; res = ram_save_release_protection(rs, pss, start_page); return (res < 0 ? res : pages); -- 2.23.0
Re: [PATCH v3 2/3] migration/ram: Reduce unnecessary rate limiting
Hi, On 2021/3/5 22:22, Peter Xu wrote: Kunkun, On Fri, Mar 05, 2021 at 03:50:34PM +0800, Kunkun Jiang wrote: When the host page is a huge page and something is sent in the current iteration, the migration_rate_limit() should be executed. If not, this function can be omitted to save time. Rename tmppages to pages_this_iteration to express its meaning more clearly. Signed-off-by: Keqian Zhu Signed-off-by: Kunkun Jiang --- migration/ram.c | 21 ++--- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/migration/ram.c b/migration/ram.c index a168da5cdd..9fc5b2997c 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -1988,7 +1988,7 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss, static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss, bool last_stage) { -int tmppages, pages = 0; +int pages = 0; size_t pagesize_bits = qemu_ram_pagesize(pss->block) >> TARGET_PAGE_BITS; unsigned long start_page = pss->page; @@ -2000,21 +2000,28 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss, } do { +int pages_this_iteration = 0; + /* Check if the page is dirty and send it if it is */ if (!migration_bitmap_clear_dirty(rs, pss->block, pss->page)) { pss->page++; continue; } -tmppages = ram_save_target_page(rs, pss, last_stage); -if (tmppages < 0) { -return tmppages; +pages_this_iteration = ram_save_target_page(rs, pss, last_stage); +if (pages_this_iteration < 0) { +return pages_this_iteration; } -pages += tmppages; +pages += pages_this_iteration; To me, both names are okay, it's just that the new name doesn't really provide a lot more new information, while it's even longer... Since you seem to prefer cleaning up tmppages, I'm actually thinking whether it should be called as "pages" at all since ram_save_target_page() majorly only returns either 1 if succeeded or <0 if error. There's only one very corner case of xbzrle where it can return 0 in save_xbzrle_page(): if (encoded_len == 0) { trace_save_xbzrle_page_skipping(); return 0; } I think it means the page didn't change at all, then I'm also wondering maybe it can also return 1 showing one page migrated (though actually skipped!) which should still be fine for the callers, e.g., ram_find_and_save_block() who will finally check this "pages" value. So I think _maybe_ that's a nicer cleanup to change that "return 0" to "return 1", then another patch to make the return value to be (1) return 0 if page saved, or (2) return <0 if error. Then here in ram_save_host_page() tmppages can be renamed to "ret" or "succeed". Thanks for your advice. change "return 0" to "return 1" would have a slight effect on 'rs->target_page_count += pages' in ram_save_iterate(). This may lead to consider more complex situations. What do you think of this? pss->page++; -/* Allow rate limiting to happen in the middle of huge pages */ -migration_rate_limit(); +/* + * Allow rate limiting to happen in the middle of huge pages if + * something is sent in the current iteration. + */ +if (pagesize_bits > 1 && pages_this_iteration > 0) { +migration_rate_limit(); +} I don't know whether this matters either.. Two calls in there: migration_update_counters(s, now); qemu_file_rate_limit(s->to_dst_file); migration_update_counters() is mostly a noop for 99.9% cases. Looks still okay... Thanks, I think even these two calls shouldn't be called if the host page size isn't a huge page or nothing is sent in the current iteration. Thanks, Kunkun Jiang } while ((pss->page & (pagesize_bits - 1)) && offset_in_ramblock(pss->block, ((ram_addr_t)pss->page) << TARGET_PAGE_BITS)); -- 2.23.0
Re: [PATCH v3 1/3] migration/ram: Modify the code comment of ram_save_host_page()
Hi, Peter On 2021/3/5 21:59, Peter Xu wrote: On Fri, Mar 05, 2021 at 03:50:33PM +0800, Kunkun Jiang wrote: The ram_save_host_page() has been modified several times since its birth. But the comment hasn't been modified as it should be. It'd better to modify the comment to explain ram_save_host_page() more clearly. Signed-off-by: Kunkun Jiang --- migration/ram.c | 16 +++- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/migration/ram.c b/migration/ram.c index 72143da0ac..a168da5cdd 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -1970,15 +1970,13 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss, } /** - * ram_save_host_page: save a whole host page + * ram_save_host_page: save a whole host page or the rest of a RAMBlock * - * Starting at *offset send pages up to the end of the current host - * page. It's valid for the initial offset to point into the middle of - * a host page in which case the remainder of the hostpage is sent. - * Only dirty target pages are sent. Note that the host page size may - * be a huge page for this block. - * The saving stops at the boundary of the used_length of the block - * if the RAMBlock isn't a multiple of the host page size. + * Send dirty pages between pss->page and either the end of that page + * or the used_length of the RAMBlock, whichever is smaller. + * + * Note that if the host page is a huge page, pss->page may be in the + * middle of that page. It reads more like a shorter version of original comment.. Could you point out what's the major difference? Since the original comment looks still good to me. Sorry for late reply. The reason I want to modify the comment is that I think some parts of the comment don't match the code. (1) The brief description of ram_save_host_page() missed a situation that ram_save_host_page() will send dirty pages between pass->page and the used_length of the block, if the RAMBlock isn't a multiple of the host page size. (2) '*offset' should be replaced with pss->page. So taking the chance of optimizing ram_save_host_page(), I modified the comment. This version comment is suggested by @David Edmondson. Compared with the original comment, I think it looks concise. * * Returns the number of pages written or negative on error * @@ -2002,7 +2000,7 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss, } do { -/* Check the pages is dirty and if it is send it */ +/* Check if the page is dirty and send it if it is */ This line fixes some English issues, it seems. Looks okay, but normally I won't change it unless the wording was too weird, so as to keep the git record or the original lines. Here the original looks still okay, no strong opinion. Thanks, Yes, the original reads weird to me. Same reason as above, I modified this line. If you think there is no need to modify the original commit, I do not insist on changing it. Thanks, Kunkun Jiang if (!migration_bitmap_clear_dirty(rs, pss->block, pss->page)) { pss->page++; continue; -- 2.23.0
[PATCH v3 3/3] migration/ram: Optimize ram_save_host_page()
Starting from pss->page, ram_save_host_page() will check every page and send the dirty pages up to the end of the current host page or the boundary of used_length of the block. If the host page size is a huge page, the step "check" will take a lot of time. This will improve performance to use migration_bitmap_find_dirty(). Signed-off-by: Keqian Zhu Signed-off-by: Kunkun Jiang --- migration/ram.c | 39 +++ 1 file changed, 19 insertions(+), 20 deletions(-) diff --git a/migration/ram.c b/migration/ram.c index 9fc5b2997c..28215aefe4 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -1991,6 +1991,8 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss, int pages = 0; size_t pagesize_bits = qemu_ram_pagesize(pss->block) >> TARGET_PAGE_BITS; +unsigned long hostpage_boundary = +QEMU_ALIGN_UP(pss->page + 1, pagesize_bits); unsigned long start_page = pss->page; int res; @@ -2003,30 +2005,27 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss, int pages_this_iteration = 0; /* Check if the page is dirty and send it if it is */ -if (!migration_bitmap_clear_dirty(rs, pss->block, pss->page)) { -pss->page++; -continue; -} - -pages_this_iteration = ram_save_target_page(rs, pss, last_stage); -if (pages_this_iteration < 0) { -return pages_this_iteration; -} +if (migration_bitmap_clear_dirty(rs, pss->block, pss->page)) { +pages_this_iteration = ram_save_target_page(rs, pss, last_stage); +if (pages_this_iteration < 0) { +return pages_this_iteration; +} -pages += pages_this_iteration; -pss->page++; -/* - * Allow rate limiting to happen in the middle of huge pages if - * something is sent in the current iteration. - */ -if (pagesize_bits > 1 && pages_this_iteration > 0) { -migration_rate_limit(); +pages += pages_this_iteration; +/* + * Allow rate limiting to happen in the middle of huge pages if + * something is sent in the current iteration. + */ +if (pagesize_bits > 1 && pages_this_iteration > 0) { +migration_rate_limit(); +} } -} while ((pss->page & (pagesize_bits - 1)) && +pss->page = migration_bitmap_find_dirty(rs, pss->block, pss->page); +} while ((pss->page < hostpage_boundary) && offset_in_ramblock(pss->block, ((ram_addr_t)pss->page) << TARGET_PAGE_BITS)); -/* The offset we leave with is the last one we looked at */ -pss->page--; +/* The offset we leave with is the min boundary of host page and block */ +pss->page = MIN(pss->page, hostpage_boundary) - 1; res = ram_save_release_protection(rs, pss, start_page); return (res < 0 ? res : pages); -- 2.23.0