Re: [Question] Can I start qemu-system-aarch64 with a vmlinux(ELF format)?

2024-03-05 Thread Kunkun Jiang via

Hi Peter,

On 2024/2/29 17:44, Peter Maydell wrote:

On Thu, 29 Feb 2024 at 03:01, Kunkun Jiang  wrote:

Hi Peter,

On 2024/2/27 23:28, Peter Maydell wrote:

On Tue, 27 Feb 2024 at 14:42, Kunkun Jiang via  wrote:

Hi everybody,

I want to start qemu-system-aarch64 with a vmlinux,
which is an ELF format file. The arm_load_elf() is
implemented in arm_setup_direct_kernel_boot(). So I
thought it was supporting the ELF format file.

No, you can't do this. The hw/arm/boot.c code assumes
that ELF files are "bare metal" binaries, whereas
uImage format, AArch64 Image format, and raw binary files
are Linux kernels. Only the last three kinds of files will
be started with the boot protocol the Linux kernel expects.

For AArch64, pass the -kernel option the path to the Image
file, not the vmlinux file.

Yes, it works fine using Image files.
I would also like to ask again, is it because AArch64 does not
support vmlinux, or is it because qemu does not implement
this capability?

As I said, it is because QEMU assumes that ELF files are
bare metal images, not Linux kernel images.


Sorry to bother you again. I still have a few questions.
1. What does "bare metal images" mean? Are they used in tcg mode?

2. How QEMU assumes an ELF file is a bare metal image? Can you post
the corresponding code?

3. How can I make the hw/arm/boot.c code assumes the ELF files are
Linux kernels?

Looking forward to your reply.

Thanks,
Kunkun Jiang




Re: [Question] Can I start qemu-system-aarch64 with a vmlinux(ELF format)?

2024-02-28 Thread Kunkun Jiang via

Hi Peter,

On 2024/2/27 23:28, Peter Maydell wrote:

On Tue, 27 Feb 2024 at 14:42, Kunkun Jiang via  wrote:

Hi everybody,

I want to start qemu-system-aarch64 with a vmlinux,
which is an ELF format file. The arm_load_elf() is
implemented in arm_setup_direct_kernel_boot(). So I
thought it was supporting the ELF format file.

No, you can't do this. The hw/arm/boot.c code assumes
that ELF files are "bare metal" binaries, whereas
uImage format, AArch64 Image format, and raw binary files
are Linux kernels. Only the last three kinds of files will
be started with the boot protocol the Linux kernel expects.

For AArch64, pass the -kernel option the path to the Image
file, not the vmlinux file.


Yes, it works fine using Image files.
I would also like to ask again, is it because AArch64 does not
support vmlinux, or is it because qemu does not implement
this capability?

Thanks,
Kunkun Jiang




[Question] Can I start qemu-system-aarch64 with a vmlinux(ELF format)?

2024-02-27 Thread Kunkun Jiang via

Hi everybody,

I want to start qemu-system-aarch64 with a vmlinux,
which is an ELF format file. The arm_load_elf() is
implemented in arm_setup_direct_kernel_boot(). So I
thought it was supporting the ELF format file.
But there's no output.
Here is my command line:

qemu-system-aarch64 -machine virt,gic-version=3 -enable-kvm -smp 4 -m 
1G -cpu host -kernel vmlinux -initrd fs -append "xxx"


Am I using it the wrong way?
Looking forward to your reply.

Thanks,
Kunkun Jiang



Re: [PATCH V9 32/46] vfio-pci: cpr part 2 (msi)

2023-07-13 Thread Kunkun Jiang via

Hi Steve,

On 2023/7/10 23:43, Steven Sistare wrote:

On 7/5/2023 4:56 AM, Kunkun Jiang wrote:

Hi Steve,

I have a few questions about the msi part of the vfio device.
In the reboot mode, you mentioned "The guest drivers' suspend methods
flush outstanding requests and re-initialize the devices". This means,
during the downtime,the vfio device dose not generate interrupts. So
no special processing is required for the msi interrupt of the vfio
device. Am I right?

Correct.


In the cpr-exec mode, will the vfio device be "stopped"? If the vfio device
is running all the time,it is possible to generate interrrupts during the
downtime. How to deal with these interrupts?

The vfio device is not stopped, but its connection to the kvm instance is 
severed.
Interrupts are pended in the vfio kernel state, and that state is preserved 
across
exec, by preserving the vfio descriptors.  After exec, qemu creates a new kvm 
instance,
attaches vfio to it, and the interrupts are delivered.


In addition, ARM GICv4 provides support for the direct injection of vLPIs.
Interrupts are more difficult to handle. In this case, what should be done?

I have not studied this feature or tried it.
Have you analyzed the VT-D post-interrupt feature? It is similar to the 
direct
injection of vLPI. They all implement the interrupt injecting channel by 
executing

irq_bypass_register_consumer.

According to the current kernel code, it need to first cancel the 
connection between
vfio producer and kvm consumer, then establishes the connection between 
vfio producer
and the new kvm consumer. During the unbinding process, both ARM and x86 
will execute
the callback of *_set_vcpu_affinity to modify the interrupt reporting 
channel. For x86,
in the process of stop posting interrupts, back to remapping mode, 
cmpxchg_double is

used in the code. Does this guarantee that interrupts will not be lost?

For ARM, it will first send a DISCARD command to ITS and then establish 
the interrupt
reporting channel for GICv3. The DISCARD will remove the pending 
interrupt. Interrupts

that come before channel re-establishment are silently discarded.
Do you guys have any good ideas?

Look forward to your reply.
Kunkun Jiang


- Steve


Look forward to your reply.

Kunkun Jiang

On 2022/7/27 0:10, Steve Sistare wrote:

Finish cpr for vfio-pci MSI/MSI-X devices by preserving eventfd's and
vector state.

Signed-off-by: Steve Sistare 
---
   hw/vfio/pci.c | 119 
+-
   1 file changed, 118 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index b5fd2ec..1d0e8db 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -54,17 +54,47 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
   static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
   static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
   +#define EVENT_FD_NAME(vdev, name)   \
+    g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
+
+static void save_event_fd(VFIOPCIDevice *vdev, const char *name, int nr,
+  EventNotifier *ev)
+{
+    int fd = event_notifier_get_fd(ev);
+
+    if (fd >= 0) {
+    g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
+
+    cpr_resave_fd(fdname, nr, fd);
+    }
+}
+
+static int load_event_fd(VFIOPCIDevice *vdev, const char *name, int nr)
+{
+    g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
+    return cpr_find_fd(fdname, nr);
+}
+
+static void delete_event_fd(VFIOPCIDevice *vdev, const char *name, int nr)
+{
+    g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
+    cpr_delete_fd(fdname, nr);
+}
+
   /* Create new or reuse existing eventfd */
   static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
     const char *name, int nr)
   {
-    int fd = -1;   /* placeholder until a subsequent patch */
   int ret = 0;
+    int fd = load_event_fd(vdev, name, nr);
     if (fd >= 0) {
   event_notifier_init_fd(e, fd);
   } else {
   ret = event_notifier_init(e, 0);
+    if (!ret) {
+    save_event_fd(vdev, name, nr, e);
+    }
   }
   return ret;
   }
@@ -72,6 +102,7 @@ static int vfio_notifier_init(VFIOPCIDevice *vdev, 
EventNotifier *e,
   static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
     const char *name, int nr)
   {
+    delete_event_fd(vdev, name, nr);
   event_notifier_cleanup(e);
   }
   @@ -512,6 +543,15 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, 
unsigned int nr,
   VFIOMSIVector *vector;
   int ret;
   +    /*
+ * Ignore the callback from msix_set_vector_notifiers during resume.
+ * The necessary subset of these actions is called from vfio_claim_vectors
+ * during post load.
+ */
+    if (vdev->vbasedev.reused) {
+    return 0;
+    }
+
   trace_vfio_msix_vector_do_use(vdev->vbasedev

Re: [PATCH V9 32/46] vfio-pci: cpr part 2 (msi)

2023-07-05 Thread Kunkun Jiang via

Hi Steve,

I have a few questions about the msi part of the vfio device.
In the reboot mode, you mentioned "The guest drivers' suspend methods
flush outstanding requests and re-initialize the devices". This means,
during the downtime,the vfio device dose not generate interrupts. So
no special processing is required for the msi interrupt of the vfio
device. Am I right?

In the cpr-exec mode, will the vfio device be "stopped"? If the vfio device
is running all the time,it is possible to generate interrrupts during the
downtime. How to deal with these interrupts?

In addition, ARM GICv4 provides support for the direct injection of vLPIs.
Interrupts are more difficult to handle. In this case, what should be done?

Look forward to your reply.

Kunkun Jiang

On 2022/7/27 0:10, Steve Sistare wrote:

Finish cpr for vfio-pci MSI/MSI-X devices by preserving eventfd's and
vector state.

Signed-off-by: Steve Sistare 
---
  hw/vfio/pci.c | 119 +-
  1 file changed, 118 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index b5fd2ec..1d0e8db 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -54,17 +54,47 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
  static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
  static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
  
+#define EVENT_FD_NAME(vdev, name)   \

+g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
+
+static void save_event_fd(VFIOPCIDevice *vdev, const char *name, int nr,
+  EventNotifier *ev)
+{
+int fd = event_notifier_get_fd(ev);
+
+if (fd >= 0) {
+g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
+
+cpr_resave_fd(fdname, nr, fd);
+}
+}
+
+static int load_event_fd(VFIOPCIDevice *vdev, const char *name, int nr)
+{
+g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
+return cpr_find_fd(fdname, nr);
+}
+
+static void delete_event_fd(VFIOPCIDevice *vdev, const char *name, int nr)
+{
+g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
+cpr_delete_fd(fdname, nr);
+}
+
  /* Create new or reuse existing eventfd */
  static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
const char *name, int nr)
  {
-int fd = -1;   /* placeholder until a subsequent patch */
  int ret = 0;
+int fd = load_event_fd(vdev, name, nr);
  
  if (fd >= 0) {

  event_notifier_init_fd(e, fd);
  } else {
  ret = event_notifier_init(e, 0);
+if (!ret) {
+save_event_fd(vdev, name, nr, e);
+}
  }
  return ret;
  }
@@ -72,6 +102,7 @@ static int vfio_notifier_init(VFIOPCIDevice *vdev, 
EventNotifier *e,
  static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
const char *name, int nr)
  {
+delete_event_fd(vdev, name, nr);
  event_notifier_cleanup(e);
  }
  
@@ -512,6 +543,15 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,

  VFIOMSIVector *vector;
  int ret;
  
+/*

+ * Ignore the callback from msix_set_vector_notifiers during resume.
+ * The necessary subset of these actions is called from vfio_claim_vectors
+ * during post load.
+ */
+if (vdev->vbasedev.reused) {
+return 0;
+}
+
  trace_vfio_msix_vector_do_use(vdev->vbasedev.name, nr);
  
  vector = >msi_vectors[nr];

@@ -2784,6 +2824,11 @@ static void vfio_register_err_notifier(VFIOPCIDevice 
*vdev)
  fd = event_notifier_get_fd(>err_notifier);
  qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
  
+/* Do not alter irq_signaling during vfio_realize for cpr */

+if (vdev->vbasedev.reused) {
+return;
+}
+
  if (vfio_set_irq_signaling(>vbasedev, VFIO_PCI_ERR_IRQ_INDEX, 0,
 VFIO_IRQ_SET_ACTION_TRIGGER, fd, )) {
  error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
@@ -2849,6 +2894,12 @@ static void vfio_register_req_notifier(VFIOPCIDevice 
*vdev)
  fd = event_notifier_get_fd(>req_notifier);
  qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
  
+/* Do not alter irq_signaling during vfio_realize for cpr */

+if (vdev->vbasedev.reused) {
+vdev->req_enabled = true;
+return;
+}
+
  if (vfio_set_irq_signaling(>vbasedev, VFIO_PCI_REQ_IRQ_INDEX, 0,
 VFIO_IRQ_SET_ACTION_TRIGGER, fd, )) {
  error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
@@ -3357,6 +3408,46 @@ static Property vfio_pci_dev_properties[] = {
  DEFINE_PROP_END_OF_LIST(),
  };
  
+static void vfio_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors, bool msix)

+{
+int i, fd;
+bool pending = false;
+PCIDevice *pdev = >pdev;
+
+vdev->nr_vectors = nr_vectors;
+v

[PATCH] qmp: Add error proofing for query-acpi-ospm-status

2023-02-07 Thread Kunkun Jiang via
The ACPI device may not implement the ospm_status callback. Executing
qmp "query-acpi-ospm-status" will cause segmentation fault. Add error
proofing add log to avoid such serious consequences.

Signed-off-by: Kunkun Jiang 
---
 monitor/qmp-cmds.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/monitor/qmp-cmds.c b/monitor/qmp-cmds.c
index bf22a8c5a6..7652613635 100644
--- a/monitor/qmp-cmds.c
+++ b/monitor/qmp-cmds.c
@@ -229,7 +229,12 @@ ACPIOSTInfoList *qmp_query_acpi_ospm_status(Error **errp)
 AcpiDeviceIfClass *adevc = ACPI_DEVICE_IF_GET_CLASS(obj);
 AcpiDeviceIf *adev = ACPI_DEVICE_IF(obj);
 
-adevc->ospm_status(adev, );
+if (adevc->ospm_status) {
+adevc->ospm_status(adev, );
+} else {
+error_setg(errp, "command is not supported, "
+ "ACPI device missing ospm_status callback");
+}
 } else {
 error_setg(errp, "command is not supported, missing ACPI device");
 }
-- 
2.33.0




[PATCH] vfio/migration: Fix incorrect initialization value for parameters in VFIOMigration

2022-07-10 Thread Kunkun Jiang via
The structure VFIOMigration of a VFIODevice is allocated and initialized
in vfio_migration_init(). "device_state" and "vm_running" are initialized
to 0, indicating that VFIO device is_STOP and VM is not-running. The
initialization value is incorrect. According to the agreement, default
state of VFIO device is _RUNNING. And if a VFIO device is hot-plugged
while the VM is running, "vm_running" should be 1. This patch fixes it.

Fixes: 02a7e71b1e5 (vfio: Add VM state change handler to know state of VM)
Signed-off-by: Kunkun Jiang 
---
 hw/vfio/migration.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index a6ad1f8945..3de4252111 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -806,6 +806,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
 }
 
 vbasedev->migration = g_new0(VFIOMigration, 1);
+vbasedev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
+vbasedev->migration->vm_running = runstate_is_running();
 
 ret = vfio_region_setup(obj, vbasedev, >migration->region,
 info->index, "migration");
-- 
2.27.0




[PATCH v3 0/2] vfio: Some fixes about vfio-pci MMIO BARs mapping

2021-10-27 Thread Kunkun Jiang
This series include patches as below:

Patch 1:
- Added support for mmapping sub-page MMIO BAR after live migration to improve
  IO performance for some devices

Patch 2:
- Added a trace point to informe users when a MMIO RAM section cannot be mapped

History:

v2 -> v3:
- Modify commit message [Eric Auger]

v1 -> v2:
- Add iterate sub-page BARs in vfio_pci_load_config and try to update them 
[Alex Williamson]

Kunkun Jiang (2):
  vfio/pci: Add support for mmapping sub-page MMIO BARs after live
migration
  vfio/common: Add a trace point when a MMIO RAM section cannot be
mapped

 hw/vfio/common.c |  7 +++
 hw/vfio/pci.c| 19 ++-
 2 files changed, 25 insertions(+), 1 deletion(-)

-- 
2.23.0




[PATCH v3 1/2] vfio/pci: Add support for mmapping sub-page MMIO BARs after live migration

2021-10-27 Thread Kunkun Jiang
We can expand MemoryRegions of sub-page MMIO BARs in
vfio_pci_write_config() to improve IO performance for some
devices. However, the MemoryRegions of destination VM are
not expanded any more after live migration. Because their
addresses have been updated in vmstate_load_state()
(vfio_pci_load_config) and vfio_sub_page_bar_update_mapping()
will not be called.

This may result in poor performance after live migration.
So iterate BARs in vfio_pci_load_config() and try to update
sub-page BARs.

Reported-by: Nianyao Tang 
Reported-by: Qixin Gan 
Signed-off-by: Kunkun Jiang 
---
 hw/vfio/pci.c | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5cdf1d4298..7b45353ce2 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2453,7 +2453,12 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, 
QEMUFile *f)
 {
 VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
 PCIDevice *pdev = >pdev;
-int ret;
+pcibus_t old_addr[PCI_NUM_REGIONS - 1];
+int bar, ret;
+
+for (bar = 0; bar < PCI_ROM_SLOT; bar++) {
+old_addr[bar] = pdev->io_regions[bar].addr;
+}
 
 ret = vmstate_load_state(f, _vfio_pci_config, vdev, 1);
 if (ret) {
@@ -2463,6 +2468,18 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, 
QEMUFile *f)
 vfio_pci_write_config(pdev, PCI_COMMAND,
   pci_get_word(pdev->config + PCI_COMMAND), 2);
 
+for (bar = 0; bar < PCI_ROM_SLOT; bar++) {
+/*
+ * The address may not be changed in some scenarios
+ * (e.g. the VF driver isn't loaded in VM).
+ */
+if (old_addr[bar] != pdev->io_regions[bar].addr &&
+vdev->bars[bar].region.size > 0 &&
+vdev->bars[bar].region.size < qemu_real_host_page_size) {
+vfio_sub_page_bar_update_mapping(pdev, bar);
+}
+}
+
 if (msi_enabled(pdev)) {
 vfio_msi_enable(vdev);
 } else if (msix_enabled(pdev)) {
-- 
2.23.0




[PATCH v3 2/2] vfio/common: Add a trace point when a MMIO RAM section cannot be mapped

2021-10-27 Thread Kunkun Jiang
The MSI-X structures of some devices and other non-MSI-X structures
may be in the same BAR. They may share one host page, especially in
the case of large page granularity, such as 64K.

For example, MSIX-Table size of 82599 NIC is 0x30 and the offset in
Bar 3(size 64KB) is 0x0. vfio_listener_region_add() will be called
to map the remaining range (0x30-0x). If host page size is 64KB,
it will return early at 'int128_ge((int128_make64(iova), llend))'
without any message. Let's add a trace point to inform users like commit
5c08600547c0 ("vfio: Use a trace point when a RAM section cannot be DMA mapped")
did.

Signed-off-by: Kunkun Jiang 
---
 hw/vfio/common.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a784b219e6..dd387b0d39 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -893,6 +893,13 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
 llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
 
 if (int128_ge(int128_make64(iova), llend)) {
+if (memory_region_is_ram_device(section->mr)) {
+trace_vfio_listener_region_add_no_dma_map(
+memory_region_name(section->mr),
+section->offset_within_address_space,
+int128_getlo(section->size),
+qemu_real_host_page_size);
+}
 return;
 }
 end = int128_get64(int128_sub(llend, int128_one()));
-- 
2.23.0




Re: [PATCH v2 1/2] vfio/pci: Fix vfio-pci sub-page MMIO BAR mmaping in live migration

2021-10-23 Thread Kunkun Jiang

Hi Eric,

On 2021/10/23 22:26, Eric Auger wrote:

Hi Kunkun,

On 10/22/21 12:01 PM, Kunkun Jiang wrote:

Hi Eric,

On 2021/10/22 0:15, Eric Auger wrote:

Hi Kunkun,

On 9/14/21 3:53 AM, Kunkun Jiang wrote:

We expand MemoryRegions of vfio-pci sub-page MMIO BARs to
vfio_pci_write_config to improve IO performance.

s/to vfio_pci_write_config/ in vfio_pci_write_config()

Thank you for your review. I will correct it in v3.

The MemoryRegions of destination VM will not be expanded
successful in live migration, because their addresses have

s/will not be expanded successful in live migration/are not expanded
anymore after live migration (?) Is that the correct symptom?

You are right. They are not expanded anymore after live migration,
not expanded unsuccessfully. I used the wrong words.

been updated in vmstate_load_state (vfio_pci_load_config).

So iterate BARs in vfio_pci_write_config and try to update
sub-page BARs.

Fixes: c5e2fb3ce4d (vfio: Add save and load functions for VFIO PCI
devices)

is it an actual fix or an optimization?

I recently realized that this is actually an optimization.

The VF driver in VM use the assembly language instructions,
which can operate two registers simultaneously, like stp, ldp.
These instructions would result in non-ISV data abort, which
cannot be handled well at the moment.

If the driver doesn't use such instructions,  not expanding
only affects performance.

I will add these to the commit message in the next version.

Reported-by: Nianyao Tang 
Reported-by: Qixin Gan 
Signed-off-by: Kunkun Jiang 
---
   hw/vfio/pci.c | 15 ++-
   1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e1ea1d8a23..43c7e93153 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2453,7 +2453,12 @@ static int vfio_pci_load_config(VFIODevice
*vbasedev, QEMUFile *f)
   {
   VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice,
vbasedev);
   PCIDevice *pdev = >pdev;
-    int ret;
+    pcibus_t old_addr[PCI_NUM_REGIONS - 1];
+    int bar, ret;
+
+    for (bar = 0; bar < PCI_ROM_SLOT; bar++) {
+    old_addr[bar] = pdev->io_regions[bar].addr;
+    }

what are those values before the vmstate_load_state ie. can't you do the

Are you referring to pdev->io_regions[bar].addr ? All of the bars addr is
PCI_BAR_UNMAPPED (~(pcibus_t)0) before the vmstate_load_state.

OK that was my assumption. What I meant is in that case you always have

old_addr[bar] != pdev->io_regions[bar].addr, right? In the positive this check 
is not needed and you don't need old_addr at all.
In the original code this was needed since one wanted to call
vfio_sub_page_bar_update_mapping() only for the bar base address that were 
changed during the
vfio_pci_write_config.

As far as I know, there is at least one case. If the VF driver is not loaded
(insmod xxx.ko) in the VM, we will have old_addr[bar] == 
pdev->io_regions[bar].addr.
The vfio_sub_page_bar_update_mapping() will be called when 0 < bar size 
< host page size.
But vfio_sub_page_bar_update_mapping() will not change anything in this 
case.


Thanks,
Kunkun Jiang


Thanks

Eric


vfio_sub_page_bar_update_mapping() unconditionnaly on old_addr[bar] !=
pdev->io_regions[bar].addr

The size of Bar is a power of 2. The Bar, whose size is greater than host
page size, doesn't need to be expanded.

Can you explain more? May be I misunderstood you.

Thanks,
Kunkun Jiang

     ret = vmstate_load_state(f, _vfio_pci_config, vdev, 1);
   if (ret) {
@@ -2463,6 +2468,14 @@ static int vfio_pci_load_config(VFIODevice
*vbasedev, QEMUFile *f)
   vfio_pci_write_config(pdev, PCI_COMMAND,
     pci_get_word(pdev->config +
PCI_COMMAND), 2);
   +    for (bar = 0; bar < PCI_ROM_SLOT; bar++) {
+    if (old_addr[bar] != pdev->io_regions[bar].addr &&
+    vdev->bars[bar].region.size > 0 &&
+    vdev->bars[bar].region.size < qemu_real_host_page_size) {
+    vfio_sub_page_bar_update_mapping(pdev, bar);
+    }
+    }
+
   if (msi_enabled(pdev)) {
   vfio_msi_enable(vdev);
   } else if (msix_enabled(pdev)) {

Thanks

Eric

.

.




Re: [PATCH v2 2/2] vfio/common: Add trace point when a MMIO RAM section less than PAGE_SIZE

2021-10-22 Thread Kunkun Jiang

Hi Eric,

On 2021/10/22 1:02, Eric Auger wrote:

Hi Kunkun,

On 9/14/21 3:53 AM, Kunkun Jiang wrote:

The MSI-X structures of some devices and other non-MSI-X structures
are in the same BAR. They may share one host page, especially in the

may be in the same bar?

You are right. So embarrassing.

case of large page granularity, such as 64K.

For example, MSIX-Table size of 82599 NIC is 0x30 and the offset in
Bar 3(size 64KB) is 0x0. If host page size is 64KB.

remove the '.'

s/./, ?

In that case wouldn't the symptom be the same with a 4kB page?

Host page size is different, iova is different.

iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);


If host page size is 4KB, the iova will smaller than llend.
If host page size is 64KB, the iova will be equal to llend.


vfio_listener_region_add() will be called to map the remaining range
(0x30-0x). And it will return early at
'int128_ge((int128_make64(iova), llend))' and hasn't any message.

s/and hasn't any message /without any message


Ok, I will correct this in next version.


Let's add a trace point to informed users like commit 5c08600547c0

s/informed/inform

Same for this.

("vfio: Use a trace point when a RAM section cannot be DMA mapped")
did.

Signed-off-by: Kunkun Jiang 
---
  hw/vfio/common.c | 7 +++
  1 file changed, 7 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 8728d4d5c2..2fc6213c0f 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -892,6 +892,13 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
  llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
  
  if (int128_ge(int128_make64(iova), llend)) {

+if (memory_region_is_ram_device(section->mr)) {
+trace_vfio_listener_region_add_no_dma_map(
+memory_region_name(section->mr),
+section->offset_within_address_space,
+int128_getlo(section->size),
+qemu_real_host_page_size);
+}
  return;
  }
  end = int128_get64(int128_sub(llend, int128_one()));

By the way, if this patch is accepted. The following code in the
vfio_listener_region_add may need to be deleted:

    if (memory_region_is_ram_device(section->mr)) {
    hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;

    if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) {
    trace_vfio_listener_region_add_no_dma_map(
    memory_region_name(section->mr),
    section->offset_within_address_space,
    int128_getlo(section->size),
    pgmask + 1);
    return;
    }
    }

What do you think?

Thanks,
Kunkun Jiang

Thanks

Eric

.




Re: [PATCH v2 1/2] vfio/pci: Fix vfio-pci sub-page MMIO BAR mmaping in live migration

2021-10-22 Thread Kunkun Jiang

Hi Eric,

On 2021/10/22 0:15, Eric Auger wrote:

Hi Kunkun,

On 9/14/21 3:53 AM, Kunkun Jiang wrote:

We expand MemoryRegions of vfio-pci sub-page MMIO BARs to
vfio_pci_write_config to improve IO performance.

s/to vfio_pci_write_config/ in vfio_pci_write_config()

Thank you for your review. I will correct it in v3.

The MemoryRegions of destination VM will not be expanded
successful in live migration, because their addresses have

s/will not be expanded successful in live migration/are not expanded
anymore after live migration (?) Is that the correct symptom?

You are right. They are not expanded anymore after live migration,
not expanded unsuccessfully. I used the wrong words.

been updated in vmstate_load_state (vfio_pci_load_config).

So iterate BARs in vfio_pci_write_config and try to update
sub-page BARs.

Fixes: c5e2fb3ce4d (vfio: Add save and load functions for VFIO PCI devices)

is it an actual fix or an optimization?

I recently realized that this is actually an optimization.

The VF driver in VM use the assembly language instructions,
which can operate two registers simultaneously, like stp, ldp.
These instructions would result in non-ISV data abort, which
cannot be handled well at the moment.

If the driver doesn't use such instructions,  not expanding
only affects performance.

I will add these to the commit message in the next version.

Reported-by: Nianyao Tang 
Reported-by: Qixin Gan 
Signed-off-by: Kunkun Jiang 
---
  hw/vfio/pci.c | 15 ++-
  1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e1ea1d8a23..43c7e93153 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2453,7 +2453,12 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, 
QEMUFile *f)
  {
  VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
  PCIDevice *pdev = >pdev;
-int ret;
+pcibus_t old_addr[PCI_NUM_REGIONS - 1];
+int bar, ret;
+
+for (bar = 0; bar < PCI_ROM_SLOT; bar++) {
+old_addr[bar] = pdev->io_regions[bar].addr;
+}

what are those values before the vmstate_load_state ie. can't you do the

Are you referring to pdev->io_regions[bar].addr ? All of the bars addr is
PCI_BAR_UNMAPPED (~(pcibus_t)0) before the vmstate_load_state.

vfio_sub_page_bar_update_mapping() unconditionnaly on old_addr[bar] !=
pdev->io_regions[bar].addr

The size of Bar is a power of 2. The Bar, whose size is greater than host
page size, doesn't need to be expanded.

Can you explain more? May be I misunderstood you.

Thanks,
Kunkun Jiang
  
  ret = vmstate_load_state(f, _vfio_pci_config, vdev, 1);

  if (ret) {
@@ -2463,6 +2468,14 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, 
QEMUFile *f)
  vfio_pci_write_config(pdev, PCI_COMMAND,
pci_get_word(pdev->config + PCI_COMMAND), 2);
  
+for (bar = 0; bar < PCI_ROM_SLOT; bar++) {

+if (old_addr[bar] != pdev->io_regions[bar].addr &&
+vdev->bars[bar].region.size > 0 &&
+vdev->bars[bar].region.size < qemu_real_host_page_size) {
+vfio_sub_page_bar_update_mapping(pdev, bar);
+}
+}
+
  if (msi_enabled(pdev)) {
  vfio_msi_enable(vdev);
  } else if (msix_enabled(pdev)) {

Thanks

Eric

.




Re: [RFC v9 15/29] vfio: Set up nested stage mappings

2021-10-07 Thread Kunkun Jiang

Hi Eric,

On 2021/10/8 0:58, Eric Auger wrote:

Hi Kunkun,

On 4/14/21 3:45 AM, Kunkun Jiang wrote:

On 2021/4/13 20:57, Auger Eric wrote:

Hi Kunkun,

On 4/13/21 2:10 PM, Kunkun Jiang wrote:

Hi Eric,

On 2021/4/11 20:08, Eric Auger wrote:

In nested mode, legacy vfio_iommu_map_notify cannot be used as
there is no "caching" mode and we do not trap on map.

On Intel, vfio_iommu_map_notify was used to DMA map the RAM
through the host single stage.

With nested mode, we need to setup the stage 2 and the stage 1
separately. This patch introduces a prereg_listener to setup
the stage 2 mapping.

The stage 1 mapping, owned by the guest, is passed to the host
when the guest invalidates the stage 1 configuration, through
a dedicated PCIPASIDOps callback. Guest IOTLB invalidations
are cascaded downto the host through another IOMMU MR UNMAP
notifier.

Signed-off-by: Eric Auger 

---

v7 -> v8:
- properly handle new IOMMUTLBEntry fields and especially
     propagate DOMAIN and PASID based invalidations

v6 -> v7:
- remove PASID based invalidation

v5 -> v6:
- add error_report_err()
- remove the abort in case of nested stage case

v4 -> v5:
- use VFIO_IOMMU_SET_PASID_TABLE
- use PCIPASIDOps for config notification

v3 -> v4:
- use iommu_inv_pasid_info for ASID invalidation

v2 -> v3:
- use VFIO_IOMMU_ATTACH_PASID_TABLE
- new user API
- handle leaf

v1 -> v2:
- adapt to uapi changes
- pass the asid
- pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier
---
    hw/vfio/common.c | 139
+--
    hw/vfio/pci.c    |  21 +++
    hw/vfio/trace-events |   2 +
    3 files changed, 157 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0cd7ef2139..e369d451e7 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry
*iotlb, void **vaddr,
    return true;
    }
    +/* Propagate a guest IOTLB invalidation to the host (nested
mode) */
+static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry
*iotlb)
+{
+    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
+    struct vfio_iommu_type1_cache_invalidate ustruct = {};
+    VFIOContainer *container = giommu->container;
+    int ret;
+
+    assert(iotlb->perm == IOMMU_NONE);
+
+    ustruct.argsz = sizeof(ustruct);
+    ustruct.flags = 0;
+    ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info);
+    ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1;
+    ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB;
+
+    switch (iotlb->granularity) {
+    case IOMMU_INV_GRAN_DOMAIN:
+    ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN;
+    break;
+    case IOMMU_INV_GRAN_PASID:
+    {
+    struct iommu_inv_pasid_info *pasid_info;
+    int archid = -1;
+
+    pasid_info = _info;
+    ustruct.info.granularity = IOMMU_INV_GRANU_PASID;
+    if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
+    pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
+    archid = iotlb->arch_id;
+    }
+    pasid_info->archid = archid;
+    trace_vfio_iommu_asid_inv_iotlb(archid);
+    break;
+    }
+    case IOMMU_INV_GRAN_ADDR:
+    {
+    hwaddr start = iotlb->iova + giommu->iommu_offset;
+    struct iommu_inv_addr_info *addr_info;
+    size_t size = iotlb->addr_mask + 1;
+    int archid = -1;
+
+    addr_info = _info;
+    ustruct.info.granularity = IOMMU_INV_GRANU_ADDR;
+    if (iotlb->leaf) {
+    addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF;
+    }
+    if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
+    addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
+    archid = iotlb->arch_id;
+    }
+    addr_info->archid = archid;
+    addr_info->addr = start;
+    addr_info->granule_size = size;
+    addr_info->nb_granules = 1;
+    trace_vfio_iommu_addr_inv_iotlb(archid, start, size,
+    1, iotlb->leaf);
+    break;
+    }

Should we pass a size to  host kernel here, even if vSMMU doesn't
support
RIL or guest kernel doesn't use RIL?

It will cause TLBI issue in  this scenario: Guest kernel issues a
TLBI cmd
without "range" (tg = 0) to invalidate a 2M huge page. Then qemu passed
the iova and size (4K) to host kernel. Finally, host kernel issues a
TLBI cmd
with "range" (4K) which can not invalidate the TLB entry of 2M huge
page.
(pSMMU supports RIL)

In that case the guest will loop over all 4K images belonging to the 2M
huge page and invalidate each of them. This should turn into qemu
notifications for each 4kB page, no? This is totally inefficient, hence

The guest will not loop over all 4K images belonging to the 2M huge page.
The iommu_iotlb_gather->pgsize will be 2M, if a page is 2M huge page. The
gather->pgsize wil

Re: [PATCH v2 0/2] vfio: Some fixes about vfio-pci MMIO RAM mapping

2021-10-07 Thread Kunkun Jiang

Kindly ping,

Hi all,

Will this patch be picked up soon, or is there any other advice?

Thanks,
Kunkun Jiang

On 2021/9/14 9:53, Kunkun Jiang wrote:

This series include patches as below:

Patch 1:
- vfio/pci: Fix vfio-pci sub-page MMIO BAR mmaping in live migration

Patch 2:
- Added a trace point to informe users when a MMIO RAM ection less than 
PAGE_SIZE

History:

v1 -> v2:
- Add iterate sub-page BARs in vfio_pci_load_config and try to update them 
[Alex Williamson]

Kunkun Jiang (2):
   vfio/pci: Fix vfio-pci sub-page MMIO BAR mmaping in live migration
   vfio/common: Add trace point when a MMIO RAM section less than
 PAGE_SIZE

  hw/vfio/common.c |  7 +++
  hw/vfio/pci.c| 15 ++-
  2 files changed, 21 insertions(+), 1 deletion(-)






Re: [question] VFIO Device Migration: The vCPU may be paused during vfio device DMA in iommu nested stage mode && vSVA

2021-09-27 Thread Kunkun Jiang

Hi Kevin:

On 2021/9/24 14:47, Tian, Kevin wrote:

From: Kunkun Jiang 
Sent: Friday, September 24, 2021 2:19 PM

Hi all,

I encountered a problem in vfio device migration test. The
vCPU may be paused during vfio-pci DMA in iommu nested
stage mode && vSVA. This may lead to migration fail and
other problems related to device hardware and driver
implementation.

It may be a bit early to discuss this issue, after all, the iommu
nested stage mode and vSVA are not yet mature. But judging
from the current implementation, we will definitely encounter
this problem in the future.

Yes, this is a known limitation to support migration with vSVA.


This is the current process of vSVA processing translation fault
in iommu nested stage mode (take SMMU as an example):

guest os            4.handle translation fault 5.send CMD_RESUME to vSMMU


qemu                3.inject fault into guest os 6.deliver response to
host os
(vfio/vsmmu)


host os          2.notify the qemu 7.send CMD_RESUME to SMMU
(vfio/smmu)


SMMU          1.address translation fault          8.retry or
terminate

The order is 1--->8.

Currently, qemu may pause vCPU at any step. It is possible to
pause vCPU at step 1-5, that is, in a DMA. This may lead to
migration fail and other problems related to device hardware
and driver implementation. For example, the device status
cannot be changed from RUNNING && SAVING to SAVING,
because the device DMA is not over.

As far as i can see, vCPU should not be paused during a device
IO process, such as DMA. However, currently live migration
does not pay attention to the state of vfio device when pausing
the vCPU. And if the vCPU is not paused, the vfio device is
always running. This looks like a *deadlock*.

Basically this requires:

1) stopping vCPU after stopping device (could selectively enable
this sequence for vSVA);

How to tell if vSVA is open?
In fact, as long as it is in iommu nested stage mode, there will
be such a problem, whether it is vSVA or no-vSVA. In no-vSVA mode,
a fault can also be generated by modifying the guest device driver.


2) when stopping device, the driver should block new requests
from vCPU (queued to a pending list) and then drain all in-fly
requests including faults;
 * to block this further requires switching from fast-path to
slow trap-emulation path for the cmd portal before stopping
the device;

3) save the pending requests in the vm image and replay them
after the vm is resumed;
 * finally disable blocking by switching back to the fast-path for
the cmd portal;

Is there any related patch sent out and discussed? I might have
overlooked that.

We may be able to discuss and finalize a specification for this
problem.

Thanks,
Kunkun Jiang

Do you have any ideas to solve this problem?
Looking forward to your replay.


We verified above flow can work in our internal POC.

Thanks
Kevin






[question] VFIO Device Migration: The vCPU may be paused during vfio device DMA in iommu nested stage mode && vSVA

2021-09-24 Thread Kunkun Jiang

Hi all,

I encountered a problem in vfio device migration test. The
vCPU may be paused during vfio-pci DMA in iommu nested
stage mode && vSVA. This may lead to migration fail and
other problems related to device hardware and driver
implementation.

It may be a bit early to discuss this issue, after all, the iommu
nested stage mode and vSVA are not yet mature. But judging
from the current implementation, we will definitely encounter
this problem in the future.

This is the current process of vSVA processing translation fault
in iommu nested stage mode (take SMMU as an example):

guest os            4.handle translation fault 5.send CMD_RESUME to vSMMU


qemu                3.inject fault into guest os 6.deliver response to 
host os

(vfio/vsmmu)


host os          2.notify the qemu 7.send CMD_RESUME to SMMU
(vfio/smmu)


SMMU          1.address translation fault          8.retry or 
terminate


The order is 1--->8.

Currently, qemu may pause vCPU at any step. It is possible to
pause vCPU at step 1-5, that is, in a DMA. This may lead to
migration fail and other problems related to device hardware
and driver implementation. For example, the device status
cannot be changed from RUNNING && SAVING to SAVING,
because the device DMA is not over.

As far as i can see, vCPU should not be paused during a device
IO process, such as DMA. However, currently live migration
does not pay attention to the state of vfio device when pausing
the vCPU. And if the vCPU is not paused, the vfio device is
always running. This looks like a *deadlock*.

Do you have any ideas to solve this problem?
Looking forward to your replay.

Thanks,
Kunkun Jiang






[PATCH v2 2/2] vfio/common: Add trace point when a MMIO RAM section less than PAGE_SIZE

2021-09-13 Thread Kunkun Jiang
The MSI-X structures of some devices and other non-MSI-X structures
are in the same BAR. They may share one host page, especially in the
case of large page granularity, such as 64K.

For example, MSIX-Table size of 82599 NIC is 0x30 and the offset in
Bar 3(size 64KB) is 0x0. If host page size is 64KB.
vfio_listener_region_add() will be called to map the remaining range
(0x30-0x). And it will return early at
'int128_ge((int128_make64(iova), llend))' and hasn't any message.
Let's add a trace point to informed users like commit 5c08600547c0
("vfio: Use a trace point when a RAM section cannot be DMA mapped")
did.

Signed-off-by: Kunkun Jiang 
---
 hw/vfio/common.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 8728d4d5c2..2fc6213c0f 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -892,6 +892,13 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
 llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
 
 if (int128_ge(int128_make64(iova), llend)) {
+if (memory_region_is_ram_device(section->mr)) {
+trace_vfio_listener_region_add_no_dma_map(
+memory_region_name(section->mr),
+section->offset_within_address_space,
+int128_getlo(section->size),
+qemu_real_host_page_size);
+}
 return;
 }
 end = int128_get64(int128_sub(llend, int128_one()));
-- 
2.23.0




[PATCH v2 1/2] vfio/pci: Fix vfio-pci sub-page MMIO BAR mmaping in live migration

2021-09-13 Thread Kunkun Jiang
We expand MemoryRegions of vfio-pci sub-page MMIO BARs to
vfio_pci_write_config to improve IO performance.
The MemoryRegions of destination VM will not be expanded
successful in live migration, because their addresses have
been updated in vmstate_load_state (vfio_pci_load_config).

So iterate BARs in vfio_pci_write_config and try to update
sub-page BARs.

Fixes: c5e2fb3ce4d (vfio: Add save and load functions for VFIO PCI devices)
Reported-by: Nianyao Tang 
Reported-by: Qixin Gan 
Signed-off-by: Kunkun Jiang 
---
 hw/vfio/pci.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e1ea1d8a23..43c7e93153 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2453,7 +2453,12 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, 
QEMUFile *f)
 {
 VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
 PCIDevice *pdev = >pdev;
-int ret;
+pcibus_t old_addr[PCI_NUM_REGIONS - 1];
+int bar, ret;
+
+for (bar = 0; bar < PCI_ROM_SLOT; bar++) {
+old_addr[bar] = pdev->io_regions[bar].addr;
+}
 
 ret = vmstate_load_state(f, _vfio_pci_config, vdev, 1);
 if (ret) {
@@ -2463,6 +2468,14 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, 
QEMUFile *f)
 vfio_pci_write_config(pdev, PCI_COMMAND,
   pci_get_word(pdev->config + PCI_COMMAND), 2);
 
+for (bar = 0; bar < PCI_ROM_SLOT; bar++) {
+if (old_addr[bar] != pdev->io_regions[bar].addr &&
+vdev->bars[bar].region.size > 0 &&
+vdev->bars[bar].region.size < qemu_real_host_page_size) {
+vfio_sub_page_bar_update_mapping(pdev, bar);
+}
+}
+
 if (msi_enabled(pdev)) {
 vfio_msi_enable(vdev);
 } else if (msix_enabled(pdev)) {
-- 
2.23.0




[PATCH v2 0/2] vfio: Some fixes about vfio-pci MMIO RAM mapping

2021-09-13 Thread Kunkun Jiang
This series include patches as below:

Patch 1:
- vfio/pci: Fix vfio-pci sub-page MMIO BAR mmaping in live migration

Patch 2:
- Added a trace point to informe users when a MMIO RAM ection less than 
PAGE_SIZE

History:

v1 -> v2:
- Add iterate sub-page BARs in vfio_pci_load_config and try to update them 
[Alex Williamson]

Kunkun Jiang (2):
  vfio/pci: Fix vfio-pci sub-page MMIO BAR mmaping in live migration
  vfio/common: Add trace point when a MMIO RAM section less than
PAGE_SIZE

 hw/vfio/common.c |  7 +++
 hw/vfio/pci.c| 15 ++-
 2 files changed, 21 insertions(+), 1 deletion(-)

-- 
2.23.0




Re: [PATCH 1/2] vfio/pci: Fix vfio-pci sub-page MMIO BAR mmaping in live migration

2021-09-12 Thread Kunkun Jiang

On 2021/9/11 6:24, Alex Williamson wrote:

On Fri, 10 Sep 2021 16:33:12 +0800
Kunkun Jiang  wrote:


Hi Alex,

On 2021/9/9 4:45, Alex Williamson wrote:

On Fri, 3 Sep 2021 17:36:10 +0800
Kunkun Jiang  wrote:
  

We expand MemoryRegions of vfio-pci sub-page MMIO BARs to
vfio_pci_write_config to improve IO performance.
The MemoryRegions of destination VM will not be expanded
successful in live migration, because their addresses have
been updated in vmstate_load_state (vfio_pci_load_config).

What's the call path through vfio_pci_write_config() that you're
relying on to get triggered to enable this and why wouldn't we just
walk all sub-page BARs in vfio_pci_load_config() to resolve the issue
then?  It's my understanding that we do this update in write-config
because it's required that the VM sizes the BAR before using it, which
is not the case when we resume from migration.  Thanks,

Let's take an example:

AArch64
host page granularity: 64KB
PCI device: *Bar2 size 32KB* [mem 0x800020-0x800020 64bit pref]

When enable Command register bit 1(Memory Space), the code flow is
as follows:

vfio_pci_write_config (addr: 4 val: 2 len: 2)
      // record the old address of each bar, 0x
      old_addr[bar] = pdev->io_regions[bar].addr;
      pci_default_write_config
      pci_update_mappings
      new_addr = pci_bar_address    // 0x800020
      r->addr = new_addr;
      memory_region_addr_subregion_overlap
      ...
*vfio_listener_region_add*
      alignment check of the ram section address and size
fail, return
*kvm_region_add*
      kvm_set_phys_mem
      alignment check of the ram section address and
size fail, return

      // old_addr[bar] != pdev->io_regions[bar].addr &&
      // 0 < vdev->bars[bar].region.size < qemu_real_host_page_size
      vfio_sub_page_bar_update_mapping
*bar size = qemu_real_host_page_size*
      ...
      vfio_listener_region_add
      map success
      kvm_region_add
      kvm_set_phys_mem
      map success

In live migration, only pci config data is sent to the destination VM.
Therefore, we need to update the bar's size before destination VM
using it.

In vfio_pci_load_config, the code flow is as follows:

vfio_pci_load_config
      vmstate_load_state
      *get_pci_config_device*
      pci_update_mappings
      ...
      // bar's addr is updated(0x800020), but bar's size
is still 32KB, so map failed
      vfio_pci_write_config
      // bar's addr will not be changed, so
vfio_sub_page_bar_update_mapping won't be called

My idea is that removing the check 'old_addr[bar] !=
pdev->io_regions[bar].addr' doesn't
affect the previous process. There's also a bar size check. In
vfio_sub_page_bar_update_mapping,
it will check if bar is mapped and page aligned.
1) If bar's addr is 0x, it will not pass the
vfio_sub_page_bar_update_mapping check.
2) If bar's size has been updated, it will not pass the bar size check
in vfio_pci_write_config.

The bar size check in vfio_pci_write_config() only tests if the vfio
region is >0 and bars[bar].region.size == qemu_real_host_page_size) once we setup
the sub-page support, I'm not convinced that's true.

So yes, sub-page-update can reject invalid addresses and we already
rely on it to do so, but the code being removed avoids that redundant
writes to the BAR won't trigger redundant MemoryRegion manipulation.
Maybe those are harmless, but that's not your argument for allowing it.

OTOH, why wouldn't vfio_pci_load_config() iterate sub-page BARs and try
to update them at that time?  Thanks,

Like this? I've tested it, and it's okay.

vfio_pci_load_config
{
 for (bar = 0; bar < PCI_ROM_SLOT; bar++) {
     if (old_addr[bar] != pdev->io_regions[bar].addr &&
 vdev->bars[bar].region.size > 0 &&
 vdev->bars[bar].region.size < qemu_real_host_page_size) {
 vfio_sub_page_bar_update_mapping(pdev, bar);
 }
 }
}


Thanks,
Kunkun Jiang


Alex


 

Remove the restriction on base address change in
vfio_pci_write_config for correct mmapping sub-page MMIO
BARs. Accroding to my analysis, the remaining parameter
verification is enough.

Fixes: c5e2fb3ce4d (vfio: Add save and load functions for VFIO PCI devices)
Reported-by: Nianyao Tang 
Reported-by: Qixin Gan 
Signed-off-by: Kunkun Jiang 
---
   hw/vfio/pci.c | 8 +---
   1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e1ea1d8a23..891b211ddf 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1189,18 +1189,12 @@ void vfio_pci_write_config(PCIDevice *pdev,
   }
   } else if (ranges_overlap(addr, len, PCI_BASE_ADDRESS_0, 24) ||
   range_covers_byte(addr, len, PCI_COMMAND)) {
-pcibus_t old_ad

Re: [PATCH 2/2] vfio/common: Add trace point when a MMIO RAM section less than PAGE_SIZE

2021-09-10 Thread Kunkun Jiang

Hi Alex,

On 2021/9/9 4:45, Alex Williamson wrote:

On Fri, 3 Sep 2021 17:36:11 +0800
Kunkun Jiang  wrote:


The MSI-X structures of some devices and other non-MSI-X structures
are in the same BAR. They may share one host page, especially in the
case of large page granularity, suck as 64K.

s/suck/such/

ok

For example, MSIX-Table size of 82599 NIC is 0x30 and the offset in
Bar 3(size 64KB) is 0x0. If host page size is 64KB.
vfio_listenerregion_add() will be called to map the remaining range

s/vfio_listenerregion_add/vfio_listener_region_add/

ok

(0x30-0x). And it will return early at
'int128_ge((int128_make64(iova), llend))' and hasn't any message.
Let's add a trace point to informed users like 5c08600547c did.

Please use the following syntax for referencing previous commits
(12-char sha1 is standard):

   commit 5c08600547c0 ("vfio: Use a trace point when a RAM section
   cannot be DMA mapped")

Thank you for your reply. I will correct it in V2.

Thanks,
Kunkun Jiang

Thanks,
Alex


Signed-off-by: Kunkun Jiang 
---
  hw/vfio/common.c | 7 +++
  1 file changed, 7 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 8728d4d5c2..2fc6213c0f 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -892,6 +892,13 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
  llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
  
  if (int128_ge(int128_make64(iova), llend)) {

+if (memory_region_is_ram_device(section->mr)) {
+trace_vfio_listener_region_add_no_dma_map(
+memory_region_name(section->mr),
+section->offset_within_address_space,
+int128_getlo(section->size),
+qemu_real_host_page_size);
+}
  return;
  }
  end = int128_get64(int128_sub(llend, int128_one()));

.






Re: [PATCH 1/2] vfio/pci: Fix vfio-pci sub-page MMIO BAR mmaping in live migration

2021-09-10 Thread Kunkun Jiang

Hi Alex,

On 2021/9/9 4:45, Alex Williamson wrote:

On Fri, 3 Sep 2021 17:36:10 +0800
Kunkun Jiang  wrote:


We expand MemoryRegions of vfio-pci sub-page MMIO BARs to
vfio_pci_write_config to improve IO performance.
The MemoryRegions of destination VM will not be expanded
successful in live migration, because their addresses have
been updated in vmstate_load_state (vfio_pci_load_config).

What's the call path through vfio_pci_write_config() that you're
relying on to get triggered to enable this and why wouldn't we just
walk all sub-page BARs in vfio_pci_load_config() to resolve the issue
then?  It's my understanding that we do this update in write-config
because it's required that the VM sizes the BAR before using it, which
is not the case when we resume from migration.  Thanks,

Let's take an example:

AArch64
host page granularity: 64KB
PCI device: *Bar2 size 32KB* [mem 0x800020-0x800020 64bit pref]

When enable Command register bit 1(Memory Space), the code flow is
as follows:

vfio_pci_write_config (addr: 4 val: 2 len: 2)
    // record the old address of each bar, 0x
    old_addr[bar] = pdev->io_regions[bar].addr;
    pci_default_write_config
    pci_update_mappings
    new_addr = pci_bar_address    // 0x800020
    r->addr = new_addr;
    memory_region_addr_subregion_overlap
    ...
*vfio_listener_region_add*
    alignment check of the ram section address and size 
fail, return

*kvm_region_add*
    kvm_set_phys_mem
    alignment check of the ram section address and 
size fail, return


    // old_addr[bar] != pdev->io_regions[bar].addr &&
    // 0 < vdev->bars[bar].region.size < qemu_real_host_page_size
    vfio_sub_page_bar_update_mapping
*bar size = qemu_real_host_page_size*
    ...
    vfio_listener_region_add
    map success
    kvm_region_add
    kvm_set_phys_mem
    map success

In live migration, only pci config data is sent to the destination VM.
Therefore, we need to update the bar's size before destination VM
using it.

In vfio_pci_load_config, the code flow is as follows:

vfio_pci_load_config
    vmstate_load_state
    *get_pci_config_device*
    pci_update_mappings
    ...
    // bar's addr is updated(0x800020), but bar's size 
is still 32KB, so map failed

    vfio_pci_write_config
    // bar's addr will not be changed, so 
vfio_sub_page_bar_update_mapping won't be called


My idea is that removing the check 'old_addr[bar] != 
pdev->io_regions[bar].addr' doesn't
affect the previous process. There's also a bar size check. In 
vfio_sub_page_bar_update_mapping,

it will check if bar is mapped and page aligned.
1) If bar's addr is 0x, it will not pass the 
vfio_sub_page_bar_update_mapping check.
2) If bar's size has been updated, it will not pass the bar size check 
in vfio_pci_write_config.


Thanks,
Kunkun Jiang


Alex
  

Remove the restriction on base address change in
vfio_pci_write_config for correct mmapping sub-page MMIO
BARs. Accroding to my analysis, the remaining parameter
verification is enough.

Fixes: c5e2fb3ce4d (vfio: Add save and load functions for VFIO PCI devices)
Reported-by: Nianyao Tang 
Reported-by: Qixin Gan 
Signed-off-by: Kunkun Jiang 
---
  hw/vfio/pci.c | 8 +---
  1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e1ea1d8a23..891b211ddf 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1189,18 +1189,12 @@ void vfio_pci_write_config(PCIDevice *pdev,
  }
  } else if (ranges_overlap(addr, len, PCI_BASE_ADDRESS_0, 24) ||
  range_covers_byte(addr, len, PCI_COMMAND)) {
-pcibus_t old_addr[PCI_NUM_REGIONS - 1];
  int bar;
  
-for (bar = 0; bar < PCI_ROM_SLOT; bar++) {

-old_addr[bar] = pdev->io_regions[bar].addr;
-}
-
  pci_default_write_config(pdev, addr, val, len);
  
  for (bar = 0; bar < PCI_ROM_SLOT; bar++) {

-if (old_addr[bar] != pdev->io_regions[bar].addr &&
-vdev->bars[bar].region.size > 0 &&
+if (vdev->bars[bar].region.size > 0 &&
  vdev->bars[bar].region.size < qemu_real_host_page_size) {
  vfio_sub_page_bar_update_mapping(pdev, bar);
  }

.






[PATCH 2/2] vfio/common: Add trace point when a MMIO RAM section less than PAGE_SIZE

2021-09-03 Thread Kunkun Jiang
The MSI-X structures of some devices and other non-MSI-X structures
are in the same BAR. They may share one host page, especially in the
case of large page granularity, suck as 64K.

For example, MSIX-Table size of 82599 NIC is 0x30 and the offset in
Bar 3(size 64KB) is 0x0. If host page size is 64KB.
vfio_listenerregion_add() will be called to map the remaining range
(0x30-0x). And it will return early at
'int128_ge((int128_make64(iova), llend))' and hasn't any message.
Let's add a trace point to informed users like 5c08600547c did.

Signed-off-by: Kunkun Jiang 
---
 hw/vfio/common.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 8728d4d5c2..2fc6213c0f 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -892,6 +892,13 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
 llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
 
 if (int128_ge(int128_make64(iova), llend)) {
+if (memory_region_is_ram_device(section->mr)) {
+trace_vfio_listener_region_add_no_dma_map(
+memory_region_name(section->mr),
+section->offset_within_address_space,
+int128_getlo(section->size),
+qemu_real_host_page_size);
+}
 return;
 }
 end = int128_get64(int128_sub(llend, int128_one()));
-- 
2.23.0




[PATCH 0/2] vfio: Some fixes about vfio-pci MMIO RAM mapping

2021-09-03 Thread Kunkun Jiang
This series include patches as below:

Patch 1:
- Deleted a check to fix vfio-pci sub-page MMIO BAR mmaping in live migration

Patch 2:
- Added a trace point to informe users when a MMIO RAM ection less than 
PAGE_SIZE

Kunkun Jiang (2):
  vfio/pci: Fix vfio-pci sub-page MMIO BAR mmaping in live migration
  vfio/common: Add trace point when a MMIO RAM section less than
PAGE_SIZE

 hw/vfio/common.c | 7 +++
 hw/vfio/pci.c| 8 +---
 2 files changed, 8 insertions(+), 7 deletions(-)

-- 
2.23.0




[PATCH 1/2] vfio/pci: Fix vfio-pci sub-page MMIO BAR mmaping in live migration

2021-09-03 Thread Kunkun Jiang
We expand MemoryRegions of vfio-pci sub-page MMIO BARs to
vfio_pci_write_config to improve IO performance.
The MemoryRegions of destination VM will not be expanded
successful in live migration, because their addresses have
been updated in vmstate_load_state (vfio_pci_load_config).

Remove the restriction on base address change in
vfio_pci_write_config for correct mmapping sub-page MMIO
BARs. Accroding to my analysis, the remaining parameter
verification is enough.

Fixes: c5e2fb3ce4d (vfio: Add save and load functions for VFIO PCI devices)
Reported-by: Nianyao Tang 
Reported-by: Qixin Gan 
Signed-off-by: Kunkun Jiang 
---
 hw/vfio/pci.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e1ea1d8a23..891b211ddf 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1189,18 +1189,12 @@ void vfio_pci_write_config(PCIDevice *pdev,
 }
 } else if (ranges_overlap(addr, len, PCI_BASE_ADDRESS_0, 24) ||
 range_covers_byte(addr, len, PCI_COMMAND)) {
-pcibus_t old_addr[PCI_NUM_REGIONS - 1];
 int bar;
 
-for (bar = 0; bar < PCI_ROM_SLOT; bar++) {
-old_addr[bar] = pdev->io_regions[bar].addr;
-}
-
 pci_default_write_config(pdev, addr, val, len);
 
 for (bar = 0; bar < PCI_ROM_SLOT; bar++) {
-if (old_addr[bar] != pdev->io_regions[bar].addr &&
-vdev->bars[bar].region.size > 0 &&
+if (vdev->bars[bar].region.size > 0 &&
 vdev->bars[bar].region.size < qemu_real_host_page_size) {
 vfio_sub_page_bar_update_mapping(pdev, bar);
 }
-- 
2.23.0




Re: [RFC v9 14/29] vfio: Introduce helpers to DMA map/unmap a RAM section

2021-09-03 Thread Kunkun Jiang

Hi Eric,

On 2021/4/11 20:08, Eric Auger wrote:

Let's introduce two helpers that allow to DMA map/unmap a RAM
section. Those helpers will be called for nested stage setup in
another call site. Also the vfio_listener_region_add/del()
structure may be clearer.

Signed-off-by: Eric Auger 

---

v8 -> v9
- rebase on top of
   1eb7f642750c ("vfio: Support host translation granule size")

v5 -> v6:
- add Error **
---
  hw/vfio/common.c | 199 +--
  hw/vfio/trace-events |   4 +-
  2 files changed, 119 insertions(+), 84 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a8f835328e..0cd7ef2139 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -662,13 +662,126 @@ hostwin_from_range(VFIOContainer *container, hwaddr 
iova, hwaddr end)
  return NULL;
  }
  
+static int vfio_dma_map_ram_section(VFIOContainer *container,

+MemoryRegionSection *section, Error **err)
+{
+VFIOHostDMAWindow *hostwin;
+Int128 llend, llsize;
+hwaddr iova, end;
+void *vaddr;
+int ret;
+
+assert(memory_region_is_ram(section->mr));
+
+iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+llend = int128_make64(section->offset_within_address_space);
+llend = int128_add(llend, section->size);
+llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));

Is there any special meaning for using TRAGET_PAGE_ALIGN here?
REAL_HOST_PAGE_ALIGN is used in vfio_listener_region_add.

And I think a check should be added here if using REAL_HOST_PAGE_ALIGN.

if (int128_ge(int128_make64(iova), llend)) {
 return;
}


It will cause an error log or 'assert(r ==a )' of int128_get64 by calling
vfio_prereg_listener_region_add in some scenarios. Some devices' BAR
may map MSI-X structures and others in one host page.

By the way, is this set of patch to be updated after "/dev/iommu" is
sent out?

Thanks,
Kunkun Jiang


+end = int128_get64(int128_sub(llend, int128_one()));
+
+vaddr = memory_region_get_ram_ptr(section->mr) +
+section->offset_within_region +
+(iova - section->offset_within_address_space);
+
+hostwin = hostwin_from_range(container, iova, end);
+if (!hostwin) {
+error_setg(err, "Container %p can't map guest IOVA region"
+   " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx, container, iova, end);
+return -EFAULT;
+}
+
+trace_vfio_dma_map_ram(iova, end, vaddr);
+
+llsize = int128_sub(llend, int128_make64(iova));
+
+if (memory_region_is_ram_device(section->mr)) {
+hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
+
+if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) {
+trace_vfio_listener_region_add_no_dma_map(
+memory_region_name(section->mr),
+section->offset_within_address_space,
+int128_getlo(section->size),
+pgmask + 1);
+return 0;
+}
+}
+
+ret = vfio_dma_map(container, iova, int128_get64(llsize),
+   vaddr, section->readonly);
+if (ret) {
+error_setg(err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
+   "0x%"HWADDR_PRIx", %p) = %d (%m)",
+   container, iova, int128_get64(llsize), vaddr, ret);
+if (memory_region_is_ram_device(section->mr)) {
+/* Allow unexpected mappings not to be fatal for RAM devices */
+error_report_err(*err);
+return 0;
+}
+return ret;
+}
+return 0;
+}
+
+static void vfio_dma_unmap_ram_section(VFIOContainer *container,
+   MemoryRegionSection *section)
+{
+Int128 llend, llsize;
+hwaddr iova, end;
+bool try_unmap = true;
+int ret;
+
+iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
+llend = int128_make64(section->offset_within_address_space);
+llend = int128_add(llend, section->size);
+llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
+
+if (int128_ge(int128_make64(iova), llend)) {
+return;
+}
+end = int128_get64(int128_sub(llend, int128_one()));
+
+llsize = int128_sub(llend, int128_make64(iova));
+
+trace_vfio_dma_unmap_ram(iova, end);
+
+if (memory_region_is_ram_device(section->mr)) {
+hwaddr pgmask;
+VFIOHostDMAWindow *hostwin = hostwin_from_range(container, iova, end);
+
+assert(hostwin); /* or region_add() would have failed */
+
+pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
+try_unmap = !((iova & pgmask) || (int128_get64(llsize) & pgmask));
+}
+
+if (try_unmap) {
+if (int128_eq(llsize, int128_2_64())) {
+/* The unmap ioctl doesn

[PATCH 1/2] vfio/common: Add trace point when a MMIO RAM section less than minimum size

2021-08-23 Thread Kunkun Jiang
I recently did some tests about the 82599 NIC, and found a strange
scenario. The MSIX-Table size of this NIC is 0x30 and the offset
in Bar 3(64KB) is 0x0. And CPU page size is 64KB. The region_add()
will return early at 'int128_ge((int128_make64(iova), llend))' and
hasn't any message. Let's add a trace point to informed users.

Signed-off-by: Kunkun Jiang 
---
 hw/vfio/common.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 7d80f43e39..bbb8d1ea0c 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -892,6 +892,13 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
 llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
 
 if (int128_ge(int128_make64(iova), llend)) {
+if (memory_region_is_ram_device(section->mr)) {
+trace_vfio_listener_region_add_no_dma_map(
+memory_region_name(section->mr),
+section->offset_within_address_space,
+int128_getlo(section->size),
+qemu_real_host_page_size);
+}
 return;
 }
 end = int128_get64(int128_sub(llend, int128_one()));
-- 
2.23.0




[PATCH 0/2] vfio/common: Some fixes about region_add/region_del

2021-08-23 Thread Kunkun Jiang
This series include patches as below:

Patch 1:
- Add a trace point to informe users when a MMIO RAM section less than minimum 
size

Patch 2:
- Fix address alignment in region_add/regiondel with vfio iommu smallest page 
size

Kunkun Jiang (2):
  vfio/common: Add trace point when a MMIO RAM section less than minimum
size
  vfio/common: Fix address alignment in region_add/region_del

 hw/vfio/common.c | 35 +++
 1 file changed, 27 insertions(+), 8 deletions(-)

-- 
2.23.0




[PATCH 2/2] vfio/common: Fix address alignment in region_add/region_del

2021-08-23 Thread Kunkun Jiang
The page sizes supported by IOMMU may not match the CPU page size.
For example, the CPU page size is 16KB, but ARM SMMU may not support
16KB. So it is inappropriate to use qemu_real_host_page_mask in
region_add/region_del.

The vfio iommu page sizes exposed via VFIO_IOMMU_GET_INFO. So use
the smallest page size to align the address.

Fixes: 1eb7f642750 (vfio: Support host translation granule size)
Signed-off-by: Kunkun Jiang 
---
 hw/vfio/common.c | 30 +-
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index bbb8d1ea0c..62f338cd78 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -859,10 +859,13 @@ static void 
vfio_unregister_ram_discard_listener(VFIOContainer *container,
 g_free(vrdl);
 }
 
+#define VFIO_IOMMU_MIN_PAGE_ALIGN(addr, pgsize) ROUND_UP((addr), (pgsize))
+
 static void vfio_listener_region_add(MemoryListener *listener,
  MemoryRegionSection *section)
 {
 VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+uint64_t vfio_iommu_min_page_size, vfio_iommu_min_page_mask;
 hwaddr iova, end;
 Int128 llend, llsize;
 void *vaddr;
@@ -879,17 +882,21 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
 return;
 }
 
+vfio_iommu_min_page_size = 1 << ctz64(container->pgsizes);
+vfio_iommu_min_page_mask = ~(vfio_iommu_min_page_size - 1);
+
 if (unlikely((section->offset_within_address_space &
-  ~qemu_real_host_page_mask) !=
- (section->offset_within_region & ~qemu_real_host_page_mask))) 
{
+  ~vfio_iommu_min_page_mask) !=
+ (section->offset_within_region & ~vfio_iommu_min_page_mask))) 
{
 error_report("%s received unaligned region", __func__);
 return;
 }
 
-iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
+iova = VFIO_IOMMU_MIN_PAGE_ALIGN(section->offset_within_address_space,
+ vfio_iommu_min_page_size);
 llend = int128_make64(section->offset_within_address_space);
 llend = int128_add(llend, section->size);
-llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
+llend = int128_and(llend, int128_exts64(vfio_iommu_min_page_mask));
 
 if (int128_ge(int128_make64(iova), llend)) {
 if (memory_region_is_ram_device(section->mr)) {
@@ -897,7 +904,7 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
 memory_region_name(section->mr),
 section->offset_within_address_space,
 int128_getlo(section->size),
-qemu_real_host_page_size);
+vfio_iommu_min_page_size);
 }
 return;
 }
@@ -1102,6 +1109,7 @@ static void vfio_listener_region_del(MemoryListener 
*listener,
  MemoryRegionSection *section)
 {
 VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+uint64_t vfio_iommu_min_page_size, vfio_iommu_min_page_mask;
 hwaddr iova, end;
 Int128 llend, llsize;
 int ret;
@@ -1115,9 +1123,12 @@ static void vfio_listener_region_del(MemoryListener 
*listener,
 return;
 }
 
+vfio_iommu_min_page_size = 1 << ctz64(container->pgsizes);
+vfio_iommu_min_page_mask = ~(vfio_iommu_min_page_size - 1);
+
 if (unlikely((section->offset_within_address_space &
-  ~qemu_real_host_page_mask) !=
- (section->offset_within_region & ~qemu_real_host_page_mask))) 
{
+  ~vfio_iommu_min_page_mask) !=
+ (section->offset_within_region & ~vfio_iommu_min_page_mask))) 
{
 error_report("%s received unaligned region", __func__);
 return;
 }
@@ -1145,10 +1156,11 @@ static void vfio_listener_region_del(MemoryListener 
*listener,
  */
 }
 
-iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
+iova = VFIO_IOMMU_MIN_PAGE_ALIGN(section->offset_within_address_space,
+ vfio_iommu_min_page_size);
 llend = int128_make64(section->offset_within_address_space);
 llend = int128_add(llend, section->size);
-llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
+llend = int128_and(llend, int128_exts64(vfio_iommu_min_page_mask));
 
 if (int128_ge(int128_make64(iova), llend)) {
 return;
-- 
2.23.0




Re: [question] Shall we flush ITS tables into guest RAM when shutdown the VM?

2021-07-26 Thread Kunkun Jiang

On 2021/7/6 22:27, Eric Auger wrote:

Hi Dave,

On 7/6/21 4:19 PM, Dr. David Alan Gilbert wrote:

* Eric Auger (eric.au...@redhat.com) wrote:

Hi,

On 7/6/21 10:18 AM, Kunkun Jiang wrote:

Hi Eric,

On 2021/6/30 17:16, Eric Auger wrote:

On 6/30/21 3:38 AM, Kunkun Jiang wrote:

On 2021/6/30 4:14, Eric Auger wrote:

Hi Kunkun,

On 6/29/21 11:33 AM, Kunkun Jiang wrote:

Hi all,

Accroding to the patch cddafd8f353d2d251b1a5c6c948a577a85838582,
our original intention is to flush the ITS tables into guest RAM at
the point
RUN_STATE_FINISH_MIGRATE, but sometimes the VM gets stopped before
migration launch so let's simply flush the tables each time the VM
gets
stopped.

But I encountered an error when I shut down the virtual machine.


qemu-system-aarch64: KVM_SET_DEVICE_ATTR failed: Group 4 attr
0x0001: Permission denied

Shall we need to flush ITS tables into guest RAM when 'shutdown' the
VM?
Or do you think this error is normal?

yes we determined in the past this was the right moment do save the
tables

"with a live migration the guest is still running after
the RAM has been first saved, and so the tables may still change
during the iterative RAM save. You would actually need to do this
at just the point we stop the VM before the final RAM save; that
*might*
be possible by using a notifier hook a vm run state change to
RUN_STATE_FINISH_MIGRATE
- if you can do the changes just as the migration flips into that mode
it *should* work. " said David.

But sometimes as the commit msg says, the VM is stopped before the
migration launch - I do not remember the exact scenario tbh -.

Well, I initially wanted to know more about this scenario to determine
whether
a normal shutdown would fall into it.

I think it was for save/restore use case. In that case you need to flush
the KVM cache in memory on VM shutdown.

Sorry for late reply.

Can we distinguish from the 'RunState'?
When we stop the VM, the RunState will be set. There are many types of
RunState, such as RUN_STATE_FINISH_MIGRATE/RUN_STATE_SAVE_VM/
RUN_STATE_SHUTDOWN/RUN_STATE_GUEST_PANICKED, etc.

Maybe RUN_STATE_SHUTDOWN doesn't belong to save/restore use case,
right?

Adding Dave, Juan and Peter in the loop for migration expertise.

At the moment we save the ARM ITS MSI controller tables whenever the VM
gets stopped. Saving the tables from KVM caches into the guest RAM is
needed for migration and save/restore use cases.
However with GICv4 this fails at KVM level because some MSIs are
forwarded and saving their state is not supported with GICv4.

While GICv4 migration is not supported we would like the VM to work
properly, ie. being stoppable without taking care of table saving.

So could we be more precise and identifiy the save/restore and migration
use cases instead of saving the tables on each VM shutdown.

During the precopy migration (not sure about others), we do:

static void migration_completion(MigrationState *s)
{

 ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
...
 ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false,
  inactivate);

so I think we do have that state there to hook off.

That's consistent with what you suggested in the past ans what is logged
in the commit message of

cddafd8f353d2d251b1a5c6c948a577a85838582 ("hw/intc/arm_gicv3_its: Implement state 
save/restore").

However does the save/restore enters that state. If I remember correctly that's 
why I decided to do the save on each VM stop instead.


My idea is to distinguish between save/restore, migration use cases and
other scenarios based on the 'RunState'. RUN_STATE_FINISH_MIGRATION and
RUN_STATE_COLO should be the right cases, I think. I'm not sure
RUN_STATE_SAVE_VM is a right case, which is used in save_snapshot.

Do you have a better way to identify it?

By the way, it's just get an -EACCESS error and doesn't cause any other
problems. If we don't have a good and easy way to solve it, maybe we can
ignore it.

Thanks,
Kunkun Jiang



The tables are saved into guest RAM so when need the CPUs and devices to
be stopped but we need the guest RAM to be saved after the ITS save
operation.

Yeh so what should happen is that you:
a) Iterate RAM a lot
b) You stop everything
  -> Flushes remaining changes into RAM
c) Transmit device state and last bits of RAM changes.

so that flush should happen at (b).

That's correct.

Thanks

Eric

Dave


Kunkun, by the way you currently just get an error from qemu, ie. qemu
does not exit? Couldn't we just ignore -EACCESS error?

Thanks

Eric





In my opinion, when the virtual machine is normally shutdown, flushing
the
ITS tables is not necessary. If we can't tell the difference between
'normal shutdown'
and the above scenario, then this 'error' is inevitable.

So each time the VM is stopped we flush the caches into guest RAM.




This error occurs in the following scenario:
Kunpeng 920

Re: [question] Shall we flush ITS tables into guest RAM when shutdown the VM?

2021-07-26 Thread Kunkun Jiang

Hi Eric,

On 2021/7/6 21:52, Eric Auger wrote:

Hi,

On 7/6/21 10:18 AM, Kunkun Jiang wrote:

Hi Eric,

On 2021/6/30 17:16, Eric Auger wrote:

On 6/30/21 3:38 AM, Kunkun Jiang wrote:

On 2021/6/30 4:14, Eric Auger wrote:

Hi Kunkun,

On 6/29/21 11:33 AM, Kunkun Jiang wrote:

Hi all,

Accroding to the patch cddafd8f353d2d251b1a5c6c948a577a85838582,
our original intention is to flush the ITS tables into guest RAM at
the point
RUN_STATE_FINISH_MIGRATE, but sometimes the VM gets stopped before
migration launch so let's simply flush the tables each time the VM
gets
stopped.

But I encountered an error when I shut down the virtual machine.


qemu-system-aarch64: KVM_SET_DEVICE_ATTR failed: Group 4 attr
0x0001: Permission denied

Shall we need to flush ITS tables into guest RAM when 'shutdown' the
VM?
Or do you think this error is normal?

yes we determined in the past this was the right moment do save the
tables

"with a live migration the guest is still running after
the RAM has been first saved, and so the tables may still change
during the iterative RAM save. You would actually need to do this
at just the point we stop the VM before the final RAM save; that
*might*
be possible by using a notifier hook a vm run state change to
RUN_STATE_FINISH_MIGRATE
- if you can do the changes just as the migration flips into that mode
it *should* work. " said David.

But sometimes as the commit msg says, the VM is stopped before the
migration launch - I do not remember the exact scenario tbh -.

Well, I initially wanted to know more about this scenario to determine
whether
a normal shutdown would fall into it.

I think it was for save/restore use case. In that case you need to flush
the KVM cache in memory on VM shutdown.

Sorry for late reply.

Can we distinguish from the 'RunState'?
When we stop the VM, the RunState will be set. There are many types of
RunState, such as RUN_STATE_FINISH_MIGRATE/RUN_STATE_SAVE_VM/
RUN_STATE_SHUTDOWN/RUN_STATE_GUEST_PANICKED, etc.

Maybe RUN_STATE_SHUTDOWN doesn't belong to save/restore use case,
right?

Adding Dave, Juan and Peter in the loop for migration expertise.

At the moment we save the ARM ITS MSI controller tables whenever the VM
gets stopped. Saving the tables from KVM caches into the guest RAM is
needed for migration and save/restore use cases.
However with GICv4 this fails at KVM level because some MSIs are
forwarded and saving their state is not supported with GICv4.

While GICv4 migration is not supported we would like the VM to work
properly, ie. being stoppable without taking care of table saving.

So could we be more precise and identifiy the save/restore and migration
use cases instead of saving the tables on each VM shutdown.

The tables are saved into guest RAM so when need the CPUs and devices to
be stopped but we need the guest RAM to be saved after the ITS save
operation.

Kunkun, by the way you currently just get an error from qemu, ie. qemu
does not exit? Couldn't we just ignore -EACCESS error?

Sorry for late reply. There is something wrong with my mailbox,
so I didn't receive your email...

Yeah, just get an error. I just wanted to know if you have a good
way to solve it. It's not causing any other problems, so ignoring
it should be OK.

Thanks,
Kunkun Jiang


Thanks

Eric





In my opinion, when the virtual machine is normally shutdown, flushing
the
ITS tables is not necessary. If we can't tell the difference between
'normal shutdown'
and the above scenario, then this 'error' is inevitable.

So each time the VM is stopped we flush the caches into guest RAM.




This error occurs in the following scenario:
Kunpeng 920 、enable GICv4、passthrough a accelerator Hisilicon SEC
to
the VM.

The flow is as follows:

QEMU:
vm_shutdown
   do_vm_stop(RUN_STATE_SHUTDOWN)
       vm_state_notify
           ...
           vm_change_state_handler (hw/intc/arm_gicv3_its_kvm.c)
               kvm_device_access

Kernel:
   vgic_its_save_tables_v0
       vgic_its_save_device_tables
           vgic_its_save_itt

There is such a code in vgic_its_save_itt():

/*
    * If an LPI carries the HW bit, this means that this
    * interrupt is controlled by GICv4, and we do not
    * have direct access to that state without GICv4.1.
    * Let's simply fail the save operation...
    */
if (ite->irq->hw && !kvm_vgic_global_state.has_gicv4_1)
     return -EACCES;

Maybe we miss a piece of code for 4.0 that unsets the forwarding. The
only way to handle this is to make sure  ite->irq->hw is not set on
shutdown, no?

It's not going to return -EACCES here, if we unset the forwarding
first. But
this may cause problems in save/restore scenarios. The GICv4 architecture
doesn't give any guarantee that the pending state is written into the
pending table.

Thanks,
Kunkun Jiang

Thanks

Eric

As far as I understand you need a v4.1 to migrate,
following Shenming's series
[PATCH v5 0/6] KVM: arm64: Add VLPI migration

Re: [PATCH 1/2] qapi/run-state: Add a new shutdown cause 'migration-completed'

2021-07-06 Thread Kunkun Jiang

On 2021/7/6 18:27, Dr. David Alan Gilbert wrote:

* Kunkun Jiang (jiangkun...@huawei.com) wrote:

Hi Daniel,

On 2021/7/5 20:48, Daniel P. Berrangé wrote:

On Mon, Jul 05, 2021 at 08:36:52PM +0800, Kunkun Jiang wrote:

In the current version, the source QEMU process does not automatic
exit after a successful migration. Additional action is required,
such as sending { "execute": "quit" } or ctrl+c. For simplify, add
a new shutdown cause 'migration-completed' to exit the source QEMU
process after a successful migration.

IIUC, 'STATUS_COMPLETED' state is entered on the source host
once it has finished sending all VM state, and thus does not
guarantee that the target host has successfully received and
loaded all VM state.

Thanks for your reply.

If the target host doesn't successfully receive and load all VM state,
we can send { "execute": "cont" } to resume the soruce in time to
ensure that VM will not lost?

Yes, that's pretty common at the moment;  the failed migration can
happen at lots of different points:
   a) The last part of the actual migration stream/loading the devices
 - that's pretty easy, since the destination hasn't actually got
 the full migration stream.

   b) If the migration itself completes, but then the management system
 then tries to reconfigure the networking/storage on the destination,
 and something goes wrong in that, then it can roll that back and
 cont on the source.

So, it's a pretty common type of failure/recovery  - the management
application has to be a bit careful not to do anything destructive
until as late as possible, so it knows it can switch back.

Okay, I see.

Typically a mgmt app will need to directly confirm that the
target host QEMU has succesfully started running, before it
will tell the source QEMU to quit.

'a mgmt app', such as libvirt?

Yes, it's currently libvirt that does that; but any of the control
things could (it's just libvirt has been going long enough so it knows
about lots and lots of nasty cases of migration failure, and recovering
properly).

Can you explain why did you want to get the source to automatically
quit?  In a real setup where does it help?

Sorry, my thoughts on live migration scenarios are not comprehensive enough.

Thanks,
Kunkun Jiang

Dave



Thanks,
Kunkun Jiang

So, AFAICT, this automatic exit after STATUS_COMPLETED is
not safe and could lead to total loss of the running VM in
error scenarios.




Signed-off-by: Kunkun Jiang 
---
   migration/migration.c | 1 +
   qapi/run-state.json   | 4 +++-
   2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 4228635d18..16782c93c2 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3539,6 +3539,7 @@ static void migration_iteration_finish(MigrationState *s)
   case MIGRATION_STATUS_COMPLETED:
   migration_calculate_complete(s);
   runstate_set(RUN_STATE_POSTMIGRATE);
+qemu_system_shutdown_request(SHUTDOWN_CAUSE_MIGRATION_COMPLETED);
   break;
   case MIGRATION_STATUS_ACTIVE:
diff --git a/qapi/run-state.json b/qapi/run-state.json
index 43d66d700f..66aaef4e2b 100644
--- a/qapi/run-state.json
+++ b/qapi/run-state.json
@@ -86,12 +86,14 @@
   #   ignores --no-reboot. This is useful for sanitizing
   #   hypercalls on s390 that are used during kexec/kdump/boot
   #
+# @migration-completed: Reaction to the successful migration
+#
   ##
   { 'enum': 'ShutdownCause',
 # Beware, shutdown_caused_by_guest() depends on enumeration order
 'data': [ 'none', 'host-error', 'host-qmp-quit', 'host-qmp-system-reset',
   'host-signal', 'host-ui', 'guest-shutdown', 'guest-reset',
-'guest-panic', 'subsystem-reset'] }
+'guest-panic', 'subsystem-reset', 'migration-completed'] }
   ##
   # @StatusInfo:
--
2.23.0



Regards,
Daniel







Re: [question] Shall we flush ITS tables into guest RAM when shutdown the VM?

2021-07-06 Thread Kunkun Jiang

Hi Eric,

On 2021/6/30 17:16, Eric Auger wrote:


On 6/30/21 3:38 AM, Kunkun Jiang wrote:

On 2021/6/30 4:14, Eric Auger wrote:

Hi Kunkun,

On 6/29/21 11:33 AM, Kunkun Jiang wrote:

Hi all,

Accroding to the patch cddafd8f353d2d251b1a5c6c948a577a85838582,
our original intention is to flush the ITS tables into guest RAM at
the point
RUN_STATE_FINISH_MIGRATE, but sometimes the VM gets stopped before
migration launch so let's simply flush the tables each time the VM gets
stopped.

But I encountered an error when I shut down the virtual machine.


qemu-system-aarch64: KVM_SET_DEVICE_ATTR failed: Group 4 attr
0x0001: Permission denied

Shall we need to flush ITS tables into guest RAM when 'shutdown' the
VM?
Or do you think this error is normal?

yes we determined in the past this was the right moment do save the
tables

"with a live migration the guest is still running after
the RAM has been first saved, and so the tables may still change
during the iterative RAM save. You would actually need to do this
at just the point we stop the VM before the final RAM save; that *might*
be possible by using a notifier hook a vm run state change to
RUN_STATE_FINISH_MIGRATE
- if you can do the changes just as the migration flips into that mode
it *should* work. " said David.

But sometimes as the commit msg says, the VM is stopped before the
migration launch - I do not remember the exact scenario tbh -.

Well, I initially wanted to know more about this scenario to determine
whether
a normal shutdown would fall into it.

I think it was for save/restore use case. In that case you need to flush
the KVM cache in memory on VM shutdown.

Sorry for late reply.

Can we distinguish from the 'RunState'?
When we stop the VM, the RunState will be set. There are many types of
RunState, such as RUN_STATE_FINISH_MIGRATE/RUN_STATE_SAVE_VM/
RUN_STATE_SHUTDOWN/RUN_STATE_GUEST_PANICKED, etc.

Maybe RUN_STATE_SHUTDOWN doesn't belong to save/restore use case,
right?

In my opinion, when the virtual machine is normally shutdown, flushing
the
ITS tables is not necessary. If we can't tell the difference between
'normal shutdown'
and the above scenario, then this 'error' is inevitable.

So each time the VM is stopped we flush the caches into guest RAM.




This error occurs in the following scenario:
Kunpeng 920 、enable GICv4、passthrough a accelerator Hisilicon SEC to
the VM.

The flow is as follows:

QEMU:
vm_shutdown
  do_vm_stop(RUN_STATE_SHUTDOWN)
      vm_state_notify
          ...
          vm_change_state_handler (hw/intc/arm_gicv3_its_kvm.c)
              kvm_device_access

Kernel:
  vgic_its_save_tables_v0
      vgic_its_save_device_tables
          vgic_its_save_itt

There is such a code in vgic_its_save_itt():

/*
   * If an LPI carries the HW bit, this means that this
   * interrupt is controlled by GICv4, and we do not
   * have direct access to that state without GICv4.1.
   * Let's simply fail the save operation...
   */
if (ite->irq->hw && !kvm_vgic_global_state.has_gicv4_1)
    return -EACCES;

Maybe we miss a piece of code for 4.0 that unsets the forwarding. The
only way to handle this is to make sure  ite->irq->hw is not set on
shutdown, no?

It's not going to return -EACCES here, if we unset the forwarding first. But
this may cause problems in save/restore scenarios. The GICv4 architecture
doesn't give any guarantee that the pending state is written into the
pending table.

Thanks,
Kunkun Jiang

Thanks

Eric

As far as I understand you need a v4.1 to migrate,
following Shenming's series
[PATCH v5 0/6] KVM: arm64: Add VLPI migration support on GICv4.1
Maybe sync with him?

Yes, GICv4 does not support live migrate.

Thanks,
Kunkun Jiang

Thanks

Eric



Looking forward to your reply.

Thanks,
Kunkun Jiang








.



.






Re: [PATCH 1/2] qapi/run-state: Add a new shutdown cause 'migration-completed'

2021-07-05 Thread Kunkun Jiang

Hi Daniel,

On 2021/7/5 20:48, Daniel P. Berrangé wrote:

On Mon, Jul 05, 2021 at 08:36:52PM +0800, Kunkun Jiang wrote:

In the current version, the source QEMU process does not automatic
exit after a successful migration. Additional action is required,
such as sending { "execute": "quit" } or ctrl+c. For simplify, add
a new shutdown cause 'migration-completed' to exit the source QEMU
process after a successful migration.

IIUC, 'STATUS_COMPLETED' state is entered on the source host
once it has finished sending all VM state, and thus does not
guarantee that the target host has successfully received and
loaded all VM state.

Thanks for your reply.

If the target host doesn't successfully receive and load all VM state,
we can send { "execute": "cont" } to resume the soruce in time to
ensure that VM will not lost?

Typically a mgmt app will need to directly confirm that the
target host QEMU has succesfully started running, before it
will tell the source QEMU to quit.

'a mgmt app', such as libvirt?

Thanks,
Kunkun Jiang

So, AFAICT, this automatic exit after STATUS_COMPLETED is
not safe and could lead to total loss of the running VM in
error scenarios.




Signed-off-by: Kunkun Jiang 
---
  migration/migration.c | 1 +
  qapi/run-state.json   | 4 +++-
  2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 4228635d18..16782c93c2 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3539,6 +3539,7 @@ static void migration_iteration_finish(MigrationState *s)
  case MIGRATION_STATUS_COMPLETED:
  migration_calculate_complete(s);
  runstate_set(RUN_STATE_POSTMIGRATE);
+qemu_system_shutdown_request(SHUTDOWN_CAUSE_MIGRATION_COMPLETED);
  break;
  
  case MIGRATION_STATUS_ACTIVE:

diff --git a/qapi/run-state.json b/qapi/run-state.json
index 43d66d700f..66aaef4e2b 100644
--- a/qapi/run-state.json
+++ b/qapi/run-state.json
@@ -86,12 +86,14 @@
  #   ignores --no-reboot. This is useful for sanitizing
  #   hypercalls on s390 that are used during kexec/kdump/boot
  #
+# @migration-completed: Reaction to the successful migration
+#
  ##
  { 'enum': 'ShutdownCause',
# Beware, shutdown_caused_by_guest() depends on enumeration order
'data': [ 'none', 'host-error', 'host-qmp-quit', 'host-qmp-system-reset',
  'host-signal', 'host-ui', 'guest-shutdown', 'guest-reset',
-'guest-panic', 'subsystem-reset'] }
+'guest-panic', 'subsystem-reset', 'migration-completed'] }
  
  ##

  # @StatusInfo:
--
2.23.0



Regards,
Daniel






[PATCH 0/2] Auto exit source QEMU process after a successful migration

2021-07-05 Thread Kunkun Jiang
Hi all,

This serial include patches as below:
Patch 1:
- add a new shutdown cause 'migration-completed', which used for automatically
  exit of source QEMU process after a successful migration

Patch 2:
- add a new migration capability 'auto-quit' to control whether to automatically
  exit source QEMU process after a successful migration

Kunkun Jiang (2):
  qapi/run-state: Add a new shutdown cause 'migration-completed'
  qapi/migration: Add a new migration capability 'auto-quit'

 migration/migration.c | 13 +
 migration/migration.h |  1 +
 qapi/migration.json   |  6 +-
 qapi/run-state.json   |  4 +++-
 4 files changed, 22 insertions(+), 2 deletions(-)

-- 
2.23.0




[PATCH 2/2] qapi/migration: Add a new migration capability 'auto-quit'

2021-07-05 Thread Kunkun Jiang
For compatibility, a new migration capability 'auto-quit' is added
to control the exit of source QEMU after a successful migration.

Signed-off-by: Kunkun Jiang 
---
 migration/migration.c | 14 +-
 migration/migration.h |  1 +
 qapi/migration.json   |  6 +-
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 16782c93c2..82ad6d35b2 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2567,6 +2567,15 @@ bool migrate_background_snapshot(void)
 return s->enabled_capabilities[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT];
 }
 
+bool migrate_auto_quit(void)
+{
+MigrationState *s;
+
+s = migrate_get_current();
+
+return s->enabled_capabilities[MIGRATION_CAPABILITY_AUTO_QUIT];
+}
+
 /* migration thread support */
 /*
  * Something bad happened to the RP stream, mark an error
@@ -3539,7 +3548,10 @@ static void migration_iteration_finish(MigrationState *s)
 case MIGRATION_STATUS_COMPLETED:
 migration_calculate_complete(s);
 runstate_set(RUN_STATE_POSTMIGRATE);
-qemu_system_shutdown_request(SHUTDOWN_CAUSE_MIGRATION_COMPLETED);
+
+if (migrate_auto_quit()) {
+qemu_system_shutdown_request(SHUTDOWN_CAUSE_MIGRATION_COMPLETED);
+}
 break;
 
 case MIGRATION_STATUS_ACTIVE:
diff --git a/migration/migration.h b/migration/migration.h
index 2ebb740dfa..a72b178d35 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -349,6 +349,7 @@ int migrate_decompress_threads(void);
 bool migrate_use_events(void);
 bool migrate_postcopy_blocktime(void);
 bool migrate_background_snapshot(void);
+bool migrate_auto_quit(void);
 
 /* Sending on the return path - generic and then for each message type */
 void migrate_send_rp_shut(MigrationIncomingState *mis,
diff --git a/qapi/migration.json b/qapi/migration.json
index 1124a2dda8..26c1a52c56 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -452,6 +452,9 @@
 #   procedure starts. The VM RAM is saved with running VM.
 #   (since 6.0)
 #
+# @auto-quit: If enabled, QEMU process will exit after a successful migration.
+# (since 6.1)
+#
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
@@ -459,7 +462,8 @@
'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
'block', 'return-path', 'pause-before-switchover', 'multifd',
'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
-   'x-ignore-shared', 'validate-uuid', 'background-snapshot'] }
+   'x-ignore-shared', 'validate-uuid', 'background-snapshot',
+   'auto-quit'] }
 
 ##
 # @MigrationCapabilityStatus:
-- 
2.23.0




[PATCH 1/2] qapi/run-state: Add a new shutdown cause 'migration-completed'

2021-07-05 Thread Kunkun Jiang
In the current version, the source QEMU process does not automatic
exit after a successful migration. Additional action is required,
such as sending { "execute": "quit" } or ctrl+c. For simplify, add
a new shutdown cause 'migration-completed' to exit the source QEMU
process after a successful migration.

Signed-off-by: Kunkun Jiang 
---
 migration/migration.c | 1 +
 qapi/run-state.json   | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 4228635d18..16782c93c2 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3539,6 +3539,7 @@ static void migration_iteration_finish(MigrationState *s)
 case MIGRATION_STATUS_COMPLETED:
 migration_calculate_complete(s);
 runstate_set(RUN_STATE_POSTMIGRATE);
+qemu_system_shutdown_request(SHUTDOWN_CAUSE_MIGRATION_COMPLETED);
 break;
 
 case MIGRATION_STATUS_ACTIVE:
diff --git a/qapi/run-state.json b/qapi/run-state.json
index 43d66d700f..66aaef4e2b 100644
--- a/qapi/run-state.json
+++ b/qapi/run-state.json
@@ -86,12 +86,14 @@
 #   ignores --no-reboot. This is useful for sanitizing
 #   hypercalls on s390 that are used during kexec/kdump/boot
 #
+# @migration-completed: Reaction to the successful migration
+#
 ##
 { 'enum': 'ShutdownCause',
   # Beware, shutdown_caused_by_guest() depends on enumeration order
   'data': [ 'none', 'host-error', 'host-qmp-quit', 'host-qmp-system-reset',
 'host-signal', 'host-ui', 'guest-shutdown', 'guest-reset',
-'guest-panic', 'subsystem-reset'] }
+'guest-panic', 'subsystem-reset', 'migration-completed'] }
 
 ##
 # @StatusInfo:
-- 
2.23.0




Re: [question] Shall we flush ITS tables into guest RAM when shutdown the VM?

2021-06-29 Thread Kunkun Jiang

On 2021/6/30 4:14, Eric Auger wrote:

Hi Kunkun,

On 6/29/21 11:33 AM, Kunkun Jiang wrote:

Hi all,

Accroding to the patch cddafd8f353d2d251b1a5c6c948a577a85838582,
our original intention is to flush the ITS tables into guest RAM at
the point
RUN_STATE_FINISH_MIGRATE, but sometimes the VM gets stopped before
migration launch so let's simply flush the tables each time the VM gets
stopped.

But I encountered an error when I shut down the virtual machine.


qemu-system-aarch64: KVM_SET_DEVICE_ATTR failed: Group 4 attr
0x0001: Permission denied

Shall we need to flush ITS tables into guest RAM when 'shutdown' the VM?
Or do you think this error is normal?

yes we determined in the past this was the right moment do save the tables

"with a live migration the guest is still running after
the RAM has been first saved, and so the tables may still change
during the iterative RAM save. You would actually need to do this
at just the point we stop the VM before the final RAM save; that *might*
be possible by using a notifier hook a vm run state change to 
RUN_STATE_FINISH_MIGRATE
- if you can do the changes just as the migration flips into that mode
it *should* work. " said David.

But sometimes as the commit msg says, the VM is stopped before the migration 
launch - I do not remember the exact scenario tbh -.
Well, I initially wanted to know more about this scenario to determine 
whether

a normal shutdown would fall into it.
In my opinion, when the virtual machine is normally shutdown, flushing the
ITS tables is not necessary. If we can't tell the difference between 
'normal shutdown'

and the above scenario, then this 'error' is inevitable.

So each time the VM is stopped we flush the caches into guest RAM.




This error occurs in the following scenario:
Kunpeng 920 、enable GICv4、passthrough a accelerator Hisilicon SEC to
the VM.

The flow is as follows:

QEMU:
vm_shutdown
     do_vm_stop(RUN_STATE_SHUTDOWN)
         vm_state_notify
             ...
             vm_change_state_handler (hw/intc/arm_gicv3_its_kvm.c)
                 kvm_device_access

Kernel:
     vgic_its_save_tables_v0
         vgic_its_save_device_tables
             vgic_its_save_itt

There is such a code in vgic_its_save_itt():

/*
  * If an LPI carries the HW bit, this means that this
  * interrupt is controlled by GICv4, and we do not
  * have direct access to that state without GICv4.1.
  * Let's simply fail the save operation...
  */
if (ite->irq->hw && !kvm_vgic_global_state.has_gicv4_1)
   return -EACCES;

As far as I understand you need a v4.1 to migrate,
following Shenming's series
[PATCH v5 0/6] KVM: arm64: Add VLPI migration support on GICv4.1
Maybe sync with him?

Yes, GICv4 does not support live migrate.

Thanks,
Kunkun Jiang


Thanks

Eric



Looking forward to your reply.

Thanks,
Kunkun Jiang








.






[question] Shall we flush ITS tables into guest RAM when shutdown the VM?

2021-06-29 Thread Kunkun Jiang

Hi all,

Accroding to the patch cddafd8f353d2d251b1a5c6c948a577a85838582,
our original intention is to flush the ITS tables into guest RAM at the 
point

RUN_STATE_FINISH_MIGRATE, but sometimes the VM gets stopped before
migration launch so let's simply flush the tables each time the VM gets
stopped.

But I encountered an error when I shut down the virtual machine.

qemu-system-aarch64: KVM_SET_DEVICE_ATTR failed: Group 4 attr 
0x0001: Permission denied


Shall we need to flush ITS tables into guest RAM when 'shutdown' the VM?
Or do you think this error is normal?

This error occurs in the following scenario:
Kunpeng 920 、enable GICv4、passthrough a accelerator Hisilicon SEC to the 
VM.


The flow is as follows:

QEMU:
vm_shutdown
    do_vm_stop(RUN_STATE_SHUTDOWN)
        vm_state_notify
            ...
            vm_change_state_handler (hw/intc/arm_gicv3_its_kvm.c)
                kvm_device_access

Kernel:
    vgic_its_save_tables_v0
        vgic_its_save_device_tables
            vgic_its_save_itt

There is such a code in vgic_its_save_itt():

/*
 * If an LPI carries the HW bit, this means that this
 * interrupt is controlled by GICv4, and we do not
 * have direct access to that state without GICv4.1.
 * Let's simply fail the save operation...
 */
if (ite->irq->hw && !kvm_vgic_global_state.has_gicv4_1)
  return -EACCES;


Looking forward to your reply.

Thanks,
Kunkun Jiang









Re: [PATCH] vfio: Fix unregister SaveVMHandler in vfio_migration_finalize

2021-06-15 Thread Kunkun Jiang

Kindly ping,

Hi everyone,

Will this patch be picked up soon, or is there any other work for me to do?

Best Regards,
Kunkun Jiang

On 2021/5/27 20:31, Kunkun Jiang wrote:

In the vfio_migration_init(), the SaveVMHandler is registered for
VFIO device. But it lacks the operation of 'unregister'. It will
lead to 'Segmentation fault (core dumped)' in
qemu_savevm_state_setup(), if performing live migration after a
VFIO device is hot deleted.

Fixes: 7c2f5f75f94 (vfio: Register SaveVMHandlers for VFIO device)
Reported-by: Qixin Gan 
Signed-off-by: Kunkun Jiang 
---
  hw/vfio/migration.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 201642d75e..ef397ebe6c 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -892,6 +892,7 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
  
  remove_migration_state_change_notifier(>migration_state);

  qemu_del_vm_change_state_handler(migration->vm_state);
+unregister_savevm(VMSTATE_IF(vbasedev->dev), "vfio", vbasedev);
  vfio_migration_exit(vbasedev);
  }
  






Re: [PATCH] vfio: Fix unregister SaveVMHandler in vfio_migration_finalize

2021-05-27 Thread Kunkun Jiang

Hi Philippe,

On 2021/5/27 21:44, Philippe Mathieu-Daudé wrote:

On 5/27/21 2:31 PM, Kunkun Jiang wrote:

In the vfio_migration_init(), the SaveVMHandler is registered for
VFIO device. But it lacks the operation of 'unregister'. It will
lead to 'Segmentation fault (core dumped)' in
qemu_savevm_state_setup(), if performing live migration after a
VFIO device is hot deleted.

Fixes: 7c2f5f75f94 (vfio: Register SaveVMHandlers for VFIO device)
Reported-by: Qixin Gan 
Signed-off-by: Kunkun Jiang 

Cc: qemu-sta...@nongnu.org


---
  hw/vfio/migration.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 201642d75e..ef397ebe6c 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -892,6 +892,7 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
  
  remove_migration_state_change_notifier(>migration_state);

  qemu_del_vm_change_state_handler(migration->vm_state);
+unregister_savevm(VMSTATE_IF(vbasedev->dev), "vfio", vbasedev);

Hmm what about devices using "%s/vfio" id?

The unregister_savevm() needs 'VMSTATEIf *obj'. If we pass a non-null 'obj'
to unregister_svevm(), it will handle the devices using "%s/vfio" id with
the following code:

    if (obj) {
    char *oid = vmstate_if_get_id(obj);
    if (oid) {
    pstrcpy(id, sizeof(id), oid);
    pstrcat(id, sizeof(id), "/");
    g_free(oid);
    }
    }
    pstrcat(id, sizeof(id), idstr);


By the way, I'm puzzled that register_savevm_live() and unregister_savevm()
handle devices using "%s/vfio" id differently. So I learned the commit
history of register_savevm_live() and unregister_savevm().

In the beginning, both them need 'DeviceState *dev', which are replaced
with VMStateIf in 3cad405babb. Later in ce62df5378b, the 'dev' was removed,
because no caller of register_savevm_live() need to pass a non-null 'dev'
at that time.

So now the vfio devices need to handle the 'id' first and then call
register_savevm_live(). I am wondering whether we need to add
'VMSTATEIf *obj' in register_savevm_live(). What do you think of this?

Thanks,
Kunkun Jiang



  vfio_migration_exit(vbasedev);
  }
  


.






[PATCH] vfio: Fix unregister SaveVMHandler in vfio_migration_finalize

2021-05-27 Thread Kunkun Jiang
In the vfio_migration_init(), the SaveVMHandler is registered for
VFIO device. But it lacks the operation of 'unregister'. It will
lead to 'Segmentation fault (core dumped)' in
qemu_savevm_state_setup(), if performing live migration after a
VFIO device is hot deleted.

Fixes: 7c2f5f75f94 (vfio: Register SaveVMHandlers for VFIO device)
Reported-by: Qixin Gan 
Signed-off-by: Kunkun Jiang 
---
 hw/vfio/migration.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 201642d75e..ef397ebe6c 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -892,6 +892,7 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
 
 remove_migration_state_change_notifier(>migration_state);
 qemu_del_vm_change_state_handler(migration->vm_state);
+unregister_savevm(VMSTATE_IF(vbasedev->dev), "vfio", vbasedev);
 vfio_migration_exit(vbasedev);
 }
 
-- 
2.23.0




Re: [RFC v9 24/29] hw/arm/smmuv3: Fill the IOTLBEntry leaf field on NH_VA invalidation

2021-05-13 Thread Kunkun Jiang

Hi Eric,

On 2021/4/11 20:09, Eric Auger wrote:

Let's propagate the leaf attribute throughout the invalidation path.
This hint is used to reduce the scope of the invalidations to the
last level of translation. Not enforcing it induces large performance
penalties in nested mode.

Signed-off-by: Eric Auger 
---
  hw/arm/smmuv3.c | 9 +
  1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 7beb55cd89..74a6408146 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -799,7 +799,7 @@ epilogue:
  static void smmuv3_notify_iova(IOMMUMemoryRegion *mr,
 IOMMUNotifier *n,
 int asid, dma_addr_t iova,
-   uint8_t tg, uint64_t num_pages)
+   uint8_t tg, uint64_t num_pages, bool leaf)
  {
  SMMUDevice *sdev = container_of(mr, SMMUDevice, iommu);
  IOMMUTLBEvent event = {};
@@ -834,6 +834,7 @@ static void smmuv3_notify_iova(IOMMUMemoryRegion *mr,
  event.entry.perm = IOMMU_NONE;
  event.entry.flags = IOMMU_INV_FLAGS_ARCHID;
  event.entry.arch_id = asid;
+event.entry.leaf = leaf;
  
  memory_region_notify_iommu_one(n, );

  }
@@ -865,7 +866,7 @@ static void smmuv3_notify_asid(IOMMUMemoryRegion *mr,
  
  /* invalidate an asid/iova range tuple in all mr's */

  static void smmuv3_inv_notifiers_iova(SMMUState *s, int asid, dma_addr_t iova,
-  uint8_t tg, uint64_t num_pages)
+  uint8_t tg, uint64_t num_pages, bool 
leaf)
  {
  SMMUDevice *sdev;

Does the parameter 'leaf' need to be added to the trace?

trace_smmuv3_inv_notifiers_iova(mr->parent_obj.name, asid, iova,
    tg, num_pages);


Thanks,
Kunkun Jiang
  
@@ -877,7 +878,7 @@ static void smmuv3_inv_notifiers_iova(SMMUState *s, int asid, dma_addr_t iova,

  tg, num_pages);
  
  IOMMU_NOTIFIER_FOREACH(n, mr) {

-smmuv3_notify_iova(mr, n, asid, iova, tg, num_pages);
+smmuv3_notify_iova(mr, n, asid, iova, tg, num_pages, leaf);
  }
  }
  }
@@ -915,7 +916,7 @@ static void smmuv3_s1_range_inval(SMMUState *s, Cmd *cmd)
  count = mask + 1;
  
  trace_smmuv3_s1_range_inval(vmid, asid, addr, tg, count, ttl, leaf);

-smmuv3_inv_notifiers_iova(s, asid, addr, tg, count);
+smmuv3_inv_notifiers_iova(s, asid, addr, tg, count, leaf);
  smmu_iotlb_inv_iova(s, asid, addr, tg, count, ttl);
  
  num_pages -= count;






Re: [RFC PATCH v2 0/4] Add migration support for VFIO PCI devices in SMMUv3 nested stage mode

2021-05-10 Thread Kunkun Jiang

Hi all,

This series has been updated to v3.[1]
Any comments and reviews are welcome.

Thanks,
Kunkun Jiang

[1] [RFC PATCH v3 0/4] Add migration support for VFIO PCI devices in 
SMMUv3 nested mode

https://lore.kernel.org/qemu-devel/20210511020816.2905-1-jiangkun...@huawei.com/

On 2021/3/31 18:12, Kunkun Jiang wrote:

Hi all,

Since the SMMUv3's nested translation stages[1] has been introduced by Eric, we
need to pay attention to the migration of VFIO PCI devices in SMMUv3 nested 
stage
mode. At present, it is not yet supported in QEMU. There are two problems in the
existing framework.

First, the current way to get dirty pages is not applicable to nested stage 
mode.
Because of the "Caching Mode", VTD can map the RAM through the host single stage
(giova->hpa). "vfio_listener_log_sync" gets dirty pages by transferring "giova"
to the kernel for the RAM block section of mapped MMIO region. In nested stage
mode, we setup the stage 2 (gpa->hpa) and the stage 1(giova->gpa) separately. So
it is inapplicable to get dirty pages by the current way in nested stage mode.

Second, it also need to pass stage 1 configurations to the destination host 
after
the migration. In Eric's patch, it passes the stage 1 configuration to the host 
on
each STE update for the devices set the PASID PciOps. The configuration will be
applied at physical level. But the data of physical level will not be sent to 
the
destination host. So we have to pass stage 1 configurations to the destination
host after the migration.

Best Regards,
Kunkun Jiang

[1] [RFC,v8,00/28] vSMMUv3/pSMMUv3 2 stage VFIO integration
http://patchwork.ozlabs.org/project/qemu-devel/cover/20210225105233.650545-1-eric.au...@redhat.com/

This Patch set includes patches as below:
Patch 1-2:
- Refactor the vfio_listener_log_sync and added a new function to get dirty 
pages
in nested stage mode.

Patch 3:
- Added global_log_start/stop interface in vfio_memory_prereg_listener

Patch 4:
- Added the post_load function to vmstate_smmuv3 for passing stage 1 
configuration
to the destination host after the migration.

History:

v1 -> v2:
- Add global_log_start/stop interface in vfio_memory_prereg_listener
- Add support for repass stage 1 configs with multiple CDs to the host

Kunkun Jiang (4):
   vfio: Introduce helpers to mark dirty pages of a RAM section
   vfio: Add vfio_prereg_listener_log_sync in nested stage
   vfio: Add vfio_prereg_listener_global_log_start/stop in nested stage
   hw/arm/smmuv3: Post-load stage 1 configurations to the host

  hw/arm/smmuv3.c | 62 +++
  hw/arm/trace-events |  1 +
  hw/vfio/common.c| 71 -
  3 files changed, 126 insertions(+), 8 deletions(-)






[RFC PATCH v3 1/4] vfio: Introduce helpers to mark dirty pages of a RAM section

2021-05-10 Thread Kunkun Jiang
Extract part of the code from vfio_sync_dirty_bitmap to form a
new helper, which allows to mark dirty pages of a RAM section.
This helper will be called for nested stage.

Signed-off-by: Kunkun Jiang 
---
 hw/vfio/common.c | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index dc8372c772..9fb8d44a6d 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1271,6 +1271,19 @@ err_out:
 return ret;
 }
 
+static int vfio_dma_sync_ram_section_dirty_bitmap(VFIOContainer *container,
+  MemoryRegionSection *section)
+{
+ram_addr_t ram_addr;
+
+ram_addr = memory_region_get_ram_addr(section->mr) +
+   section->offset_within_region;
+
+return vfio_get_dirty_bitmap(container,
+REAL_HOST_PAGE_ALIGN(section->offset_within_address_space),
+int128_get64(section->size), ram_addr);
+}
+
 typedef struct {
 IOMMUNotifier n;
 VFIOGuestIOMMU *giommu;
@@ -1312,8 +1325,6 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, 
IOMMUTLBEntry *iotlb)
 static int vfio_sync_dirty_bitmap(VFIOContainer *container,
   MemoryRegionSection *section)
 {
-ram_addr_t ram_addr;
-
 if (memory_region_is_iommu(section->mr)) {
 VFIOGuestIOMMU *giommu;
 
@@ -1342,12 +1353,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer 
*container,
 return 0;
 }
 
-ram_addr = memory_region_get_ram_addr(section->mr) +
-   section->offset_within_region;
-
-return vfio_get_dirty_bitmap(container,
-   REAL_HOST_PAGE_ALIGN(section->offset_within_address_space),
-   int128_get64(section->size), ram_addr);
+return vfio_dma_sync_ram_section_dirty_bitmap(container, section);
 }
 
 static void vfio_listener_log_sync(MemoryListener *listener,
-- 
2.23.0




[RFC PATCH v3 4/4] hw/arm/smmuv3: Post-load stage 1 configurations to the host

2021-05-10 Thread Kunkun Jiang
In nested mode, we call the set_pasid_table() callback on each
STE update to pass the guest stage 1 configuration to the host
and apply it at physical level.

In the case of live migration, we need to manually call the
set_pasid_table() to load the guest stage 1 configurations to
the host. If this operation fails, the migration fails.

Signed-off-by: Kunkun Jiang 
---
 hw/arm/smmuv3.c | 33 -
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index ca690513e6..ac1de572f3 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -929,7 +929,7 @@ static void smmuv3_s1_range_inval(SMMUState *s, Cmd *cmd)
 }
 }
 
-static void smmuv3_notify_config_change(SMMUState *bs, uint32_t sid)
+static int smmuv3_notify_config_change(SMMUState *bs, uint32_t sid)
 {
 #ifdef __linux__
 IOMMUMemoryRegion *mr = smmu_iommu_mr(bs, sid);
@@ -938,9 +938,10 @@ static void smmuv3_notify_config_change(SMMUState *bs, 
uint32_t sid)
 IOMMUConfig iommu_config = {};
 SMMUTransCfg *cfg;
 SMMUDevice *sdev;
+int ret;
 
 if (!mr) {
-return;
+return 0;
 }
 
 sdev = container_of(mr, SMMUDevice, iommu);
@@ -949,13 +950,13 @@ static void smmuv3_notify_config_change(SMMUState *bs, 
uint32_t sid)
 smmuv3_flush_config(sdev);
 
 if (!pci_device_is_pasid_ops_set(sdev->bus, sdev->devfn)) {
-return;
+return 0;
 }
 
 cfg = smmuv3_get_config(sdev, );
 
 if (!cfg) {
-return;
+return 0;
 }
 
 iommu_config.pasid_cfg.argsz = sizeof(struct iommu_pasid_table_config);
@@ -977,10 +978,13 @@ static void smmuv3_notify_config_change(SMMUState *bs, 
uint32_t sid)
   iommu_config.pasid_cfg.config,
   iommu_config.pasid_cfg.base_ptr);
 
-if (pci_device_set_pasid_table(sdev->bus, sdev->devfn, _config)) {
+ret = pci_device_set_pasid_table(sdev->bus, sdev->devfn, _config);
+if (ret) {
 error_report("Failed to pass PASID table to host for iommu mr %s (%m)",
  mr->parent_obj.name);
 }
+
+return ret;
 #endif
 }
 
@@ -1545,6 +1549,24 @@ static void smmu_realize(DeviceState *d, Error **errp)
 smmu_init_irq(s, dev);
 }
 
+static int smmuv3_post_load(void *opaque, int version_id)
+{
+SMMUv3State *s3 = opaque;
+SMMUState *s = &(s3->smmu_state);
+SMMUDevice *sdev;
+int ret = 0;
+
+QLIST_FOREACH(sdev, >devices_with_notifiers, next) {
+uint32_t sid = smmu_get_sid(sdev);
+ret = smmuv3_notify_config_change(s, sid);
+if (ret) {
+break;
+}
+}
+
+return ret;
+}
+
 static const VMStateDescription vmstate_smmuv3_queue = {
 .name = "smmuv3_queue",
 .version_id = 1,
@@ -1563,6 +1585,7 @@ static const VMStateDescription vmstate_smmuv3 = {
 .version_id = 1,
 .minimum_version_id = 1,
 .priority = MIG_PRI_IOMMU,
+.post_load = smmuv3_post_load,
 .fields = (VMStateField[]) {
 VMSTATE_UINT32(features, SMMUv3State),
 VMSTATE_UINT8(sid_size, SMMUv3State),
-- 
2.23.0




[RFC PATCH v3 2/4] vfio: Add vfio_prereg_listener_log_sync in nested stage

2021-05-10 Thread Kunkun Jiang
In nested mode, we set up the stage 2 (gpa->hpa)and stage 1
(giova->gpa) separately by vfio_prereg_listener_region_add()
and vfio_listener_region_add(). So when marking dirty pages
we just need to pay attention to stage 2 mappings.

Legacy vfio_listener_log_sync cannot be used in nested stage.
This patch adds vfio_prereg_listener_log_sync to mark dirty
pages in nested mode.

Signed-off-by: Kunkun Jiang 
---
 hw/vfio/common.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 9fb8d44a6d..149e535a75 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1284,6 +1284,22 @@ static int 
vfio_dma_sync_ram_section_dirty_bitmap(VFIOContainer *container,
 int128_get64(section->size), ram_addr);
 }
 
+static void vfio_prereg_listener_log_sync(MemoryListener *listener,
+  MemoryRegionSection *section)
+{
+VFIOContainer *container =
+container_of(listener, VFIOContainer, prereg_listener);
+
+if (!memory_region_is_ram(section->mr) ||
+!container->dirty_pages_supported) {
+return;
+}
+
+if (vfio_devices_all_dirty_tracking(container)) {
+vfio_dma_sync_ram_section_dirty_bitmap(container, section);
+}
+}
+
 typedef struct {
 IOMMUNotifier n;
 VFIOGuestIOMMU *giommu;
@@ -1328,6 +1344,16 @@ static int vfio_sync_dirty_bitmap(VFIOContainer 
*container,
 if (memory_region_is_iommu(section->mr)) {
 VFIOGuestIOMMU *giommu;
 
+/*
+ * In nested mode, stage 2 (gpa->hpa) and stage 1 (giova->gpa) are
+ * set up separately. It is inappropriate to pass 'giova' to kernel
+ * to get dirty pages. We only need to focus on stage 2 mapping when
+ * marking dirty pages.
+ */
+if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
+return 0;
+}
+
 QLIST_FOREACH(giommu, >giommu_list, giommu_next) {
 if (MEMORY_REGION(giommu->iommu) == section->mr &&
 giommu->n.start == section->offset_within_region) {
@@ -1382,6 +1408,7 @@ static const MemoryListener vfio_memory_listener = {
 static MemoryListener vfio_memory_prereg_listener = {
 .region_add = vfio_prereg_listener_region_add,
 .region_del = vfio_prereg_listener_region_del,
+.log_sync = vfio_prereg_listener_log_sync,
 };
 
 static void vfio_listener_release(VFIOContainer *container)
-- 
2.23.0




[RFC PATCH v3 3/4] vfio: Add vfio_prereg_listener_global_log_start/stop in nested stage

2021-05-10 Thread Kunkun Jiang
In nested mode, we set up the stage 2 and stage 1 separately. In my
opinion, vfio_memory_prereg_listener is used for stage 2 and
vfio_memory_listener is used for stage 1. So it feels weird to call
the global_log_start/stop interface in vfio_memory_listener to switch
dirty tracking, although this won't cause any errors. Add
global_log_start/stop interface in vfio_memory_prereg_listener
can separate stage 2 from stage 1.

Signed-off-by: Kunkun Jiang 
---
 hw/vfio/common.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 149e535a75..6e004fc00f 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1209,6 +1209,17 @@ static void 
vfio_listener_log_global_start(MemoryListener *listener)
 {
 VFIOContainer *container = container_of(listener, VFIOContainer, listener);
 
+/* For nested mode, vfio_prereg_listener is used to start dirty tracking */
+if (container->iommu_type != VFIO_TYPE1_NESTING_IOMMU) {
+vfio_set_dirty_page_tracking(container, true);
+}
+}
+
+static void vfio_prereg_listener_log_global_start(MemoryListener *listener)
+{
+VFIOContainer *container =
+container_of(listener, VFIOContainer, prereg_listener);
+
 vfio_set_dirty_page_tracking(container, true);
 }
 
@@ -1216,6 +1227,17 @@ static void vfio_listener_log_global_stop(MemoryListener 
*listener)
 {
 VFIOContainer *container = container_of(listener, VFIOContainer, listener);
 
+/* For nested mode, vfio_prereg_listener is used to stop dirty tracking */
+if (container->iommu_type != VFIO_TYPE1_NESTING_IOMMU) {
+vfio_set_dirty_page_tracking(container, false);
+}
+}
+
+static void vfio_prereg_listener_log_global_stop(MemoryListener *listener)
+{
+VFIOContainer *container =
+container_of(listener, VFIOContainer, prereg_listener);
+
 vfio_set_dirty_page_tracking(container, false);
 }
 
@@ -1408,6 +1430,8 @@ static const MemoryListener vfio_memory_listener = {
 static MemoryListener vfio_memory_prereg_listener = {
 .region_add = vfio_prereg_listener_region_add,
 .region_del = vfio_prereg_listener_region_del,
+.log_global_start = vfio_prereg_listener_log_global_start,
+.log_global_stop = vfio_prereg_listener_log_global_stop,
 .log_sync = vfio_prereg_listener_log_sync,
 };
 
-- 
2.23.0




[RFC PATCH v3 0/4] Add migration support for VFIO PCI devices in SMMUv3 nested mode

2021-05-10 Thread Kunkun Jiang
Hi all,

Since the SMMUv3's nested translation stages has been introduced by Eric, we
need to pay attention to the migration of VFIO PCI devices in SMMUv3 nested
stage mode. At present, it is not yet supported in QEMU. There are two problems
in the existing framework.

First, the current way to get dirty pages is not applicable to nested stage
mode. Because of the "Caching Mode", VTD can map the RAM through the host
single stage (giova->hpa). "vfio_listener_log_sync" gets dirty pages by
transferring "giova" to the kernel for the RAM block section of mapped MMIO
region. In nested stage mode, we setup the stage 2 (gpa->hpa) and the stage 1
(giova->gpa) separately. So it is inapplicable to get dirty pages by the
current way in nested stage mode.

Second, it also need to pass stage 1 configurations to the destination host
after the migration. In Eric's patch, it passes the stage 1 configuration to
the host on each STE update for the devices set the PASID PciOps. The
configuration will be applied at physical level. But the data of physical
level will not be sent to the destination host. So we have to pass stage 1
configurations to the destination host after the migration.

This Patch set includes patches as below:
Patch 1-2:
- Refactor the vfio_listener_log_sync and added a new function to get dirty
pages for nested mode

Patch 3:
- Added global_log_start/stop interface in vfio_memory_prereg_listener for
nested mode

Patch 4:
- Added the post_load function to vmstate_smmuv3 for passing stage 1
configuration to the destination host after the migration

Best regards,
Kunkun Jiang

History:

v2 -> v3:
- Rebase to v9 of Eric's series 'vSMMUv3/pSMMUv3 2 stage VFIO integration'[1]
- Delete smmuv3_manual_set_pci_device_pasid_table() and reuse
  smmuv3_notify_config_change() [Eric]

v1 -> v2:
- Add global_log_start/stop interface in vfio_memory_prereg_listener
- Add support for repass stage 1 configs with multiple CDs to the host

[1] [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration
https://lore.kernel.org/qemu-devel/20210411120912.15770-1-eric.au...@redhat.com/

Kunkun Jiang (4):
  vfio: Introduce helpers to mark dirty pages of a RAM section
  vfio: Add vfio_prereg_listener_log_sync in nested stage
  vfio: Add vfio_prereg_listener_global_log_start/stop in nested stage
  hw/arm/smmuv3: Post-load stage 1 configurations to the host

 hw/arm/smmuv3.c  | 33 ++
 hw/vfio/common.c | 73 ++--
 2 files changed, 93 insertions(+), 13 deletions(-)

-- 
2.23.0




Re: [RFC PATCH 0/3] vfio/migration: Support manual clear vfio dirty log

2021-05-10 Thread Kunkun Jiang

Hi all,

Sorry for my carelessness.
This is the v2 of this series.

Thanks,
Kunkun Jiang

On 2021/5/8 17:31, Kunkun Jiang wrote:

In the past, we clear dirty log immediately after sync dirty log to
userspace. This may cause redundant dirty handling if userspace
handles dirty log iteratively:

After vfio clears dirty log, new dirty log starts to generate. These
new dirty log will be reported to userspace even if they are generated
before userspace handles the same dirty page.

Since a new dirty log tracking method for vfio based on iommu hwdbm[1]
has been introduced in the kernel and added a new capability named
VFIO_DIRTY_LOG_MANUAL_CLEAR, we can eliminate some redundant dirty
handling by supporting it.

This series include patches as below:
Patch 1:
- updated the linux-headers/linux/vfio.h from kernel side

Patch 2:
- introduced 'struct VFIODMARange' to describe a range of the given DMA
   mapping and with respect to a VFIO_IOMMU_MAP_DMA operation

Patch 3:
- implemented the operation to manual clear vfio dirty log, which can
   eliminate some redundant dirty handling

History:
v1 -> v2:
- Add a new ioctl VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP_NOCLEAR to get
   vfio dirty log when support manual clear.

Thanks,
Kunkun Jiang

[1]
IOMMU part: 
https://lore.kernel.org/linux-iommu/20210507102211.8836-1-zhukeqi...@huawei.com/
VFIO part: 
https://lore.kernel.org/kvm/20210507103608.39440-1-zhukeqi...@huawei.com/

Zenghui Yu (3):
   linux-headers: update against 5.12 and "manual clear vfio dirty log"
 series
   vfio: Maintain DMA mapping range for the container
   vfio/migration: Add support for manual clear vfio dirty log

  hw/vfio/common.c  | 211 --
  include/hw/vfio/vfio-common.h |  10 ++
  linux-headers/linux/vfio.h|  61 +-
  3 files changed, 273 insertions(+), 9 deletions(-)






[RFC PATCH 2/3] vfio: Maintain DMA mapping range for the container

2021-05-08 Thread Kunkun Jiang
From: Zenghui Yu 

When synchronizing dirty bitmap from kernel VFIO we do it in a
per-iova-range fashion and we allocate the userspace bitmap for each of the
ioctl. This patch introduces `struct VFIODMARange` to describe a range of
the given DMA mapping with respect to a VFIO_IOMMU_MAP_DMA operation, and
make the bitmap cache of this range be persistent so that we don't need to
g_try_malloc0() every time. Note that the new structure is almost a copy of
`struct vfio_iommu_type1_dma_map` but only internally used by QEMU.

More importantly, the cached per-iova-range dirty bitmap will be further
used when we want to add support for the CLEAR_BITMAP and this cached
bitmap will be used to guarantee we don't clear any unknown dirty bits
otherwise that can be a severe data loss issue for migration code.

It's pretty intuitive to maintain a bitmap per container since we perform
log_sync at this granule. But I don't know how to deal with things like
memory hot-{un}plug, sparse DMA mappings, etc. Suggestions welcome.

* yet something to-do:
  - can't work with guest viommu
  - no locks
  - etc

[ The idea and even the commit message are largely inherited from kvm side.
  See commit 9f4bf4baa8b820c7930e23c9566c9493db7e1d25. ]

Signed-off-by: Zenghui Yu 
Signed-off-by: Kunkun Jiang 
---
 hw/vfio/common.c  | 62 +++
 include/hw/vfio/vfio-common.h |  9 +
 2 files changed, 65 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index ae5654fcdb..b8b6418e69 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -421,6 +421,29 @@ unmap_exit:
 return ret;
 }
 
+static VFIODMARange *vfio_lookup_match_range(VFIOContainer *container,
+hwaddr start_addr, hwaddr size)
+{
+VFIODMARange *qrange;
+
+QLIST_FOREACH(qrange, >dma_list, next) {
+if (qrange->iova == start_addr && qrange->size == size) {
+return qrange;
+}
+}
+return NULL;
+}
+
+static void vfio_dma_range_init_dirty_bitmap(VFIODMARange *qrange)
+{
+uint64_t pages, size;
+
+pages = REAL_HOST_PAGE_ALIGN(qrange->size) / qemu_real_host_page_size;
+size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / BITS_PER_BYTE;
+
+qrange->bitmap = g_malloc0(size);
+}
+
 /*
  * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
  */
@@ -434,12 +457,29 @@ static int vfio_dma_unmap(VFIOContainer *container,
 .iova = iova,
 .size = size,
 };
+VFIODMARange *qrange;
 
 if (iotlb && container->dirty_pages_supported &&
 vfio_devices_all_running_and_saving(container)) {
 return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
 }
 
+/*
+ * unregister the DMA range
+ *
+ * It seems that the memory layer will give us the same section as the one
+ * used in region_add(). Otherwise it'll be complicated to manipulate the
+ * bitmap across region_{add,del}. Is there any guarantee?
+ *
+ * But there is really not such a restriction on the kernel interface
+ * (VFIO_IOMMU_DIRTY_PAGES_FLAG_{UN}MAP_DMA, etc).
+ */
+qrange = vfio_lookup_match_range(container, iova, size);
+assert(qrange);
+g_free(qrange->bitmap);
+QLIST_REMOVE(qrange, next);
+g_free(qrange);
+
 while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, )) {
 /*
  * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
@@ -476,6 +516,14 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr 
iova,
 .iova = iova,
 .size = size,
 };
+VFIODMARange *qrange;
+
+qrange = g_malloc0(sizeof(*qrange));
+qrange->iova = iova;
+qrange->size = size;
+QLIST_INSERT_HEAD(>dma_list, qrange, next);
+/* XXX allocate the dirty bitmap on demand */
+vfio_dma_range_init_dirty_bitmap(qrange);
 
 if (!readonly) {
 map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
@@ -1023,9 +1071,14 @@ static int vfio_get_dirty_bitmap(VFIOContainer 
*container, uint64_t iova,
 {
 struct vfio_iommu_type1_dirty_bitmap *dbitmap;
 struct vfio_iommu_type1_dirty_bitmap_get *range;
+VFIODMARange *qrange;
 uint64_t pages;
 int ret;
 
+qrange = vfio_lookup_match_range(container, iova, size);
+/* the same as vfio_dma_unmap() */
+assert(qrange);
+
 dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
 
 dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
@@ -1044,11 +1097,8 @@ static int vfio_get_dirty_bitmap(VFIOContainer 
*container, uint64_t iova,
 pages = REAL_HOST_PAGE_ALIGN(range->size) / qemu_real_host_page_size;
 range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
  BITS_PER_BYTE;
-range->bitmap.data = g_try_malloc0(range->bitmap.size);
-if (!range->bitmap.data) {
-ret = -ENOMEM;
-goto err_out;
-}

[RFC PATCH 1/3] linux-headers: update against 5.12 and "manual clear vfio dirty log" series

2021-05-08 Thread Kunkun Jiang
From: Zenghui Yu 

The new capability VFIO_DIRTY_LOG_MANUAL_CLEAR and the new ioctl
VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP_NOCLEAR and
VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP have been introduced in
the kernel, update the header to add them.

Signed-off-by: Zenghui Yu 
Signed-off-by: Kunkun Jiang 
---
 linux-headers/linux/vfio.h | 61 +-
 1 file changed, 60 insertions(+), 1 deletion(-)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 609099e455..7e918c697f 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -46,6 +46,20 @@
  */
 #define VFIO_NOIOMMU_IOMMU 8
 
+/* Supports VFIO_DMA_UNMAP_FLAG_ALL */
+#define VFIO_UNMAP_ALL 9
+
+/* Supports the vaddr flag for DMA map and unmap */
+#define VFIO_UPDATE_VADDR  10
+
+/*
+ * The vfio_iommu driver may support user clears dirty log manually, which 
means
+ * dirty log is not cleared automatically after dirty log is copied to 
userspace,
+ * it's user's duty to clear dirty log. Note: when user queries this extension
+ * and vfio_iommu driver supports it, then it is enabled.
+ */
+#define VFIO_DIRTY_LOG_MANUAL_CLEAR11
+
 /*
  * The IOCTL interface is designed for extensibility by embedding the
  * structure length (argsz) and flags into structures passed between
@@ -1074,12 +1088,22 @@ struct vfio_iommu_type1_info_dma_avail {
  *
  * Map process virtual addresses to IO virtual addresses using the
  * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
+ *
+ * If flags & VFIO_DMA_MAP_FLAG_VADDR, update the base vaddr for iova, and
+ * unblock translation of host virtual addresses in the iova range.  The vaddr
+ * must have previously been invalidated with VFIO_DMA_UNMAP_FLAG_VADDR.  To
+ * maintain memory consistency within the user application, the updated vaddr
+ * must address the same memory object as originally mapped.  Failure to do so
+ * will result in user memory corruption and/or device misbehavior.  iova and
+ * size must match those in the original MAP_DMA call.  Protection is not
+ * changed, and the READ & WRITE flags must be 0.
  */
 struct vfio_iommu_type1_dma_map {
__u32   argsz;
__u32   flags;
 #define VFIO_DMA_MAP_FLAG_READ (1 << 0)/* readable from device 
*/
 #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)   /* writable from device */
+#define VFIO_DMA_MAP_FLAG_VADDR (1 << 2)
__u64   vaddr;  /* Process virtual address */
__u64   iova;   /* IO virtual address */
__u64   size;   /* Size of mapping (bytes) */
@@ -1102,6 +1126,7 @@ struct vfio_bitmap {
  * field.  No guarantee is made to the user that arbitrary unmaps of iova
  * or size different from those used in the original mapping call will
  * succeed.
+ *
  * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get the dirty bitmap
  * before unmapping IO virtual addresses. When this flag is set, the user must
  * provide a struct vfio_bitmap in data[]. User must provide zero-allocated
@@ -,11 +1136,21 @@ struct vfio_bitmap {
  * indicates that the page at that offset from iova is dirty. A Bitmap of the
  * pages in the range of unmapped size is returned in the user-provided
  * vfio_bitmap.data.
+ *
+ * If flags & VFIO_DMA_UNMAP_FLAG_ALL, unmap all addresses.  iova and size
+ * must be 0.  This cannot be combined with the get-dirty-bitmap flag.
+ *
+ * If flags & VFIO_DMA_UNMAP_FLAG_VADDR, do not unmap, but invalidate host
+ * virtual addresses in the iova range.  Tasks that attempt to translate an
+ * iova's vaddr will block.  DMA to already-mapped pages continues.  This
+ * cannot be combined with the get-dirty-bitmap flag.
  */
 struct vfio_iommu_type1_dma_unmap {
__u32   argsz;
__u32   flags;
 #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
+#define VFIO_DMA_UNMAP_FLAG_ALL (1 << 1)
+#define VFIO_DMA_UNMAP_FLAG_VADDR   (1 << 2)
__u64   iova;   /* IO virtual address */
__u64   size;   /* Size of mapping (bytes) */
__u8data[];
@@ -1161,7 +1196,29 @@ struct vfio_iommu_type1_dma_unmap {
  * actual bitmap. If dirty pages logging is not enabled, an error will be
  * returned.
  *
- * Only one of the flags _START, _STOP and _GET may be specified at a time.
+ * The VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP_NOCLEAR flag is almost same as
+ * VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP, except that it requires underlying
+ * dirty bitmap is not cleared automatically. The user can clear it manually by
+ * calling the IOCTL with VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP flag set.
+ *
+ * Calling the IOCTL with VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP flag set,
+ * instructs the IOMMU driver to clear the dirty status of pages in a bitmap
+ * for IOMMU

[RFC PATCH 3/3] vfio/migration: Add support for manual clear vfio dirty log

2021-05-08 Thread Kunkun Jiang
From: Zenghui Yu 

The new capability VFIO_DIRTY_LOG_MANUAL_CLEAR and the new ioctl
VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP_NOCLEAR and
VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP have been introduced in
the kernel, tweak the userspace side to use them.

Check if the kernel supports VFIO_DIRTY_LOG_MANUAL_CLEAR and
provide the log_clear() hook for vfio_memory_listener. If the
kernel supports it, deliever the clear message to kernel.

Signed-off-by: Zenghui Yu 
Signed-off-by: Kunkun Jiang 
---
 hw/vfio/common.c  | 149 +-
 include/hw/vfio/vfio-common.h |   1 +
 2 files changed, 148 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index b8b6418e69..9c41a36a61 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1082,7 +1082,9 @@ static int vfio_get_dirty_bitmap(VFIOContainer 
*container, uint64_t iova,
 dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
 
 dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
-dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
+dbitmap->flags = container->dirty_log_manual_clear ?
+ VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP_NOCLEAR :
+ VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
 range = (struct vfio_iommu_type1_dirty_bitmap_get *)>data;
 range->iova = iova;
 range->size = size;
@@ -1213,12 +1215,148 @@ static void vfio_listener_log_sync(MemoryListener 
*listener,
 }
 }
 
+/*
+ * I'm not sure if there's any alignment requirement for the CLEAR_BITMAP
+ * ioctl. But copy from kvm side and align {start, size} with 64 pages.
+ *
+ * I think the code can be simplified a lot if no alignment requirement.
+ */
+#define VFIO_CLEAR_LOG_SHIFT  6
+#define VFIO_CLEAR_LOG_ALIGN  (qemu_real_host_page_size << 
VFIO_CLEAR_LOG_SHIFT)
+#define VFIO_CLEAR_LOG_MASK   (-VFIO_CLEAR_LOG_ALIGN)
+
+static int vfio_log_clear_one_range(VFIOContainer *container,
+VFIODMARange *qrange, uint64_t start, uint64_t size)
+{
+struct vfio_iommu_type1_dirty_bitmap *dbitmap;
+struct vfio_iommu_type1_dirty_bitmap_get *range;
+
+dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
+
+dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
+dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP;
+range = (struct vfio_iommu_type1_dirty_bitmap_get *)>data;
+
+/*
+ * Now let's deal with the actual bitmap, which is almost the same
+ * as the kvm side.
+ */
+uint64_t end, bmap_start, start_delta, bmap_npages;
+unsigned long *bmap_clear = NULL, psize = qemu_real_host_page_size;
+int ret;
+
+bmap_start = start & VFIO_CLEAR_LOG_MASK;
+start_delta = start - bmap_start;
+bmap_start /= psize;
+
+bmap_npages = DIV_ROUND_UP(size + start_delta, VFIO_CLEAR_LOG_ALIGN)
+<< VFIO_CLEAR_LOG_SHIFT;
+end = qrange->size / psize;
+if (bmap_npages > end - bmap_start) {
+bmap_npages = end - bmap_start;
+}
+start_delta /= psize;
+
+if (start_delta) {
+bmap_clear = bitmap_new(bmap_npages);
+bitmap_copy_with_src_offset(bmap_clear, qrange->bitmap,
+bmap_start, start_delta + size / psize);
+bitmap_clear(bmap_clear, 0, start_delta);
+range->bitmap.data = (__u64 *)bmap_clear;
+} else {
+range->bitmap.data = (__u64 *)(qrange->bitmap + BIT_WORD(bmap_start));
+}
+
+range->iova = qrange->iova + bmap_start * psize;
+range->size = bmap_npages * psize;
+range->bitmap.size = ROUND_UP(bmap_npages, sizeof(__u64) * BITS_PER_BYTE) /
+   BITS_PER_BYTE;
+range->bitmap.pgsize = qemu_real_host_page_size;
+
+ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
+if (ret) {
+error_report("Failed to clear dirty log for iova: 0x%"PRIx64
+" size: 0x%"PRIx64" err: %d", (uint64_t)range->iova,
+(uint64_t)range->size, errno);
+goto err_out;
+}
+
+bitmap_clear(qrange->bitmap, bmap_start + start_delta, size / psize);
+err_out:
+g_free(bmap_clear);
+g_free(dbitmap);
+return 0;
+}
+
+static int vfio_physical_log_clear(VFIOContainer *container,
+   MemoryRegionSection *section)
+{
+uint64_t start, size, offset, count;
+VFIODMARange *qrange;
+int ret = 0;
+
+if (!container->dirty_log_manual_clear) {
+/* No need to do explicit clear */
+return ret;
+}
+
+start = section->offset_within_address_space;
+size = int128_get64(section->size);
+
+if (!size) {
+return ret;
+}
+
+QLIST_FOREACH(qrange, >dma_list, next) {
+/*
+ * Discard ranges that do not overlap the section (e.g., the
+ * Memory BAR regions of the device)
+ */
+if (qrange->i

[RFC PATCH 0/3] vfio/migration: Support manual clear vfio dirty log

2021-05-08 Thread Kunkun Jiang
In the past, we clear dirty log immediately after sync dirty log to
userspace. This may cause redundant dirty handling if userspace
handles dirty log iteratively:

After vfio clears dirty log, new dirty log starts to generate. These
new dirty log will be reported to userspace even if they are generated
before userspace handles the same dirty page.

Since a new dirty log tracking method for vfio based on iommu hwdbm[1]
has been introduced in the kernel and added a new capability named
VFIO_DIRTY_LOG_MANUAL_CLEAR, we can eliminate some redundant dirty
handling by supporting it.

This series include patches as below:
Patch 1:
- updated the linux-headers/linux/vfio.h from kernel side

Patch 2:
- introduced 'struct VFIODMARange' to describe a range of the given DMA
  mapping and with respect to a VFIO_IOMMU_MAP_DMA operation

Patch 3:
- implemented the operation to manual clear vfio dirty log, which can
  eliminate some redundant dirty handling

History:
v1 -> v2:
- Add a new ioctl VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP_NOCLEAR to get
  vfio dirty log when support manual clear.

Thanks,
Kunkun Jiang

[1]
IOMMU part: 
https://lore.kernel.org/linux-iommu/20210507102211.8836-1-zhukeqi...@huawei.com/
VFIO part: 
https://lore.kernel.org/kvm/20210507103608.39440-1-zhukeqi...@huawei.com/

Zenghui Yu (3):
  linux-headers: update against 5.12 and "manual clear vfio dirty log"
series
  vfio: Maintain DMA mapping range for the container
  vfio/migration: Add support for manual clear vfio dirty log

 hw/vfio/common.c  | 211 --
 include/hw/vfio/vfio-common.h |  10 ++
 linux-headers/linux/vfio.h|  61 +-
 3 files changed, 273 insertions(+), 9 deletions(-)

-- 
2.23.0




Re: [question] The source cannot recover, if the destination fails in the last round of live migration

2021-05-07 Thread Kunkun Jiang

Hi Dave,

On 2021/5/6 21:05, Dr. David Alan Gilbert wrote:

* Kunkun Jiang (jiangkun...@huawei.com) wrote:

Hi all,

Hi,


Recently I am learning about the part of live migration.
I have a question about the last round.

When the pending_size is less than the threshold, it will enter
the last round and call migration_completion(). It will stop the
source and sent the remaining dirty pages and devices' status
information to the destination. The destination will load these
information and start the VM.

If there is an error at the destination at this time, it will exit
directly, and the source will not be able to detect the error
and recover. Because the source will not call
migration_detect_error().

Is my understanding correct?
Should the source wait the result of the last round of destination ?

Try setting the 'return-path' migration capability on both the source
and destination;  I think it's that option will cause the destination to
send an OK/error at the end and the source to wait for it.

Thank you for your reply!
The 'return-path' migration capability solved my question. 

But why not set it as the default? In my opinion, it is a basic ability
of live migration. We need it to judge whether the last round of the
destination is successful in the way of 'precopy'.

Looking forward to your reply.

Thanks,
Kunkun Jiang

Dave


Thanks,
Kunkun Jiang







[question] The source cannot recover, if the destination fails in the last round of live migration

2021-05-06 Thread Kunkun Jiang

Hi all,

Recently I am learning about the part of live migration.
I have a question about the last round.

When the pending_size is less than the threshold, it will enter
the last round and call migration_completion(). It will stop the
source and sent the remaining dirty pages and devices' status
information to the destination. The destination will load these
information and start the VM.

If there is an error at the destination at this time, it will exit
directly, and the source will not be able to detect the error
and recover. Because the source will not call
migration_detect_error().

Is my understanding correct?
Should the source wait the result of the last round of destination ?

Thanks,
Kunkun Jiang





Re: [RFC v9 15/29] vfio: Set up nested stage mappings

2021-04-28 Thread Kunkun Jiang

Hi Eric,

On 2021/4/27 3:16, Auger Eric wrote:

Hi Kunkun,

On 4/15/21 4:03 AM, Kunkun Jiang wrote:

Hi Eric,

On 2021/4/14 16:05, Auger Eric wrote:

Hi Kunkun,

On 4/14/21 3:45 AM, Kunkun Jiang wrote:

On 2021/4/13 20:57, Auger Eric wrote:

Hi Kunkun,

On 4/13/21 2:10 PM, Kunkun Jiang wrote:

Hi Eric,

On 2021/4/11 20:08, Eric Auger wrote:

In nested mode, legacy vfio_iommu_map_notify cannot be used as
there is no "caching" mode and we do not trap on map.

On Intel, vfio_iommu_map_notify was used to DMA map the RAM
through the host single stage.

With nested mode, we need to setup the stage 2 and the stage 1
separately. This patch introduces a prereg_listener to setup
the stage 2 mapping.

The stage 1 mapping, owned by the guest, is passed to the host
when the guest invalidates the stage 1 configuration, through
a dedicated PCIPASIDOps callback. Guest IOTLB invalidations
are cascaded downto the host through another IOMMU MR UNMAP
notifier.

Signed-off-by: Eric Auger 

---

v7 -> v8:
- properly handle new IOMMUTLBEntry fields and especially
  propagate DOMAIN and PASID based invalidations

v6 -> v7:
- remove PASID based invalidation

v5 -> v6:
- add error_report_err()
- remove the abort in case of nested stage case

v4 -> v5:
- use VFIO_IOMMU_SET_PASID_TABLE
- use PCIPASIDOps for config notification

v3 -> v4:
- use iommu_inv_pasid_info for ASID invalidation

v2 -> v3:
- use VFIO_IOMMU_ATTACH_PASID_TABLE
- new user API
- handle leaf

v1 -> v2:
- adapt to uapi changes
- pass the asid
- pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier
---
     hw/vfio/common.c | 139
+--
     hw/vfio/pci.c    |  21 +++
     hw/vfio/trace-events |   2 +
     3 files changed, 157 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0cd7ef2139..e369d451e7 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry
*iotlb, void **vaddr,
     return true;
     }
     +/* Propagate a guest IOTLB invalidation to the host (nested
mode) */
+static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry
*iotlb)
+{
+    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
+    struct vfio_iommu_type1_cache_invalidate ustruct = {};
+    VFIOContainer *container = giommu->container;
+    int ret;
+
+    assert(iotlb->perm == IOMMU_NONE);
+
+    ustruct.argsz = sizeof(ustruct);
+    ustruct.flags = 0;
+    ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info);
+    ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1;
+    ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB;
+
+    switch (iotlb->granularity) {
+    case IOMMU_INV_GRAN_DOMAIN:
+    ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN;
+    break;
+    case IOMMU_INV_GRAN_PASID:
+    {
+    struct iommu_inv_pasid_info *pasid_info;
+    int archid = -1;
+
+    pasid_info = _info;
+    ustruct.info.granularity = IOMMU_INV_GRANU_PASID;
+    if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
+    pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
+    archid = iotlb->arch_id;
+    }
+    pasid_info->archid = archid;
+    trace_vfio_iommu_asid_inv_iotlb(archid);
+    break;
+    }
+    case IOMMU_INV_GRAN_ADDR:
+    {
+    hwaddr start = iotlb->iova + giommu->iommu_offset;
+    struct iommu_inv_addr_info *addr_info;
+    size_t size = iotlb->addr_mask + 1;
+    int archid = -1;
+
+    addr_info = _info;
+    ustruct.info.granularity = IOMMU_INV_GRANU_ADDR;
+    if (iotlb->leaf) {
+    addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF;
+    }
+    if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
+    addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
+    archid = iotlb->arch_id;
+    }
+    addr_info->archid = archid;
+    addr_info->addr = start;
+    addr_info->granule_size = size;
+    addr_info->nb_granules = 1;
+    trace_vfio_iommu_addr_inv_iotlb(archid, start, size,
+    1, iotlb->leaf);
+    break;
+    }

Should we pass a size to  host kernel here, even if vSMMU doesn't
support
RIL or guest kernel doesn't use RIL?

It will cause TLBI issue in  this scenario: Guest kernel issues a
TLBI cmd
without "range" (tg = 0) to invalidate a 2M huge page. Then qemu
passed
the iova and size (4K) to host kernel. Finally, host kernel issues a
TLBI cmd
with "range" (4K) which can not invalidate the TLB entry of 2M huge
page.
(pSMMU supports RIL)

In that case the guest will loop over all 4K images belonging to the 2M
huge page and invalidate each of them. This should turn into qemu
notifications for each 4kB page, no? This is totally inefficient, hence

The guest will not loop over all 4K images belonging to th

Re: [RFC v9 14/29] vfio: Introduce helpers to DMA map/unmap a RAM section

2021-04-27 Thread Kunkun Jiang
+if (ret) {
+error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
+ "0x%"HWADDR_PRIx") = %d (%m)",
+ container, iova, int128_get64(llsize), ret);
+}
+}
+}
+
  static void vfio_listener_region_add(MemoryListener *listener,
   MemoryRegionSection *section)
  {
  VFIOContainer *container = container_of(listener, VFIOContainer, 
listener);
  hwaddr iova, end;
-Int128 llend, llsize;
-void *vaddr;
+Int128 llend;
  int ret;
  VFIOHostDMAWindow *hostwin;
  Error *err = NULL;
@@ -814,39 +927,7 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
  }
  
  /* Here we assume that memory_region_is_ram(section->mr)==true */

-
-vaddr = memory_region_get_ram_ptr(section->mr) +
-section->offset_within_region +
-(iova - section->offset_within_address_space);
-
-trace_vfio_listener_region_add_ram(iova, end, vaddr);
-
-llsize = int128_sub(llend, int128_make64(iova));
-
-if (memory_region_is_ram_device(section->mr)) {
-hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
-
-if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) {
-trace_vfio_listener_region_add_no_dma_map(
-memory_region_name(section->mr),
-section->offset_within_address_space,
-int128_getlo(section->size),
-pgmask + 1);
-return;
-}
-}
-
-ret = vfio_dma_map(container, iova, int128_get64(llsize),
-   vaddr, section->readonly);
-if (ret) {
-error_setg(, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
-   "0x%"HWADDR_PRIx", %p) = %d (%m)",
-   container, iova, int128_get64(llsize), vaddr, ret);
-if (memory_region_is_ram_device(section->mr)) {
-/* Allow unexpected mappings not to be fatal for RAM devices */
-error_report_err(err);
-return;
-}
+if (vfio_dma_map_ram_section(container, section, )) {
  goto fail;
  }
  
@@ -880,10 +961,6 @@ static void vfio_listener_region_del(MemoryListener *listener,

   MemoryRegionSection *section)
  {
  VFIOContainer *container = container_of(listener, VFIOContainer, 
listener);
-hwaddr iova, end;
-Int128 llend, llsize;
-int ret;
-bool try_unmap = true;
  
  if (vfio_listener_skipped_section(section)) {

  trace_vfio_listener_region_del_skip(
@@ -923,49 +1000,7 @@ static void vfio_listener_region_del(MemoryListener 
*listener,
   */
  }
There are some codes of vfio_listener_region_del() that just doesn't 
show up.

I post it below.


if (memory_region_is_iommu(section->mr)) {
    VFIOGuestIOMMU *giommu;

    QLIST_FOREACH(giommu, >giommu_list, giommu_next) {
    if (MEMORY_REGION(giommu->iommu) == section->mr &&
    giommu->n.start == section->offset_within_region) {
memory_region_unregister_iommu_notifier(section->mr,
>n);
    QLIST_REMOVE(giommu, giommu_next);
    g_free(giommu);
    break;
    }
    }

    /*
 * FIXME: We assume the one big unmap below is adequate to
 * remove any individual page mappings in the IOMMU which
 * might have been copied into VFIO. This works for a page table
 * based IOMMU where a big unmap flattens a large range of 
IO-PTEs.

 * That may not be true for all IOMMU types.
 */
    }

I think we need a check here. If it is nested mode, just return after
g_free(giommu). Because in nested mode, stage 2 (gpa->hpa) and the
stage 1 (giova->gpa) are set separately.

When hot delete a pci device,  we are going to call
vfio_listener_region_del() and vfio_prereg_listener_region_del().
So it's not appropriate to unmap RAM section in
vfio_listener_region_del(). The RAM section will be unmap in
vfio_prereg_listener_region_del().

Thanks,
Kunkun Jiang
  
-iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);

-llend = int128_make64(section->offset_within_address_space);
-llend = int128_add(llend, section->size);
-llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
-
-if (int128_ge(int128_make64(iova), llend)) {
-return;
-}
-end = int128_get64(int128_sub(llend, int128_one()));
-
-llsize = int128_sub(llend, int128_make64(iova));
-
-trace_vfio_listener_region_del(iova, end);
-
-if (memory_region_is_ram_device(section->mr)) {
-hwaddr pgmask;
-VFIOHostDMAWindow *hostwin = hostwin_from_range(container, iova, end);
-
-assert(hostwin); /* or region_add() would have failed */
-
-pgmask = (1ULL &

Re: [RFC v9 15/29] vfio: Set up nested stage mappings

2021-04-27 Thread Kunkun Jiang

Hi Eric,

On 2021/4/26 20:30, Auger Eric wrote:

Hi Kunkun,

On 4/14/21 3:45 AM, Kunkun Jiang wrote:

On 2021/4/13 20:57, Auger Eric wrote:

Hi Kunkun,

On 4/13/21 2:10 PM, Kunkun Jiang wrote:

Hi Eric,

On 2021/4/11 20:08, Eric Auger wrote:

In nested mode, legacy vfio_iommu_map_notify cannot be used as
there is no "caching" mode and we do not trap on map.

On Intel, vfio_iommu_map_notify was used to DMA map the RAM
through the host single stage.

With nested mode, we need to setup the stage 2 and the stage 1
separately. This patch introduces a prereg_listener to setup
the stage 2 mapping.

The stage 1 mapping, owned by the guest, is passed to the host
when the guest invalidates the stage 1 configuration, through
a dedicated PCIPASIDOps callback. Guest IOTLB invalidations
are cascaded downto the host through another IOMMU MR UNMAP
notifier.

Signed-off-by: Eric Auger 

---

v7 -> v8:
- properly handle new IOMMUTLBEntry fields and especially
     propagate DOMAIN and PASID based invalidations

v6 -> v7:
- remove PASID based invalidation

v5 -> v6:
- add error_report_err()
- remove the abort in case of nested stage case

v4 -> v5:
- use VFIO_IOMMU_SET_PASID_TABLE
- use PCIPASIDOps for config notification

v3 -> v4:
- use iommu_inv_pasid_info for ASID invalidation

v2 -> v3:
- use VFIO_IOMMU_ATTACH_PASID_TABLE
- new user API
- handle leaf

v1 -> v2:
- adapt to uapi changes
- pass the asid
- pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier
---
    hw/vfio/common.c | 139
+--
    hw/vfio/pci.c    |  21 +++
    hw/vfio/trace-events |   2 +
    3 files changed, 157 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0cd7ef2139..e369d451e7 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry
*iotlb, void **vaddr,
    return true;
    }
    +/* Propagate a guest IOTLB invalidation to the host (nested
mode) */
+static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry
*iotlb)
+{
+    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
+    struct vfio_iommu_type1_cache_invalidate ustruct = {};
+    VFIOContainer *container = giommu->container;
+    int ret;
+
+    assert(iotlb->perm == IOMMU_NONE);
+
+    ustruct.argsz = sizeof(ustruct);
+    ustruct.flags = 0;
+    ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info);
+    ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1;
+    ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB;
+
+    switch (iotlb->granularity) {
+    case IOMMU_INV_GRAN_DOMAIN:
+    ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN;
+    break;
+    case IOMMU_INV_GRAN_PASID:
+    {
+    struct iommu_inv_pasid_info *pasid_info;
+    int archid = -1;
+
+    pasid_info = _info;
+    ustruct.info.granularity = IOMMU_INV_GRANU_PASID;
+    if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
+    pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
+    archid = iotlb->arch_id;
+    }
+    pasid_info->archid = archid;
+    trace_vfio_iommu_asid_inv_iotlb(archid);
+    break;
+    }
+    case IOMMU_INV_GRAN_ADDR:
+    {
+    hwaddr start = iotlb->iova + giommu->iommu_offset;
+    struct iommu_inv_addr_info *addr_info;
+    size_t size = iotlb->addr_mask + 1;
+    int archid = -1;
+
+    addr_info = _info;
+    ustruct.info.granularity = IOMMU_INV_GRANU_ADDR;
+    if (iotlb->leaf) {
+    addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF;
+    }
+    if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
+    addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
+    archid = iotlb->arch_id;
+    }
+    addr_info->archid = archid;
+    addr_info->addr = start;
+    addr_info->granule_size = size;
+    addr_info->nb_granules = 1;
+    trace_vfio_iommu_addr_inv_iotlb(archid, start, size,
+    1, iotlb->leaf);
+    break;
+    }

Should we pass a size to  host kernel here, even if vSMMU doesn't
support
RIL or guest kernel doesn't use RIL?

It will cause TLBI issue in  this scenario: Guest kernel issues a
TLBI cmd
without "range" (tg = 0) to invalidate a 2M huge page. Then qemu passed
the iova and size (4K) to host kernel. Finally, host kernel issues a
TLBI cmd
with "range" (4K) which can not invalidate the TLB entry of 2M huge
page.
(pSMMU supports RIL)

In that case the guest will loop over all 4K images belonging to the 2M
huge page and invalidate each of them. This should turn into qemu
notifications for each 4kB page, no? This is totally inefficient, hence

The guest will not loop over all 4K images belonging to the 2M huge page.
The iommu_iotlb_gather->pgsize will be 2M, if a page is 2M huge page. The
gather->pgsize wil

Re: [RFC v9 15/29] vfio: Set up nested stage mappings

2021-04-14 Thread Kunkun Jiang

Hi Eric,

On 2021/4/14 16:05, Auger Eric wrote:

Hi Kunkun,

On 4/14/21 3:45 AM, Kunkun Jiang wrote:

On 2021/4/13 20:57, Auger Eric wrote:

Hi Kunkun,

On 4/13/21 2:10 PM, Kunkun Jiang wrote:

Hi Eric,

On 2021/4/11 20:08, Eric Auger wrote:

In nested mode, legacy vfio_iommu_map_notify cannot be used as
there is no "caching" mode and we do not trap on map.

On Intel, vfio_iommu_map_notify was used to DMA map the RAM
through the host single stage.

With nested mode, we need to setup the stage 2 and the stage 1
separately. This patch introduces a prereg_listener to setup
the stage 2 mapping.

The stage 1 mapping, owned by the guest, is passed to the host
when the guest invalidates the stage 1 configuration, through
a dedicated PCIPASIDOps callback. Guest IOTLB invalidations
are cascaded downto the host through another IOMMU MR UNMAP
notifier.

Signed-off-by: Eric Auger 

---

v7 -> v8:
- properly handle new IOMMUTLBEntry fields and especially
     propagate DOMAIN and PASID based invalidations

v6 -> v7:
- remove PASID based invalidation

v5 -> v6:
- add error_report_err()
- remove the abort in case of nested stage case

v4 -> v5:
- use VFIO_IOMMU_SET_PASID_TABLE
- use PCIPASIDOps for config notification

v3 -> v4:
- use iommu_inv_pasid_info for ASID invalidation

v2 -> v3:
- use VFIO_IOMMU_ATTACH_PASID_TABLE
- new user API
- handle leaf

v1 -> v2:
- adapt to uapi changes
- pass the asid
- pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier
---
    hw/vfio/common.c | 139
+--
    hw/vfio/pci.c    |  21 +++
    hw/vfio/trace-events |   2 +
    3 files changed, 157 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0cd7ef2139..e369d451e7 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry
*iotlb, void **vaddr,
    return true;
    }
    +/* Propagate a guest IOTLB invalidation to the host (nested
mode) */
+static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry
*iotlb)
+{
+    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
+    struct vfio_iommu_type1_cache_invalidate ustruct = {};
+    VFIOContainer *container = giommu->container;
+    int ret;
+
+    assert(iotlb->perm == IOMMU_NONE);
+
+    ustruct.argsz = sizeof(ustruct);
+    ustruct.flags = 0;
+    ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info);
+    ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1;
+    ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB;
+
+    switch (iotlb->granularity) {
+    case IOMMU_INV_GRAN_DOMAIN:
+    ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN;
+    break;
+    case IOMMU_INV_GRAN_PASID:
+    {
+    struct iommu_inv_pasid_info *pasid_info;
+    int archid = -1;
+
+    pasid_info = _info;
+    ustruct.info.granularity = IOMMU_INV_GRANU_PASID;
+    if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
+    pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
+    archid = iotlb->arch_id;
+    }
+    pasid_info->archid = archid;
+    trace_vfio_iommu_asid_inv_iotlb(archid);
+    break;
+    }
+    case IOMMU_INV_GRAN_ADDR:
+    {
+    hwaddr start = iotlb->iova + giommu->iommu_offset;
+    struct iommu_inv_addr_info *addr_info;
+    size_t size = iotlb->addr_mask + 1;
+    int archid = -1;
+
+    addr_info = _info;
+    ustruct.info.granularity = IOMMU_INV_GRANU_ADDR;
+    if (iotlb->leaf) {
+    addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF;
+    }
+    if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
+    addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
+    archid = iotlb->arch_id;
+    }
+    addr_info->archid = archid;
+    addr_info->addr = start;
+    addr_info->granule_size = size;
+    addr_info->nb_granules = 1;
+    trace_vfio_iommu_addr_inv_iotlb(archid, start, size,
+    1, iotlb->leaf);
+    break;
+    }

Should we pass a size to  host kernel here, even if vSMMU doesn't
support
RIL or guest kernel doesn't use RIL?

It will cause TLBI issue in  this scenario: Guest kernel issues a
TLBI cmd
without "range" (tg = 0) to invalidate a 2M huge page. Then qemu passed
the iova and size (4K) to host kernel. Finally, host kernel issues a
TLBI cmd
with "range" (4K) which can not invalidate the TLB entry of 2M huge
page.
(pSMMU supports RIL)

In that case the guest will loop over all 4K images belonging to the 2M
huge page and invalidate each of them. This should turn into qemu
notifications for each 4kB page, no? This is totally inefficient, hence

The guest will not loop over all 4K images belonging to the 2M huge page.
The iommu_iotlb_gather->pgsize will be 2M, if a page is 2M huge page. The
gather->pgsize wil

Re: [RFC v9 15/29] vfio: Set up nested stage mappings

2021-04-13 Thread Kunkun Jiang

On 2021/4/13 20:57, Auger Eric wrote:

Hi Kunkun,

On 4/13/21 2:10 PM, Kunkun Jiang wrote:

Hi Eric,

On 2021/4/11 20:08, Eric Auger wrote:

In nested mode, legacy vfio_iommu_map_notify cannot be used as
there is no "caching" mode and we do not trap on map.

On Intel, vfio_iommu_map_notify was used to DMA map the RAM
through the host single stage.

With nested mode, we need to setup the stage 2 and the stage 1
separately. This patch introduces a prereg_listener to setup
the stage 2 mapping.

The stage 1 mapping, owned by the guest, is passed to the host
when the guest invalidates the stage 1 configuration, through
a dedicated PCIPASIDOps callback. Guest IOTLB invalidations
are cascaded downto the host through another IOMMU MR UNMAP
notifier.

Signed-off-by: Eric Auger 

---

v7 -> v8:
- properly handle new IOMMUTLBEntry fields and especially
    propagate DOMAIN and PASID based invalidations

v6 -> v7:
- remove PASID based invalidation

v5 -> v6:
- add error_report_err()
- remove the abort in case of nested stage case

v4 -> v5:
- use VFIO_IOMMU_SET_PASID_TABLE
- use PCIPASIDOps for config notification

v3 -> v4:
- use iommu_inv_pasid_info for ASID invalidation

v2 -> v3:
- use VFIO_IOMMU_ATTACH_PASID_TABLE
- new user API
- handle leaf

v1 -> v2:
- adapt to uapi changes
- pass the asid
- pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier
---
   hw/vfio/common.c | 139 +--
   hw/vfio/pci.c    |  21 +++
   hw/vfio/trace-events |   2 +
   3 files changed, 157 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0cd7ef2139..e369d451e7 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry
*iotlb, void **vaddr,
   return true;
   }
   +/* Propagate a guest IOTLB invalidation to the host (nested mode) */
+static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry
*iotlb)
+{
+    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
+    struct vfio_iommu_type1_cache_invalidate ustruct = {};
+    VFIOContainer *container = giommu->container;
+    int ret;
+
+    assert(iotlb->perm == IOMMU_NONE);
+
+    ustruct.argsz = sizeof(ustruct);
+    ustruct.flags = 0;
+    ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info);
+    ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1;
+    ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB;
+
+    switch (iotlb->granularity) {
+    case IOMMU_INV_GRAN_DOMAIN:
+    ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN;
+    break;
+    case IOMMU_INV_GRAN_PASID:
+    {
+    struct iommu_inv_pasid_info *pasid_info;
+    int archid = -1;
+
+    pasid_info = _info;
+    ustruct.info.granularity = IOMMU_INV_GRANU_PASID;
+    if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
+    pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
+    archid = iotlb->arch_id;
+    }
+    pasid_info->archid = archid;
+    trace_vfio_iommu_asid_inv_iotlb(archid);
+    break;
+    }
+    case IOMMU_INV_GRAN_ADDR:
+    {
+    hwaddr start = iotlb->iova + giommu->iommu_offset;
+    struct iommu_inv_addr_info *addr_info;
+    size_t size = iotlb->addr_mask + 1;
+    int archid = -1;
+
+    addr_info = _info;
+    ustruct.info.granularity = IOMMU_INV_GRANU_ADDR;
+    if (iotlb->leaf) {
+    addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF;
+    }
+    if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
+    addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
+    archid = iotlb->arch_id;
+    }
+    addr_info->archid = archid;
+    addr_info->addr = start;
+    addr_info->granule_size = size;
+    addr_info->nb_granules = 1;
+    trace_vfio_iommu_addr_inv_iotlb(archid, start, size,
+    1, iotlb->leaf);
+    break;
+    }

Should we pass a size to  host kernel here, even if vSMMU doesn't support
RIL or guest kernel doesn't use RIL?

It will cause TLBI issue in  this scenario: Guest kernel issues a TLBI cmd
without "range" (tg = 0) to invalidate a 2M huge page. Then qemu passed
the iova and size (4K) to host kernel. Finally, host kernel issues a
TLBI cmd
with "range" (4K) which can not invalidate the TLB entry of 2M huge page.
(pSMMU supports RIL)

In that case the guest will loop over all 4K images belonging to the 2M
huge page and invalidate each of them. This should turn into qemu
notifications for each 4kB page, no? This is totally inefficient, hence

The guest will not loop over all 4K images belonging to the 2M huge page.
The iommu_iotlb_gather->pgsize will be 2M, if a page is 2M huge page. The
gather->pgsize will be passed to __arm_smmu_tlb_inv_range as "granule":

iommu_iotlb_gather_add_

Re: [RFC v9 15/29] vfio: Set up nested stage mappings

2021-04-13 Thread Kunkun Jiang

Hi Eric,

On 2021/4/11 20:08, Eric Auger wrote:

In nested mode, legacy vfio_iommu_map_notify cannot be used as
there is no "caching" mode and we do not trap on map.

On Intel, vfio_iommu_map_notify was used to DMA map the RAM
through the host single stage.

With nested mode, we need to setup the stage 2 and the stage 1
separately. This patch introduces a prereg_listener to setup
the stage 2 mapping.

The stage 1 mapping, owned by the guest, is passed to the host
when the guest invalidates the stage 1 configuration, through
a dedicated PCIPASIDOps callback. Guest IOTLB invalidations
are cascaded downto the host through another IOMMU MR UNMAP
notifier.

Signed-off-by: Eric Auger 

---

v7 -> v8:
- properly handle new IOMMUTLBEntry fields and especially
   propagate DOMAIN and PASID based invalidations

v6 -> v7:
- remove PASID based invalidation

v5 -> v6:
- add error_report_err()
- remove the abort in case of nested stage case

v4 -> v5:
- use VFIO_IOMMU_SET_PASID_TABLE
- use PCIPASIDOps for config notification

v3 -> v4:
- use iommu_inv_pasid_info for ASID invalidation

v2 -> v3:
- use VFIO_IOMMU_ATTACH_PASID_TABLE
- new user API
- handle leaf

v1 -> v2:
- adapt to uapi changes
- pass the asid
- pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier
---
  hw/vfio/common.c | 139 +--
  hw/vfio/pci.c|  21 +++
  hw/vfio/trace-events |   2 +
  3 files changed, 157 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0cd7ef2139..e369d451e7 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void 
**vaddr,
  return true;
  }
  
+/* Propagate a guest IOTLB invalidation to the host (nested mode) */

+static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
+{
+VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
+struct vfio_iommu_type1_cache_invalidate ustruct = {};
+VFIOContainer *container = giommu->container;
+int ret;
+
+assert(iotlb->perm == IOMMU_NONE);
+
+ustruct.argsz = sizeof(ustruct);
+ustruct.flags = 0;
+ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info);
+ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1;
+ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB;
+
+switch (iotlb->granularity) {
+case IOMMU_INV_GRAN_DOMAIN:
+ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN;
+break;
+case IOMMU_INV_GRAN_PASID:
+{
+struct iommu_inv_pasid_info *pasid_info;
+int archid = -1;
+
+pasid_info = _info;
+ustruct.info.granularity = IOMMU_INV_GRANU_PASID;
+if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
+pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
+archid = iotlb->arch_id;
+}
+pasid_info->archid = archid;
+trace_vfio_iommu_asid_inv_iotlb(archid);
+break;
+}
+case IOMMU_INV_GRAN_ADDR:
+{
+hwaddr start = iotlb->iova + giommu->iommu_offset;
+struct iommu_inv_addr_info *addr_info;
+size_t size = iotlb->addr_mask + 1;
+int archid = -1;
+
+addr_info = _info;
+ustruct.info.granularity = IOMMU_INV_GRANU_ADDR;
+if (iotlb->leaf) {
+addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF;
+}
+if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
+addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
+archid = iotlb->arch_id;
+}
+addr_info->archid = archid;
+addr_info->addr = start;
+addr_info->granule_size = size;
+addr_info->nb_granules = 1;
+trace_vfio_iommu_addr_inv_iotlb(archid, start, size,
+1, iotlb->leaf);
+break;
+}

Should we pass a size to  host kernel here, even if vSMMU doesn't support
RIL or guest kernel doesn't use RIL?

It will cause TLBI issue in  this scenario: Guest kernel issues a TLBI cmd
without "range" (tg = 0) to invalidate a 2M huge page. Then qemu passed
the iova and size (4K) to host kernel. Finally, host kernel issues a 
TLBI cmd

with "range" (4K) which can not invalidate the TLB entry of 2M huge page.
(pSMMU supports RIL)

Thanks,
Kunkun Jiang

+}
+
+ret = ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE, );
+if (ret) {
+error_report("%p: failed to invalidate CACHE (%d)", container, ret);
+}
+}
+
  static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
  {
  VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
@@ -776,6 +843,35 @@ static void vfio_dma_unmap_ram_section(VFIOContainer 
*container,
  }
  }
  
+static void vfio_prereg_listener_region_add(MemoryListener *listener,

+MemoryRe

Re: [RFC PATCH 3/3] hw/arm/smmuv3: Post-load stage 1 configurations to the host

2021-04-12 Thread Kunkun Jiang

Hi Eric,

On 2021/4/12 16:34, Auger Eric wrote:

Hi Kunkun,
On 2/19/21 10:42 AM, Kunkun Jiang wrote:

In nested mode, we call the set_pasid_table() callback on each STE
update to pass the guest stage 1 configuration to the host and
apply it at physical level.

In the case of live migration, we need to manual call the

s/manual/manually

You are right.

set_pasid_table() to load the guest stage 1 configurations to the
host. If this operation is fail, the migration is fail.

s/If this operation is fail, the migration is fail/If this operation
fails, the migration fails.

Thanks for your careful review.

Signed-off-by: Kunkun Jiang 
---
  hw/arm/smmuv3.c | 60 +
  hw/arm/trace-events |  1 +
  2 files changed, 61 insertions(+)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 6c6ed84e78..94ca15375c 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -1473,6 +1473,65 @@ static void smmu_realize(DeviceState *d, Error **errp)
  smmu_init_irq(s, dev);
  }
  
+static int smmuv3_manual_set_pci_device_pasid_table(SMMUDevice *sdev)

Can't you retrieve the associated sid and then call
smmuv3_notify_config_change()

Agreed. It can reuse the code of smmuv3_notify_config_change().
I will modify it in the next version.

But I need a return value to judge whether it is successful, which
may need to modify the type of smmuv3_notify_config_change().

Thanks,
Kunkun Jiang

+{
+#ifdef __linux__
+IOMMUMemoryRegion *mr = &(sdev->iommu);
+int sid = smmu_get_sid(sdev);
+SMMUEventInfo event = {.type = SMMU_EVT_NONE, .sid = sid,
+   .inval_ste_allowed = true};
+IOMMUConfig iommu_config = {};
+SMMUTransCfg *cfg;
+int ret = -1;
+
+cfg = smmuv3_get_config(sdev, );
+if (!cfg) {
+return ret;
+}
+
+iommu_config.pasid_cfg.argsz = sizeof(struct iommu_pasid_table_config);
+iommu_config.pasid_cfg.version = PASID_TABLE_CFG_VERSION_1;
+iommu_config.pasid_cfg.format = IOMMU_PASID_FORMAT_SMMUV3;
+iommu_config.pasid_cfg.base_ptr = cfg->s1ctxptr;
+iommu_config.pasid_cfg.pasid_bits = 0;
+iommu_config.pasid_cfg.vendor_data.smmuv3.version = 
PASID_TABLE_SMMUV3_CFG_VERSION_1;
+
+if (cfg->disabled || cfg->bypassed) {
+iommu_config.pasid_cfg.config = IOMMU_PASID_CONFIG_BYPASS;
+} else if (cfg->aborted) {
+iommu_config.pasid_cfg.config = IOMMU_PASID_CONFIG_ABORT;
+} else {
+iommu_config.pasid_cfg.config = IOMMU_PASID_CONFIG_TRANSLATE;
+}
+
+ret = pci_device_set_pasid_table(sdev->bus, sdev->devfn, _config);
+if (ret) {
+error_report("Failed to pass PASID table to host for iommu mr %s (%m)",
+ mr->parent_obj.name);
+}
+
+return ret;
+#endif
+}
+
+static int smmuv3_post_load(void *opaque, int version_id)
+{
+SMMUv3State *s3 = opaque;
+SMMUState *s = &(s3->smmu_state);
+SMMUDevice *sdev;
+int ret = 0;
+
+QLIST_FOREACH(sdev, >devices_with_notifiers, next) {
+trace_smmuv3_post_load_sdev(sdev->devfn, sdev->iommu.parent_obj.name);
+ret = smmuv3_manual_set_pci_device_pasid_table(sdev);
+if (ret) {
+break;
+}
+}
+
+return ret;
+}
+
  static const VMStateDescription vmstate_smmuv3_queue = {
  .name = "smmuv3_queue",
  .version_id = 1,
@@ -1491,6 +1550,7 @@ static const VMStateDescription vmstate_smmuv3 = {
  .version_id = 1,
  .minimum_version_id = 1,
  .priority = MIG_PRI_IOMMU,
+.post_load = smmuv3_post_load,
  .fields = (VMStateField[]) {
  VMSTATE_UINT32(features, SMMUv3State),
  VMSTATE_UINT8(sid_size, SMMUv3State),
diff --git a/hw/arm/trace-events b/hw/arm/trace-events
index 35e562ab74..caa864dd72 100644
--- a/hw/arm/trace-events
+++ b/hw/arm/trace-events
@@ -53,4 +53,5 @@ smmuv3_notify_flag_add(const char *iommu) "ADD SMMUNotifier 
node for iommu mr=%s
  smmuv3_notify_flag_del(const char *iommu) "DEL SMMUNotifier node for iommu 
mr=%s"
  smmuv3_inv_notifiers_iova(const char *name, uint16_t asid, uint64_t iova, uint8_t tg, uint64_t 
num_pages) "iommu mr=%s asid=%d iova=0x%"PRIx64" tg=%d num_pages=0x%"PRIx64
  smmuv3_notify_config_change(const char *name, uint8_t config, uint64_t s1ctxptr) 
"iommu mr=%s config=%d s1ctxptr=0x%"PRIx64
+smmuv3_post_load_sdev(int devfn, const char *name) "sdev devfn=%d iommu 
mr=%s"PRIx64
  


Thanks

Eric

.






Re: [RFC PATCH 2/3] vfio: Add vfio_prereg_listener_log_sync in nested stage

2021-04-12 Thread Kunkun Jiang

Hi Eric,

On 2021/4/8 21:56, Auger Eric wrote:

Hi Kunkun,

On 2/19/21 10:42 AM, Kunkun Jiang wrote:

On Intel, the DMA mapped through the host single stage. Instead
we set up the stage 2 and stage 1 separately in nested mode as there
is no "Caching Mode".

You need to rewrite the above sentences, Missing ARM and also the 1st
sentences misses a verb.

Thanks for your review! I will fix it in the next version.

Legacy vfio_listener_log_sync cannot be used in nested stage as we
don't need to pay close attention to stage 1 mapping. This patch adds
vfio_prereg_listener_log_sync to mark dirty pages in nested mode.

Signed-off-by: Kunkun Jiang 
---
  hw/vfio/common.c | 25 +
  1 file changed, 25 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 7c50905856..af333e0dee 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1216,6 +1216,22 @@ static int 
vfio_dma_sync_ram_section_dirty_bitmap(VFIOContainer *container,
 int128_get64(section->size), ram_addr);
  }
  
+static void vfio_prereg_listener_log_sync(MemoryListener *listener,

+  MemoryRegionSection *section)
+{
+VFIOContainer *container =
+container_of(listener, VFIOContainer, prereg_listener);
+
+if (!memory_region_is_ram(section->mr) ||
+!container->dirty_pages_supported) {
+return;
+}
+
+if (vfio_devices_all_saving(container)) {

I fail to see where is this defined?

Keqian modified vfio_devices_all_saving to vfio_devices_all_dirty_tracking
in 758b96b61d5.

When I posted this series patches, it was vfio_devices_all_saving. In 
v2[1], I

have updated it based on the lasted qemu.

[1] 
https://lore.kernel.org/qemu-devel/20210331101259.2153-3-jiangkun...@huawei.com/


Thanks,
Kunkun Jiang

+vfio_dma_sync_ram_section_dirty_bitmap(container, section);
+}
+}
+
  typedef struct {
  IOMMUNotifier n;
  VFIOGuestIOMMU *giommu;
@@ -1260,6 +1276,14 @@ static int vfio_sync_dirty_bitmap(VFIOContainer 
*container,
  if (memory_region_is_iommu(section->mr)) {
  VFIOGuestIOMMU *giommu;
  
+/*

+ * In nested mode, stage 2 and stage 1 are set up separately. We
+ * only need to focus on stage 2 mapping when marking dirty pages.
+ */
+if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
+return 0;
+}
+
  QLIST_FOREACH(giommu, >giommu_list, giommu_next) {
  if (MEMORY_REGION(giommu->iommu) == section->mr &&
  giommu->n.start == section->offset_within_region) {
@@ -1312,6 +1336,7 @@ static const MemoryListener vfio_memory_listener = {
  static MemoryListener vfio_memory_prereg_listener = {
  .region_add = vfio_prereg_listener_region_add,
  .region_del = vfio_prereg_listener_region_del,
+.log_sync = vfio_prereg_listener_log_sync,
  };
  
  static void vfio_listener_release(VFIOContainer *container)



Thanks

Eric

.






Re: [RFC PATCH 0/3] Add migration support for VFIO PCI devices in SMMUv3 nested stage mode

2021-04-12 Thread Kunkun Jiang

Hi Eric,

On 2021/4/12 16:40, Auger Eric wrote:

Hi Kunkun,

On 2/19/21 10:42 AM, Kunkun Jiang wrote:

Hi all,

Since the SMMUv3's nested translation stages[1] has been introduced by Eric, we
need to pay attention to the migration of VFIO PCI devices in SMMUv3 nested 
stage
mode. At present, it is not yet supported in QEMU. There are two problems in the
existing framework.

First, the current way to get dirty pages is not applicable to nested stage 
mode.
Because of the "Caching Mode", VTD can map the RAM through the host single stage
(giova->hpa). "vfio_listener_log_sync" gets dirty pages by transferring "giova"
to the kernel for the RAM block section of mapped MMIO region. In nested stage
mode, we setup the stage 2 (gpa->hpa) and the stage 1(giova->gpa) separately. So
it is inapplicable to get dirty pages by the current way in nested stage mode.

Second, it also need to pass stage 1 configurations to the destination host 
after
the migration. In Eric's patch, it passes the stage 1 configuration to the host 
on
each STE update for the devices set the PASID PciOps. The configuration will be
applied at physical level. But the data of physical level will not be sent to 
the
destination host. So we have to pass stage 1 configurations to the destination
host after the migration.

This Patch set includes patches as below:
Patch 1-2:
- Refactor the vfio_listener_log_sync and added a new function to get dirty 
pages
in nested stage mode.

Patch 3:
- Added the post_load function to vmstate_smmuv3 for passing stage 1 
configuration
to the destination host after the migration.

@Eric, Could you please add this Patch set to your future version of
"vSMMUv3/pSMMUv3 2 stage VFIO integration", if you think this Patch set makes 
sense? :)

First of all, thank you for working on this. As you may have noticed I
sent a new RFC version yesterday (without including this). When time
allows, you may have a look at the comments I posted on your series. I
don't think I can test it at the moment so I may prefer to keep it
separate. Also be aware that the QEMU integration of nested has not
received much comments yet and is likely to evolve. The priority is to
get some R-b's on the kernel pieces, especially the SMMU part. With this
dependency resolved, things can't move forward I am afraid.

Thanks

Eric

Yes, I saw your latest version and comments. Thanks again for your review.

I will try my best to test your patches and come up with some useful ideas.
In the future, this series will be updated with your series of nested.
If conditions permit, I hope you can test it and give me some comments.

Best regards,
Kunkun Jiang

Best Regards
Kunkun Jiang

[1] [RFC,v7,00/26] vSMMUv3/pSMMUv3 2 stage VFIO integration
http://patchwork.ozlabs.org/project/qemu-devel/cover/20201116181349.11908-1-eric.au...@redhat.com/

Kunkun Jiang (3):
   vfio: Introduce helpers to mark dirty pages of a RAM section
   vfio: Add vfio_prereg_listener_log_sync in nested stage
   hw/arm/smmuv3: Post-load stage 1 configurations to the host

  hw/arm/smmuv3.c | 60 +
  hw/arm/trace-events |  1 +
  hw/vfio/common.c| 47 +--
  3 files changed, 100 insertions(+), 8 deletions(-)


.






Re: [RFC PATCH 1/3] vfio: Introduce helpers to mark dirty pages of a RAM section

2021-04-12 Thread Kunkun Jiang

Hi Eric,

On 2021/4/8 21:46, Auger Eric wrote:

Hi Kunkun,

On 2/19/21 10:42 AM, Kunkun Jiang wrote:

Extract part of the code from vfio_sync_dirty_bitmap to form a
new helper, which allows to mark dirty pages of a RAM section.
This helper will be called for nested stage.

Signed-off-by: Kunkun Jiang 
---
  hw/vfio/common.c | 22 ++
  1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 9225f10722..7c50905856 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1203,6 +1203,19 @@ err_out:
  return ret;
  }
  
+static int vfio_dma_sync_ram_section_dirty_bitmap(VFIOContainer *container,

+  MemoryRegionSection *section)
+{
+ram_addr_t ram_addr;
+
+ram_addr = memory_region_get_ram_addr(section->mr) +
+   section->offset_within_region;
+
+return vfio_get_dirty_bitmap(container,
+   TARGET_PAGE_ALIGN(section->offset_within_address_space),
+   int128_get64(section->size), ram_addr);
+}
+
  typedef struct {
  IOMMUNotifier n;
  VFIOGuestIOMMU *giommu;
@@ -1244,8 +1257,6 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, 
IOMMUTLBEntry *iotlb)
  static int vfio_sync_dirty_bitmap(VFIOContainer *container,
MemoryRegionSection *section)
  {
-ram_addr_t ram_addr;
-
  if (memory_region_is_iommu(section->mr)) {
  VFIOGuestIOMMU *giommu;
  
@@ -1274,12 +1285,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container,

  return 0;
  }
  
-ram_addr = memory_region_get_ram_addr(section->mr) +

-   section->offset_within_region;
-
-return vfio_get_dirty_bitmap(container,
-   TARGET_PAGE_ALIGN(section->offset_within_address_space),

this is now REAL_HOST_PAGE_ALIGN

Thanks

Eric

Sorry for late replay.

Yes, it is REAL_HOST_PAGE_ALIGN now which modified by a patch I sent 
earlier.


I posted v2 a few days ago and I have modified TARGET_PAGE_ALIGN to
REAL_HOST_PAGE_ALIGN.[1]

[1] 
https://lore.kernel.org/qemu-devel/20210331101259.2153-2-jiangkun...@huawei.com/


Thanks,
Kunkun Jiang

-   int128_get64(section->size), ram_addr);
+return vfio_dma_sync_ram_section_dirty_bitmap(container, section);
  }
  
  static void vfio_listerner_log_sync(MemoryListener *listener,



.






Re: A question about the translation granule size supported by the vSMMU

2021-04-09 Thread Kunkun Jiang

Hi Eric,

On 2021/4/8 15:27, Auger Eric wrote:

Hi Kunkun,

On 4/7/21 11:26 AM, Kunkun Jiang wrote:

Hi Eric,

On 2021/4/7 3:50, Auger Eric wrote:

Hi Kunkun,

On 3/27/21 3:24 AM, Kunkun Jiang wrote:

Hi all,

Recently, I did some tests on SMMU nested mode. Here is
a question about the translation granule size supported by
vSMMU.

There is such a code in SMMUv3_init_regs():


     /* 4K and 64K granule support */
  s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN4K, 1);
  s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN64K, 1);
  s->idr[5] = FIELD_DP32(s->idr[5], IDR5, OAS, SMMU_IDR5_OAS); /* 44
bits */

Why is the 16K granule not supported? I modified the code
to support it and did not encounter any problems in the
test. Although 4K and 64K minimal granules are "strongly
recommended", I think vSMMU should still support 16K.
Are there other reasons why 16K is not supported here?

no there aren't any. The main reasons were 16KB support is optional and
supporting it increases the test matrix. Also it seems quite a few
machines I have access to do support 16KB granule. On the others I get

"EFI stub: ERROR: This 16 KB granular kernel is not supported by your
CPU".

Nevertheless I am not opposed to support it as it seems to work without
trouble. Just need to have an extra look at implied validity checks but
there shouldn't be much.

Thanks

Eric

Yes, you are right. In my opinion, it is necessary to check whether pSMMU
supports 16K to avoid the situation I mentioned below.
In SMMU nested mode, if vSMMU supports 16K and set pasid table to
pSMMU, it may get errors when pSMMU does translation table walk if
pSMMU doesn't support 16K (not tested). Do you think we need to add
an interface to get some pSMMU info?>
Maybe my consideration was superfluous.

No it is not. At qemu level we have
memory_region_iommu_set_page_size_mask() that is called from the VFIO
device. It allows to pass such info to the IOMMU device (qemu
b91774984249).

iommu_set_page_size_mask() cb needs to be implemented at SMMU QEMU
device level. Also [PATCH 0/2] Domain nesting info for arm-smmu may
allow to return other constraints from the pSMMU.

Thanks

Eric

Ok, it makes sense to me.
I am glad to test them if these patches are ready.

Thanks,
Kunkun Jiang

Thanks,
Kunkun Jiang

When in SMMU nested mode, it may get errors if pSMMU
doesn't support 16K but vSMMU supports 16K. But we
can get some settings of pSMMU to avoid this situation.
I found some discussions between Eric and Linu about
this [1], but this idea does not seem to be implemented.

[1] https://lists.gnu.org/archive/html/qemu-arm/2017-09/msg00149.html

Best regards,
Kunkun Jiang


.



.






Re: [PATCH v4 0/2] Some modification about ram_save_host_page()

2021-04-07 Thread Kunkun Jiang

Hi Dave,

On 2021/4/7 1:14, Dr. David Alan Gilbert wrote:

* Kunkun Jiang (jiangkun...@huawei.com) wrote:

Kindly ping,

Hi David Alan Gilbert,
Will this series be picked up soon, or is there any other work for me to do?

You don't need to do anything, but it did miss the cutoff for soft
freeze; since it's an optimisation not a fix; it's still on my list so
it'll go in just as soon as 6.1 opens up.

Dave

Okay, I see. Thank you.

Best Regards,
Kunkun Jiang

Best Regards,
Kunkun Jiang

On 2021/3/16 20:57, Kunkun Jiang wrote:

Hi all,

This series include patches as below:
Patch 1:
- reduce unnecessary rate limiting in ram_save_host_page()

Patch 2:
- optimized ram_save_host_page() by using migration_bitmap_find_dirty() to find
dirty pages

History:

v3 -> v4:
- Remove the modification to ram_save_host_page() comment [Peter Xu]
- Remove the renaming of tmppages

v2 -> v3:
- Reduce unnecessary rate limiting if nothing is sent in the current iteration 
[David Edmondson]
- Invert the body of the loop in ram_save_host_page() [David Edmondson]

v1 -> v2:
- Modify ram_save_host_page() comment [David Edmondson]
- Remove 'goto' [David Edmondson]

Kunkun Jiang (2):
migration/ram: Reduce unnecessary rate limiting
migration/ram: Optimize ram_save_host_page()

   migration/ram.c | 34 +++---
   1 file changed, 19 insertions(+), 15 deletions(-)






Re: A question about the translation granule size supported by the vSMMU

2021-04-07 Thread Kunkun Jiang

Hi Eric,

On 2021/4/7 3:50, Auger Eric wrote:

Hi Kunkun,

On 3/27/21 3:24 AM, Kunkun Jiang wrote:

Hi all,

Recently, I did some tests on SMMU nested mode. Here is
a question about the translation granule size supported by
vSMMU.

There is such a code in SMMUv3_init_regs():


    /* 4K and 64K granule support */
     s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN4K, 1);
     s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN64K, 1);
     s->idr[5] = FIELD_DP32(s->idr[5], IDR5, OAS, SMMU_IDR5_OAS); /* 44
bits */

Why is the 16K granule not supported? I modified the code
to support it and did not encounter any problems in the
test. Although 4K and 64K minimal granules are "strongly
recommended", I think vSMMU should still support 16K.
Are there other reasons why 16K is not supported here?

no there aren't any. The main reasons were 16KB support is optional and
supporting it increases the test matrix. Also it seems quite a few
machines I have access to do support 16KB granule. On the others I get

"EFI stub: ERROR: This 16 KB granular kernel is not supported by your CPU".

Nevertheless I am not opposed to support it as it seems to work without
trouble. Just need to have an extra look at implied validity checks but
there shouldn't be much.

Thanks

Eric

Yes, you are right. In my opinion, it is necessary to check whether pSMMU
supports 16K to avoid the situation I mentioned below.
In SMMU nested mode, if vSMMU supports 16K and set pasid table to
pSMMU, it may get errors when pSMMU does translation table walk if
pSMMU doesn't support 16K (not tested). Do you think we need to add
an interface to get some pSMMU info?

Maybe my consideration was superfluous.

Thanks,
Kunkun Jiang

When in SMMU nested mode, it may get errors if pSMMU
doesn't support 16K but vSMMU supports 16K. But we
can get some settings of pSMMU to avoid this situation.
I found some discussions between Eric and Linu about
this [1], but this idea does not seem to be implemented.

[1] https://lists.gnu.org/archive/html/qemu-arm/2017-09/msg00149.html

Best regards,
Kunkun Jiang



.






Re: [PATCH v4 0/2] Some modification about ram_save_host_page()

2021-04-06 Thread Kunkun Jiang

Kindly ping,

Hi David Alan Gilbert,
Will this series be picked up soon, or is there any other work for me to do?

Best Regards,
Kunkun Jiang

On 2021/3/16 20:57, Kunkun Jiang wrote:

Hi all,

This series include patches as below:
Patch 1:
- reduce unnecessary rate limiting in ram_save_host_page()

Patch 2:
- optimized ram_save_host_page() by using migration_bitmap_find_dirty() to find
dirty pages

History:

v3 -> v4:
- Remove the modification to ram_save_host_page() comment [Peter Xu]
- Remove the renaming of tmppages

v2 -> v3:
- Reduce unnecessary rate limiting if nothing is sent in the current iteration 
[David Edmondson]
- Invert the body of the loop in ram_save_host_page() [David Edmondson]

v1 -> v2:
- Modify ram_save_host_page() comment [David Edmondson]
- Remove 'goto' [David Edmondson]

Kunkun Jiang (2):
   migration/ram: Reduce unnecessary rate limiting
   migration/ram: Optimize ram_save_host_page()

  migration/ram.c | 34 +++---
  1 file changed, 19 insertions(+), 15 deletions(-)






[RFC PATCH v2 2/4] vfio: Add vfio_prereg_listener_log_sync in nested stage

2021-03-31 Thread Kunkun Jiang
On Intel, the DMA mapped through the host single stage. Instead
we set up the stage 2 and stage 1 separately in nested mode as there
is no "Caching Mode".

Legacy vfio_listener_log_sync cannot be used in nested stage as we
don't need to pay close attention to stage 1 mapping. This patch adds
vfio_prereg_listener_log_sync to mark dirty pages in nested mode.

Signed-off-by: Kunkun Jiang 
---
 hw/vfio/common.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 3117979307..86722814d4 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1284,6 +1284,22 @@ static int 
vfio_dma_sync_ram_section_dirty_bitmap(VFIOContainer *container,
 int128_get64(section->size), ram_addr);
 }
 
+static void vfio_prereg_listener_log_sync(MemoryListener *listener,
+  MemoryRegionSection *section)
+{
+VFIOContainer *container =
+container_of(listener, VFIOContainer, prereg_listener);
+
+if (!memory_region_is_ram(section->mr) ||
+!container->dirty_pages_supported) {
+return;
+}
+
+if (vfio_devices_all_dirty_tracking(container)) {
+vfio_dma_sync_ram_section_dirty_bitmap(container, section);
+}
+}
+
 typedef struct {
 IOMMUNotifier n;
 VFIOGuestIOMMU *giommu;
@@ -1328,6 +1344,16 @@ static int vfio_sync_dirty_bitmap(VFIOContainer 
*container,
 if (memory_region_is_iommu(section->mr)) {
 VFIOGuestIOMMU *giommu;
 
+/*
+ * In nested mode, stage 2 (gpa->hpa) and stage 1 (giova->gpa) are
+ * set up separately. It is inappropriate to pass 'giova' to kernel
+ * to get dirty pages. We only need to focus on stage 2 mapping when
+ * marking dirty pages.
+ */
+if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
+return 0;
+}
+
 QLIST_FOREACH(giommu, >giommu_list, giommu_next) {
 if (MEMORY_REGION(giommu->iommu) == section->mr &&
 giommu->n.start == section->offset_within_region) {
@@ -1382,6 +1408,7 @@ static const MemoryListener vfio_memory_listener = {
 static MemoryListener vfio_memory_prereg_listener = {
 .region_add = vfio_prereg_listener_region_add,
 .region_del = vfio_prereg_listener_region_del,
+.log_sync = vfio_prereg_listener_log_sync,
 };
 
 static void vfio_listener_release(VFIOContainer *container)
-- 
2.23.0




[RFC PATCH v2 3/4] vfio: Add vfio_prereg_listener_global_log_start/stop in nested stage

2021-03-31 Thread Kunkun Jiang
In nested mode, we set up the stage 2 and stage 1 separately. In my
opinion, vfio_memory_prereg_listener is used for stage 2 and
vfio_memory_listener is used for stage 1. So it feels weird to call
the global_log_start/stop interface in vfio_memory_listener to switch
dirty tracking, although this won't cause any errors. Add
global_log_start/stop interface in vfio_memory_prereg_listener
can separate stage 2 from stage 1.

Signed-off-by: Kunkun Jiang 
---
 hw/vfio/common.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 86722814d4..efea252e46 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1209,6 +1209,16 @@ static void 
vfio_listener_log_global_start(MemoryListener *listener)
 {
 VFIOContainer *container = container_of(listener, VFIOContainer, listener);
 
+if (container->iommu_type != VFIO_TYPE1_NESTING_IOMMU) {
+vfio_set_dirty_page_tracking(container, true);
+}
+}
+
+static void vfio_prereg_listener_log_global_start(MemoryListener *listener)
+{
+VFIOContainer *container =
+container_of(listener, VFIOContainer, prereg_listener);
+
 vfio_set_dirty_page_tracking(container, true);
 }
 
@@ -1216,6 +1226,16 @@ static void vfio_listener_log_global_stop(MemoryListener 
*listener)
 {
 VFIOContainer *container = container_of(listener, VFIOContainer, listener);
 
+if (container->iommu_type != VFIO_TYPE1_NESTING_IOMMU) {
+vfio_set_dirty_page_tracking(container, false);
+}
+}
+
+static void vfio_prereg_listener_log_global_stop(MemoryListener *listener)
+{
+VFIOContainer *container =
+container_of(listener, VFIOContainer, prereg_listener);
+
 vfio_set_dirty_page_tracking(container, false);
 }
 
@@ -1408,6 +1428,8 @@ static const MemoryListener vfio_memory_listener = {
 static MemoryListener vfio_memory_prereg_listener = {
 .region_add = vfio_prereg_listener_region_add,
 .region_del = vfio_prereg_listener_region_del,
+.log_global_start = vfio_prereg_listener_log_global_start,
+.log_global_stop = vfio_prereg_listener_log_global_stop,
 .log_sync = vfio_prereg_listener_log_sync,
 };
 
-- 
2.23.0




[RFC PATCH v2 0/4] Add migration support for VFIO PCI devices in SMMUv3 nested stage mode

2021-03-31 Thread Kunkun Jiang
Hi all,

Since the SMMUv3's nested translation stages[1] has been introduced by Eric, we
need to pay attention to the migration of VFIO PCI devices in SMMUv3 nested 
stage
mode. At present, it is not yet supported in QEMU. There are two problems in the
existing framework.

First, the current way to get dirty pages is not applicable to nested stage 
mode.
Because of the "Caching Mode", VTD can map the RAM through the host single stage
(giova->hpa). "vfio_listener_log_sync" gets dirty pages by transferring "giova"
to the kernel for the RAM block section of mapped MMIO region. In nested stage
mode, we setup the stage 2 (gpa->hpa) and the stage 1(giova->gpa) separately. So
it is inapplicable to get dirty pages by the current way in nested stage mode.

Second, it also need to pass stage 1 configurations to the destination host 
after
the migration. In Eric's patch, it passes the stage 1 configuration to the host 
on
each STE update for the devices set the PASID PciOps. The configuration will be
applied at physical level. But the data of physical level will not be sent to 
the
destination host. So we have to pass stage 1 configurations to the destination
host after the migration.

Best Regards,
Kunkun Jiang

[1] [RFC,v8,00/28] vSMMUv3/pSMMUv3 2 stage VFIO integration
http://patchwork.ozlabs.org/project/qemu-devel/cover/20210225105233.650545-1-eric.au...@redhat.com/

This Patch set includes patches as below:
Patch 1-2:
- Refactor the vfio_listener_log_sync and added a new function to get dirty 
pages
in nested stage mode.

Patch 3:
- Added global_log_start/stop interface in vfio_memory_prereg_listener

Patch 4:
- Added the post_load function to vmstate_smmuv3 for passing stage 1 
configuration
to the destination host after the migration.

History:

v1 -> v2:
- Add global_log_start/stop interface in vfio_memory_prereg_listener
- Add support for repass stage 1 configs with multiple CDs to the host

Kunkun Jiang (4):
  vfio: Introduce helpers to mark dirty pages of a RAM section
  vfio: Add vfio_prereg_listener_log_sync in nested stage
  vfio: Add vfio_prereg_listener_global_log_start/stop in nested stage
  hw/arm/smmuv3: Post-load stage 1 configurations to the host

 hw/arm/smmuv3.c | 62 +++
 hw/arm/trace-events |  1 +
 hw/vfio/common.c| 71 -
 3 files changed, 126 insertions(+), 8 deletions(-)

-- 
2.23.0




[RFC PATCH v2 4/4] hw/arm/smmuv3: Post-load stage 1 configurations to the host

2021-03-31 Thread Kunkun Jiang
In nested mode, we call the set_pasid_table() callback on each STE
update to pass the guest stage 1 configuration to the host and
apply it at physical level.

In the case of live migration, we need to manual call the
set_pasid_table() to load the guest stage 1 configurations to the
host. If this operation is fail, the migration is fail.

Signed-off-by: Kunkun Jiang 
---
 hw/arm/smmuv3.c | 62 +
 hw/arm/trace-events |  1 +
 2 files changed, 63 insertions(+)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 55aa6ad874..4d28ca3777 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -1534,6 +1534,67 @@ static void smmu_realize(DeviceState *d, Error **errp)
 smmu_init_irq(s, dev);
 }
 
+static int smmuv3_manual_set_pci_device_pasid_table(SMMUDevice *sdev)
+{
+#ifdef __linux__
+IOMMUMemoryRegion *mr = &(sdev->iommu);
+int sid = smmu_get_sid(sdev);
+SMMUEventInfo event = {.type = SMMU_EVT_NONE, .sid = sid,
+   .inval_ste_allowed = true};
+IOMMUConfig iommu_config = {};
+SMMUTransCfg *cfg;
+int ret = -1;
+
+cfg = smmuv3_get_config(sdev, );
+if (!cfg) {
+return ret;
+}
+
+iommu_config.pasid_cfg.argsz = sizeof(struct iommu_pasid_table_config);
+iommu_config.pasid_cfg.version = PASID_TABLE_CFG_VERSION_1;
+iommu_config.pasid_cfg.format = IOMMU_PASID_FORMAT_SMMUV3;
+iommu_config.pasid_cfg.base_ptr = cfg->s1ctxptr;
+iommu_config.pasid_cfg.pasid_bits = cfg->s1cdmax;
+iommu_config.pasid_cfg.vendor_data.smmuv3.version = 
PASID_TABLE_SMMUV3_CFG_VERSION_1;
+iommu_config.pasid_cfg.vendor_data.smmuv3.s1fmt = cfg->s1fmt;
+iommu_config.pasid_cfg.vendor_data.smmuv3.s1dss = cfg->s1dss;
+
+if (cfg->disabled || cfg->bypassed) {
+iommu_config.pasid_cfg.config = IOMMU_PASID_CONFIG_BYPASS;
+} else if (cfg->aborted) {
+iommu_config.pasid_cfg.config = IOMMU_PASID_CONFIG_ABORT;
+} else {
+iommu_config.pasid_cfg.config = IOMMU_PASID_CONFIG_TRANSLATE;
+}
+
+ret = pci_device_set_pasid_table(sdev->bus, sdev->devfn, _config);
+if (ret) {
+error_report("Failed to pass PASID table to host for iommu mr %s (%m)",
+ mr->parent_obj.name);
+}
+
+return ret;
+#endif
+}
+
+static int smmuv3_post_load(void *opaque, int version_id)
+{
+SMMUv3State *s3 = opaque;
+SMMUState *s = &(s3->smmu_state);
+SMMUDevice *sdev;
+int ret = 0;
+
+QLIST_FOREACH(sdev, >devices_with_notifiers, next) {
+trace_smmuv3_post_load_sdev(sdev->devfn, sdev->iommu.parent_obj.name);
+ret = smmuv3_manual_set_pci_device_pasid_table(sdev);
+if (ret) {
+break;
+}
+}
+
+return ret;
+}
+
 static const VMStateDescription vmstate_smmuv3_queue = {
 .name = "smmuv3_queue",
 .version_id = 1,
@@ -1552,6 +1613,7 @@ static const VMStateDescription vmstate_smmuv3 = {
 .version_id = 1,
 .minimum_version_id = 1,
 .priority = MIG_PRI_IOMMU,
+.post_load = smmuv3_post_load,
 .fields = (VMStateField[]) {
 VMSTATE_UINT32(features, SMMUv3State),
 VMSTATE_UINT8(sid_size, SMMUv3State),
diff --git a/hw/arm/trace-events b/hw/arm/trace-events
index b0b0030d24..2f093286ec 100644
--- a/hw/arm/trace-events
+++ b/hw/arm/trace-events
@@ -54,4 +54,5 @@ smmuv3_notify_flag_add(const char *iommu) "ADD SMMUNotifier 
node for iommu mr=%s
 smmuv3_notify_flag_del(const char *iommu) "DEL SMMUNotifier node for iommu 
mr=%s"
 smmuv3_inv_notifiers_iova(const char *name, uint16_t asid, uint64_t iova, 
uint8_t tg, uint64_t num_pages) "iommu mr=%s asid=%d iova=0x%"PRIx64" tg=%d 
num_pages=0x%"PRIx64
 smmuv3_notify_config_change(const char *name, uint8_t config, uint64_t 
s1ctxptr) "iommu mr=%s config=%d s1ctxptr=0x%"PRIx64
+smmuv3_post_load_sdev(int devfn, const char *name) "sdev devfn=%d iommu 
mr=%s"PRIx64
 
-- 
2.23.0




[RFC PATCH v2 1/4] vfio: Introduce helpers to mark dirty pages of a RAM section

2021-03-31 Thread Kunkun Jiang
Extract part of the code from vfio_sync_dirty_bitmap to form a
new helper, which allows to mark dirty pages of a RAM section.
This helper will be called for nested stage.

Signed-off-by: Kunkun Jiang 
---
 hw/vfio/common.c | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 7da1e95b43..3117979307 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1271,6 +1271,19 @@ err_out:
 return ret;
 }
 
+static int vfio_dma_sync_ram_section_dirty_bitmap(VFIOContainer *container,
+  MemoryRegionSection *section)
+{
+ram_addr_t ram_addr;
+
+ram_addr = memory_region_get_ram_addr(section->mr) +
+   section->offset_within_region;
+
+return vfio_get_dirty_bitmap(container,
+REAL_HOST_PAGE_ALIGN(section->offset_within_address_space),
+int128_get64(section->size), ram_addr);
+}
+
 typedef struct {
 IOMMUNotifier n;
 VFIOGuestIOMMU *giommu;
@@ -1312,8 +1325,6 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, 
IOMMUTLBEntry *iotlb)
 static int vfio_sync_dirty_bitmap(VFIOContainer *container,
   MemoryRegionSection *section)
 {
-ram_addr_t ram_addr;
-
 if (memory_region_is_iommu(section->mr)) {
 VFIOGuestIOMMU *giommu;
 
@@ -1342,12 +1353,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer 
*container,
 return 0;
 }
 
-ram_addr = memory_region_get_ram_addr(section->mr) +
-   section->offset_within_region;
-
-return vfio_get_dirty_bitmap(container,
-   REAL_HOST_PAGE_ALIGN(section->offset_within_address_space),
-   int128_get64(section->size), ram_addr);
+return vfio_dma_sync_ram_section_dirty_bitmap(container, section);
 }
 
 static void vfio_listener_log_sync(MemoryListener *listener,
-- 
2.23.0




[PATCH] hw/arm/smmuv3: Support 16K translation granule

2021-03-31 Thread Kunkun Jiang
The driver can query some bits in SMMUv3 IDR5 to learn which
translation granules are supported. Arm recommends that SMMUv3
implementations support at least 4K and 64K granules. But in
the vSMMUv3, there seems to be no reason not to support 16K
translation granule. In addition, if 16K is not supported,
vSVA will failed to be enabled in the future for 16K guest
kernel. So it'd better to support it.

Signed-off-by: Kunkun Jiang 
---
 hw/arm/smmuv3.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 3b87324ce2..0a483b0bab 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -259,8 +259,9 @@ static void smmuv3_init_regs(SMMUv3State *s)
 s->idr[3] = FIELD_DP32(s->idr[3], IDR3, RIL, 1);
 s->idr[3] = FIELD_DP32(s->idr[3], IDR3, HAD, 1);
 
-   /* 4K and 64K granule support */
+/* 4K, 16K and 64K granule support */
 s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN4K, 1);
+s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN16K, 1);
 s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN64K, 1);
 s->idr[5] = FIELD_DP32(s->idr[5], IDR5, OAS, SMMU_IDR5_OAS); /* 44 bits */
 
@@ -503,7 +504,8 @@ static int decode_cd(SMMUTransCfg *cfg, CD *cd, 
SMMUEventInfo *event)
 
 tg = CD_TG(cd, i);
 tt->granule_sz = tg2granule(tg, i);
-if ((tt->granule_sz != 12 && tt->granule_sz != 16) || CD_ENDI(cd)) {
+if ((tt->granule_sz != 12 && tt->granule_sz != 14 &&
+ tt->granule_sz != 16) || CD_ENDI(cd)) {
 goto bad_cd;
 }
 
-- 
2.19.1




A question about the translation granule size supported by the vSMMU

2021-03-26 Thread Kunkun Jiang

Hi all,

Recently, I did some tests on SMMU nested mode. Here is
a question about the translation granule size supported by
vSMMU.

There is such a code in SMMUv3_init_regs():


   /* 4K and 64K granule support */
    s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN4K, 1);
    s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN64K, 1);
    s->idr[5] = FIELD_DP32(s->idr[5], IDR5, OAS, SMMU_IDR5_OAS); /* 44 
bits */

Why is the 16K granule not supported? I modified the code
to support it and did not encounter any problems in the
test. Although 4K and 64K minimal granules are "strongly
recommended", I think vSMMU should still support 16K.
Are there other reasons why 16K is not supported here?

When in SMMU nested mode, it may get errors if pSMMU
doesn't support 16K but vSMMU supports 16K. But we
can get some settings of pSMMU to avoid this situation.
I found some discussions between Eric and Linu about
this [1], but this idea does not seem to be implemented.

[1] https://lists.gnu.org/archive/html/qemu-arm/2017-09/msg00149.html

Best regards,
Kunkun Jiang




Re: [RFC PATCH 0/3] vfio/migration: Support manual clear vfio dirty log

2021-03-22 Thread Kunkun Jiang

On 2021/3/18 20:36, Tian, Kevin wrote:

From: Kunkun Jiang 
Sent: Thursday, March 18, 2021 8:29 PM

Hi Kevin,

On 2021/3/18 17:04, Tian, Kevin wrote:

From: Kunkun Jiang 
Sent: Thursday, March 18, 2021 3:59 PM

Hi Kevin,

On 2021/3/18 14:28, Tian, Kevin wrote:

From: Kunkun Jiang
Sent: Wednesday, March 10, 2021 5:41 PM

Hi all,

In the past, we clear dirty log immediately after sync dirty log to
userspace. This may cause redundant dirty handling if userspace
handles dirty log iteratively:

After vfio clears dirty log, new dirty log starts to generate. These
new dirty log will be reported to userspace even if they are generated
before userspace handles the same dirty page.

Since a new dirty log tracking method for vfio based on iommu

hwdbm[1]

has been introduced in the kernel and added a new capability named
VFIO_DIRTY_LOG_MANUAL_CLEAR, we can eliminate some redundant

dirty

handling by supporting it.

Is there any performance data showing the benefit of this new method?


Current dirty log tracking method for VFIO:
[1] All pages marked dirty if not all iommu_groups have pinned_scope
[2] pinned pages by various vendor drivers if all iommu_groups have
pinned scope

Both methods are coarse-grained and can not determine which pages are
really dirty. Each round may mark the pages that are not really dirty as
dirty
and send them to the destination. ( It might be better if the range of the
pinned_scope was smaller. ) This will result in a waste of resources.

HWDBM is short for Hardware Dirty Bit Management.
(e.g. smmuv3 HTTU, Hardware Translation Table Update)

About SMMU HTTU:
HTTU is a feature of ARM SMMUv3, it can update access flag or/and dirty
state of the TTD (Translation Table Descriptor) by hardware.

With HTTU, stage1 TTD is classified into 3 types:
            DBM bit AP[2](readonly bit)
1. writable_clean  1    1
2. writable_dirty   1    0
3. readonly   0    1

If HTTU_HD (manage dirty state) is enabled, smmu can change TTD from
writable_clean to writable_dirty. Then software can scan TTD to sync dirty
state into dirty bitmap. With this feature, we can track the dirty log of
DMA continuously and precisely.

The capability of VFIO_DIRTY_LOG_MANUAL_CLEAR is similar to that on
the KVM side. We add this new log_clear() interface only to split the old
log_sync() into two separated procedures:

- use log_sync() to collect the collection only, and,
- use log_clear() to clear the dirty bitmap.

If you're interested in this new method, you can take a look at our set of
patches.
[1]
https://lore.kernel.org/linux-iommu/20210310090614.26668-1-
zhukeqi...@huawei.com/


I know what you are doing. Intel is also working on VT-d dirty bit support
based on above link. What I'm curious is the actual performance gain
with this optimization. KVM doing that is one good reference, but IOMMU
has different characteristics (e.g. longer invalidation latency) compared to
CPU MMU. It's always good to understand what a so-called optimization
can actually optimize in a context different from where it's originally

proved.

Thanks
Kevin

My understanding is that this is a new method, which is quite different
from the
previous two. So can you explain in more detail what performance data
you want?

Thanks,
Kunkun Jiang

When you have HTTU enabled, compare the migration efficiency with and
without this manual clear interface.

Thanks
Kevin


Hi Kevin,

Sorry for late reply.

I tested it on our FPGA in two scenarios.
[1] perform limited times of DMA on a fixed ram area
[2] perform infinite DMA on a fixed ram area

In scenario [1], we can clearly see that lesser data is being transmitted
with this manual clear interface. For example, a total of 10 DMA are
performed, the amount of transferred data is the sum of 6 times. This
depends on whether the device performs a DMA on the dirty ram
area between log_sync() and log_clear().

In scenario [2], with or without this manual clear interface, it doesn't
make a big difference. Every time we call log_sync(), the fixed ram
area is dirty.

So, in general scenarios, it can reduce the amount of transmitted data.

In addition, regarding the difference between MMU and IOMMU (e.g.
longer invalidation latency) you mentioned last time, I think it has no
effect on this manual clear interface. Currently, the way we invalidate
IOMMU TLB is iommu_flush_iotlb_all(). It takes lesser time than multiple,
range-based invalidation.

Thanks,
Kunkun Jiang




Re: [RFC PATCH 0/3] vfio/migration: Support manual clear vfio dirty log

2021-03-18 Thread Kunkun Jiang

On 2021/3/18 20:36, Tian, Kevin wrote:

From: Kunkun Jiang 
Sent: Thursday, March 18, 2021 8:29 PM

Hi Kevin,

On 2021/3/18 17:04, Tian, Kevin wrote:

From: Kunkun Jiang 
Sent: Thursday, March 18, 2021 3:59 PM

Hi Kevin,

On 2021/3/18 14:28, Tian, Kevin wrote:

From: Kunkun Jiang
Sent: Wednesday, March 10, 2021 5:41 PM

Hi all,

In the past, we clear dirty log immediately after sync dirty log to
userspace. This may cause redundant dirty handling if userspace
handles dirty log iteratively:

After vfio clears dirty log, new dirty log starts to generate. These
new dirty log will be reported to userspace even if they are generated
before userspace handles the same dirty page.

Since a new dirty log tracking method for vfio based on iommu

hwdbm[1]

has been introduced in the kernel and added a new capability named
VFIO_DIRTY_LOG_MANUAL_CLEAR, we can eliminate some redundant

dirty

handling by supporting it.

Is there any performance data showing the benefit of this new method?


Current dirty log tracking method for VFIO:
[1] All pages marked dirty if not all iommu_groups have pinned_scope
[2] pinned pages by various vendor drivers if all iommu_groups have
pinned scope

Both methods are coarse-grained and can not determine which pages are
really dirty. Each round may mark the pages that are not really dirty as
dirty
and send them to the destination. ( It might be better if the range of the
pinned_scope was smaller. ) This will result in a waste of resources.

HWDBM is short for Hardware Dirty Bit Management.
(e.g. smmuv3 HTTU, Hardware Translation Table Update)

About SMMU HTTU:
HTTU is a feature of ARM SMMUv3, it can update access flag or/and dirty
state of the TTD (Translation Table Descriptor) by hardware.

With HTTU, stage1 TTD is classified into 3 types:
            DBM bit AP[2](readonly bit)
1. writable_clean  1    1
2. writable_dirty   1    0
3. readonly   0    1

If HTTU_HD (manage dirty state) is enabled, smmu can change TTD from
writable_clean to writable_dirty. Then software can scan TTD to sync dirty
state into dirty bitmap. With this feature, we can track the dirty log of
DMA continuously and precisely.

The capability of VFIO_DIRTY_LOG_MANUAL_CLEAR is similar to that on
the KVM side. We add this new log_clear() interface only to split the old
log_sync() into two separated procedures:

- use log_sync() to collect the collection only, and,
- use log_clear() to clear the dirty bitmap.

If you're interested in this new method, you can take a look at our set of
patches.
[1]
https://lore.kernel.org/linux-iommu/20210310090614.26668-1-
zhukeqi...@huawei.com/


I know what you are doing. Intel is also working on VT-d dirty bit support
based on above link. What I'm curious is the actual performance gain
with this optimization. KVM doing that is one good reference, but IOMMU
has different characteristics (e.g. longer invalidation latency) compared to
CPU MMU. It's always good to understand what a so-called optimization
can actually optimize in a context different from where it's originally

proved.

Thanks
Kevin

My understanding is that this is a new method, which is quite different
from the
previous two. So can you explain in more detail what performance data
you want?

Thanks,
Kunkun Jiang

When you have HTTU enabled, compare the migration efficiency with and
without this manual clear interface.

Thanks
Kevin


Oh, I see. From a theoretical analysis, this manual clear interface will be
effective.  I will perform some tests on it and send out the results.

Thanks,
Kunkun Jiang




Re: [RFC PATCH 0/3] vfio/migration: Support manual clear vfio dirty log

2021-03-18 Thread Kunkun Jiang

Hi Kevin,

On 2021/3/18 17:04, Tian, Kevin wrote:

From: Kunkun Jiang 
Sent: Thursday, March 18, 2021 3:59 PM

Hi Kevin,

On 2021/3/18 14:28, Tian, Kevin wrote:

From: Kunkun Jiang
Sent: Wednesday, March 10, 2021 5:41 PM

Hi all,

In the past, we clear dirty log immediately after sync dirty log to
userspace. This may cause redundant dirty handling if userspace
handles dirty log iteratively:

After vfio clears dirty log, new dirty log starts to generate. These
new dirty log will be reported to userspace even if they are generated
before userspace handles the same dirty page.

Since a new dirty log tracking method for vfio based on iommu hwdbm[1]
has been introduced in the kernel and added a new capability named
VFIO_DIRTY_LOG_MANUAL_CLEAR, we can eliminate some redundant

dirty

handling by supporting it.

Is there any performance data showing the benefit of this new method?


Current dirty log tracking method for VFIO:
[1] All pages marked dirty if not all iommu_groups have pinned_scope
[2] pinned pages by various vendor drivers if all iommu_groups have
pinned scope

Both methods are coarse-grained and can not determine which pages are
really dirty. Each round may mark the pages that are not really dirty as
dirty
and send them to the destination. ( It might be better if the range of the
pinned_scope was smaller. ) This will result in a waste of resources.

HWDBM is short for Hardware Dirty Bit Management.
(e.g. smmuv3 HTTU, Hardware Translation Table Update)

About SMMU HTTU:
HTTU is a feature of ARM SMMUv3, it can update access flag or/and dirty
state of the TTD (Translation Table Descriptor) by hardware.

With HTTU, stage1 TTD is classified into 3 types:
           DBM bit AP[2](readonly bit)
1. writable_clean  1    1
2. writable_dirty   1    0
3. readonly   0    1

If HTTU_HD (manage dirty state) is enabled, smmu can change TTD from
writable_clean to writable_dirty. Then software can scan TTD to sync dirty
state into dirty bitmap. With this feature, we can track the dirty log of
DMA continuously and precisely.

The capability of VFIO_DIRTY_LOG_MANUAL_CLEAR is similar to that on
the KVM side. We add this new log_clear() interface only to split the old
log_sync() into two separated procedures:

- use log_sync() to collect the collection only, and,
- use log_clear() to clear the dirty bitmap.

If you're interested in this new method, you can take a look at our set of
patches.
[1]
https://lore.kernel.org/linux-iommu/20210310090614.26668-1-
zhukeqi...@huawei.com/


I know what you are doing. Intel is also working on VT-d dirty bit support
based on above link. What I'm curious is the actual performance gain
with this optimization. KVM doing that is one good reference, but IOMMU
has different characteristics (e.g. longer invalidation latency) compared to
CPU MMU. It's always good to understand what a so-called optimization
can actually optimize in a context different from where it's originally proved.

Thanks
Kevin


My understanding is that this is a new method, which is quite different 
from the
previous two. So can you explain in more detail what performance data 
you want?


Thanks,
Kunkun Jiang




Re: [RFC PATCH 0/3] vfio/migration: Support manual clear vfio dirty log

2021-03-18 Thread Kunkun Jiang

Hi Kevin,

On 2021/3/18 14:28, Tian, Kevin wrote:

From: Kunkun Jiang
Sent: Wednesday, March 10, 2021 5:41 PM

Hi all,

In the past, we clear dirty log immediately after sync dirty log to
userspace. This may cause redundant dirty handling if userspace
handles dirty log iteratively:

After vfio clears dirty log, new dirty log starts to generate. These
new dirty log will be reported to userspace even if they are generated
before userspace handles the same dirty page.

Since a new dirty log tracking method for vfio based on iommu hwdbm[1]
has been introduced in the kernel and added a new capability named
VFIO_DIRTY_LOG_MANUAL_CLEAR, we can eliminate some redundant dirty
handling by supporting it.

Is there any performance data showing the benefit of this new method?


Current dirty log tracking method for VFIO:
[1] All pages marked dirty if not all iommu_groups have pinned_scope
[2] pinned pages by various vendor drivers if all iommu_groups have 
pinned scope


Both methods are coarse-grained and can not determine which pages are
really dirty. Each round may mark the pages that are not really dirty as 
dirty

and send them to the destination. ( It might be better if the range of the
pinned_scope was smaller. ) This will result in a waste of resources.

HWDBM is short for Hardware Dirty Bit Management.
(e.g. smmuv3 HTTU, Hardware Translation Table Update)

About SMMU HTTU:
HTTU is a feature of ARM SMMUv3, it can update access flag or/and dirty
state of the TTD (Translation Table Descriptor) by hardware.

With HTTU, stage1 TTD is classified into 3 types:
         DBM bit AP[2](readonly bit)
1. writable_clean  1    1
2. writable_dirty   1    0
3. readonly   0    1

If HTTU_HD (manage dirty state) is enabled, smmu can change TTD from
writable_clean to writable_dirty. Then software can scan TTD to sync dirty
state into dirty bitmap. With this feature, we can track the dirty log of
DMA continuously and precisely.

The capability of VFIO_DIRTY_LOG_MANUAL_CLEAR is similar to that on
the KVM side. We add this new log_clear() interface only to split the old
log_sync() into two separated procedures:

- use log_sync() to collect the collection only, and,
- use log_clear() to clear the dirty bitmap.

If you're interested in this new method, you can take a look at our set of
patches.
[1] 
https://lore.kernel.org/linux-iommu/20210310090614.26668-1-zhukeqi...@huawei.com/


Best regards,
Kunkun Jiang


This series include patches as below:
Patch 1:
- updated the linux-headers/linux/vfio.h from kernel side

Patch 2:
- introduced 'struct VFIODMARange' to describe a range of the given DMA
   mapping and with respect to a VFIO_IOMMU_MAP_DMA operation

Patch 3:
- implemented the operation to manual clear vfio dirty log, which can
   eliminate some redundant dirty handling

Thanks,
Kunkun Jiang

[1] https://lore.kernel.org/linux-iommu/20210310090614.26668-1-
zhukeqi...@huawei.com/T/#mb168c9738ecd3d8794e2da14f970545d5820f
863

Zenghui Yu (3):
   linux-headers: update against 5.12-rc2 and "vfio log clear" series
   vfio: Maintain DMA mapping range for the container
   vfio/migration: Support VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP

  hw/vfio/common.c  | 207 --
  include/hw/vfio/vfio-common.h |  10 ++
  linux-headers/linux/vfio.h|  55 -
  3 files changed, 264 insertions(+), 8 deletions(-)

--
2.23.0



.






Re: [RFC PATCH 0/3] vfio/migration: Support manual clear vfio dirty log

2021-03-18 Thread Kunkun Jiang

kindly ping,
Any comments and reviews are welcome.

Thanks,
Kunkun Jiang

On 2021/3/10 17:41, Kunkun Jiang wrote:

Hi all,

In the past, we clear dirty log immediately after sync dirty log to
userspace. This may cause redundant dirty handling if userspace
handles dirty log iteratively:

After vfio clears dirty log, new dirty log starts to generate. These
new dirty log will be reported to userspace even if they are generated
before userspace handles the same dirty page.

Since a new dirty log tracking method for vfio based on iommu hwdbm[1]
has been introduced in the kernel and added a new capability named
VFIO_DIRTY_LOG_MANUAL_CLEAR, we can eliminate some redundant dirty
handling by supporting it.

This series include patches as below:
Patch 1:
- updated the linux-headers/linux/vfio.h from kernel side

Patch 2:
- introduced 'struct VFIODMARange' to describe a range of the given DMA
   mapping and with respect to a VFIO_IOMMU_MAP_DMA operation

Patch 3:
- implemented the operation to manual clear vfio dirty log, which can
   eliminate some redundant dirty handling

Thanks,
Kunkun Jiang

[1] 
https://lore.kernel.org/linux-iommu/20210310090614.26668-1-zhukeqi...@huawei.com/T/#mb168c9738ecd3d8794e2da14f970545d5820f863

Zenghui Yu (3):
   linux-headers: update against 5.12-rc2 and "vfio log clear" series
   vfio: Maintain DMA mapping range for the container
   vfio/migration: Support VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP

  hw/vfio/common.c  | 207 --
  include/hw/vfio/vfio-common.h |  10 ++
  linux-headers/linux/vfio.h|  55 -
  3 files changed, 264 insertions(+), 8 deletions(-)






Re: [PATCH v4 1/2] migration/ram: Reduce unnecessary rate limiting

2021-03-16 Thread Kunkun Jiang

Hi  Peter,

On 2021/3/17 5:39, Peter Xu wrote:

On Tue, Mar 16, 2021 at 08:57:15PM +0800, Kunkun Jiang wrote:

When the host page is a huge page and something is sent in the
current iteration, migration_rate_limit() should be executed.
If not, it can be omitted.

Signed-off-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
Reviewed-by: David Edmondson 
---
  migration/ram.c | 9 +++--
  1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 72143da0ac..3eb5b0d7a7 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2015,8 +2015,13 @@ static int ram_save_host_page(RAMState *rs, 
PageSearchStatus *pss,
  
  pages += tmppages;

  pss->page++;
-/* Allow rate limiting to happen in the middle of huge pages */
-migration_rate_limit();
+/*
+ * Allow rate limiting to happen in the middle of huge pages if
+ * something is sent in the current iteration.
+ */
+if (pagesize_bits > 1 && tmppages > 0) {
+migration_rate_limit();
+}

Sorry I'm still not a fan of this - I'd even prefer calling that once more just
to make sure it won't be forgotten to be called..  Not to say it's merely a 
noop.

I'll leave this to Dave..  Maybe I'm too harsh! :)


You are very serious and meticulous. I like your character very much.
This patch was used to reviewed by David. So, I want to know what
his opinion is.

@David
Hi David, what is your opinion on this patch?

Thanks,
Kunkun Jiang



[PATCH v4 2/2] migration/ram: Optimize ram_save_host_page()

2021-03-16 Thread Kunkun Jiang
Starting from pss->page, ram_save_host_page() will check every page
and send the dirty pages up to the end of the current host page or
the boundary of used_length of the block. If the host page size is
a huge page, the step "check" will take a lot of time.

It will improve performance to use migration_bitmap_find_dirty().

Tested on Kunpeng 920; VM parameters: 1U 4G (page size 1G)
The time of ram_save_host_page() in the last round of ram saving:
before optimize: 9250us after optimize: 34us

Signed-off-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 migration/ram.c | 39 +++
 1 file changed, 19 insertions(+), 20 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 3eb5b0d7a7..1d9c24aa85 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1993,6 +1993,8 @@ static int ram_save_host_page(RAMState *rs, 
PageSearchStatus *pss,
 int tmppages, pages = 0;
 size_t pagesize_bits =
 qemu_ram_pagesize(pss->block) >> TARGET_PAGE_BITS;
+unsigned long hostpage_boundary =
+QEMU_ALIGN_UP(pss->page + 1, pagesize_bits);
 unsigned long start_page = pss->page;
 int res;
 
@@ -2003,30 +2005,27 @@ static int ram_save_host_page(RAMState *rs, 
PageSearchStatus *pss,
 
 do {
 /* Check the pages is dirty and if it is send it */
-if (!migration_bitmap_clear_dirty(rs, pss->block, pss->page)) {
-pss->page++;
-continue;
-}
-
-tmppages = ram_save_target_page(rs, pss, last_stage);
-if (tmppages < 0) {
-return tmppages;
-}
+if (migration_bitmap_clear_dirty(rs, pss->block, pss->page)) {
+tmppages = ram_save_target_page(rs, pss, last_stage);
+if (tmppages < 0) {
+return tmppages;
+}
 
-pages += tmppages;
-pss->page++;
-/*
- * Allow rate limiting to happen in the middle of huge pages if
- * something is sent in the current iteration.
- */
-if (pagesize_bits > 1 && tmppages > 0) {
-migration_rate_limit();
+pages += tmppages;
+/*
+ * Allow rate limiting to happen in the middle of huge pages if
+ * something is sent in the current iteration.
+ */
+if (pagesize_bits > 1 && tmppages > 0) {
+migration_rate_limit();
+}
 }
-} while ((pss->page & (pagesize_bits - 1)) &&
+pss->page = migration_bitmap_find_dirty(rs, pss->block, pss->page);
+} while ((pss->page < hostpage_boundary) &&
  offset_in_ramblock(pss->block,
 ((ram_addr_t)pss->page) << TARGET_PAGE_BITS));
-/* The offset we leave with is the last one we looked at */
-pss->page--;
+/* The offset we leave with is the min boundary of host page and block */
+pss->page = MIN(pss->page, hostpage_boundary) - 1;
 
 res = ram_save_release_protection(rs, pss, start_page);
 return (res < 0 ? res : pages);
-- 
2.23.0




[PATCH v4 1/2] migration/ram: Reduce unnecessary rate limiting

2021-03-16 Thread Kunkun Jiang
When the host page is a huge page and something is sent in the
current iteration, migration_rate_limit() should be executed.
If not, it can be omitted.

Signed-off-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
Reviewed-by: David Edmondson 
---
 migration/ram.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 72143da0ac..3eb5b0d7a7 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2015,8 +2015,13 @@ static int ram_save_host_page(RAMState *rs, 
PageSearchStatus *pss,
 
 pages += tmppages;
 pss->page++;
-/* Allow rate limiting to happen in the middle of huge pages */
-migration_rate_limit();
+/*
+ * Allow rate limiting to happen in the middle of huge pages if
+ * something is sent in the current iteration.
+ */
+if (pagesize_bits > 1 && tmppages > 0) {
+migration_rate_limit();
+}
 } while ((pss->page & (pagesize_bits - 1)) &&
  offset_in_ramblock(pss->block,
 ((ram_addr_t)pss->page) << TARGET_PAGE_BITS));
-- 
2.23.0




[PATCH v4 0/2] Some modification about ram_save_host_page()

2021-03-16 Thread Kunkun Jiang
Hi all,

This series include patches as below:
Patch 1:
- reduce unnecessary rate limiting in ram_save_host_page()

Patch 2:
- optimized ram_save_host_page() by using migration_bitmap_find_dirty() to find
dirty pages

History:

v3 -> v4:
- Remove the modification to ram_save_host_page() comment [Peter Xu]
- Remove the renaming of tmppages 

v2 -> v3:
- Reduce unnecessary rate limiting if nothing is sent in the current iteration 
[David Edmondson]
- Invert the body of the loop in ram_save_host_page() [David Edmondson]

v1 -> v2:
- Modify ram_save_host_page() comment [David Edmondson]
- Remove 'goto' [David Edmondson]

Kunkun Jiang (2):
  migration/ram: Reduce unnecessary rate limiting
  migration/ram: Optimize ram_save_host_page()

 migration/ram.c | 34 +++---
 1 file changed, 19 insertions(+), 15 deletions(-)

-- 
2.23.0




[RFC PATCH 2/3] vfio: Maintain DMA mapping range for the container

2021-03-10 Thread Kunkun Jiang
From: Zenghui Yu 

When synchronizing dirty bitmap from kernel VFIO we do it in a
per-iova-range fashion and we allocate the userspace bitmap for each of the
ioctl. This patch introduces `struct VFIODMARange` to describe a range of
the given DMA mapping with respect to a VFIO_IOMMU_MAP_DMA operation, and
make the bitmap cache of this range be persistent so that we don't need to
g_try_malloc0() every time. Note that the new structure is almost a copy of
`struct vfio_iommu_type1_dma_map` but only internally used by QEMU.

More importantly, the cached per-iova-range dirty bitmap will be further
used when we want to add support for the CLEAR_BITMAP and this cached
bitmap will be used to guarantee we don't clear any unknown dirty bits
otherwise that can be a severe data loss issue for migration code.

It's pretty intuitive to maintain a bitmap per container since we perform
log_sync at this granule. But I don't know how to deal with things like
memory hot-{un}plug, sparse DMA mappings, etc. Suggestions welcome.

* yet something to-do:
  - can't work with guest viommu
  - no locks
  - etc

[ The idea and even the commit message are largely inherited from kvm side.
  See commit 9f4bf4baa8b820c7930e23c9566c9493db7e1d25. ]

Signed-off-by: Zenghui Yu 
Signed-off-by: Kunkun Jiang 
---
 hw/vfio/common.c  | 62 +++
 include/hw/vfio/vfio-common.h |  9 +
 2 files changed, 65 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c0ff20f0a2..e7cb0e2b23 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -426,6 +426,29 @@ unmap_exit:
 return ret;
 }
 
+static VFIODMARange *vfio_lookup_match_range(VFIOContainer *container,
+hwaddr start_addr, hwaddr size)
+{
+VFIODMARange *qrange;
+
+QLIST_FOREACH(qrange, >dma_list, next) {
+if (qrange->iova == start_addr && qrange->size == size) {
+return qrange;
+}
+}
+return NULL;
+}
+
+static void vfio_dma_range_init_dirty_bitmap(VFIODMARange *qrange)
+{
+uint64_t pages, size;
+
+pages = REAL_HOST_PAGE_ALIGN(qrange->size) / qemu_real_host_page_size;
+size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / BITS_PER_BYTE;
+
+qrange->bitmap = g_malloc0(size);
+}
+
 /*
  * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
  */
@@ -439,12 +462,29 @@ static int vfio_dma_unmap(VFIOContainer *container,
 .iova = iova,
 .size = size,
 };
+VFIODMARange *qrange;
 
 if (iotlb && container->dirty_pages_supported &&
 vfio_devices_all_running_and_saving(container)) {
 return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
 }
 
+/*
+ * unregister the DMA range
+ *
+ * It seems that the memory layer will give us the same section as the one
+ * used in region_add(). Otherwise it'll be complicated to manipulate the
+ * bitmap across region_{add,del}. Is there any guarantee?
+ *
+ * But there is really not such a restriction on the kernel interface
+ * (VFIO_IOMMU_DIRTY_PAGES_FLAG_{UN}MAP_DMA, etc).
+ */
+qrange = vfio_lookup_match_range(container, iova, size);
+assert(qrange);
+g_free(qrange->bitmap);
+QLIST_REMOVE(qrange, next);
+g_free(qrange);
+
 while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, )) {
 /*
  * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
@@ -481,6 +521,14 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr 
iova,
 .iova = iova,
 .size = size,
 };
+VFIODMARange *qrange;
+
+qrange = g_malloc0(sizeof(*qrange));
+qrange->iova = iova;
+qrange->size = size;
+QLIST_INSERT_HEAD(>dma_list, qrange, next);
+/* XXX allocate the dirty bitmap on demand */
+vfio_dma_range_init_dirty_bitmap(qrange);
 
 if (!readonly) {
 map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
@@ -992,9 +1040,14 @@ static int vfio_get_dirty_bitmap(VFIOContainer 
*container, uint64_t iova,
 {
 struct vfio_iommu_type1_dirty_bitmap *dbitmap;
 struct vfio_iommu_type1_dirty_bitmap_get *range;
+VFIODMARange *qrange;
 uint64_t pages;
 int ret;
 
+qrange = vfio_lookup_match_range(container, iova, size);
+/* the same as vfio_dma_unmap() */
+assert(qrange);
+
 dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
 
 dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
@@ -1013,11 +1066,8 @@ static int vfio_get_dirty_bitmap(VFIOContainer 
*container, uint64_t iova,
 pages = REAL_HOST_PAGE_ALIGN(range->size) / qemu_real_host_page_size;
 range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
  BITS_PER_BYTE;
-range->bitmap.data = g_try_malloc0(range->bitmap.size);
-if (!range->bitmap.data) {
-ret = -ENOMEM;
-goto err_out;
-}

[RFC PATCH 3/3] vfio/migration: Support VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP

2021-03-10 Thread Kunkun Jiang
From: Zenghui Yu 

The new capability VFIO_DIRTY_LOG_MANUAL_CLEAR and the new ioctl
VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP have been introduced in
the kernel, tweak the userspace side to use them.

Check if the kernel supports VFIO_DIRTY_LOG_MANUAL_CLEAR and
provide the log_clear() hook for vfio_memory_listener. If the
kernel supports it, deliever the clear message to kernel.

Signed-off-by: Zenghui Yu 
Signed-off-by: Kunkun Jiang 
---
 hw/vfio/common.c  | 145 +-
 include/hw/vfio/vfio-common.h |   1 +
 2 files changed, 145 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index e7cb0e2b23..713515b86a 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1182,10 +1182,146 @@ static void vfio_listener_log_sync(MemoryListener 
*listener,
 }
 }
 
+/*
+ * I'm not sure if there's any alignment requirement for the CLEAR_BITMAP
+ * ioctl. But copy from kvm side and align {start, size} with 64 pages.
+ *
+ * I think the code can be simplified a lot if no alignment requirement.
+ */
+#define VFIO_CLEAR_LOG_SHIFT  6
+#define VFIO_CLEAR_LOG_ALIGN  (qemu_real_host_page_size << 
VFIO_CLEAR_LOG_SHIFT)
+#define VFIO_CLEAR_LOG_MASK   (-VFIO_CLEAR_LOG_ALIGN)
+
+static int vfio_log_clear_one_range(VFIOContainer *container,
+VFIODMARange *qrange, uint64_t start, uint64_t size)
+{
+struct vfio_iommu_type1_dirty_bitmap *dbitmap;
+struct vfio_iommu_type1_dirty_bitmap_get *range;
+
+dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
+
+dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
+dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP;
+range = (struct vfio_iommu_type1_dirty_bitmap_get *)>data;
+
+/*
+ * Now let's deal with the actual bitmap, which is almost the same
+ * as the kvm side.
+ */
+uint64_t end, bmap_start, start_delta, bmap_npages;
+unsigned long *bmap_clear = NULL, psize = qemu_real_host_page_size;
+int ret;
+
+bmap_start = start & VFIO_CLEAR_LOG_MASK;
+start_delta = start - bmap_start;
+bmap_start /= psize;
+
+bmap_npages = DIV_ROUND_UP(size + start_delta, VFIO_CLEAR_LOG_ALIGN)
+<< VFIO_CLEAR_LOG_SHIFT;
+end = qrange->size / psize;
+if (bmap_npages > end - bmap_start) {
+bmap_npages = end - bmap_start;
+}
+start_delta /= psize;
+
+if (start_delta) {
+bmap_clear = bitmap_new(bmap_npages);
+bitmap_copy_with_src_offset(bmap_clear, qrange->bitmap,
+bmap_start, start_delta + size / psize);
+bitmap_clear(bmap_clear, 0, start_delta);
+range->bitmap.data = (__u64 *)bmap_clear;
+} else {
+range->bitmap.data = (__u64 *)(qrange->bitmap + BIT_WORD(bmap_start));
+}
+
+range->iova = qrange->iova + bmap_start * psize;
+range->size = bmap_npages * psize;
+range->bitmap.size = ROUND_UP(bmap_npages, sizeof(__u64) * BITS_PER_BYTE) /
+   BITS_PER_BYTE;
+range->bitmap.pgsize = qemu_real_host_page_size;
+
+ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
+if (ret) {
+error_report("Failed to clear dirty log for iova: 0x%"PRIx64
+" size: 0x%"PRIx64" err: %d", (uint64_t)range->iova,
+(uint64_t)range->size, errno);
+goto err_out;
+}
+
+bitmap_clear(qrange->bitmap, bmap_start + start_delta, size / psize);
+err_out:
+g_free(bmap_clear);
+g_free(dbitmap);
+return 0;
+}
+
+static int vfio_physical_log_clear(VFIOContainer *container,
+   MemoryRegionSection *section)
+{
+uint64_t start, size, offset, count;
+VFIODMARange *qrange;
+int ret = 0;
+
+if (!container->dirty_log_manual_clear) {
+/* No need to do explicit clear */
+return ret;
+}
+
+start = section->offset_within_address_space;
+size = int128_get64(section->size);
+
+if (!size) {
+return ret;
+}
+
+QLIST_FOREACH(qrange, >dma_list, next) {
+/*
+ * Discard ranges that do not overlap the section (e.g., the
+ * Memory BAR regions of the device)
+ */
+if (qrange->iova > start + size - 1 ||
+start > qrange->iova + qrange->size - 1) {
+continue;
+}
+
+if (start >= qrange->iova) {
+/* The range starts before section or is aligned to it. */
+offset = start - qrange->iova;
+count = MIN(qrange->size - offset, size);
+} else {
+/* The range starts after section. */
+offset = 0;
+count = MIN(qrange->size, size - (qrange->iova - start));
+}
+ret = vfio_log_clear_one_range(container, qrange, offset, count);
+ 

[RFC PATCH 1/3] linux-headers: update against 5.12-rc2 and "vfio log clear" series

2021-03-10 Thread Kunkun Jiang
From: Zenghui Yu 

The new capability VFIO_DIRTY_LOG_MANUAL_CLEAR and the new ioctl
VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP have been introduced in
the kernel, update the header to add them.

Signed-off-by: Zenghui Yu 
Signed-off-by: Kunkun Jiang 
---
 linux-headers/linux/vfio.h | 55 +-
 1 file changed, 54 insertions(+), 1 deletion(-)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 609099e455..9836c5442f 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -46,6 +46,20 @@
  */
 #define VFIO_NOIOMMU_IOMMU 8
 
+/* Supports VFIO_DMA_UNMAP_FLAG_ALL */
+#define VFIO_UNMAP_ALL 9
+
+/* Supports the vaddr flag for DMA map and unmap */
+#define VFIO_UPDATE_VADDR  10
+
+/*
+ * The vfio_iommu driver may support user clears dirty log manually, which 
means
+ * dirty log is not cleared automatically after dirty log is copied to 
userspace,
+ * it's user's duty to clear dirty log. Note: when user queries this extension
+ * and vfio_iommu driver supports it, then it is enabled.
+ */
+#define VFIO_DIRTY_LOG_MANUAL_CLEAR11
+
 /*
  * The IOCTL interface is designed for extensibility by embedding the
  * structure length (argsz) and flags into structures passed between
@@ -1074,12 +1088,22 @@ struct vfio_iommu_type1_info_dma_avail {
  *
  * Map process virtual addresses to IO virtual addresses using the
  * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
+ *
+ * If flags & VFIO_DMA_MAP_FLAG_VADDR, update the base vaddr for iova, and
+ * unblock translation of host virtual addresses in the iova range.  The vaddr
+ * must have previously been invalidated with VFIO_DMA_UNMAP_FLAG_VADDR.  To
+ * maintain memory consistency within the user application, the updated vaddr
+ * must address the same memory object as originally mapped.  Failure to do so
+ * will result in user memory corruption and/or device misbehavior.  iova and
+ * size must match those in the original MAP_DMA call.  Protection is not
+ * changed, and the READ & WRITE flags must be 0.
  */
 struct vfio_iommu_type1_dma_map {
__u32   argsz;
__u32   flags;
 #define VFIO_DMA_MAP_FLAG_READ (1 << 0)/* readable from device 
*/
 #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)   /* writable from device */
+#define VFIO_DMA_MAP_FLAG_VADDR (1 << 2)
__u64   vaddr;  /* Process virtual address */
__u64   iova;   /* IO virtual address */
__u64   size;   /* Size of mapping (bytes) */
@@ -1102,6 +1126,7 @@ struct vfio_bitmap {
  * field.  No guarantee is made to the user that arbitrary unmaps of iova
  * or size different from those used in the original mapping call will
  * succeed.
+ *
  * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get the dirty bitmap
  * before unmapping IO virtual addresses. When this flag is set, the user must
  * provide a struct vfio_bitmap in data[]. User must provide zero-allocated
@@ -,11 +1136,21 @@ struct vfio_bitmap {
  * indicates that the page at that offset from iova is dirty. A Bitmap of the
  * pages in the range of unmapped size is returned in the user-provided
  * vfio_bitmap.data.
+ *
+ * If flags & VFIO_DMA_UNMAP_FLAG_ALL, unmap all addresses.  iova and size
+ * must be 0.  This cannot be combined with the get-dirty-bitmap flag.
+ *
+ * If flags & VFIO_DMA_UNMAP_FLAG_VADDR, do not unmap, but invalidate host
+ * virtual addresses in the iova range.  Tasks that attempt to translate an
+ * iova's vaddr will block.  DMA to already-mapped pages continues.  This
+ * cannot be combined with the get-dirty-bitmap flag.
  */
 struct vfio_iommu_type1_dma_unmap {
__u32   argsz;
__u32   flags;
 #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
+#define VFIO_DMA_UNMAP_FLAG_ALL (1 << 1)
+#define VFIO_DMA_UNMAP_FLAG_VADDR   (1 << 2)
__u64   iova;   /* IO virtual address */
__u64   size;   /* Size of mapping (bytes) */
__u8data[];
@@ -1161,7 +1196,24 @@ struct vfio_iommu_type1_dma_unmap {
  * actual bitmap. If dirty pages logging is not enabled, an error will be
  * returned.
  *
- * Only one of the flags _START, _STOP and _GET may be specified at a time.
+ * Calling the IOCTL with VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP flag set,
+ * instructs the IOMMU driver to clear the dirty status of pages in a bitmap
+ * for IOMMU container for a given IOVA range. The user must specify the IOVA
+ * range, the bitmap and the pgsize through the structure
+ * vfio_iommu_type1_dirty_bitmap_get in the data[] portion. This interface
+ * supports clearing a bitmap of the smallest supported pgsize only and can be
+ * modified in future to clear a bitmap of any specified supported pgsize. The
+ * user must

[RFC PATCH 0/3] vfio/migration: Support manual clear vfio dirty log

2021-03-10 Thread Kunkun Jiang
Hi all,

In the past, we clear dirty log immediately after sync dirty log to
userspace. This may cause redundant dirty handling if userspace
handles dirty log iteratively:

After vfio clears dirty log, new dirty log starts to generate. These
new dirty log will be reported to userspace even if they are generated
before userspace handles the same dirty page.

Since a new dirty log tracking method for vfio based on iommu hwdbm[1]
has been introduced in the kernel and added a new capability named
VFIO_DIRTY_LOG_MANUAL_CLEAR, we can eliminate some redundant dirty
handling by supporting it.

This series include patches as below:
Patch 1:
- updated the linux-headers/linux/vfio.h from kernel side

Patch 2:
- introduced 'struct VFIODMARange' to describe a range of the given DMA
  mapping and with respect to a VFIO_IOMMU_MAP_DMA operation

Patch 3:
- implemented the operation to manual clear vfio dirty log, which can
  eliminate some redundant dirty handling

Thanks,
Kunkun Jiang

[1] 
https://lore.kernel.org/linux-iommu/20210310090614.26668-1-zhukeqi...@huawei.com/T/#mb168c9738ecd3d8794e2da14f970545d5820f863

Zenghui Yu (3):
  linux-headers: update against 5.12-rc2 and "vfio log clear" series
  vfio: Maintain DMA mapping range for the container
  vfio/migration: Support VFIO_IOMMU_DIRTY_PAGES_FLAG_CLEAR_BITMAP

 hw/vfio/common.c  | 207 --
 include/hw/vfio/vfio-common.h |  10 ++
 linux-headers/linux/vfio.h|  55 -
 3 files changed, 264 insertions(+), 8 deletions(-)

-- 
2.23.0




Re: [PATCH] vfio: Support host translation granule size

2021-03-09 Thread Kunkun Jiang

Hi Alex,

On 2021/3/10 7:17, Alex Williamson wrote:

On Thu, 4 Mar 2021 21:34:46 +0800
Kunkun Jiang  wrote:


The cpu_physical_memory_set_dirty_lebitmap() can quickly deal with
the dirty pages of memory by bitmap-traveling, regardless of whether
the bitmap is aligned correctly or not.

cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
host page size. So it'd better to set bitmap_pgsize to host page size
to support more translation granule sizes.

Fixes: 87ea529c502 (vfio: Get migration capability flags for container)
Signed-off-by: Kunkun Jiang 
---
  hw/vfio/common.c | 44 ++--
  1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6ff1daa763..69fb5083a4 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -378,7 +378,7 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
  {
  struct vfio_iommu_type1_dma_unmap *unmap;
  struct vfio_bitmap *bitmap;
-uint64_t pages = TARGET_PAGE_ALIGN(size) >> TARGET_PAGE_BITS;
+uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size;
  int ret;
  
  unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));

@@ -390,12 +390,12 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
  bitmap = (struct vfio_bitmap *)>data;
  
  /*

- * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
- * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap_pgsize to
- * TARGET_PAGE_SIZE.
+ * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
+ * qemu_real_host_page_size to mark those dirty. Hence set bitmap_pgsize
+ * to qemu_real_host_page_size.


I don't see that this change is well supported by the code,
cpu_physical_memory_set_dirty_lebitmap() seems to operate on

Yes, cpu_physical_memory_set_dirty_lebitmap() is finally to operate on
TARGET_PAGE_SIZE. But actually it supports pages in bitmap of
qemu_real_host_page_size to mark those dirty. It uses
"hpratio = qemu_real_host_page_size / TARGET_PAGE_SIZE" to adapt to
different translation granule size(e.g. 4K 2M 1G).

TARGET_PAGE_SIZE, and the next three patch chunks take a detour through
memory listener code that seem unrelated to the change described in the
commit log.  This claims to fix something, what is actually broken?
Thanks,

Alex

This patch 87ea529c502 (vfio: Get migration capability flags for container)
is the start of the bug. The code of [1](marked below) restricts the host
page size must be TARGET_PAGE_SIZE(e.g. 4K) to set
container->dirty_pages_supported = true. It is inappropriate to limit the
page size to TARGET_PAGE_SIZE.

Best Regards

Kunkun Jiang


   */
  
-bitmap->pgsize = TARGET_PAGE_SIZE;

+bitmap->pgsize = qemu_real_host_page_size;
  bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
 BITS_PER_BYTE;
  
@@ -674,16 +674,16 @@ static void vfio_listener_region_add(MemoryListener *listener,

  return;
  }
  
-if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=

- (section->offset_within_region & ~TARGET_PAGE_MASK))) {
+if (unlikely((section->offset_within_address_space & 
~qemu_real_host_page_mask) !=
+ (section->offset_within_region & ~qemu_real_host_page_mask))) 
{
  error_report("%s received unaligned region", __func__);
  return;
  }
  
-iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);

+iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
  llend = int128_make64(section->offset_within_address_space);
  llend = int128_add(llend, section->size);
-llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
+llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
  
  if (int128_ge(int128_make64(iova), llend)) {

  return;
@@ -892,8 +892,8 @@ static void vfio_listener_region_del(MemoryListener 
*listener,
  return;
  }
  
-if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=

- (section->offset_within_region & ~TARGET_PAGE_MASK))) {
+if (unlikely((section->offset_within_address_space & 
~qemu_real_host_page_mask) !=
+ (section->offset_within_region & ~qemu_real_host_page_mask))) 
{
  error_report("%s received unaligned region", __func__);
  return;
  }
@@ -921,10 +921,10 @@ static void vfio_listener_region_del(MemoryListener 
*listener,
   */
  }
  
-iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);

+iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
  llend = int128_make64(section->offset_within_address_space);
  llend = int128_add(llend, section->size);
-llend = int128_and(llend

Re: [PATCH v3 2/3] migration/ram: Reduce unnecessary rate limiting

2021-03-09 Thread Kunkun Jiang

Hi,

On 2021/3/10 0:15, Peter Xu wrote:

On Tue, Mar 09, 2021 at 10:33:04PM +0800, Kunkun Jiang wrote:

Hi,

On 2021/3/9 5:12, Peter Xu wrote:

On Mon, Mar 08, 2021 at 06:34:58PM +0800, Kunkun Jiang wrote:

Hi,

On 2021/3/5 22:22, Peter Xu wrote:

Kunkun,

On Fri, Mar 05, 2021 at 03:50:34PM +0800, Kunkun Jiang wrote:

When the host page is a huge page and something is sent in the
current iteration, the migration_rate_limit() should be executed.
If not, this function can be omitted to save time.

Rename tmppages to pages_this_iteration to express its meaning
more clearly.

Signed-off-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
migration/ram.c | 21 ++---
1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index a168da5cdd..9fc5b2997c 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1988,7 +1988,7 @@ static int ram_save_target_page(RAMState *rs, 
PageSearchStatus *pss,
static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
  bool last_stage)
{
-int tmppages, pages = 0;
+int pages = 0;
size_t pagesize_bits =
qemu_ram_pagesize(pss->block) >> TARGET_PAGE_BITS;
unsigned long start_page = pss->page;
@@ -2000,21 +2000,28 @@ static int ram_save_host_page(RAMState *rs, 
PageSearchStatus *pss,
}
do {
+int pages_this_iteration = 0;
+
/* Check if the page is dirty and send it if it is */
if (!migration_bitmap_clear_dirty(rs, pss->block, pss->page)) {
pss->page++;
continue;
}
-tmppages = ram_save_target_page(rs, pss, last_stage);
-if (tmppages < 0) {
-return tmppages;
+pages_this_iteration = ram_save_target_page(rs, pss, last_stage);
+if (pages_this_iteration < 0) {
+return pages_this_iteration;
}
-pages += tmppages;
+pages += pages_this_iteration;

To me, both names are okay, it's just that the new name doesn't really provide
a lot more new information, while it's even longer...

Since you seem to prefer cleaning up tmppages, I'm actually thinking whether
it should be called as "pages" at all since ram_save_target_page() majorly only
returns either 1 if succeeded or <0 if error.  There's only one very corner
case of xbzrle where it can return 0 in save_xbzrle_page():

   if (encoded_len == 0) {
   trace_save_xbzrle_page_skipping();
   return 0;
   }

I think it means the page didn't change at all, then I'm also wondering maybe
it can also return 1 showing one page migrated (though actually skipped!) which
should still be fine for the callers, e.g., ram_find_and_save_block() who will
finally check this "pages" value.

So I think _maybe_ that's a nicer cleanup to change that "return 0" to "return
1", then another patch to make the return value to be (1) return 0 if page
saved, or (2) return <0 if error.  Then here in ram_save_host_page() tmppages
can be renamed to "ret" or "succeed".

Thanks for your advice.
change "return 0" to "return 1" would have a slight effect on
'rs->target_page_count += pages'
in ram_save_iterate(). This may lead to consider more complex situations.
What do you think of
this?

I don't think we should change the meaning of ram_save_host_page()'s return
value, but only ram_save_target_page(); ram_save_host_page() could return >1
for huge pages.  Thanks,


I am not sure I have got your point. If I change "return 0" to "return 1"
(actually skipped),
the "pages" in the ram_save_host_page() will add 1(original add 0). Then it
will end the
loop in the ram_find_and_save_block.

Frankly I even think it's a bug - when return 0 it could mean the xbzrle page
is the same as before even if dirty bit set (the page just got written with the
same data, that's why dirty bit set but xbzrle calculated with no diff).
However it shouldn't mean "done==1" which is a sign of "migration complete"
imho..

Thanks for your explanation, I get it.

And then in the ram_save_iterate(), the
modify may
have a effect on "rs->target_page_count += page".
I haven't done enough research on this part of code. Do you think this
change will have
a big impact?

Maybe, but I don't expect it to change anything real.  If to change it we'd
definitely better smoke xbzrle a bit.  It's a pure idea I got in mind to
cleanup the code, but feel free to ignore it too.

For your current series, I think the last patch is the most appealing.  So
maybe we can focus on that first.

Thanks,


You are right. The change here may be not worth it.

Thanks,

Kunkun Jiang




Re: [PATCH v3 2/3] migration/ram: Reduce unnecessary rate limiting

2021-03-09 Thread Kunkun Jiang

Hi,

On 2021/3/9 5:12, Peter Xu wrote:

On Mon, Mar 08, 2021 at 06:34:58PM +0800, Kunkun Jiang wrote:

Hi,

On 2021/3/5 22:22, Peter Xu wrote:

Kunkun,

On Fri, Mar 05, 2021 at 03:50:34PM +0800, Kunkun Jiang wrote:

When the host page is a huge page and something is sent in the
current iteration, the migration_rate_limit() should be executed.
If not, this function can be omitted to save time.

Rename tmppages to pages_this_iteration to express its meaning
more clearly.

Signed-off-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
   migration/ram.c | 21 ++---
   1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index a168da5cdd..9fc5b2997c 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1988,7 +1988,7 @@ static int ram_save_target_page(RAMState *rs, 
PageSearchStatus *pss,
   static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
 bool last_stage)
   {
-int tmppages, pages = 0;
+int pages = 0;
   size_t pagesize_bits =
   qemu_ram_pagesize(pss->block) >> TARGET_PAGE_BITS;
   unsigned long start_page = pss->page;
@@ -2000,21 +2000,28 @@ static int ram_save_host_page(RAMState *rs, 
PageSearchStatus *pss,
   }
   do {
+int pages_this_iteration = 0;
+
   /* Check if the page is dirty and send it if it is */
   if (!migration_bitmap_clear_dirty(rs, pss->block, pss->page)) {
   pss->page++;
   continue;
   }
-tmppages = ram_save_target_page(rs, pss, last_stage);
-if (tmppages < 0) {
-return tmppages;
+pages_this_iteration = ram_save_target_page(rs, pss, last_stage);
+if (pages_this_iteration < 0) {
+return pages_this_iteration;
   }
-pages += tmppages;
+pages += pages_this_iteration;

To me, both names are okay, it's just that the new name doesn't really provide
a lot more new information, while it's even longer...

Since you seem to prefer cleaning up tmppages, I'm actually thinking whether
it should be called as "pages" at all since ram_save_target_page() majorly only
returns either 1 if succeeded or <0 if error.  There's only one very corner
case of xbzrle where it can return 0 in save_xbzrle_page():

  if (encoded_len == 0) {
  trace_save_xbzrle_page_skipping();
  return 0;
  }

I think it means the page didn't change at all, then I'm also wondering maybe
it can also return 1 showing one page migrated (though actually skipped!) which
should still be fine for the callers, e.g., ram_find_and_save_block() who will
finally check this "pages" value.

So I think _maybe_ that's a nicer cleanup to change that "return 0" to "return
1", then another patch to make the return value to be (1) return 0 if page
saved, or (2) return <0 if error.  Then here in ram_save_host_page() tmppages
can be renamed to "ret" or "succeed".

Thanks for your advice.
change "return 0" to "return 1" would have a slight effect on
'rs->target_page_count += pages'
in ram_save_iterate(). This may lead to consider more complex situations.
What do you think of
this?

I don't think we should change the meaning of ram_save_host_page()'s return
value, but only ram_save_target_page(); ram_save_host_page() could return >1
for huge pages.  Thanks,

I am not sure I have got your point. If I change "return 0" to "return 
1" (actually skipped),
the "pages" in the ram_save_host_page() will add 1(original add 0). Then 
it will end the
loop in the ram_find_and_save_block. And then in the ram_save_iterate(), 
the modify may

have a effect on "rs->target_page_count += page".
I haven't done enough research on this part of code. Do you think this 
change will have

a big impact?

Thanks,

Kunkun Jiang




Re: [PATCH v3 1/3] migration/ram: Modify the code comment of ram_save_host_page()

2021-03-09 Thread Kunkun Jiang

Hi,

On 2021/3/9 5:03, Peter Xu wrote:

On Mon, Mar 08, 2021 at 06:33:56PM +0800, Kunkun Jiang wrote:

Hi, Peter

On 2021/3/5 21:59, Peter Xu wrote:

On Fri, Mar 05, 2021 at 03:50:33PM +0800, Kunkun Jiang wrote:

The ram_save_host_page() has been modified several times
since its birth. But the comment hasn't been modified as it should
be. It'd better to modify the comment to explain ram_save_host_page()
more clearly.

Signed-off-by: Kunkun Jiang 
---
   migration/ram.c | 16 +++-
   1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 72143da0ac..a168da5cdd 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1970,15 +1970,13 @@ static int ram_save_target_page(RAMState *rs, 
PageSearchStatus *pss,
   }
   /**
- * ram_save_host_page: save a whole host page
+ * ram_save_host_page: save a whole host page or the rest of a RAMBlock
*
- * Starting at *offset send pages up to the end of the current host
- * page. It's valid for the initial offset to point into the middle of
- * a host page in which case the remainder of the hostpage is sent.
- * Only dirty target pages are sent. Note that the host page size may
- * be a huge page for this block.
- * The saving stops at the boundary of the used_length of the block
- * if the RAMBlock isn't a multiple of the host page size.

[1]


+ * Send dirty pages between pss->page and either the end of that page
+ * or the used_length of the RAMBlock, whichever is smaller.
+ *
+ * Note that if the host page is a huge page, pss->page may be in the
+ * middle of that page.

It reads more like a shorter version of original comment..  Could you point out
what's the major difference?  Since the original comment looks still good to me.

Sorry for late reply.
The reason I want to modify the comment is that I think some parts of the
comment
don't match the code. (1) The brief description of ram_save_host_page()
missed a
situation that ram_save_host_page() will send dirty pages between pass->page
and
the used_length of the block, if the RAMBlock isn't a multiple of the host
page
size.

It seems it mentioned at [1] above.


(2) '*offset' should be replaced with pss->page.

This one looks right as you mentioned.


So taking the chance of optimizing ram_save_host_page(), I modified the
comment.
This version comment is suggested by @David Edmondson. Compared with the
original
comment, I think it looks concise.

I think changing incorrect comments is nice; it's just that we should still
avoid rewritting comments with similar contents.


*
* Returns the number of pages written or negative on error
*
@@ -2002,7 +2000,7 @@ static int ram_save_host_page(RAMState *rs, 
PageSearchStatus *pss,
   }
   do {
-/* Check the pages is dirty and if it is send it */
+/* Check if the page is dirty and send it if it is */

This line fixes some English issues, it seems.  Looks okay, but normally I
won't change it unless the wording was too weird, so as to keep the git record
or the original lines.  Here the original looks still okay, no strong opinion.

Thanks,

Yes, the original reads weird to me. Same reason as above, I modified this
line.

If you think there is no need to modify the original commit, I do not insist
on
changing it.

As I mentioned I don't really have a strong preference, so feel free to keep
your own will. :) I would only to suggest avoid rewritting chunk of similar
meanings.

Thanks,


Thank you for your advice, I will pay more attention to it in the future.

Best regards,

Kunkun Jiang




Re: [PATCH v3 3/3] migration/ram: Optimize ram_save_host_page()

2021-03-09 Thread Kunkun Jiang

Hi,

On 2021/3/9 5:36, Peter Xu wrote:

On Mon, Mar 08, 2021 at 09:58:02PM +0800, Kunkun Jiang wrote:

Hi,

On 2021/3/5 22:30, Peter Xu wrote:

On Fri, Mar 05, 2021 at 03:50:35PM +0800, Kunkun Jiang wrote:

Starting from pss->page, ram_save_host_page() will check every page
and send the dirty pages up to the end of the current host page or
the boundary of used_length of the block. If the host page size is
a huge page, the step "check" will take a lot of time.

This will improve performance to use migration_bitmap_find_dirty().

Is there any measurement done?

I tested it on Kunpeng 920.  VM params: 1U 4G( page size 1G).
The time of ram_save_host_page() in the last round of ram saving:
before optimize: 9250us   after optimize: 34us

Looks like an idle VM, but still this is a great improvement.  Would you mind
add this into the commit message too?

Ok, I will add it in the next version.

This looks like an optimization, but to me it seems to have changed a lot
context that it doesn't need to... Do you think it'll also work to just look up
dirty again and update pss->page properly if migration_bitmap_clear_dirty()
returned zero?

Thanks,

This just inverted the body of the loop, suggested by @David Edmondson.
Here is the v2[1]. Do you mean to change it like this?

[1]: 
http://patchwork.ozlabs.org/project/qemu-devel/patch/20210301082132.1107-4-jiangkun...@huawei.com/

I see, then it's okay - But indeed I still prefer your previous version. :)

Thanks,

Both versions are fine to me. This version may make the final code 
slightly cleaner, I think.


Thanks,

Kunkun Jiang




Re: [PATCH v3 3/3] migration/ram: Optimize ram_save_host_page()

2021-03-08 Thread Kunkun Jiang

Hi,

On 2021/3/5 22:30, Peter Xu wrote:

On Fri, Mar 05, 2021 at 03:50:35PM +0800, Kunkun Jiang wrote:

Starting from pss->page, ram_save_host_page() will check every page
and send the dirty pages up to the end of the current host page or
the boundary of used_length of the block. If the host page size is
a huge page, the step "check" will take a lot of time.

This will improve performance to use migration_bitmap_find_dirty().

Is there any measurement done?

I tested it on Kunpeng 920.  VM params: 1U 4G( page size 1G).
The time of ram_save_host_page() in the last round of ram saving:
before optimize: 9250us   after optimize: 34us

This looks like an optimization, but to me it seems to have changed a lot
context that it doesn't need to... Do you think it'll also work to just look up
dirty again and update pss->page properly if migration_bitmap_clear_dirty()
returned zero?

Thanks,

This just inverted the body of the loop, suggested by @David Edmondson.
Here is the v2[1]. Do you mean to change it like this?

[1]: 
http://patchwork.ozlabs.org/project/qemu-devel/patch/20210301082132.1107-4-jiangkun...@huawei.com/


Thanks,
Kunkun Jiang

Signed-off-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
  migration/ram.c | 39 +++
  1 file changed, 19 insertions(+), 20 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 9fc5b2997c..28215aefe4 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1991,6 +1991,8 @@ static int ram_save_host_page(RAMState *rs, 
PageSearchStatus *pss,
  int pages = 0;
  size_t pagesize_bits =
  qemu_ram_pagesize(pss->block) >> TARGET_PAGE_BITS;
+unsigned long hostpage_boundary =
+QEMU_ALIGN_UP(pss->page + 1, pagesize_bits);
  unsigned long start_page = pss->page;
  int res;
  
@@ -2003,30 +2005,27 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,

  int pages_this_iteration = 0;
  
  /* Check if the page is dirty and send it if it is */

-if (!migration_bitmap_clear_dirty(rs, pss->block, pss->page)) {
-pss->page++;
-continue;
-}
-
-pages_this_iteration = ram_save_target_page(rs, pss, last_stage);
-if (pages_this_iteration < 0) {
-return pages_this_iteration;
-}
+if (migration_bitmap_clear_dirty(rs, pss->block, pss->page)) {
+pages_this_iteration = ram_save_target_page(rs, pss, last_stage);
+if (pages_this_iteration < 0) {
+return pages_this_iteration;
+}
  
-pages += pages_this_iteration;

-pss->page++;
-/*
- * Allow rate limiting to happen in the middle of huge pages if
- * something is sent in the current iteration.
- */
-if (pagesize_bits > 1 && pages_this_iteration > 0) {
-migration_rate_limit();
+pages += pages_this_iteration;
+/*
+ * Allow rate limiting to happen in the middle of huge pages if
+ * something is sent in the current iteration.
+ */
+if (pagesize_bits > 1 && pages_this_iteration > 0) {
+migration_rate_limit();
+}
  }
-} while ((pss->page & (pagesize_bits - 1)) &&
+pss->page = migration_bitmap_find_dirty(rs, pss->block, pss->page);
+} while ((pss->page < hostpage_boundary) &&
   offset_in_ramblock(pss->block,
  ((ram_addr_t)pss->page) << TARGET_PAGE_BITS));
-/* The offset we leave with is the last one we looked at */
-pss->page--;
+/* The offset we leave with is the min boundary of host page and block */
+pss->page = MIN(pss->page, hostpage_boundary) - 1;
  
  res = ram_save_release_protection(rs, pss, start_page);

  return (res < 0 ? res : pages);
--
2.23.0






Re: [PATCH v3 2/3] migration/ram: Reduce unnecessary rate limiting

2021-03-08 Thread Kunkun Jiang

Hi,

On 2021/3/5 22:22, Peter Xu wrote:

Kunkun,

On Fri, Mar 05, 2021 at 03:50:34PM +0800, Kunkun Jiang wrote:

When the host page is a huge page and something is sent in the
current iteration, the migration_rate_limit() should be executed.
If not, this function can be omitted to save time.

Rename tmppages to pages_this_iteration to express its meaning
more clearly.

Signed-off-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
  migration/ram.c | 21 ++---
  1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index a168da5cdd..9fc5b2997c 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1988,7 +1988,7 @@ static int ram_save_target_page(RAMState *rs, 
PageSearchStatus *pss,
  static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
bool last_stage)
  {
-int tmppages, pages = 0;
+int pages = 0;
  size_t pagesize_bits =
  qemu_ram_pagesize(pss->block) >> TARGET_PAGE_BITS;
  unsigned long start_page = pss->page;
@@ -2000,21 +2000,28 @@ static int ram_save_host_page(RAMState *rs, 
PageSearchStatus *pss,
  }
  
  do {

+int pages_this_iteration = 0;
+
  /* Check if the page is dirty and send it if it is */
  if (!migration_bitmap_clear_dirty(rs, pss->block, pss->page)) {
  pss->page++;
  continue;
  }
  
-tmppages = ram_save_target_page(rs, pss, last_stage);

-if (tmppages < 0) {
-return tmppages;
+pages_this_iteration = ram_save_target_page(rs, pss, last_stage);
+if (pages_this_iteration < 0) {
+return pages_this_iteration;
  }
  
-pages += tmppages;

+pages += pages_this_iteration;

To me, both names are okay, it's just that the new name doesn't really provide
a lot more new information, while it's even longer...

Since you seem to prefer cleaning up tmppages, I'm actually thinking whether
it should be called as "pages" at all since ram_save_target_page() majorly only
returns either 1 if succeeded or <0 if error.  There's only one very corner
case of xbzrle where it can return 0 in save_xbzrle_page():

 if (encoded_len == 0) {
 trace_save_xbzrle_page_skipping();
 return 0;
 }

I think it means the page didn't change at all, then I'm also wondering maybe
it can also return 1 showing one page migrated (though actually skipped!) which
should still be fine for the callers, e.g., ram_find_and_save_block() who will
finally check this "pages" value.

So I think _maybe_ that's a nicer cleanup to change that "return 0" to "return
1", then another patch to make the return value to be (1) return 0 if page
saved, or (2) return <0 if error.  Then here in ram_save_host_page() tmppages
can be renamed to "ret" or "succeed".

Thanks for your advice.
change "return 0" to "return 1" would have a slight effect on 
'rs->target_page_count += pages'
in ram_save_iterate(). This may lead to consider more complex 
situations. What do you think of

this?

  pss->page++;
-/* Allow rate limiting to happen in the middle of huge pages */
-migration_rate_limit();
+/*
+ * Allow rate limiting to happen in the middle of huge pages if
+ * something is sent in the current iteration.
+ */
+if (pagesize_bits > 1 && pages_this_iteration > 0) {
+migration_rate_limit();
+}

I don't know whether this matters either..  Two calls in there:

 migration_update_counters(s, now);
 qemu_file_rate_limit(s->to_dst_file);

migration_update_counters() is mostly a noop for 99.9% cases. Looks still 
okay...

Thanks,
I think even these two calls shouldn't be called if the host page size 
isn't a huge page or

nothing is sent in the current iteration.

Thanks,
Kunkun Jiang

  } while ((pss->page & (pagesize_bits - 1)) &&
   offset_in_ramblock(pss->block,
  ((ram_addr_t)pss->page) << TARGET_PAGE_BITS));
--
2.23.0






Re: [PATCH v3 1/3] migration/ram: Modify the code comment of ram_save_host_page()

2021-03-08 Thread Kunkun Jiang

Hi, Peter

On 2021/3/5 21:59, Peter Xu wrote:

On Fri, Mar 05, 2021 at 03:50:33PM +0800, Kunkun Jiang wrote:

The ram_save_host_page() has been modified several times
since its birth. But the comment hasn't been modified as it should
be. It'd better to modify the comment to explain ram_save_host_page()
more clearly.

Signed-off-by: Kunkun Jiang 
---
  migration/ram.c | 16 +++-
  1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 72143da0ac..a168da5cdd 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1970,15 +1970,13 @@ static int ram_save_target_page(RAMState *rs, 
PageSearchStatus *pss,
  }
  
  /**

- * ram_save_host_page: save a whole host page
+ * ram_save_host_page: save a whole host page or the rest of a RAMBlock
   *
- * Starting at *offset send pages up to the end of the current host
- * page. It's valid for the initial offset to point into the middle of
- * a host page in which case the remainder of the hostpage is sent.
- * Only dirty target pages are sent. Note that the host page size may
- * be a huge page for this block.
- * The saving stops at the boundary of the used_length of the block
- * if the RAMBlock isn't a multiple of the host page size.
+ * Send dirty pages between pss->page and either the end of that page
+ * or the used_length of the RAMBlock, whichever is smaller.
+ *
+ * Note that if the host page is a huge page, pss->page may be in the
+ * middle of that page.

It reads more like a shorter version of original comment..  Could you point out
what's the major difference?  Since the original comment looks still good to me.

Sorry for late reply.
The reason I want to modify the comment is that I think some parts of 
the comment
don't match the code. (1) The brief description of ram_save_host_page() 
missed a
situation that ram_save_host_page() will send dirty pages between 
pass->page and
the used_length of the block, if the RAMBlock isn't a multiple of the 
host page

size. (2) '*offset' should be replaced with pss->page.

So taking the chance of optimizing ram_save_host_page(), I modified the 
comment.
This version comment is suggested by @David Edmondson. Compared with the 
original

comment, I think it looks concise.

   *
   * Returns the number of pages written or negative on error
   *
@@ -2002,7 +2000,7 @@ static int ram_save_host_page(RAMState *rs, 
PageSearchStatus *pss,
  }
  
  do {

-/* Check the pages is dirty and if it is send it */
+/* Check if the page is dirty and send it if it is */

This line fixes some English issues, it seems.  Looks okay, but normally I
won't change it unless the wording was too weird, so as to keep the git record
or the original lines.  Here the original looks still okay, no strong opinion.

Thanks,
Yes, the original reads weird to me. Same reason as above, I modified 
this line.


If you think there is no need to modify the original commit, I do not 
insist on

changing it.

Thanks,
Kunkun Jiang

  if (!migration_bitmap_clear_dirty(rs, pss->block, pss->page)) {
  pss->page++;
  continue;
--
2.23.0






[PATCH v3 3/3] migration/ram: Optimize ram_save_host_page()

2021-03-04 Thread Kunkun Jiang
Starting from pss->page, ram_save_host_page() will check every page
and send the dirty pages up to the end of the current host page or
the boundary of used_length of the block. If the host page size is
a huge page, the step "check" will take a lot of time.

This will improve performance to use migration_bitmap_find_dirty().

Signed-off-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 migration/ram.c | 39 +++
 1 file changed, 19 insertions(+), 20 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 9fc5b2997c..28215aefe4 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1991,6 +1991,8 @@ static int ram_save_host_page(RAMState *rs, 
PageSearchStatus *pss,
 int pages = 0;
 size_t pagesize_bits =
 qemu_ram_pagesize(pss->block) >> TARGET_PAGE_BITS;
+unsigned long hostpage_boundary =
+QEMU_ALIGN_UP(pss->page + 1, pagesize_bits);
 unsigned long start_page = pss->page;
 int res;
 
@@ -2003,30 +2005,27 @@ static int ram_save_host_page(RAMState *rs, 
PageSearchStatus *pss,
 int pages_this_iteration = 0;
 
 /* Check if the page is dirty and send it if it is */
-if (!migration_bitmap_clear_dirty(rs, pss->block, pss->page)) {
-pss->page++;
-continue;
-}
-
-pages_this_iteration = ram_save_target_page(rs, pss, last_stage);
-if (pages_this_iteration < 0) {
-return pages_this_iteration;
-}
+if (migration_bitmap_clear_dirty(rs, pss->block, pss->page)) {
+pages_this_iteration = ram_save_target_page(rs, pss, last_stage);
+if (pages_this_iteration < 0) {
+return pages_this_iteration;
+}
 
-pages += pages_this_iteration;
-pss->page++;
-/*
- * Allow rate limiting to happen in the middle of huge pages if
- * something is sent in the current iteration.
- */
-if (pagesize_bits > 1 && pages_this_iteration > 0) {
-migration_rate_limit();
+pages += pages_this_iteration;
+/*
+ * Allow rate limiting to happen in the middle of huge pages if
+ * something is sent in the current iteration.
+ */
+if (pagesize_bits > 1 && pages_this_iteration > 0) {
+migration_rate_limit();
+}
 }
-} while ((pss->page & (pagesize_bits - 1)) &&
+pss->page = migration_bitmap_find_dirty(rs, pss->block, pss->page);
+} while ((pss->page < hostpage_boundary) &&
  offset_in_ramblock(pss->block,
 ((ram_addr_t)pss->page) << TARGET_PAGE_BITS));
-/* The offset we leave with is the last one we looked at */
-pss->page--;
+/* The offset we leave with is the min boundary of host page and block */
+pss->page = MIN(pss->page, hostpage_boundary) - 1;
 
 res = ram_save_release_protection(rs, pss, start_page);
 return (res < 0 ? res : pages);
-- 
2.23.0




  1   2   >